Lesson 7: Model Evaluation & Optimization

Why Evaluation Matters

Building a model is only half the battle. You need to know: Is it actually good? How can you make it better? Model evaluation provides the answers through systematic measurement of performance.

Think of it like being a coach - you need metrics to track player performance, identify weaknesses, and develop training strategies. In ML, evaluation metrics serve the same purpose for your models.

Classification Metrics

For problems where you predict categories (spam/not spam, cat/dog, disease/healthy):

Essential Classification Metrics

📊 Accuracy

Formula: (TP + TN) / (TP + TN + FP + FN)

Percentage of correct predictions. Simple but can be misleading with imbalanced data.

🎯 Precision

Formula: TP / (TP + FP)

Of all positive predictions, how many were actually correct? Important when false positives are costly.

🔍 Recall (Sensitivity)

Formula: TP / (TP + FN)

Of all actual positives, how many did we correctly identify? Critical when missing positives is dangerous.

⚖️ F1-Score

Formula: 2 × (Precision × Recall) / (Precision + Recall)

Harmonic mean of precision and recall. Balances both metrics.

🔢 MNIST Digit "8" Classification Matrix

Adjust the classification results and see how evaluation metrics change for digit recognition:

Correctly Identified "8"s True Positives: Model said "8", actually was "8"

Misidentified as "8" False Positives: Model said "8", actually other digit

Missed "8"s False Negatives: Model missed "8", said other digit

Correctly Rejected Non-"8"s True Negatives: Model correctly identified other digits

	Model Prediction
Actual Digit	Other (0,1,2,3,4,5,6,7,9)	Digit "8"
Other Digits	942 Correctly identified other digits	8 Mistakenly called "8"
Digit "8"	3 Missed actual "8"s	47 Correctly found "8"s

📊 Overall Accuracy

98.9%

Correct predictions / All predictions

🎯 "8" Precision

85.5%

When model says "8", how often correct?

🔍 "8" Recall

94.0%

Of all actual "8"s, how many found?

⚖️ F1-Score

89.5%

Balance of precision and recall

💡 MNIST Insight: In digit recognition, high precision means when the model says "8", it's usually right. High recall means the model catches most of the actual "8"s in the dataset.

Regression Metrics

For predicting continuous values (house prices, temperature, stock prices):

Mean Absolute Error (MAE)

Average absolute difference between predictions and actual values

Advantage: Easy to interpret, same units as target

Mean Squared Error (MSE)

Average squared difference - heavily penalizes large errors

Advantage: Smooth gradient for optimization

R² Score (Coefficient of Determination)

Proportion of variance explained by the model (0-1 scale)

Advantage: Scale-independent, easy to interpret

Cross-Validation: Robust Evaluation

Never trust a model evaluated on just one split of data. Cross-validation provides more reliable performance estimates:

K-Fold Cross-Validation Process:

Split data into k equal parts (folds)
Train on k-1 folds, test on remaining fold
Repeat k times, using each fold as test set once
Average the k performance scores

Hyperparameter Optimization

Fine-Tuning Your Model

Hyperparameters are settings you choose before training (learning rate, number of layers, etc.). Finding optimal values requires systematic search:

Common Hyperparameters:

Learning Rate: How big steps to take during optimization
Batch Size: Number of examples processed together
Number of Epochs: How many times to see the full dataset
Architecture: Number of layers, neurons per layer
Regularization: Techniques to prevent overfitting

Search Strategies:

Grid Search: Test all combinations of predefined values
Random Search: Randomly sample hyperparameter combinations
Bayesian Optimization: Use previous results to guide search

Preventing Overfitting

Overfitting occurs when models memorize training data but fail on new data. Prevention strategies:

🛡️ Regularization

L1/L2 Regularization: Add penalty for large weights

Dropout: Randomly ignore neurons during training

⏰ Early Stopping

Monitor validation performance and stop training when it stops improving

📊 Data Augmentation

Create variations of training examples to increase diversity

🔄 Cross-Validation

Use multiple train/validation splits to get robust estimates

Bias-Variance Tradeoff

Understanding this fundamental tradeoff helps make better modeling decisions:

High Bias: Model is too simple, misses patterns (underfitting)
High Variance: Model is too complex, memorizes noise (overfitting)
Sweet Spot: Balance complexity to minimize total error

Model Selection Best Practices

🎯 Systematic Approach

Start Simple: Begin with basic models to establish baselines
Add Complexity Gradually: Increase sophistication step by step
Use Validation Sets: Never optimize on test data
Consider Domain Knowledge: Let expertise guide feature selection
Monitor Multiple Metrics: Don't optimize for just one number
Document Everything: Track experiments for reproducibility

Knowledge Check

Test your understanding of model evaluation and optimization

1. When is precision more important than recall?

A) When false negatives are costly

B) When false positives are costly

C) When the dataset is balanced

D) Never - recall is always more important

2. What does the F1-score represent?

A) The arithmetic mean of precision and recall

B) The harmonic mean of precision and recall

C) The maximum of precision and recall

D) The difference between precision and recall

3. What is the main advantage of k-fold cross-validation?

A) It trains models faster

B) It provides more robust performance estimates

C) It requires less data

D) It automatically optimizes hyperparameters

4. Which metric heavily penalizes large errors?

A) Mean Absolute Error (MAE)

B) Mean Squared Error (MSE)

C) Accuracy

D) R² score

5. What is overfitting?

A) Model performs poorly on both training and test data

B) Model performs well on training data but poorly on test data

C) Model is too simple to capture patterns

D) Model trains too quickly

6. Which technique randomly ignores neurons during training to prevent overfitting?

A) L2 regularization

B) Early stopping

C) Dropout

D) Data augmentation

7. In a confusion matrix, what does a False Positive represent?

A) Correctly predicted positive

B) Correctly predicted negative

C) Incorrectly predicted as positive

D) Incorrectly predicted as negative

8. What does an R² score of 0.85 mean?

A) 85% accuracy

B) 85% of variance is explained by the model

C) 15% error rate

D) 85% precision

Quiz Complete!

0/8

Great job! You understand model evaluation and optimization.