<< back to Guides

AI Core Concepts (Part 19): Model Evaluation

Model Evaluation is the process of assessing how well your machine learning or deep learning model performs on unseen data. This ensures your model generalizes beyond the training data and performs reliably in production.


1. Why Model Evaluation Matters


2. Key Concepts

Term Description
Training Set Used to train the model
Validation Set Used for tuning (cross-validation, early stopping)
Test Set Final performance check, never touched during training

3. Common Evaluation Metrics

For Classification

Metric Use Case
Accuracy % correct predictions
Precision Correct positive predictions / all predicted positives
Recall Correct positive predictions / all actual positives
F1 Score Harmonic mean of precision and recall
ROC-AUC Area under the ROC curve (binary classifiers)
from sklearn.metrics import classification_report

y_true = [1, 0, 1, 1, 0, 1]
y_pred = [1, 0, 0, 1, 0, 1]

print(classification_report(y_true, y_pred))

For Regression

Metric Description
MAE (Mean Absolute Error) Average absolute difference
MSE (Mean Squared Error) Penalizes large errors
RMSE Square root of MSE
R² Score % variance explained by model
from sklearn.metrics import mean_squared_error, r2_score

y_true = [3.0, -0.5, 2.0, 7.0]
y_pred = [2.5, 0.0, 2.1, 7.8]

mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)

print(f"MSE: {mse:.2f}, R²: {r2:.2f}")

4. Cross-Validation

Cross-validation helps assess how the model generalizes. Most common is k-fold CV:

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
model = RandomForestClassifier()

scores = cross_val_score(model, X, y, cv=5)
print("Cross-validation scores:", scores)
print("Mean CV score:", scores.mean())

5. Evaluation for Deep Learning Models

Use validation accuracy and loss curves during training to spot overfitting.

import matplotlib.pyplot as plt

# Assume you have history object from Keras
history = model.fit(...)

plt.plot(history.history['accuracy'], label='Train Acc')
plt.plot(history.history['val_accuracy'], label='Val Acc')
plt.legend()
plt.title("Training vs Validation Accuracy")
plt.show()

Also consider:


6. Evaluation in NLP / LLMs

Task Metric
Text classification Accuracy, F1
Translation BLEU score
Summarization ROUGE score
Q&A / Chatbot Human evals, BLEU, GPT-based grading
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score("The cat sat on the mat.", "The cat is sitting on the mat.")

print(scores)

For generative tasks, automatic metrics often fall short. Human evaluation is still gold standard.


7. A/B Testing in Production

Once a model is deployed, evaluate with live metrics:

A/B tests help compare models in real environments.


8. Tools for Evaluation


📚 Further Resources


<< back to Guides