Classification metrics
Suppose I build a machine learning model.
It gives predictions.
How do I know if my model is GOOD or BAD?
- By checking accuracy
- By seeing correct predictions
Is being correct most of the time always enough?
Why Accuracy Alone Fails
Imagine a hospital uses an AI model to detect a serious disease.
- Total patients = 1000
- Actually sick = 10
- Healthy = 990
Model says:
Everyone is healthy
How many predictions are correct?
- Correct = 990
- Accuracy = 99%
his model has 99% accuracy…
but it FAILED to detect even one sick patient.
So accuracy alone cannot tell us the full story.
High accuracy ≠ Good model
We don’t just want to know how many predictions were correct,
we want to know HOW the model is making mistakes.
This is why we need Classification Metrics.
WHY Classification Metrics Exist
Classification metrics help us:
- Measure model performance properly
- Understand different types of errors
- Decide whether the model is useful in real life”
Why We Need Classification Metrics
- Accuracy alone can be misleading
- Different mistakes have different impact
- Some problems need confidence, some need coverage
- Real-world decisions depend on error type
To evaluate any machine learning model,
we use something called Performance Metrics.
Performance metrics are numerical measures
that tell us how well a model is performing.
Performance Metrics = Model Evaluation Measures
For classification problems, we use classification performance metrics.
What Exactly Do Classification Metrics Measure?
Classification metrics do NOT only count correctness.
They measure:
- How many predictions were correct
- How many positives were detected
- How many false alarms occurred
- How many real cases were missed
Classification metrics analyze model errors in detail
Before calculating any metric,
we must first understand what kinds of predictions a model makes.
For that, we use something called the Confusion Matrix.
All classification performance metrics
are derived from the confusion matrix.
First we understand how predictions are classified
then we measure how good those classifications are.
When we build a classification model, its job is simple:
- Predict YES or NO
- Predict Spam or Not Spam
- Predict Sick or Healthy
- Predict Pass or Fail
Once predictions are made, a natural question arises:
How good is our classification model?
When we build a classification model, its job is simple:
- Predict YES or NO
- Predict Spam or Not Spam
- Predict Sick or Healthy
- Predict Pass or Fail
Once predictions are made, a natural question arises:
How good is our classification model?
But after predictions are made, a very important question arises:
How good is our model?
At first glance, the answer seems easy:
“Just check how many predictions are correct.”
This idea leads to accuracy
But this is where the problem starts.
Spam Filter Example (Very Important)
Suppose your spam filter checks 100 emails.
At first glance:
- 95 emails are classified correctly
- Accuracy = 95%
Sounds great, right?
Now think carefully.
Case 1: Most Emails Are Spam
Out of 100 emails:
- 90 are actually spam
- 10 are legitimate
Model prediction:
- Marks all 100 emails as spam
Result:
- Spam emails caught → 90 (correct)
- Important emails blocked → 10 (wrong)
Accuracy:

But ask yourself:
Would you use a spam filter that blocks important emails?
No.
Case 2: Very Few Spam Emails (More Realistic)
Out of 100 emails:
- 5 are spam
- 95 are legitimate
Model prediction:
- Again marks all 100 as spam
Result:
- Correct = 5
- Wrong = 95

Now the model is clearly terrible.
What Did We Learn?
Accuracy only answers:
“How many predictions were correct?”
But it does NOT answer:
- What kind of mistakes were made
- Whether those mistakes are acceptable
So the real question is:
What type of mistakes is the model making?
This is why we need other classification metrics.
Confusion Matrix – The Foundation of All Metrics
Before calculating any metric, we must first understand how predictions are classified.
This is done using the Confusion Matrix.
What Is a Confusion Matrix?
A confusion matrix is a 2×2 table that compares:
- Actual values (ground truth)
- Predicted values (model output)
It shows all possible outcomes of classification.
The Four Outcomes
| Prediction | Actual Positive | Actual Negative |
| Predicted Positive | True Positive (TP) | False Positive (FP) |
| Predicted Negative | False Negative (FN) | True Negative (TN) |
Let’s understand each clearly:
- True Positive (TP)
Model says YES, reality is YES
Example: Sick person correctly identified - True Negative (TN)
Model says NO, reality is NO
Example: Healthy person correctly cleared - False Positive (FP)
Model says YES, reality is NO
Example: Healthy person marked sick (false alarm) - False Negative (FN)
Model says NO, reality is YES
Example: Sick person marked healthy (dangerous)
Every classification metric is calculated using TP, FP, FN, and TN.
Accuracy – The Simplest Metric
What Is Accuracy?
Accuracy measures:
How many predictions were correct overall

Example: Student Pass/Fail Prediction
- Total students = 100
- Correct predictions = 87
Accuracy = 87%
Meaning:
The model is correct 87% of the time.
When Accuracy Works Well
Accuracy is useful when:
- Dataset is balanced
- Both types of errors are equally serious
When Accuracy Lies 🚨
Fraud Detection Example
- 1000 transactions
- Only 10 are fraud
- Model predicts all as normal
Accuracy = 99%
But:
- Fraud detected = 0
A 99% accurate model that catches zero fraud is useless.
Precision
Problem: Restaurant Review Sentiment
Classes:
- Positive → YES
- Negative → NO
Step 1: What the MODEL Predicted
“Model predicts 80 reviews as POSITIVE”
So for these 80 reviews:
- Model said YES (Positive)
Step 2: What is the REALITY for These 80 Reviews
Out of these 80 predicted-positive reviews:
- 70 are actually Positive
- 10 are actually Negative
Step 3: Now Assign TP / FP / TN / FN
Case A: 70 reviews
- Predicted: Positive (YES)
- Actual: Positive (YES)
True Positive (TP = 70)
Case B: 10 reviews
- Predicted: Positive (YES)
- Actual: Negative (NO)
This is a False Positive (FP = 10)
Why?
Because:
- Model said YES
- Reality was NO
This is a false alarm.

Interpretation (Very Important)
When the model says a review is Positive,
it is correct 87.5% of the time.
This means:
- The model’s positive predictions are mostly trustworthy
- Only a few false alarms exist
5. Another Numerical Example: Loan Approval (High-Risk Case)
Situation
A bank uses an ML model to approve loans.
Model Output
- Predicts 50 loan applications as Approved
Reality
- 40 applicants are actually good customers
- 10 applicants are risky and default later
So:
- TP = 40
- FP = 10
Precision

Meaning in Real Life
When the model approves a loan,
there is an 80% chance the customer is trustworthy.
The 20% false approvals can cause:
- Financial loss
- Legal trouble
- Reputation damage
That’s why precision is critical in loan approval systems.
Where Precision Matters Most (Very Important)
Precision is crucial when false positives are costly.
Examples:
- Spam filters
Marking an important email as spam is dangerous - Loan approval
Approving a bad loan causes financial loss - Product recommendations
Showing irrelevant products annoys users - Offensive content detection
Flagging innocent posts can cause serious issues
In all these cases:
It is better to say NO than to say YES incorrectly
100% Precision but Still a Bad Model
Step 1: Actual Situation (Reality)
Suppose we have 100 emails:
- 50 are actually spam
- 50 are legitimate
So there is a lot of spam in reality.
Step 2: Model’s Behaviour (Extreme & Lazy Model)
Now imagine a very conservative spam filter that behaves like this:
“I will mark an email as spam only if I am 100% sure.
Otherwise, I will say it is legitimate.”
So what does it do?
- It predicts ONLY 1 email as spam
- That email is actually spam
All other emails (99 emails):
- Are predicted as legitimate
- Even though many of them are spam
Step 3: Now Count TP, FP, FN, TN
For the ONE predicted spam email:
- Predicted: Spam
- Actual: Spam
True Positive (TP = 1)
For predicted spam emails:
- No wrong spam predictions
False Positive (FP = 0)
So far:
TP = 1
FP = 0
Step 4: Precision Calculation

So mathematically:
Precision = 100%
And this is correct mathematically.
Then Why Is the Model BAD?
Because now look at what it MISSED.
Actual spam emails = 50
Spam detected = 1
So:
- 49 spam emails were missed
- Inbox is full of spam
- Users are unhappy
This means:
- Recall is extremely low
- Model is practically useless
Precision looks ONLY at predicted YES cases.
It completely ignores missed YES cases.
That’s why:
- Predicting very few YES → precision goes up
- But usefulness goes down
Precision answers:
“When the model says YES, is it correct?”
Precision is a classification metric that measures the correctness of positive predictions. It is defined as the ratio of true positives to the total number of predicted positives. Precision is important when false positive errors are costly.
Final Takeaway
- High precision → Few false alarms
- Precision focuses on quality, not coverage
- Precision alone can be misleading
- Must be used together with Recall
Precision tells us how reliable YES predictions are.
Now let us see Recall, which tells us how many real YES cases were found.
What Is Recall?
Recall tells us how many actual YES cases the model was able to detect.
More clearly:
Out of all the cases that were actually positive,
how many did the model correctly identify?
Recall focuses on:
- Coverage of positives
- Missed cases
3. Recall Formula (With Meaning)

Where:
- TP (True Positive) = Actual YES, predicted YES
- FN (False Negative) = Actual YES, predicted NO (missed case)
So:

4. Numerical Example 1: Disease Detection (Classic Example)
Situation (Reality)
Suppose:
- 100 patients are actually sick
Model Prediction
- Model correctly identifies 85 sick patients
- Model misses 15 sick patients
So:
- TP = 85
- FN = 15
Recall Calculation

Interpretation (Very Important)
The model detects 85% of the sick patients
but misses 15%, who may not receive treatment.
In medical problems:
Missing a sick patient can be dangerous or fatal.
5. Numerical Example 2: Spam Filter (Easy to Visualize)
Reality
Out of 100 emails:
- 40 are actually spam
- 60 are legitimate
Model Prediction
- Correctly detects 30 spam emails
- Misses 10 spam emails (go to inbox)
So:
- TP = 30
- FN = 10
Recall Calculation

Meaning
The spam filter catches 75% of spam,
but 25% of spam still reaches the inbox.
Low recall here means:
- Users still see a lot of spam
- Spam filter is weak
7. Where Recall Matters the MOST (Critical Section)
Recall is crucial when missing a positive case is very dangerous.
Examples:
- Medical diagnosis
Missing disease = patient suffers - Cancer screening
Missing cancer = life-threatening - Fraud detection
Missing fraud = financial loss - Security systems
Missing threats = safety risk
In such systems:
Better to raise some false alarms
than to miss real danger.
8. Problem With Only High Recall (Very Important)
Now comes the most important learning.
Extreme Case: 100% Recall but Still a Bad Model
Imagine a medical test that:
- Predicts EVERYONE as sick
Reality:
- 100 patients are sick
- 9,900 are healthy
Model prediction:
- All 10,000 predicted sick
So:
- TP = 100
- FN = 0
Recall

Perfect recall!
But what happened?
- 9,900 healthy people got false alarms
- Hospitals overflow
- Panic everywhere
So:
High recall alone does NOT mean a good model.
9. Why Recall Alone Is Not Enough
Recall does NOT tell us:
- How many false alarms occurred
- Whether YES predictions are trustworthy
So a model can:
- Catch all positives
- But still be unusable
That’s why recall must be balanced with Precision.
Out of all actual YES cases, how many did we find?
Recall is a classification metric that measures the ability of a model to identify all actual positive cases. It is defined as the ratio of true positives to the total number of actual positives. Recall is important when missing a positive case is costly or dangerous.
Final Takeaway
- High recall → Few missed cases
- Recall focuses on coverage, not accuracy of YES
- High recall alone can cause many false alarms
- Recall must be used with Precision
Precision tells us how reliable YES predictions are.
Recall tells us how many real YES cases were found.
Now let us combine both using F1-Score.
F1-Score – Balancing Precision and Recall
Why Do We Need F1-Score?
So far, we learned two important metrics:
- Precision → When the model says YES, is it correct?
- Recall → Did the model find all the YES cases?
- Now, the questions are:
- What if a model has high precision but low recall?
What if a model has high recall but low precision?
So we need ONE metric that:
- Considers both precision and recall
- Penalizes models that do well in only one
- Rewards balanced performance
This is why F1-Score exists.
What Is F1-Score? (In Simple Words)
F1-Score is a single number that tells how well a model balances precision and recall.
In other words:
A model gets a high F1-Score only if BOTH precision and recall are high.
If either one is low:
- F1-Score becomes low
F1-Score Formula

This is called the harmonic mean of precision and recall.
Numerical Example 1: Balanced Model (Good Case)
Model Performance
- Precision = 90%
- Recall = 80%
F1-Score Calculation

Interpretation
The model is reliable (high precision)
AND
The model covers most positives (good recall)
So the F1-Score is high.
Numerical Example 2: High Precision, Low Recall (Bad Case)
Model Performance
- Precision = 95%
- Recall = 10%
This happens when:
- Model predicts YES very rarely
- Almost never gives false alarms
- But misses most real cases
Simple Average (MISLEADING)

Looks okay… but it’s NOT.
F1-Score

Interpretation
Even though precision is very high,
the model is terrible at finding real cases.
F1-Score exposes this clearly.
Numerical Example 3: High Recall, Low Precision (Also Bad)
Model Performance
- Precision = 20%
- Recall = 95%
This happens when:
- Model predicts almost everything as YES
- Catches nearly all positives
- But raises many false alarms

Interpretation
Catching everything is useless
if most predictions are wrong.
Again, F1-Score shows the truth.
hink of F1-Score like a balance scale:
- Precision on one side
- Recall on the other side
If one side is very low → balance breaks → low F1
Why Harmonic Mean Is Used (Very Important)
Simple Explanation
- Arithmetic mean allows cheating
- Harmonic mean does not
Example:
| Precision | Recall | Average | F1 |
| 95% | 10% | 52.5% | 18% |
| 90% | 90% | 90% | 90% |
F1-Score punishes imbalance and rewards balance.
Where F1-Score Is MOST Useful
F1-Score is ideal when:
- Dataset is imbalanced
- Both false positives and false negatives matter
- You need one number to compare models
Common Uses:
- NLP (spam detection, hate speech)
- Medical screening (general assessment)
- Fraud detection
- ML competitions
10. Where F1-Score Is NOT Ideal
F1-Score may not be the best choice when:
- Only precision matters (e.g., loan approval)
- Only recall matters (e.g., cancer screening)
In such cases:
Focus on the more important metric directly
F1-Score answers:
“Is the model good at both being correct and not missing important cases?”
F1-Score is a classification metric that combines precision and recall using their harmonic mean. It provides a balanced evaluation of a model by penalizing extreme values of precision or recall. It is especially useful for imbalanced datasets.
Final Takeaway
- Precision alone can be misleading
- Recall alone can be misleading
- F1-Score balances both
- High F1-Score = truly good classification model
Metrics Comparison Table
| Metric | Formula | What It Measures | Best Used When | Not Suitable When |
| Accuracy | (TP + TN) / All | Overall correctness | Balanced data | Imbalanced data |
| Precision | TP / (TP + FP) | Quality of YES predictions | False positives are costly | Missing positives is costly |
| Recall | TP / (TP + FN) | Coverage of actual positives | False negatives are dangerous | False alarms are costly |
| F1-Score | 2PR / (P + R) | Balance of precision & recall | Both FP & FN matter | One metric is clearly more important |
Summary
2. Accuracy – “How many predictions were correct?”
What Accuracy Tells Us
Accuracy measures:
Overall correctness of the model
It counts:
- Correct YES predictions
- Correct NO predictions
and ignores:
- Which class was more important
When Accuracy Is Good
Use accuracy when:
- Dataset is balanced
- Both types of errors are equally serious
Example:
- Student pass/fail prediction
- Balanced exam datasets
When Accuracy Fails
Accuracy is misleading when:
- One class dominates (95% NO, 5% YES)
- Errors have different costs
Key lesson:
High accuracy does NOT guarantee a good model.
3. Precision – “Can I trust YES predictions?”
What Precision Measures
Precision focuses on:
How correct the model’s YES predictions are
It answers:
“When the model says YES, is it right?”
When Precision Is Most Important
Use precision when:
- False positives are costly or harmful
Examples:
- Spam filters (don’t block real emails)
- Loan approval (don’t approve bad loans)
- Product recommendations (avoid irrelevant items)
When Precision Is Not Enough
Precision ignores:
- Missed positive cases
So:
- High precision does not mean high usefulness
Key lesson:
Precision = confidence in YES predictions.
4. Recall – “Did we find all the important cases?”
What Recall Measures
Recall focuses on:
How many actual YES cases the model found
It answers:
“Out of all real positives, how many did we catch?”
When Recall Is Critical
Use recall when:
- Missing a case is dangerous
Examples:
- Medical diagnosis
- Cancer screening
- Fraud detection
- Security systems
When Recall Alone Is Dangerous
Recall ignores:
- How many false alarms occurred
So:
- High recall can create panic and waste
Key lesson:
Recall = coverage of important cases.
5. F1-Score – “Is the model balanced?”
What F1-Score Measures
F1-Score combines:
- Precision (quality)
- Recall (coverage)
It gives:
One balanced score
When F1-Score Is Best
Use F1-Score when:
- Dataset is imbalanced
- Both FP and FN matter
- You need a single comparison metric
Examples:
- Spam detection
- NLP tasks
- ML competitions
When F1-Score Is Not Ideal
Avoid F1-Score when:
- One metric is clearly more important
(e.g., only recall matters in cancer screening)
Key lesson:
F1-Score punishes imbalance and rewards balance.
6. Applications Summary
| Application | Biggest Risk | Best Metric |
| Cancer detection | Missing disease | Recall |
| Spam filter | Blocking real emails | Precision |
| Loan approval | Approving risky customer | Precision |
| Fraud detection | Missing fraud | Recall |
| Recommendation system | Poor quality suggestions | Precision |
| General ML evaluation | Balanced performance | F1-Score |
