Why do we use harmonic mean in F1 scores?

The harmonic mean is used in F1 scores to effectively balance precision and recall, ensuring that a model performs well on both metrics rather than excelling at one at the expense of the other.

Understanding the F1 Score

The F1 score is a crucial metric in machine learning, particularly for evaluating classification models. It combines two fundamental metrics:

Precision: Measures the accuracy of positive predictions. It answers: "Of all items identified as positive, how many are actually positive?"
Recall: Measures the ability of a model to find all the positive samples. It answers: "Of all actual positive items, how many were correctly identified?"

Both precision and recall are rates (ratios) between 0 and 1. While important individually, a model can have high precision but low recall (e.g., it rarely predicts positive, but when it does, it's usually correct), or vice-versa. The F1 score addresses this by providing a single metric that rewards models with a good balance of both.

The Role of the Harmonic Mean

The harmonic mean is chosen for the F1 score for several key reasons:

Penalizes Imbalance Significantly: Unlike the arithmetic mean, the harmonic mean is heavily influenced by lower values. Since both precision and recall are rates between 0 and 1, the harmonic mean helps balance these two metrics by considering their reciprocals. This ensures that a low value in either precision or recall has a significant impact on the overall F1 score, thus incentivizing a balance between the two.
Discourages Extreme Performance: A model that achieves excellent performance on one metric (e.g., 100% precision) but poor performance on the other (e.g., 10% recall) will result in a low F1 score. This pushes models to improve both aspects.
Robust to Zero Values (Unlike Geometric Mean): While the geometric mean also penalizes imbalances, it would return zero if either precision or recall were exactly zero. The harmonic mean allows for a non-zero score even if one component is very low but not precisely zero.

The formula for the F1 score clearly illustrates this:

$$ F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$

This formula is equivalent to the reciprocal of the arithmetic mean of the reciprocals of precision and recall, fundamentally emphasizing the impact of lower values.

Comparison with Other Means

To illustrate the effect, consider two hypothetical models:

Model	Precision	Recall	Arithmetic Mean	Harmonic Mean (F1)
A	1.0	0.1	0.55	0.18
B	0.5	0.5	0.5	0.5

In this example, Model A has perfect precision but very poor recall. Model B has a balanced, moderate performance for both metrics.

The arithmetic mean suggests Model A is slightly better (0.55 vs. 0.5), which is misleading given Model A's severe deficiency in recall.
The harmonic mean (F1 score) correctly highlights that Model B, with its balanced performance, is superior when both precision and recall are important considerations. The drastic drop in F1 for Model A reflects the penalty for imbalance.

Practical Implications

The use of the harmonic mean in the F1 score is particularly beneficial in several real-world scenarios:

Imbalanced Datasets: When one class vastly outnumbers the other (e.g., detecting rare diseases, fraud detection), models can achieve high overall accuracy by simply predicting the majority class. However, they might miss nearly all instances of the minority class. The F1 score, by demanding a balance between precision and recall, serves as a more reliable metric than simple accuracy in such cases.
Optimizing for Balanced Performance: If the cost of false positives and false negatives is roughly equal or both are critical to minimize, the F1 score provides a single, intuitive metric to guide model optimization. For instance, in medical diagnostics, both misdiagnosing a healthy person (false positive) and missing a disease (false negative) can have severe consequences.

In essence, the harmonic mean ensures that a high F1 score is only achievable when a model demonstrates robust performance across both its positive predictions and its ability to identify all relevant instances, making it a powerful and widely adopted metric in machine learning.