Ranking evaluation metrics

## Prologue A review of the most essential ranking evaluation metrics, the Receiver-Operating-Characteristic (ROC), Discounted Cumulative Gain (DCG), and Average Precision (AP). This summary also presents my own evaluation metric, **F-measured Average Precision (FAP)**, for the ranked retrieval of *sets*. Use FAP when finding the right cutoff is more critical than a hit in the first position. If you need a robust information-theoretic metric, use ROC. If concentrating hits in the top ranks is paramount, for example in information retrieval, prefer DCG. Finally, AP can provide an interpretable score to accompany DCG. ### Nomenclature - $Q$ : the set of all **queries** being evaluated - $f$ : the predictor/classifier **function** to rank results - $I$ : the set of all **instances** for a query, separated into relevant $I_{pos}$ and irrelvant cases $I_{neg}$ - $R$ : the **result** subset of all instances that the predictor selected, i.e., $R \subset I$ and $r \in R$. - $rel_i$ : the **relevance** of the i-th ranked result (either binary or real-valued and based on some external relevance grading approach) - [Discrete evaluation measures](Discrete%20evaluation%20measures.md) for more basic variables ## [Receiver-Operating-Characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) (ROC) $\int_{x=0}^1 TPR(FPR^{-1}(X)) dx$ which is the Area Under the ROC Curve, with $TPR$ as the True Positive Rate, and $FPR$ as the False Positive Rate. ![ROC curve example|300](roc_curve_example.png) ([Source](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)) The AUC ROC represents the probability that a classifier can tell a randomly selected positive instance $p$ from a negative one $n$. It is calculated over a set of discrete instances $I$ in the following form: $\frac{\sum_{n \in I_{neg}} \sum_{p \in I_{pos}} \mathbb{1}[f(n) < f(p)]}{|I_{neg}| \times |I_{pos}|}$ is the fraction of all possible pairings of negative and positive instances where the classifier (a.k.a. predictor) function $f$ correctly prefers the positive example over the negative example. You can understand the ROC curve as making a [statistical significance test](Power%20and%20Significance.md) for every ranked item by assigning positive items to the $H_1$ distribution and negative ones to $H_0$, and where the current item is effectively the position of the critical value, [as described for example by Alaa Tharwat](https://www.researchgate.net/figure/A-visualization-of-how-changing-the-threshold-changes-the-TP-TN-FP-and-FN-values_fig5_327148996): ![](Ranking%20evaluations%20are%20hypothesis%20tests.png) You can calculate a *weighted* ROC by trapezoidal approximation of the ordered results: $\sum_{i=1}^{|I|-1} w_i \frac{(FPR_{i+1} - FPR_i) \times (TPR_i + TPR_{i+1})}{2}$ where $TPR$ is the True Positive Rate at result $i$, and $FPR$ the corresponding False Positive Rate. Because ROC is **scale-invariant**, a weight $w$ for every rank $i$ can be used to give the top ranks more importance. A common practice are exponentially decaying weights to give the top results more importance in information retrieval settings: ```python from sklearn.metrics import roc_auc_score import numpy as np # Example test data: y_true = np.array([0, 1, 1, 0, 1, 0]) # Hypothetical relevance scores: y_scores = np.array([0.1, 0.8, 0.4, 0.2, 0.9, 0.3]) # Generate exponential decay sample weights: num_samples = len(y_true) decay_factor = 0.8 # Adjust based on your preferences; # Lower factors lead to faster decays, # making the top results even more important. sample_weights = np.array([decay_factor ** i for i in range(num_samples)]) # NB: sample_weights and y_scores do not need to be in the same order. # Example weights with a factor of 0.5: [1., .5, .25, .125, ...]. # Calculate the sample-weighted AUC ROC: auc_roc = roc_auc_score(y_true, y_scores, sample_weight=sample_weights) ``` Because ROC considers all variables of the confusion matrix (including TN), ROC is **insensitive to changes in class distribution**. Therefore, the ROC is the most *robust* ranked evaluation metric among all options on this page. This nature of ROC makes it an excellent pick if the class distributions of the individual result sets have high variance. ROC curves are **invariant to the classification threshold**, meaning the amount of spread between the positive and negative instance distribution is not evaluated. That makes it hard to tune this metric to prefer, say, precision over recall (which is something the $\beta$ parameter for F-Scores can do easily). [ROC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score) should be your **preferred ranking evaluation metric**. The *limitation* is that you need to be aware of the full set of negative instances $I_{neg}$ (and, obviously, $I_{pos}$, all relevant/positive ones, too). If you do not know or have access to the $I_{neg}$ set, use another evaluation strategy, such as AP. ## [Discounted Cumulative Gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) (DCG$_k$) $\sum_{i=1}^k \frac{2^{rel_i} - 1}{log_2(i+1)}$ where $k$ is the max. number of ranks to take into consideration. $k$ can be some fixed number, such as $k=10$, the number of all instances $k = |I|$, or even the size of the result set $k = |R|$. $rel_i$ is the [graded relevance](https://en.wikipedia.org/wiki/Relevance_(information_retrieval)#Evaluation) of the result at position $i$. If all relevant results are equally important, $rel_i$ can be binary: $1$ for all positive results, and therefore $2^{rel_i} - 1 = 1$ for positive cases, and $0$ for negative cases. The denominator for the first result is $1$, and $1.585$ for the second, etc. Therefore, DCG has exponential rank decay built in. This metric is frequently used for evaluating **search engine results** because it allows for granular relevance scoring of hits. In that domain, a common relevance grading mechanism is the click-through rate. $k$ often will be the number of results shown on the first page of search results. DCG (and Normalized DCG below) does not penalize predictors for low-ranking negative results (At least if $rel_i = 0$ for those negative cases, which typically is the case.) For example, if two predictors return ranked result sets with relevance scores 1,1,1 and 1,1,1,0,0 respectively, both would be considered equal even though the latter contains two negative results (FAP - discussed later - would account for that case). You should use [DCG](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.dcg_score.html) if $I$ cannot be enumerated and you can identify a meaningful $k$. If you have trouble finding a $k$, look at nDCG (next). #### [Normalized DCG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG) (nDCG) To compare queries with different $k$ or different sizes of instances, a DCG score can be _normalized_ by the **ideal** DCG score, $iDCG$. $\frac{DCG_k}{iDCG_k}$ where $iDCG$ recalculates the DCG by ordering the most relevant results first. Therefore, if the results happen to be in that reverse relevance order already, $nDCG = 1$. Normalized DCG with a $k$ set to the size of each result set ($k = |R|$) does not penalize for missing documents in the result. For example, if two systems return result sets of different sizes with scores 1,1,1 and 1,1,1,1,1, both would be considered equally good due to the normalization. [nDCG](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ndcg_score.html) is best used with $k = |I|$ when you cannot pick a meaningful $k$ and you have fine-grained relevance ranking for your results: - If there is a clear, fixed $k$ to pick, you can just use DCG$_k$. - And if you don’t have graded relevance results, it might be better to use ROC or AUC PR. ## [(Mean) Reciprocal Rank](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) ([M]RR) $\frac{1}{|Q|}\sum_{i=1}^{|Q|}\frac{1}{rank_Q}$ where $rank_Q$ is the rank of the first relevant item for query $Q$ Reciprocal rank is the right metric to pick if only the **first hit** (i.e., the first relevant result) in the prediction result sets matters. Think of it as the tiny sibling of DCG. ## [Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) (AP) The aggregated Average Precision over multiple queries $Q$, know as Mean Average Precision (MAP), is $\frac{\sum_{i=1}^{|Q|}AP(R_i)}{|Q|}$ where $R_i$ is the $i$th ranked result set from a set of queries $Q$. With that, the [Average Precision](https://en.wikipedia.org/wiki/Average_precision) $AP$ of such a result set $R$ is defined as: $\frac{\sum_{i=1}^{|R|} w_i p_i \times \mathbb{1}[r_i \in I_{pos}]}{|I_{pos}|}$ where $p_i$ the precision of the result at rank $i$ among all $|R|$ ranked results, and $\mathbb{1}$ is an indicator function that evaluates to $1$ if the result $r_i$ is in the set of relevant instances $I_{pos}$. If you want to assign the top-ranked results more importance, you can set a weight $w$ for every rank $i$. Just as discussed for ROC, you might want to use exponential decay for information retrieval tasks (The default would be $w_i = 1$.) Unlike DCG, AP has a probabilistic interpretation that ranges from $0$ to $1$ unless you are using weights. The baseline classifier has an average precision equal to the percentage of positives among all possible instances $I$. The AP in that case is $\frac{|I_{pos}|}{|I|}$ because such a baseline classifier ranks all hits last. While an ideal classifier has an average precision of $1$. Like DCG, AP does not penalize for low-ranked negative results. For example, if two predictors return results with scores 1,1,1 and 1,1,1,0,0, both would have the same AP (FAP - explained later - would be able to differentiate between these two results.) Unlike DCG, AP never suffers from different result set sizes - 1,1,1 and 1,1,1,1,1 would lead to different scores. (At least, unless you use AP@k, and with $k = 3$ here.) The *limitation* is that it must be possible to enumerate $|I_{pos}|$. And, AP does not account for the relevance grading of results, unlike DCG. (Graded Average Precision, GAP, [has been suggested](https://dl.acm.org/doi/abs/10.1145/1835449.1835550), but I know of no implementation in common ML libraries. The authors claim GAP performs better than [n]DCG when learning to rank from graded results.) With the `sample_weight` parameter in [`sklearn.average_precision_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score), you can give every rank in the result a specific weight and put more emphasis on the top ranks. However, as you lose the probabilistic interpretability benefit with weighting, you might use DCG instead. #### Area Under the interpolated Precision/Recall Curve (AUC iPR) Plotting the Average Precision over the whole ranked result set helps understand this metric better. [Plotting iPR curves](https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html) allows you to visually compare two results with similar scores and possibly identify a preference. ![PR-curve|350](pr_curve_example.png) An **interpolated** Precision/Recall (iPR) curve. Note the stepwise recovery, which is due to interpolation. ([Source](https://pub.towardsai.net/precision-recall-curve-26f9e7984add)) $AP$ *approximates* the area under the Precision/Recall curve (AUC PR) by using *interpolation*: ![AP is AUC iPR and not AUC PR](AP%20vs%20AUC%20PR.png) Average Precision represents the area under the **interpolated** PR curve (left), not the actual PR curve (right). The turquoise area under the iPR curve not covered by the pink area under the PR curve is the *overestimation bias*. The two curves represent a ranked protein normalization result on Article 12 of BioCreative II.5 (Orange dots: positive instances; grey dots: negative instances; light green area: hits; red area: errors; orange area: misses) (Source: My defense.) However, unlike AP (i.e., the area under the interpolated PR curve), the area under the PR curve (AUC PR) has no probabilistic interpretability. Therefore, AP and DCG are sufficient measure to complement ROC. It is also worth contrasting ROC and PR curves side-by-side to understand their differences: ![ROC vs PR curves](ROC%20vs%20PR%20curves.png) As is easy to spot by comparing the DNN and XGBoost results, PR curves are much better at highlighting any issues with the top ranks. ([Source](https://www.researchgate.net/publication/358509219_Diabetes_mellitus_risk_prediction_in_the_presence_of_class_imbalance_using_flexible_machine_learning_methods)) ## [F-measured Average Precision](https://repositorio.uam.es/bitstream/handle/10486/8883/47142_leitner_florian.pdf?sequence=1) (FAP) In subsection 3.2.5 of [my PhD thesis](https://repositorio.uam.es/bitstream/handle/10486/8883/47142_leitner_florian.pdf?sequence=1), I propose the F-measured Average Precision metric, which is a combination of the [$F_\beta$-Score](Discrete%20evaluation%20measures.md) and the Average Precision (defined above): $(1 + \beta^2) \frac{F_\beta \times AP}{\beta F_\beta + AP}$ For the balanced F-measure ($\beta = 1$) this simplifies to: $\frac{2F_1 \times AP}{F_1 + AP}$ The FAP score is a metric that dynamically penalizes long result list tails without relevant hits due to the F-measure while taking ranking performance into account via the AP term. I.e., it "encourages" the predictor $f$ to pick a good cutoff when there are many negative instances. ![Comparing F-Score, AP, and FAP](fap_score_comparison.png) A comparison of F-Score, AP, and FAP via precision and recall (Source: My defense slide deck; unpublished.) This makes FAP uniquely helpful in evaluating the ranked retrieval of **sets**. As shown in my thesis, combining the two metrics with the harmonic mean (as opposed to other aggregation strategies) allows you to find a more clear cutoff point at peak performance if you evaluate FAP at each rank (FAP@k) of a result set. Unlike AP or DCG, FAP takes the overall quality of the result set into account, making long tails of irrelevant results unattractive (1,1,1 vs. 1,1,1,0,0). This makes FAP an ideal measure to **use in [bagging](https://en.wikipedia.org/wiki/Bootstrap_aggregating) or other ensemble learning setups** (as was the need in my thesis, and hence the reason for creating this metric). ## Pearson & Spearman Two more "traditional" ranking evaluation metrics are [Pearson's](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) and, in particular, [Spearman's rank](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) [correlation coefficient](Measuring%20Correlations.md)