Understanding the receiver operating characteristic curve

Table of Contents

AUC versus AUROC
#

AUC stands for Area Under the Curve, but which curve? In practice, when people say “AUC” in the context of binary classification, they almost always mean the AUROC (Area Under the Receiver Operating Characteristic curve). Using “AUC” alone is technically ambiguous; “AUROC” is the precise term. This distinction matters because there are other curves (e.g. precision-recall curves) with their own AUC values.

There are several equivalent interpretations of the AUROC.

The expectation that a uniformly drawn random positive is ranked before a uniformly drawn random negative.
The expected proportion of positives ranked before a uniformly drawn random negative.
The expected true positive rate if the ranking is split just before a uniformly drawn random negative.
The expected proportion of negatives ranked after a uniformly drawn random positive.
The expected false positive rate if the ranking is split just after a uniformly drawn random positive.

The confusion matrix
#

Assume a probabilistic binary classifier such as logistic regression. For a given decision threshold $t$, every prediction falls into one of four buckets:

	Predicted Negative	Predicted Positive	Total
Actually Negative	True Negative $(TN)$	False Positive $(FP)$	$N$
Actually Positive	False Negative $(FN)$	True Positive $(TP)$	$P$

From these four cells we derive two key rates:

$$\text{TPR (True Positive Rate / Sensitivity / Recall)} = \frac{TP}{TP + FN}$$$$\text{FPR (False Positive Rate)} = \frac{FP}{FP + TN}$$

Note that $\text{FPR} = 1 - \text{Specificity}$.

Building the ROC curve
#

A logistic regression model outputs a probability $\hat{p} \in [0,1]$ for each observation. To classify, we pick a threshold $t$ and predict “positive” when $\hat{p} \geq t$.

Sweep $t$ over a grid, e.g. $t \in {0.00, 0.01, 0.02, \ldots, 1.00}$.
At each $t$, compute FPR and TPR from the confusion matrix.
Plot TPR (y-axis) vs. FPR (x-axis).

The resulting curve is the ROC curve. The AUROC is the area enclosed between this curve and the x-axis:

$$\text{AUROC} = \int_0^1 \text{TPR}(\text{FPR}) \, d(\text{FPR})$$

Key reference points
#

AUROC	Interpretation
$1.0$	Perfect classifier
$0.5$	Random / no-skill classifier (the diagonal line)
$0.0$	Perfectly wrong classifier (all predictions flipped)

The dashed diagonal line $(\text{FPR} = \text{TPR})$ represents a random predictor and serves as the baseline.

The probabilistic interpretation
#

This is the most powerful way to understand AUROC. AUROC equals the probability that a randomly drawn positive example is ranked higher (given a higher predicted probability) than a randomly drawn negative example.

Formally, let $X_+$ be the score assigned to a random positive and $X_-$ the score for a random negative:

$$\text{AUROC} = P(X_+ > X_-)$$

This means:

AUROC	Interpretation
1.0	The model always ranks positives above negatives.
0.5	The model ranks them at random.
0.8	There is an 80% chance the model will rank a random positive above a random negative.

AUROC is therefore purely a ranking metric, it does not depend on the absolute calibration of the predicted probabilities, only on their relative ordering.

Computing the AUROC in Scikit-Learn
#

Minimal example
#

import numpy as np
from sklearn import metrics

y_true = np.array(
    ['P', 'P', 'N', 'P', 'P',
     'P', 'N', 'N', 'P', 'N',
     'P', 'N', 'P', 'N', 'N',
     'N', 'P', 'N', 'P', 'N']
     )
y_score = np.array(
    [0.9, 0.8, 0.7, 0.6, 0.55,
    0.51, 0.49, 0.43, 0.42, 0.39,
    0.33, 0.31, 0.23, 0.22, 0.19,
    0.15, 0.12, 0.11, 0.04, 0.01]
    )

fpr, tpr, thresholds = metrics.roc_curve(y_true, y_score, pos_label='P')
print(metrics.auc(fpr, tpr))   # 0.67999

This returns 0.67999, but you could try to simulate it as well. Using the same data as above, raw random positive and negative examples and then calculate the proportion of cases when positives have greater score than negatives.

pos = y_score[y_true == 'P']
neg = y_score[y_true == 'N']

rng = np.random.default_rng(33)
p = rng.choice(pos, size=50000) > rng.choice(neg, size=50000)
print(p.mean())   # 0.67916

And you get 0.67916. Quite close, isn’t it? More information about the AUC can be found at sklearn.metrics.auc in the Scikit-Learn documentation website.

Full end-to-end example with logistic regression
#

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# generate synthetic binary dataset
X, y = make_classification(n_samples=1000, n_features=20,
                            n_classes=2, random_state=0)

# train / test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

# fit logistic regression
model = LogisticRegression()
model.fit(X_train, y_train)

# predicted probabilities for the positive class
preds = model.predict_proba(X_test)[:, 1]

# compute AUROC
print("AUROC:", roc_auc_score(y_test, preds))

# plot ROC curves
plt.figure(figsize=(7, 6))
fpr, tpr, _ = roc_curve(y_test, preds)
plt.plot(fpr, tpr, label=f"AUC = {roc_auc_score(y_test, preds):.2f}")
plt.plot([0,1],[0,1], 'k--', label="Random (AUC=0.5)")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()

In Figure 1 one can see the area under the curve of the receiver operating characteristic. The dashed line in the diagonal represent the ROC curve of a random predictor, which has an AUROC of 0.5. The random predictor is commonly used as a baseline to see whether the model is useful or not.

More information can be found at sklearn.metrics.roc_curve and sklearn.metrics.roc_auc_score at the Scikit-Learn documentation website.

Calculating the AUC directly on the breast cancer dataset
#

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

X, y = load_breast_cancer(return_X_y=True)
clf = LogisticRegression(solver="newton-cholesky", random_state=0)
clf.fit(X, y)

roc_auc_score(y, clf.predict_proba(X)[:, 1])    # 0.99
roc_auc_score(y, clf.decision_function(X))      # 0.99

Trapezoidal integration for AUC
#

In practice, AUC is approximated numerically using the trapezoid rule: the area under the ROC curve is divided into trapezoids defined by consecutive (FPR, TPR) points, and their areas are summed:

$$\text{AUROC} \approx \sum_{i=1}^{n-1} \frac{(\text{FPR}_{i+1} - \text{FPR}_i)(\text{TPR}_i + \text{TPR}_{i+1})}{2}$$

Why not just use accuracy?
#

Consider a heavily imbalanced dataset where 95% of samples belong to class 1. A naive model that always predicts class 1 achieves 95% accuracy yet has zero ability to distinguish between classes. AUROC exposes this failure:

import numpy as np
from sklearn.metrics import accuracy_score, roc_auc_score

n, ratio = 10000, 0.95
y = np.array([0] * int((1 - ratio)*n) + [1] * int(ratio*n))

# model that always predicts class 1
y_proba_naive = np.ones(n)
print("Accuracy:", accuracy_score(y, y_proba_naive > 0.5))  # 0.95
print("AUROC:", roc_auc_score(y, y_proba_naive))            # 0.5 (random!)

AUROC = 0.5 correctly flags the naive model as no better than chance.

Choosing a threshold from the ROC curve
#

The ROC curve also guides threshold selection.

Priority	Use
Low FPR priority (e.g. legal systems, spam filters)	Pick a point far left on the curve (low $t$).
High TPR priority (e.g. cancer screening)	Pick a point near the top of the curve (lower $t$, accepting more false alarms).
Balanced trade-off	The point on the curve closest to $(0, 1)$.

This last item in the table on the balanced trade-off is related to Youden’s J statistic.

$$J = \text{TPR} - \text{FPR} = \text{Sensitivity} + \text{Specificity} - 1$$

Important caveats
#

Limitation	Detail
Calibration-insensitive	AUROC is identical if probabilities range 0.9–1.0 vs 0–1, as long as ranking order is preserved.
Not for imbalanced absolute performance	A high AUROC may still correspond to poor precision or negative predictive value.
Single-number oversimplification	Collapsing the full curve to one scalar ignores threshold-specific trade-offs.
Multiclass extension	For $c$ classes, one common approach averages pairwise AUC over all $\frac{c(c-1)}{2}$ pairs.

Quick reference cheat sheet
#

$$\text{TPR} = \frac{TP}{TP+FN}, \quad \text{FPR} = \frac{FP}{FP+TN}$$$$\text{AUROC} = \int_0^1 \text{TPR}(u)\, du = P(X_+ > X_-)$$$$\text{AUC}_{\text{trapezoid}} \approx \sum_{i} \frac{(\Delta \text{FPR}_i)(\text{TPR}_i + \text{TPR}_{i+1})}{2}$$

	Predicted Negative	Predicted Positive	Total
Actually Negative	True Negative \((TN)\)	False Positive \((FP)\)	\(N\)
Actually Positive	False Negative \((FN)\)	True Positive \((TP)\)	\(P\)

AUROC	Interpretation
\(1.0\)	Perfect classifier
\(0.5\)	Random / no-skill classifier (the diagonal line)
\(0.0\)	Perfectly wrong classifier (all predictions flipped)

AUC versus AUROC #

The confusion matrix #

Building the ROC curve #

Key reference points #

The probabilistic interpretation #

Computing the AUROC in Scikit-Learn #

Minimal example #

Full end-to-end example with logistic regression #

Calculating the AUC directly on the breast cancer dataset #

Trapezoidal integration for AUC #

Why not just use accuracy? #

Choosing a threshold from the ROC curve #

Important caveats #

Quick reference cheat sheet #

Further reading #