Note
Go to the end to download the full example code.
EstimatorReport
: Get insights from any scikit-learn estimator#
This example shows how the skore.EstimatorReport
class can be used to
quickly get insights from any scikit-learn estimator.
Loading our dataset and defining our estimator#
First, we load a dataset from skrub. Our goal is to predict if a company paid a physician. The ultimate goal is to detect potential conflict of interest when it comes to the actual problem that we want to solve.
from skrub.datasets import fetch_open_payments
dataset = fetch_open_payments()
df = dataset.X
y = dataset.y
Downloading 'open_payments' from https://github.com/skrub-data/skrub-data-files/raw/refs/heads/main/open_payments.zip (attempt 1/3)
from skrub import TableReport
TableReport(df)
Looking at the distributions of the target, we observe that this classification task is quite imbalanced. It means that we have to be careful when selecting a set of statistical metrics to evaluate the classification performance of our predictive model. In addition, we see that the class labels are not specified by an integer 0 or 1 but instead by a string “allowed” or “disallowed”.
For our application, the label of interest is “allowed”.
Before training a predictive model, we need to split our dataset into a training and a validation set.
from skore import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, y, random_state=42)
╭───────────────────────────── HighClassImbalanceWarning ──────────────────────────────╮
│ It seems that you have a classification problem with a high class imbalance. In this │
│ case, using train_test_split may not be a good idea because of high variability in │
│ the scores obtained on the test set. To tackle this challenge we suggest to use │
│ skore's cross_validate function. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
│ We detected that the `shuffle` parameter is set to `True` either explicitly or from │
│ its default value. In case of time-ordered events (even if they are independent), │
│ this will result in inflated model performance evaluation because natural drift will │
│ not be taken into account. We recommend setting the shuffle parameter to `False` in │
│ order to ensure the evaluation process is really representative of your production │
│ release process. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
By the way, notice how skore’s train_test_split()
automatically warns us
for a class imbalance.
Now, we need to define a predictive model. Hopefully, skrub
provides a convenient
function (skrub.tabular_learner()
) when it comes to getting strong baseline
predictive models with a single line of code. As its feature engineering is generic,
it does not provide some handcrafted and tailored feature engineering but still
provides a good starting point.
So let’s create a classifier for our task and fit it on the training set.
from skrub import tabular_learner
estimator = tabular_learner("classifier").fit(X_train, y_train)
estimator
Getting insights from our estimator#
Introducing the skore.EstimatorReport
class#
Now, we would be interested in getting some insights from our predictive model.
One way is to use the skore.EstimatorReport
class. This constructor will
detect that our estimator is already fitted and will not fit it again.
Once the report is created, we get some information regarding the available tools
allowing us to get some insights from our specific model on our specific task by
calling the help()
method.
╭──────────── Tools to diagnose estimator HistGradientBoostingClassifier ─────────────╮
│ EstimatorReport │
│ ├── .metrics │
│ │ ├── .accuracy(...) (↗︎) - Compute the accuracy score. │
│ │ ├── .brier_score(...) (↘︎) - Compute the Brier score. │
│ │ ├── .log_loss(...) (↘︎) - Compute the log loss. │
│ │ ├── .precision(...) (↗︎) - Compute the precision score. │
│ │ ├── .precision_recall(...) - Plot the precision-recall curve. │
│ │ ├── .recall(...) (↗︎) - Compute the recall score. │
│ │ ├── .roc(...) - Plot the ROC curve. │
│ │ ├── .roc_auc(...) (↗︎) - Compute the ROC AUC score. │
│ │ ├── .custom_metric(...) - Compute a custom metric. │
│ │ └── .report_metrics(...) - Report a set of metrics for our estimator. │
│ ├── .cache_predictions(...) - Cache estimator's predictions. │
│ ├── .clear_cache(...) - Clear the cache. │
│ └── Attributes │
│ ├── .X_test │
│ ├── .X_train │
│ ├── .y_test │
│ ├── .y_train │
│ ├── .estimator_ │
│ └── .estimator_name_ │
│ │
│ │
│ Legend: │
│ (↗︎) higher is better (↘︎) lower is better │
╰─────────────────────────────────────────────────────────────────────────────────────╯
Be aware that we can access the help for each individual sub-accessor. For instance:
report.metrics.help()
╭─────────────────────────── Available metrics methods ───────────────────────────╮
│ report.metrics │
│ ├── .accuracy(...) (↗︎) - Compute the accuracy score. │
│ ├── .brier_score(...) (↘︎) - Compute the Brier score. │
│ ├── .log_loss(...) (↘︎) - Compute the log loss. │
│ ├── .precision(...) (↗︎) - Compute the precision score. │
│ ├── .precision_recall(...) - Plot the precision-recall curve. │
│ ├── .recall(...) (↗︎) - Compute the recall score. │
│ ├── .roc(...) - Plot the ROC curve. │
│ ├── .roc_auc(...) (↗︎) - Compute the ROC AUC score. │
│ ├── .custom_metric(...) - Compute a custom metric. │
│ └── .report_metrics(...) - Report a set of metrics for our estimator. │
│ │
│ │
│ Legend: │
│ (↗︎) higher is better (↘︎) lower is better │
╰─────────────────────────────────────────────────────────────────────────────────╯
Metrics computation with aggressive caching#
At this point, we might be interested to have a first look at the statistical
performance of our model on the validation set that we provided. We can access it
by calling any of the metrics displayed above. Since we are greedy, we want to get
several metrics at once and we will use the
report_metrics()
method.
import time
start = time.time()
metric_report = report.metrics.report_metrics(pos_label=pos_label)
end = time.time()
metric_report
Time taken to compute the metrics: 4.72 seconds
An interesting feature provided by the skore.EstimatorReport
is the
the caching mechanism. Indeed, when we have a large enough dataset, computing the
predictions for a model is not cheap anymore. For instance, on our smallish dataset,
it took a couple of seconds to compute the metrics. The report will cache the
predictions and if we are interested in computing a metric again or an alternative
metric that requires the same predictions, it will be faster. Let’s check by
requesting the same metrics report again.
start = time.time()
metric_report = report.metrics.report_metrics(pos_label=pos_label)
end = time.time()
metric_report
Time taken to compute the metrics: 0.00 seconds
Since we obtain a pandas dataframe, we can also use the plotting interface of pandas.
import matplotlib.pyplot as plt
ax = metric_report.plot.barh()
ax.set_title("Metrics report")
plt.tight_layout()

Whenever computing a metric, we check if the predictions are available in the cache and reload them if available. So for instance, let’s compute the log loss.
0.12132732949290304
Time taken to compute the log loss: 0.03 seconds
We can show that without initial cache, it would have taken more time to compute the log loss.
0.12132732949290304
Time taken to compute the log loss: 1.57 seconds
By default, the metrics are computed on the test set only. However, if a training set
is provided, we can also compute the metrics by specifying the data_source
parameter.
0.09789401747509845
In the case where we are interested in computing the metrics on a completely new set
of data, we can use the data_source="X_y"
parameter. In addition, we need to provide
a X
and y
parameters.
Time taken to compute the metrics: 4.90 seconds
As in the other case, we rely on the cache to avoid recomputing the predictions. Internally, we compute a hash of the input data to be sure that we can hit the cache in a consistent way.
Time taken to compute the metrics: 0.18 seconds
Note
In this last example, we rely on computing the hash of the input data. Therefore, there is a trade-off: the computation of the hash is not free and it might be faster to compute the predictions instead.
Be aware that we can also benefit from the caching mechanism with our own custom
metrics. Skore only expects that we define our own metric function to take y_true
and y_pred
as the first two positional arguments. It can take any other arguments.
Let’s see an example.
def operational_decision_cost(y_true, y_pred, amount):
mask_true_positive = (y_true == pos_label) & (y_pred == pos_label)
mask_true_negative = (y_true == neg_label) & (y_pred == neg_label)
mask_false_positive = (y_true == neg_label) & (y_pred == pos_label)
mask_false_negative = (y_true == pos_label) & (y_pred == neg_label)
fraudulent_refuse = mask_true_positive.sum() * 50
fraudulent_accept = -amount[mask_false_negative].sum()
legitimate_refuse = mask_false_positive.sum() * -5
legitimate_accept = (amount[mask_true_negative] * 0.02).sum()
return fraudulent_refuse + fraudulent_accept + legitimate_refuse + legitimate_accept
In our use case, we have a operational decision to make that translate the classification outcome into a cost. It translate the confusion matrix into a cost matrix based on some amount linked to each sample in the dataset that are provided to us. Here, we randomly generate some amount as an illustration.
import numpy as np
rng = np.random.default_rng(42)
amount = rng.integers(low=100, high=1000, size=len(y_test))
Let’s make sure that a function called the predict
method and cached the result.
We compute the accuracy metric to make sure that the predict
method is called.
report.metrics.accuracy()
0.9526916802610114
We can now compute the cost of our operational decision.
-124349.77999999997
Time taken to compute the cost: 0.01 seconds
Let’s now clean the cache and see if it is faster.
-124349.77999999997
Time taken to compute the cost: 1.52 seconds
We observe that caching is working as expected. It is really handy because it means that we can compute some additional metrics without having to recompute the the predictions.
It could happen that we are interested in providing several custom metrics which
does not necessarily share the same parameters. In this more complex case, skore will
require us to provide a scorer using the sklearn.metrics.make_scorer()
function.
from sklearn.metrics import make_scorer, f1_score
f1_scorer = make_scorer(f1_score, response_method="predict", pos_label=pos_label)
operational_decision_cost_scorer = make_scorer(
operational_decision_cost, response_method="predict", amount=amount
)
report.metrics.report_metrics(
scoring=[f1_scorer, operational_decision_cost_scorer],
scoring_names=["F1 Score", "Operational Decision Cost"],
)
Effortless one-liner plotting#
The skore.EstimatorReport
class also provides a plotting interface that
allows to plot defacto the most common plots. As for the metrics, we only
provide the meaningful set of plots for the provided estimator.
report.metrics.help()
╭─────────────────────────── Available metrics methods ───────────────────────────╮
│ report.metrics │
│ ├── .accuracy(...) (↗︎) - Compute the accuracy score. │
│ ├── .brier_score(...) (↘︎) - Compute the Brier score. │
│ ├── .log_loss(...) (↘︎) - Compute the log loss. │
│ ├── .precision(...) (↗︎) - Compute the precision score. │
│ ├── .precision_recall(...) - Plot the precision-recall curve. │
│ ├── .recall(...) (↗︎) - Compute the recall score. │
│ ├── .roc(...) - Plot the ROC curve. │
│ ├── .roc_auc(...) (↗︎) - Compute the ROC AUC score. │
│ ├── .custom_metric(...) - Compute a custom metric. │
│ └── .report_metrics(...) - Report a set of metrics for our estimator. │
│ │
│ │
│ Legend: │
│ (↗︎) higher is better (↘︎) lower is better │
╰─────────────────────────────────────────────────────────────────────────────────╯
Let’s start by plotting the ROC curve for our binary classification task.
display = report.metrics.roc(pos_label=pos_label)
display.plot()
plt.tight_layout()

The plot functionality is built upon the scikit-learn display objects. We return
those display (slightly modified to improve the UI) in case we want to tweak some
of the plot properties. We can have quick look at the available attributes and
methods by calling the help
method or simply by printing the display.
skore.RocCurveDisplay(...)
╭─ RocCurveDisplay for HistGradientBoostingClassifier ─╮
│ display │
│ ├── Attributes │
│ │ ├── .ax_ │
│ │ ├── .chance_level_ │
│ │ ├── .figure_ │
│ │ └── .lines_ │
│ └── Methods │
│ └── .plot(...) - Plot visualization. │
╰──────────────────────────────────────────────────────╯
display.plot()
display.ax_.set_title("Example of a ROC curve")
display.figure_
plt.tight_layout()

Similarly to the metrics, we aggressively use the caching to avoid recomputing the predictions of the model. We also cache the plot display object by detection if the input parameters are the same as the previous call. Let’s demonstrate the kind of performance gain we can get.
start = time.time()
# we already trigger the computation of the predictions in a previous call
display = report.metrics.roc(pos_label=pos_label)
display.plot()
plt.tight_layout()
end = time.time()

Time taken to compute the ROC curve: 0.04 seconds
Now, let’s clean the cache and check if we get a slowdown.
start = time.time()
display = report.metrics.roc(pos_label=pos_label)
display.plot()
plt.tight_layout()
end = time.time()

Time taken to compute the ROC curve: 1.58 seconds
As expected, since we need to recompute the predictions, it takes more time.
Total running time of the script: (0 minutes 27.286 seconds)