Note
Go to the end to download the full example code.
train_test_split
: get diagnostics when splitting your data#
This example illustrates the motivation and the use of skore’s
skore.train_test_split()
to get assistance when developing ML/DS projects.
Train-test split in scikit-learn#
Scikit-learn has a function for splitting the data into train and test
sets: sklearn.model_selection.train_test_split()
.
Its signature is the following:
sklearn.model_selection.train_test_split(
*arrays,
test_size=None,
train_size=None,
random_state=None,
shuffle=True,
stratify=None
)
where *arrays
is a Python *args
(it allows you to pass a varying number of
positional arguments) and the scikit-learn doc indicates that it is a sequence of
indexables with same length / shape[0]
.
Let us construct a design matrix X
and target y
to illustrate our point:
X = array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
y = array([0, 1, 2, 3, 4])
In scikit-learn, the most common usage is the following:
X_train = array([[0, 1],
[2, 3],
[6, 7],
[8, 9]])
y_train = array([0, 1, 3, 4])
X_test = array([[4, 5]])
y_test = array([2])
Notice the shuffling that is done by default.
In scikit-learn, the user cannot explicitly set the design matrix X
and
the target y
. The following:
would return:
TypeError: got an unexpected keyword argument 'X'
In general, in Python, keyword arguments are useful to prevent typos. For example,
in the following, X
and y
are reversed:
X_train = array([0, 1, 3, 4])
y_train = array([[0, 1],
[2, 3],
[6, 7],
[8, 9]])
X_test = array([2])
y_test = array([[4, 5]])
but Python will not catch this mistake for us. This is where skore comes in handy.
Train-test split in skore#
Skore has its own skore.train_test_split()
that wraps scikit-learn’s
sklearn.model_selection.train_test_split()
.
Expliciting the positional arguments for X
and y
#
First of all, naturally, it can be used as a simple drop-in replacement for scikit-learn:
╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
│ We detected that the `shuffle` parameter is set to `True` either explicitly or from │
│ its default value. In case of time-ordered events (even if they are independent), │
│ this will result in inflated model performance evaluation because natural drift will │
│ not be taken into account. We recommend setting the shuffle parameter to `False` in │
│ order to ensure the evaluation process is really representative of your production │
│ release process. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
Note
The outputs of skore.train_test_split()
are intentionally exactly the same as
sklearn.model_selection.train_test_split()
, so the user can just use the
skore version as a drop-in replacement of scikit-learn.
Contrary to scikit-learn, skore allows users to explicit the X
and y
, making
detection of potential issues easier:
X_train, X_test, y_train, y_test = skore.train_test_split(
X=X, y=y, test_size=0.2, random_state=0
)
X_train_explicit = X_train.copy()
╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
│ We detected that the `shuffle` parameter is set to `True` either explicitly or from │
│ its default value. In case of time-ordered events (even if they are independent), │
│ this will result in inflated model performance evaluation because natural drift will │
│ not be taken into account. We recommend setting the shuffle parameter to `False` in │
│ order to ensure the evaluation process is really representative of your production │
│ release process. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
Moreover, when passing X
and y
explicitly, the X
’s are always returned
before the y
’s, even when they are inverted:
arr = X.copy()
arr_train, arr_test, X_train, X_test, y_train, y_test = skore.train_test_split(
arr, y=y, X=X, test_size=0.2, random_state=0
)
X_train_explicit_inverted = X_train.copy()
print("When expliciting, with the small typo, are the `X_train`'s still the same?")
print(np.allclose(X_train_explicit, X_train_explicit_inverted))
╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
│ We detected that the `shuffle` parameter is set to `True` either explicitly or from │
│ its default value. In case of time-ordered events (even if they are independent), │
│ this will result in inflated model performance evaluation because natural drift will │
│ not be taken into account. We recommend setting the shuffle parameter to `False` in │
│ order to ensure the evaluation process is really representative of your production │
│ release process. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
When expliciting, with the small typo, are the `X_train`'s still the same?
True
Automatic diagnostics: raising methodological warnings#
In this section, we show how skore can provide methodological checks.
Class imbalance#
In machine learning, class imbalance (the classes in a dataset are not equally
represented) requires a specific modelling.
For example, in a dataset with 95% majority class (class 1
) and 5% minority class
(class 0
), a dummy model that always predicts class 1
will have a 95%
accuracy, while it would be useless for identifying examples of class 0
.
Hence, it is important to detect when we have class imbalance.
Suppose that we have imbalanced data:
In that case, skore.train_test_split()
raises a HighClassImbalanceWarning
:
╭───────────────────────────── HighClassImbalanceWarning ──────────────────────────────╮
│ It seems that you have a classification problem with a high class imbalance. In this │
│ case, using train_test_split may not be a good idea because of high variability in │
│ the scores obtained on the test set. To tackle this challenge we suggest to use │
│ skore's cross_validate function. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
│ We detected that the `shuffle` parameter is set to `True` either explicitly or from │
│ its default value. In case of time-ordered events (even if they are independent), │
│ this will result in inflated model performance evaluation because natural drift will │
│ not be taken into account. We recommend setting the shuffle parameter to `False` in │
│ order to ensure the evaluation process is really representative of your production │
│ release process. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
Hence, skore recommends the users to take into account this class imbalance, that they might have missed, in their modelling strategy.
Moreover, skore also detects class imbalance with a class that has too few samples
with a HighClassImbalanceTooFewExamplesWarning
:
╭────────────────────── HighClassImbalanceTooFewExamplesWarning ───────────────────────╮
│ It seems that you have a classification problem with at least one class with fewer │
│ than 100 examples in the test set. In this case, using train_test_split may not be a │
│ good idea because of high variability in the scores obtained on the test set. We │
│ suggest three options to tackle this challenge: you can increase test_size, collect │
│ more data, or use skore's cross_validate function. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────── HighClassImbalanceWarning ──────────────────────────────╮
│ It seems that you have a classification problem with a high class imbalance. In this │
│ case, using train_test_split may not be a good idea because of high variability in │
│ the scores obtained on the test set. To tackle this challenge we suggest to use │
│ skore's cross_validate function. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
│ We detected that the `shuffle` parameter is set to `True` either explicitly or from │
│ its default value. In case of time-ordered events (even if they are independent), │
│ this will result in inflated model performance evaluation because natural drift will │
│ not be taken into account. We recommend setting the shuffle parameter to `False` in │
│ order to ensure the evaluation process is really representative of your production │
│ release process. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
Shuffling without a random state#
For reproducible results across executions,
skore recommends the use of the random_state
parameter when shuffling
(remember that shuffle=True
by default) with a RandomStateUnsetWarning
:
╭────────────────────────────── RandomStateUnsetWarning ───────────────────────────────╮
│ We recommend setting the parameter `random_state`. This will ensure the │
│ reproducibility of your work. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
│ We detected that the `shuffle` parameter is set to `True` either explicitly or from │
│ its default value. In case of time-ordered events (even if they are independent), │
│ this will result in inflated model performance evaluation because natural drift will │
│ not be taken into account. We recommend setting the shuffle parameter to `False` in │
│ order to ensure the evaluation process is really representative of your production │
│ release process. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
Time series data#
Now, let us assume that we have time series data: the data is somewhat time-ordered:
import pandas as pd
from skrub.datasets import fetch_employee_salaries
dataset = fetch_employee_salaries()
X, y = dataset.X, dataset.y
X["date_first_hired"] = pd.to_datetime(X["date_first_hired"])
X.head(2)
We can observe that there is a date_first_hired
which is time-based.
As one can not shuffle time (time only moves in one direction: forward), we
recommend using sklearn.model_selection.TimeSeriesSplit
instead of
sklearn.model_selection.train_test_split()
(or skore.train_test_split()
)
with a TimeBasedColumnWarning
:
╭─────────────────────────────── TimeBasedColumnWarning ───────────────────────────────╮
│ We detected some time-based columns (column "date_first_hired") in your data. We │
│ recommend using scikit-learn's TimeSeriesSplit instead of train_test_split. │
│ Otherwise you might train on future data to predict the past, or get inflated model │
│ performance evaluation because natural drift will not be taken into account. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
Total running time of the script: (0 minutes 0.058 seconds)