.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/model_evaluation/plot_train_test_split.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_model_evaluation_plot_train_test_split.py: .. _example_train_test_split: ============================================================ `train_test_split`: get diagnostics when splitting your data ============================================================ This example illustrates the motivation and the use of skore's :func:`skore.train_test_split` to get assistance when developing ML/DS projects. .. GENERATED FROM PYTHON SOURCE LINES 13-34 Train-test split in scikit-learn ================================ Scikit-learn has a function for splitting the data into train and test sets: :func:`sklearn.model_selection.train_test_split`. Its signature is the following: .. code-block:: python sklearn.model_selection.train_test_split( *arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None ) where ``*arrays`` is a Python ``*args`` (it allows you to pass a varying number of positional arguments) and the scikit-learn doc indicates that it is ``a sequence of indexables with same length / shape[0]``. .. GENERATED FROM PYTHON SOURCE LINES 36-37 Let us construct a design matrix ``X`` and target ``y`` to illustrate our point: .. GENERATED FROM PYTHON SOURCE LINES 39-45 .. code-block:: Python import numpy as np X = np.arange(10).reshape((5, 2)) y = np.arange(5) print(f"{X = }\n{y = }") .. rst-class:: sphx-glr-script-out .. code-block:: none X = array([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]) y = array([0, 1, 2, 3, 4]) .. GENERATED FROM PYTHON SOURCE LINES 46-47 In scikit-learn, the most common usage is the following: .. GENERATED FROM PYTHON SOURCE LINES 49-56 .. code-block:: Python from sklearn.model_selection import train_test_split as sklearn_train_test_split X_train, X_test, y_train, y_test = sklearn_train_test_split( X, y, test_size=0.2, random_state=0 ) print(f"{X_train = }\n{y_train = }\n{X_test = }\n{y_test = }") .. rst-class:: sphx-glr-script-out .. code-block:: none X_train = array([[0, 1], [2, 3], [6, 7], [8, 9]]) y_train = array([0, 1, 3, 4]) X_test = array([[4, 5]]) y_test = array([2]) .. GENERATED FROM PYTHON SOURCE LINES 57-58 Notice the shuffling that is done by default. .. GENERATED FROM PYTHON SOURCE LINES 60-76 In scikit-learn, the user cannot explicitly set the design matrix ``X`` and the target ``y``. The following: .. code-block:: python X_train, X_test, y_train, y_test = sklearn_train_test_split( X=X, y=y, test_size=0.2, random_state=0) would return: .. code-block:: python TypeError: got an unexpected keyword argument 'X' In general, in Python, keyword arguments are useful to prevent typos. For example, in the following, ``X`` and ``y`` are reversed: .. GENERATED FROM PYTHON SOURCE LINES 78-83 .. code-block:: Python X_train, X_test, y_train, y_test = sklearn_train_test_split( y, X, test_size=0.2, random_state=0 ) print(f"{X_train = }\n{y_train = }\n{X_test = }\n{y_test = }") .. rst-class:: sphx-glr-script-out .. code-block:: none X_train = array([0, 1, 3, 4]) y_train = array([[0, 1], [2, 3], [6, 7], [8, 9]]) X_test = array([2]) y_test = array([[4, 5]]) .. GENERATED FROM PYTHON SOURCE LINES 84-86 but Python will not catch this mistake for us. This is where skore comes in handy. .. GENERATED FROM PYTHON SOURCE LINES 88-90 Train-test split in skore ========================= .. GENERATED FROM PYTHON SOURCE LINES 92-94 Skore has its own :func:`skore.train_test_split` that wraps scikit-learn's :func:`sklearn.model_selection.train_test_split`. .. GENERATED FROM PYTHON SOURCE LINES 96-99 .. code-block:: Python X = np.arange(10_000).reshape((5_000, 2)) y = [0] * 2_500 + [1] * 2_500 .. GENERATED FROM PYTHON SOURCE LINES 100-102 Expliciting the positional arguments for ``X`` and ``y`` -------------------------------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 104-106 First of all, naturally, it can be used as a simple drop-in replacement for scikit-learn: .. GENERATED FROM PYTHON SOURCE LINES 108-114 .. code-block:: Python import skore X_train, X_test, y_train, y_test = skore.train_test_split( X, y, test_size=0.2, random_state=0 ) .. rst-class:: sphx-glr-script-out .. code-block:: none ╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮ │ We detected that the `shuffle` parameter is set to `True` either explicitly or from │ │ its default value. In case of time-ordered events (even if they are independent), │ │ this will result in inflated model performance evaluation because natural drift will │ │ not be taken into account. We recommend setting the shuffle parameter to `False` in │ │ order to ensure the evaluation process is really representative of your production │ │ release process. │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ .. GENERATED FROM PYTHON SOURCE LINES 115-120 .. note:: The outputs of :func:`skore.train_test_split` are intentionally exactly the same as :func:`sklearn.model_selection.train_test_split`, so the user can just use the skore version as a drop-in replacement of scikit-learn. .. GENERATED FROM PYTHON SOURCE LINES 122-124 Contrary to scikit-learn, skore allows users to explicit the ``X`` and ``y``, making detection of potential issues easier: .. GENERATED FROM PYTHON SOURCE LINES 126-131 .. code-block:: Python X_train, X_test, y_train, y_test = skore.train_test_split( X=X, y=y, test_size=0.2, random_state=0 ) X_train_explicit = X_train.copy() .. rst-class:: sphx-glr-script-out .. code-block:: none ╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮ │ We detected that the `shuffle` parameter is set to `True` either explicitly or from │ │ its default value. In case of time-ordered events (even if they are independent), │ │ this will result in inflated model performance evaluation because natural drift will │ │ not be taken into account. We recommend setting the shuffle parameter to `False` in │ │ order to ensure the evaluation process is really representative of your production │ │ release process. │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ .. GENERATED FROM PYTHON SOURCE LINES 132-134 Moreover, when passing ``X`` and ``y`` explicitly, the ``X``'s are always returned before the ``y``'s, even when they are inverted: .. GENERATED FROM PYTHON SOURCE LINES 136-146 .. code-block:: Python arr = X.copy() arr_train, arr_test, X_train, X_test, y_train, y_test = skore.train_test_split( arr, y=y, X=X, test_size=0.2, random_state=0 ) X_train_explicit_inverted = X_train.copy() print("When expliciting, with the small typo, are the `X_train`'s still the same?") print(np.allclose(X_train_explicit, X_train_explicit_inverted)) .. rst-class:: sphx-glr-script-out .. code-block:: none ╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮ │ We detected that the `shuffle` parameter is set to `True` either explicitly or from │ │ its default value. In case of time-ordered events (even if they are independent), │ │ this will result in inflated model performance evaluation because natural drift will │ │ not be taken into account. We recommend setting the shuffle parameter to `False` in │ │ order to ensure the evaluation process is really representative of your production │ │ release process. │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ When expliciting, with the small typo, are the `X_train`'s still the same? True .. GENERATED FROM PYTHON SOURCE LINES 147-163 Automatic diagnostics: raising methodological warnings ------------------------------------------------------ In this section, we show how skore can provide methodological checks. Class imbalance ^^^^^^^^^^^^^^^ In machine learning, class imbalance (the classes in a dataset are not equally represented) requires a specific modelling. For example, in a dataset with 95% majority class (class ``1``) and 5% minority class (class ``0``), a dummy model that always predicts class ``1`` will have a 95% accuracy, while it would be useless for identifying examples of class ``0``. Hence, it is important to detect when we have class imbalance. Suppose that we have imbalanced data: .. GENERATED FROM PYTHON SOURCE LINES 165-168 .. code-block:: Python X = np.arange(10_000).reshape((5_000, 2)) y = [0] * 4_000 + [1] * 1_000 .. GENERATED FROM PYTHON SOURCE LINES 169-170 In that case, :func:`skore.train_test_split` raises a ``HighClassImbalanceWarning``: .. GENERATED FROM PYTHON SOURCE LINES 172-176 .. code-block:: Python X_train, X_test, y_train, y_test = skore.train_test_split( X=X, y=y, test_size=0.2, random_state=0 ) .. rst-class:: sphx-glr-script-out .. code-block:: none ╭───────────────────────────── HighClassImbalanceWarning ──────────────────────────────╮ │ It seems that you have a classification problem with a high class imbalance. In this │ │ case, using train_test_split may not be a good idea because of high variability in │ │ the scores obtained on the test set. To tackle this challenge we suggest to use │ │ skore's cross_validate function. │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ ╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮ │ We detected that the `shuffle` parameter is set to `True` either explicitly or from │ │ its default value. In case of time-ordered events (even if they are independent), │ │ this will result in inflated model performance evaluation because natural drift will │ │ not be taken into account. We recommend setting the shuffle parameter to `False` in │ │ order to ensure the evaluation process is really representative of your production │ │ release process. │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ .. GENERATED FROM PYTHON SOURCE LINES 177-179 Hence, skore recommends the users to take into account this class imbalance, that they might have missed, in their modelling strategy. .. GENERATED FROM PYTHON SOURCE LINES 181-183 Moreover, skore also detects class imbalance with a class that has too few samples with a ``HighClassImbalanceTooFewExamplesWarning``: .. GENERATED FROM PYTHON SOURCE LINES 183-191 .. code-block:: Python X = np.arange(400).reshape((200, 2)) y = [0] * 150 + [1] * 50 X_train, X_test, y_train, y_test = skore.train_test_split( X=X, y=y, test_size=0.2, random_state=0 ) .. rst-class:: sphx-glr-script-out .. code-block:: none ╭────────────────────── HighClassImbalanceTooFewExamplesWarning ───────────────────────╮ │ It seems that you have a classification problem with at least one class with fewer │ │ than 100 examples in the test set. In this case, using train_test_split may not be a │ │ good idea because of high variability in the scores obtained on the test set. We │ │ suggest three options to tackle this challenge: you can increase test_size, collect │ │ more data, or use skore's cross_validate function. │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ ╭───────────────────────────── HighClassImbalanceWarning ──────────────────────────────╮ │ It seems that you have a classification problem with a high class imbalance. In this │ │ case, using train_test_split may not be a good idea because of high variability in │ │ the scores obtained on the test set. To tackle this challenge we suggest to use │ │ skore's cross_validate function. │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ ╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮ │ We detected that the `shuffle` parameter is set to `True` either explicitly or from │ │ its default value. In case of time-ordered events (even if they are independent), │ │ this will result in inflated model performance evaluation because natural drift will │ │ not be taken into account. We recommend setting the shuffle parameter to `False` in │ │ order to ensure the evaluation process is really representative of your production │ │ release process. │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ .. GENERATED FROM PYTHON SOURCE LINES 192-199 Shuffling without a random state ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For `reproducible results across executions `_, skore recommends the use of the ``random_state`` parameter when shuffling (remember that ``shuffle=True`` by default) with a ``RandomStateUnsetWarning``: .. GENERATED FROM PYTHON SOURCE LINES 199-205 .. code-block:: Python X = np.arange(10_000).reshape((5_000, 2)) y = [0] * 2_500 + [1] * 2_500 X_train, X_test, y_train, y_test = skore.train_test_split(X=X, y=y, test_size=0.2) .. rst-class:: sphx-glr-script-out .. code-block:: none ╭────────────────────────────── RandomStateUnsetWarning ───────────────────────────────╮ │ We recommend setting the parameter `random_state`. This will ensure the │ │ reproducibility of your work. │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ ╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮ │ We detected that the `shuffle` parameter is set to `True` either explicitly or from │ │ its default value. In case of time-ordered events (even if they are independent), │ │ this will result in inflated model performance evaluation because natural drift will │ │ not be taken into account. We recommend setting the shuffle parameter to `False` in │ │ order to ensure the evaluation process is really representative of your production │ │ release process. │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ .. GENERATED FROM PYTHON SOURCE LINES 206-208 Time series data ^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 210-213 Now, let us assume that we have `time series data `_: the data is somewhat time-ordered: .. GENERATED FROM PYTHON SOURCE LINES 215-223 .. code-block:: Python import pandas as pd from skrub.datasets import fetch_employee_salaries dataset = fetch_employee_salaries() X, y = dataset.X, dataset.y X["date_first_hired"] = pd.to_datetime(X["date_first_hired"]) X.head(2) .. raw:: html
gender department department_name division assignment_category employee_position_title date_first_hired year_first_hired
0 F POL Department of Police MSB Information Mgmt and Tech Division Records... Fulltime-Regular Office Services Coordinator 1986-09-22 1986
1 M POL Department of Police ISB Major Crimes Division Fugitive Section Fulltime-Regular Master Police Officer 1988-09-12 1988


.. GENERATED FROM PYTHON SOURCE LINES 224-230 We can observe that there is a ``date_first_hired`` which is time-based. As one can not shuffle time (time only moves in one direction: forward), we recommend using :class:`sklearn.model_selection.TimeSeriesSplit` instead of :func:`sklearn.model_selection.train_test_split` (or :func:`skore.train_test_split`) with a ``TimeBasedColumnWarning``: .. GENERATED FROM PYTHON SOURCE LINES 232-235 .. code-block:: Python X_train, X_test, y_train, y_test = skore.train_test_split( X, y, random_state=0, shuffle=False ) .. rst-class:: sphx-glr-script-out .. code-block:: none ╭─────────────────────────────── TimeBasedColumnWarning ───────────────────────────────╮ │ We detected some time-based columns (column "date_first_hired") in your data. We │ │ recommend using scikit-learn's TimeSeriesSplit instead of train_test_split. │ │ Otherwise you might train on future data to predict the past, or get inflated model │ │ performance evaluation because natural drift will not be taken into account. │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.058 seconds) .. _sphx_glr_download_auto_examples_model_evaluation_plot_train_test_split.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_train_test_split.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_train_test_split.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_train_test_split.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_