.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/model_evaluation/plot_train_test_split.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_model_evaluation_plot_train_test_split.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_model_evaluation_plot_train_test_split.py:


.. _example_train_test_split:

============================================================
`train_test_split`: get diagnostics when splitting your data
============================================================

This example illustrates the motivation and the use of skore's
:func:`skore.train_test_split` to get assistance when developing ML/DS projects.

.. GENERATED FROM PYTHON SOURCE LINES 13-34

Train-test split in scikit-learn
================================

Scikit-learn has a function for splitting the data into train and test
sets: :func:`sklearn.model_selection.train_test_split`.
Its signature is the following:

.. code-block:: python

    sklearn.model_selection.train_test_split(
        *arrays,
        test_size=None,
        train_size=None,
        random_state=None,
        shuffle=True,
        stratify=None
    )

where ``*arrays`` is a Python ``*args`` (it allows you to pass a varying number of
positional arguments) and the scikit-learn doc indicates that it is ``a sequence of
indexables with same length / shape[0]``.

.. GENERATED FROM PYTHON SOURCE LINES 36-37

Let us construct a design matrix ``X`` and target ``y`` to illustrate our point:

.. GENERATED FROM PYTHON SOURCE LINES 39-45

.. code-block:: Python

    import numpy as np

    X = np.arange(10).reshape((5, 2))
    y = np.arange(5)
    print(f"{X = }\n{y = }")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    X = array([[0, 1],
           [2, 3],
           [4, 5],
           [6, 7],
           [8, 9]])
    y = array([0, 1, 2, 3, 4])


.. GENERATED FROM PYTHON SOURCE LINES 46-47

In scikit-learn, the most common usage is the following:

.. GENERATED FROM PYTHON SOURCE LINES 49-56

.. code-block:: Python

    from sklearn.model_selection import train_test_split as sklearn_train_test_split

    X_train, X_test, y_train, y_test = sklearn_train_test_split(
        X, y, test_size=0.2, random_state=0
    )
    print(f"{X_train = }\n{y_train = }\n{X_test = }\n{y_test = }")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    X_train = array([[0, 1],
           [2, 3],
           [6, 7],
           [8, 9]])
    y_train = array([0, 1, 3, 4])
    X_test = array([[4, 5]])
    y_test = array([2])


.. GENERATED FROM PYTHON SOURCE LINES 57-58

Notice the shuffling that is done by default.

.. GENERATED FROM PYTHON SOURCE LINES 60-76

In scikit-learn, the user cannot explicitly set the design matrix ``X`` and
the target ``y``. The following:

.. code-block:: python

  X_train, X_test, y_train, y_test = sklearn_train_test_split(
      X=X, y=y, test_size=0.2, random_state=0)

would return:

.. code-block:: python

  TypeError: got an unexpected keyword argument 'X'

In general, in Python, keyword arguments are useful to prevent typos. For example,
in the following, ``X`` and ``y`` are reversed:

.. GENERATED FROM PYTHON SOURCE LINES 78-83

.. code-block:: Python

    X_train, X_test, y_train, y_test = sklearn_train_test_split(
        y, X, test_size=0.2, random_state=0
    )
    print(f"{X_train = }\n{y_train = }\n{X_test = }\n{y_test = }")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    X_train = array([0, 1, 3, 4])
    y_train = array([[0, 1],
           [2, 3],
           [6, 7],
           [8, 9]])
    X_test = array([2])
    y_test = array([[4, 5]])


.. GENERATED FROM PYTHON SOURCE LINES 84-86

but Python will not catch this mistake for us.
This is where skore comes in handy.

.. GENERATED FROM PYTHON SOURCE LINES 88-90

Train-test split in skore
=========================

.. GENERATED FROM PYTHON SOURCE LINES 92-94

Skore has its own :func:`skore.train_test_split` that wraps scikit-learn's
:func:`sklearn.model_selection.train_test_split`.

.. GENERATED FROM PYTHON SOURCE LINES 96-99

.. code-block:: Python

    X = np.arange(10_000).reshape((5_000, 2))
    y = [0] * 2_500 + [1] * 2_500


.. GENERATED FROM PYTHON SOURCE LINES 100-102

Expliciting the positional arguments for ``X`` and ``y``
--------------------------------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 104-106

First of all, naturally, it can be used as a simple drop-in replacement for
scikit-learn:

.. GENERATED FROM PYTHON SOURCE LINES 108-114

.. code-block:: Python

    import skore

    X_train, X_test, y_train, y_test = skore.train_test_split(
        X, y, test_size=0.2, random_state=0
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    ╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
    │ We detected that the `shuffle` parameter is set to `True` either explicitly or from  │
    │ its default value. In case of time-ordered events (even if they are independent),    │
    │ this will result in inflated model performance evaluation because natural drift will │
    │ not be taken into account. We recommend setting the shuffle parameter to `False` in  │
    │ order to ensure the evaluation process is really representative of your production   │
    │ release process.                                                                     │
    ╰──────────────────────────────────────────────────────────────────────────────────────╯


.. GENERATED FROM PYTHON SOURCE LINES 115-120

.. note::

  The outputs of :func:`skore.train_test_split` are intentionally exactly the same as
  :func:`sklearn.model_selection.train_test_split`, so the user can just use the
  skore version as a drop-in replacement of scikit-learn.

.. GENERATED FROM PYTHON SOURCE LINES 122-124

Contrary to scikit-learn, skore allows users to explicit the ``X`` and ``y``, making
detection of potential issues easier:

.. GENERATED FROM PYTHON SOURCE LINES 126-131

.. code-block:: Python

    X_train, X_test, y_train, y_test = skore.train_test_split(
        X=X, y=y, test_size=0.2, random_state=0
    )
    X_train_explicit = X_train.copy()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    ╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
    │ We detected that the `shuffle` parameter is set to `True` either explicitly or from  │
    │ its default value. In case of time-ordered events (even if they are independent),    │
    │ this will result in inflated model performance evaluation because natural drift will │
    │ not be taken into account. We recommend setting the shuffle parameter to `False` in  │
    │ order to ensure the evaluation process is really representative of your production   │
    │ release process.                                                                     │
    ╰──────────────────────────────────────────────────────────────────────────────────────╯


.. GENERATED FROM PYTHON SOURCE LINES 132-134

Moreover, when passing ``X`` and ``y`` explicitly, the ``X``'s are always returned
before the ``y``'s, even when they are inverted:

.. GENERATED FROM PYTHON SOURCE LINES 136-146

.. code-block:: Python

    arr = X.copy()
    arr_train, arr_test, X_train, X_test, y_train, y_test = skore.train_test_split(
        arr, y=y, X=X, test_size=0.2, random_state=0
    )
    X_train_explicit_inverted = X_train.copy()

    print("When expliciting, with the small typo, are the `X_train`'s still the same?")
    print(np.allclose(X_train_explicit, X_train_explicit_inverted))


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    ╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
    │ We detected that the `shuffle` parameter is set to `True` either explicitly or from  │
    │ its default value. In case of time-ordered events (even if they are independent),    │
    │ this will result in inflated model performance evaluation because natural drift will │
    │ not be taken into account. We recommend setting the shuffle parameter to `False` in  │
    │ order to ensure the evaluation process is really representative of your production   │
    │ release process.                                                                     │
    ╰──────────────────────────────────────────────────────────────────────────────────────╯
    When expliciting, with the small typo, are the `X_train`'s still the same?
    True


.. GENERATED FROM PYTHON SOURCE LINES 147-163

Automatic diagnostics: raising methodological warnings
------------------------------------------------------

In this section, we show how skore can provide methodological checks.

Class imbalance
^^^^^^^^^^^^^^^

In machine learning, class imbalance (the classes in a dataset are not equally
represented) requires a specific modelling.
For example, in a dataset with 95% majority class (class ``1``) and 5% minority class
(class ``0``), a dummy model that always predicts class ``1`` will have a 95%
accuracy, while it would be useless for identifying examples of class ``0``.
Hence, it is important to detect when we have class imbalance.

Suppose that we have imbalanced data:

.. GENERATED FROM PYTHON SOURCE LINES 165-168

.. code-block:: Python

    X = np.arange(10_000).reshape((5_000, 2))
    y = [0] * 4_000 + [1] * 1_000


.. GENERATED FROM PYTHON SOURCE LINES 169-170

In that case, :func:`skore.train_test_split` raises a ``HighClassImbalanceWarning``:

.. GENERATED FROM PYTHON SOURCE LINES 172-176

.. code-block:: Python

    X_train, X_test, y_train, y_test = skore.train_test_split(
        X=X, y=y, test_size=0.2, random_state=0
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    ╭───────────────────────────── HighClassImbalanceWarning ──────────────────────────────╮
    │ It seems that you have a classification problem with a high class imbalance. In this │
    │ case, using train_test_split may not be a good idea because of high variability in   │
    │ the scores obtained on the test set. To tackle this challenge we suggest to use      │
    │ skore's cross_validate function.                                                     │
    ╰──────────────────────────────────────────────────────────────────────────────────────╯
    ╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
    │ We detected that the `shuffle` parameter is set to `True` either explicitly or from  │
    │ its default value. In case of time-ordered events (even if they are independent),    │
    │ this will result in inflated model performance evaluation because natural drift will │
    │ not be taken into account. We recommend setting the shuffle parameter to `False` in  │
    │ order to ensure the evaluation process is really representative of your production   │
    │ release process.                                                                     │
    ╰──────────────────────────────────────────────────────────────────────────────────────╯


.. GENERATED FROM PYTHON SOURCE LINES 177-179

Hence, skore recommends the users to take into account this class imbalance, that
they might have missed, in their modelling strategy.

.. GENERATED FROM PYTHON SOURCE LINES 181-183

Moreover, skore also detects class imbalance with a class that has too few samples
with a ``HighClassImbalanceTooFewExamplesWarning``:

.. GENERATED FROM PYTHON SOURCE LINES 183-191

.. code-block:: Python


    X = np.arange(400).reshape((200, 2))
    y = [0] * 150 + [1] * 50

    X_train, X_test, y_train, y_test = skore.train_test_split(
        X=X, y=y, test_size=0.2, random_state=0
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    ╭────────────────────── HighClassImbalanceTooFewExamplesWarning ───────────────────────╮
    │ It seems that you have a classification problem with at least one class with fewer   │
    │ than 100 examples in the test set. In this case, using train_test_split may not be a │
    │ good idea because of high variability in the scores obtained on the test set. We     │
    │ suggest three options to tackle this challenge: you can increase test_size, collect  │
    │ more data, or use skore's cross_validate function.                                   │
    ╰──────────────────────────────────────────────────────────────────────────────────────╯
    ╭───────────────────────────── HighClassImbalanceWarning ──────────────────────────────╮
    │ It seems that you have a classification problem with a high class imbalance. In this │
    │ case, using train_test_split may not be a good idea because of high variability in   │
    │ the scores obtained on the test set. To tackle this challenge we suggest to use      │
    │ skore's cross_validate function.                                                     │
    ╰──────────────────────────────────────────────────────────────────────────────────────╯
    ╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
    │ We detected that the `shuffle` parameter is set to `True` either explicitly or from  │
    │ its default value. In case of time-ordered events (even if they are independent),    │
    │ this will result in inflated model performance evaluation because natural drift will │
    │ not be taken into account. We recommend setting the shuffle parameter to `False` in  │
    │ order to ensure the evaluation process is really representative of your production   │
    │ release process.                                                                     │
    ╰──────────────────────────────────────────────────────────────────────────────────────╯


.. GENERATED FROM PYTHON SOURCE LINES 192-199

Shuffling without a random state
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For `reproducible results across executions
<https://scikit-learn.org/stable/common_pitfalls.html#controlling-randomness>`_,
skore recommends the use of the ``random_state`` parameter when shuffling
(remember that ``shuffle=True`` by default) with a ``RandomStateUnsetWarning``:

.. GENERATED FROM PYTHON SOURCE LINES 199-205

.. code-block:: Python


    X = np.arange(10_000).reshape((5_000, 2))
    y = [0] * 2_500 + [1] * 2_500

    X_train, X_test, y_train, y_test = skore.train_test_split(X=X, y=y, test_size=0.2)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    ╭────────────────────────────── RandomStateUnsetWarning ───────────────────────────────╮
    │ We recommend setting the parameter `random_state`. This will ensure the              │
    │ reproducibility of your work.                                                        │
    ╰──────────────────────────────────────────────────────────────────────────────────────╯
    ╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
    │ We detected that the `shuffle` parameter is set to `True` either explicitly or from  │
    │ its default value. In case of time-ordered events (even if they are independent),    │
    │ this will result in inflated model performance evaluation because natural drift will │
    │ not be taken into account. We recommend setting the shuffle parameter to `False` in  │
    │ order to ensure the evaluation process is really representative of your production   │
    │ release process.                                                                     │
    ╰──────────────────────────────────────────────────────────────────────────────────────╯


.. GENERATED FROM PYTHON SOURCE LINES 206-208

Time series data
^^^^^^^^^^^^^^^^

.. GENERATED FROM PYTHON SOURCE LINES 210-213

Now, let us assume that we have `time series data
<https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-of-time-series-data>`_:
the data is somewhat time-ordered:

.. GENERATED FROM PYTHON SOURCE LINES 215-223

.. code-block:: Python

    import pandas as pd
    from skrub.datasets import fetch_employee_salaries

    dataset = fetch_employee_salaries()
    X, y = dataset.X, dataset.y
    X["date_first_hired"] = pd.to_datetime(X["date_first_hired"])
    X.head(2)


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>gender</th>
          <th>department</th>
          <th>department_name</th>
          <th>division</th>
          <th>assignment_category</th>
          <th>employee_position_title</th>
          <th>date_first_hired</th>
          <th>year_first_hired</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>F</td>
          <td>POL</td>
          <td>Department of Police</td>
          <td>MSB Information Mgmt and Tech Division Records...</td>
          <td>Fulltime-Regular</td>
          <td>Office Services Coordinator</td>
          <td>1986-09-22</td>
          <td>1986</td>
        </tr>
        <tr>
          <th>1</th>
          <td>M</td>
          <td>POL</td>
          <td>Department of Police</td>
          <td>ISB Major Crimes Division Fugitive Section</td>
          <td>Fulltime-Regular</td>
          <td>Master Police Officer</td>
          <td>1988-09-12</td>
          <td>1988</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 224-230

We can observe that there is a ``date_first_hired`` which is time-based.

As one can not shuffle time (time only moves in one direction: forward), we
recommend using :class:`sklearn.model_selection.TimeSeriesSplit` instead of
:func:`sklearn.model_selection.train_test_split` (or :func:`skore.train_test_split`)
with a ``TimeBasedColumnWarning``:

.. GENERATED FROM PYTHON SOURCE LINES 232-235

.. code-block:: Python

    X_train, X_test, y_train, y_test = skore.train_test_split(
        X, y, random_state=0, shuffle=False
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    ╭─────────────────────────────── TimeBasedColumnWarning ───────────────────────────────╮
    │ We detected some time-based columns (column "date_first_hired") in your data. We     │
    │ recommend using scikit-learn's TimeSeriesSplit instead of train_test_split.          │
    │ Otherwise you might train on future data to predict the past, or get inflated model  │
    │ performance evaluation because natural drift will not be taken into account.         │
    ╰──────────────────────────────────────────────────────────────────────────────────────╯


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 0.058 seconds)


.. _sphx_glr_download_auto_examples_model_evaluation_plot_train_test_split.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_train_test_split.ipynb <plot_train_test_split.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_train_test_split.py <plot_train_test_split.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_train_test_split.zip <plot_train_test_split.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_