train_test_split#

skore.train_test_split(*arrays, X=None, y=None, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None, project=None)[source]#

Perform train-test-split of data.

This is a wrapper over scikit-learn’s sklearn.model_selection.train_test_split() helper function, enriching it with various warnings that can be saved in a skore Project.

The signature is fully compatible with sklearn’s train_test_split, and some keyword arguments are added to make the detection of issues more accurate. For instance, argument y has been added to pass the target explicitly, which makes it easier to detect issues with the target.

See the train_test_split: get diagnostics when splitting your data example.

Parameters:
*arrayssequence of indexables with same length / shape[0]

Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

Xarray-like, optional

If not None, will be appended to the list of arrays passed positionally.

yarray-like, optional

If not None, will be appended to the list of arrays passed positionally, after X. If None, it is assumed that the last array in arrays is y.

test_sizefloat or int, optional

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.

train_sizefloat or int, optional

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

random_stateint or numpy RandomState instance, optional

Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.

shufflebool, default is True

Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

stratifyarray-like, optional

If not None, data is split in a stratified fashion, using this as the class labels.

projectProject, optional

The project to save information into. If None, no information will be saved.

Returns:
splittinglist

List containing train-test split of inputs. The length of the list is twice the number of arrays passed, including the X and y keyword arguments. If arrays are passed positionally as well as through X and y, the output arrays are ordered as follows: first the arrays passed positionally, in the order they were passed, then X if it was passed, then y if it was passed.

Examples

>>> import numpy as np
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> # Drop-in replacement for sklearn train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,  
...     test_size=0.33, random_state=42)
>>> X_train  
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> # Explicit X and y, makes detection of problems easier
>>> X_train, X_test, y_train, y_test = train_test_split(X=X, y=y,  
...     test_size=0.33, random_state=42)
>>> X_train  
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> # When passing X and y explicitly, X is returned before y
>>> arr = np.arange(10).reshape((5, 2))
>>> splits = train_test_split(  
...     arr, y=y, X=X, test_size=0.33, random_state=42)
>>> arr_train, arr_test, X_train, X_test, y_train, y_test = splits  
>>> X_train  
array([[4, 5],
       [0, 1],
       [6, 7]])