train_test_split#
- skore.train_test_split(*arrays, X=None, y=None, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None, project=None)[source]#
Perform train-test-split of data.
This is a wrapper over scikit-learn’s
sklearn.model_selection.train_test_split()
helper function, enriching it with various warnings that can be saved in a skore Project.The signature is fully compatible with sklearn’s
train_test_split
, and some keyword arguments are added to make the detection of issues more accurate. For instance, argumenty
has been added to pass the target explicitly, which makes it easier to detect issues with the target.See the train_test_split: get diagnostics when splitting your data example.
- Parameters:
- *arrayssequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
- Xarray-like, optional
If not None, will be appended to the list of arrays passed positionally.
- yarray-like, optional
If not None, will be appended to the list of arrays passed positionally, after
X
. If None, it is assumed that the last array inarrays
isy
.- test_sizefloat or int, optional
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
- train_sizefloat or int, optional
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
- random_stateint or numpy RandomState instance, optional
Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.
- shufflebool, default is True
Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.
- stratifyarray-like, optional
If not None, data is split in a stratified fashion, using this as the class labels.
- projectProject, optional
The project to save information into. If None, no information will be saved.
- Returns:
- splittinglist
List containing train-test split of inputs. The length of the list is twice the number of arrays passed, including the
X
andy
keyword arguments. If arrays are passed positionally as well as throughX
andy
, the output arrays are ordered as follows: first the arrays passed positionally, in the order they were passed, thenX
if it was passed, theny
if it was passed.
Examples
>>> import numpy as np >>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> # Drop-in replacement for sklearn train_test_split >>> X_train, X_test, y_train, y_test = train_test_split(X, y, ... test_size=0.33, random_state=42) >>> X_train array([[4, 5], [0, 1], [6, 7]])
>>> # Explicit X and y, makes detection of problems easier >>> X_train, X_test, y_train, y_test = train_test_split(X=X, y=y, ... test_size=0.33, random_state=42) >>> X_train array([[4, 5], [0, 1], [6, 7]])
>>> # When passing X and y explicitly, X is returned before y >>> arr = np.arange(10).reshape((5, 2)) >>> splits = train_test_split( ... arr, y=y, X=X, test_size=0.33, random_state=42) >>> arr_train, arr_test, X_train, X_test, y_train, y_test = splits >>> X_train array([[4, 5], [0, 1], [6, 7]])
Gallery examples#

EstimatorReport: Get insights from any scikit-learn estimator

train_test_split: get diagnostics when splitting your data