Skip to content

empml.cv

Object Description
KFold Standard K-Fold cross-validation with random shuffling.
StratifiedKFold Stratified K-Fold that preserves class distribution across folds.
GroupKFold Group K-Fold that prevents data leakage between groups.
LeaveOneGroupOut Leave-One-Group-Out cross-validation.
TimeSeriesSplit Time series cross-validation generator that splits data based on date ranges.
TrainTestSplit Single train-test split with random shuffling.

KFold

Standard K-Fold cross-validation with random shuffling.

Methods

def __init__(self, n_splits: int = 5, random_state: int = None):
    pass

def split(self, lf: pl.LazyFrame, row_id: str) -> List[Tuple[np.array]]:
    """
    Generate k-fold train/validation splits.

    Parameters
    ----------
    lf : pl.LazyFrame
        Input data to split.
    row_id : str
        Column name containing unique row identifiers.

    Returns
    -------
    List[Tuple[np.ndarray, np.ndarray]]
        List of (train_indices, validation_indices) tuples for each fold.
    """
    pass

StratifiedKFold

Stratified K-Fold that preserves class distribution across folds.

Methods

def __init__(self, target_col: str, n_splits: int = 5, random_state: int = None):
    pass

def split(self, lf: pl.LazyFrame, row_id: str) -> List[Tuple[np.array]]:
    """
    Generate stratified k-fold train/validation splits.

    Parameters
    ----------
    lf : pl.LazyFrame
        Input data to split.
    row_id : str
        Column name containing unique row identifiers.

    Returns
    -------
    List[Tuple[np.ndarray, np.ndarray]]
        List of (train_indices, validation_indices) tuples for each fold.
    """
    pass

GroupKFold

Group K-Fold that prevents data leakage between groups.

Methods

def __init__(self, group_col: str, n_splits: int = 5, random_state: int = None):
    pass

def split(self, lf: pl.LazyFrame, row_id: str) -> List[Tuple[np.array]]:
    """
    Generate group-aware k-fold train/validation splits.

    Parameters
    ----------
    lf : pl.LazyFrame
        Input data to split.
    row_id : str
        Column name containing unique row identifiers.

    Returns
    -------
    List[Tuple[np.ndarray, np.ndarray]]
        List of (train_indices, validation_indices) tuples for each fold.
    """
    pass

LeaveOneGroupOut

Leave-One-Group-Out cross-validation.

Methods

def __init__(self, group_col: str):
    pass

def split(self, lf: pl.LazyFrame, row_id: str) -> List[Tuple[np.ndarray, np.ndarray]]:
    """
    Generate leave-one-group-out train/validation splits.

    Parameters
    ----------
    lf : pl.LazyFrame
        Input data to split.
    row_id : str
        Column name containing unique row identifiers.

    Returns
    -------
    List[Tuple[np.ndarray, np.ndarray]]
        List of (train_indices, validation_indices) tuples, one per group.
    """
    pass

TimeSeriesSplit

Time series cross-validation generator that splits data based on date ranges.

Methods

def __init__(self, windows: List[Tuple[str, str, str, str]], date_col: str):
    """
    Initialize the TimeSeriesSplit cross-validator.

    Parameters
    ----------
    windows : List[Tuple[str, str, str, str]]
        List of date range tuples defining train/validation splits for each fold.
        Each tuple contains (train_start, train_end, val_start, val_end).
    date_col : str
        Name of the date/timestamp column in the dataset.
    """
    pass

def split(self, lf: pl.LazyFrame, row_id: str) -> List[Tuple[np.ndarray, np.ndarray]]:
    """
    Generate train/validation row indices for each fold based on date windows.

    Parameters
    ----------
    lf : pl.LazyFrame
        Input data containing the date column and row identifier.
    row_id : str
        Name of the column containing unique row identifiers.

    Returns
    -------
    List[Tuple[np.ndarray, np.ndarray]]
        List of tuples, one per fold, where each tuple contains:
        - train_indices: numpy array of row IDs for training
        - val_indices: numpy array of row IDs for validation

    Notes
    -----
    - If date_col is not already datetime type, it will be automatically converted
      from string format using polars' str.to_datetime() method.
    - Date filtering uses inclusive start (>=) and exclusive end (<) boundaries.
    """
    pass

TrainTestSplit

Single train-test split with random shuffling.

Methods

def __init__(self, test_size: float = 0.2, random_state: int = None):
    pass

def split(self, lf: pl.LazyFrame, row_id: str) -> List[Tuple[np.ndarray, np.ndarray]]:
    """
    Generate single train/test split.

    Parameters
    ----------
    lf : pl.LazyFrame
        Input data to split.
    row_id : str
        Column name containing unique row identifiers.

    Returns
    -------
    List[Tuple[np.ndarray, np.ndarray]]
        Single-element list containing (train_indices, test_indices) tuple.
    """
    pass