empml.transformers¶

Object	Description
`Identity`	Pass-through transformer.
`AvgFeatures`	Compute mean across multiple features row-wise.
`MaxFeatures`	Compute max across multiple features row-wise.
`MinFeatures`	Compute min across multiple features row-wise.
`StdFeatures`	Compute standard deviation across multiple features row-wise.
`MedianFeatures`	Compute median across multiple features row-wise.
`ModuleFeatures`	Compute Euclidean norm (module) of two features.
`InteractionFeatures`	Create pairwise multiplication features from feature pairs.
`MeanTargetEncoder`	Target encoding using mean of target variable.
`StdTargetEncoder`	Target encoding using standard deviation of target variable.
`MaxTargetEncoder`	Target encoding using max of target variable.
`MinTargetEncoder`	Target encoding using min of target variable.
`MedianTargetEncoder`	Target encoding using median of target variable.
`KurtTargetEncoder`	Target encoding using kurtosis of target variable.
`SkewTargetEncoder`	Target encoding using skewness of target variable.
`OrdinalEncoder`	Encode categorical features as ordinal integers.
`DummyEncoder`	One-hot encode categorical features.
`FrequencyEncoder`	Encode categorical features by their frequency or proportion.
`StandardScaler`	Standardize features by removing the mean and scaling to unit variance.
`MinMaxScaler`	Scale features to [0, 1] range using min-max normalization.
`RobustScaler`	Scale features using median and interquartile range (IQR).
`Log1pFeatures`	Apply log(1+x) transformation.
`Expm1Features`	Apply exp(x-1) transformation.
`PowerFeatures`	Apply power transformation.
`InverseFeatures`	Apply inverse transformation (1/x).
`QuantileBinning`	Discretize continuous features into quantile-based bins.
`RankFeatures`	Convert features to percentile rank based on training distribution.
`SimpleImputer`	Impute missing values using mean or median.
`FillNulls`	Fill null and NaN values with a constant.
`GenerateLags`	Generate lagged features for time series data.
`KMeansCluster`	Assign cluster labels using KMeans on selected features.
`PCATransformer`	Reduce dimensionality using Principal Component Analysis.

Identity¶

Pass-through transformer that returns data unchanged. No parameters required.

AvgFeatures¶

Compute mean across multiple features row-wise.

Methods¶

def __init__(self, features: list[str], new_feature: str):
    """
    Parameters:
    -----------
    features : list[str]
        Columns to average.
    new_feature : str
        Name of the output column.
    """
    pass

MaxFeatures¶

Compute max across multiple features row-wise.

Methods¶

def __init__(self, features: list[str], new_feature: str):
    """
    Parameters:
    -----------
    features : list[str]
        Columns to compute max over.
    new_feature : str
        Name of the output column.
    """
    pass

MinFeatures¶

Compute min across multiple features row-wise.

Methods¶

def __init__(self, features: list[str], new_feature: str):
    """
    Parameters:
    -----------
    features : list[str]
        Columns to compute min over.
    new_feature : str
        Name of the output column.
    """
    pass

StdFeatures¶

Compute standard deviation across multiple features row-wise.

Methods¶

def __init__(self, features: list[str], new_feature: str):
    """
    Parameters:
    -----------
    features : list[str]
        Columns to compute std over.
    new_feature : str
        Name of the output column.
    """
    pass

MedianFeatures¶

Compute median across multiple features row-wise.

Methods¶

def __init__(self, features: list[str], new_feature: str):
    """
    Parameters:
    -----------
    features : list[str]
        Columns to compute median over.
    new_feature : str
        Name of the output column.
    """
    pass

ModuleFeatures¶

Compute Euclidean norm (module) of two features: sqrt(f1^2 + f2^2).

Methods¶

def __init__(self, features: tuple[str, str], new_feature: str):
    """
    Parameters:
    -----------
    features : tuple[str, str]
        Tuple of two column names.
    new_feature : str
        Name of the output column.
    """
    pass

InteractionFeatures¶

Create pairwise multiplication features from feature pairs. For each pair (f1, f2), creates a column f1 * f2.

Methods¶

def __init__(self, feature_pairs: list[tuple[str, str]], separator: str = '_x_'):
    """
    Parameters:
    -----------
    feature_pairs : list[tuple[str, str]]
        List of (col1, col2) tuples to multiply.
    separator : str
        String between feature names in the output column name (default: '_x_').
        Output column name: '{col1}{separator}{col2}'.
    """
    pass

MeanTargetEncoder¶

Encode categorical features with the mean of a target variable. Unseen categories during transform are filled with the global mean.

Methods¶

def __init__(
    self,
    features: list[str],
    encoder_col: str,
    prefix: str = 'mean_',
    suffix: str = '_encoded',
    replace_original: bool = False
):
    """
    Parameters:
    -----------
    features : list[str]
        Categorical columns to encode.
    encoder_col : str
        Target column to aggregate.
    prefix : str
        Prefix for encoded column names (default: 'mean_').
    suffix : str
        Suffix for encoded column names (default: '_encoded').
    replace_original : bool
        If True, drop original columns and use their names
        for encoded columns, ignoring prefix/suffix (default: False).
    """
    pass

StdTargetEncoder¶

Encode categorical features with the standard deviation of a target variable. Same interface as MeanTargetEncoder.

Methods¶

def __init__(
    self,
    features: list[str],
    encoder_col: str,
    prefix: str = 'std_',
    suffix: str = '_encoded',
    replace_original: bool = False
):
    pass

MaxTargetEncoder¶

Encode categorical features with the max of a target variable. Same interface as MeanTargetEncoder.

Methods¶

def __init__(
    self,
    features: list[str],
    encoder_col: str,
    prefix: str = 'max_',
    suffix: str = '_encoded',
    replace_original: bool = False
):
    pass

MinTargetEncoder¶

Encode categorical features with the min of a target variable. Same interface as MeanTargetEncoder.

Methods¶

def __init__(
    self,
    features: list[str],
    encoder_col: str,
    prefix: str = 'min_',
    suffix: str = '_encoded',
    replace_original: bool = False
):
    pass

MedianTargetEncoder¶

Encode categorical features with the median of a target variable. Same interface as MeanTargetEncoder.

Methods¶

def __init__(
    self,
    features: list[str],
    encoder_col: str,
    prefix: str = 'median_',
    suffix: str = '_encoded',
    replace_original: bool = False
):
    pass

KurtTargetEncoder¶

Encode categorical features with the kurtosis of a target variable. Same interface as MeanTargetEncoder.

Methods¶

def __init__(
    self,
    features: list[str],
    encoder_col: str,
    prefix: str = 'kurt_',
    suffix: str = '_encoded',
    replace_original: bool = False
):
    pass

SkewTargetEncoder¶

Encode categorical features with the skewness of a target variable. Same interface as MeanTargetEncoder.

Methods¶

def __init__(
    self,
    features: list[str],
    encoder_col: str,
    prefix: str = 'skew_',
    suffix: str = '_encoded',
    replace_original: bool = False
):
    pass

OrdinalEncoder¶

Encode categorical features with ordinal integers based on sorted order. Null values are encoded as -99, unknown categories (not seen during fit) as -9999.

Methods¶

def __init__(
    self,
    features: list[str],
    suffix: str = '_ordinal_encoded',
    replace_original: bool = False
):
    """
    Parameters:
    -----------
    features : list[str]
        Categorical columns to encode.
    suffix : str
        Suffix for encoded column names (default: '_ordinal_encoded').
    replace_original : bool
        If True, drop original columns and use their names
        for encoded columns, ignoring suffix (default: False).
    """
    pass

DummyEncoder¶

One-hot encode categorical features. Creates binary columns for each category, plus dedicated columns for null and unknown values.

Methods¶

def __init__(self, features: list[str]):
    """
    Parameters:
    -----------
    features : list[str]
        Categorical columns to encode.
    """
    pass

FrequencyEncoder¶

Encode categorical features by their frequency (count) or proportion. Unseen categories during transform are filled with 0.

Methods¶

def __init__(
    self,
    features: list[str],
    normalize: bool = True,
    prefix: str = 'freq_',
    suffix: str = '_encoded',
    replace_original: bool = False
):
    """
    Parameters:
    -----------
    features : list[str]
        Categorical columns to encode.
    normalize : bool
        If True, encode as proportion (0-1); if False, as raw count (default: True).
    prefix : str
        Prefix for encoded column names (default: 'freq_').
    suffix : str
        Suffix for encoded column names (default: '_encoded').
    replace_original : bool
        If True, drop original columns and use their names
        for encoded columns, ignoring prefix/suffix (default: False).
    """
    pass

StandardScaler¶

Standardize features by removing the mean and scaling to unit variance (z-score normalization). Handles zero standard deviation by returning 0.

Methods¶

def __init__(self, features: list[str], suffix: str = ''):
    """
    Parameters:
    -----------
    features : list[str]
        Columns to standardize.
    suffix : str
        Suffix for scaled column names (default: '' overwrites original).
    """
    pass

MinMaxScaler¶

Scale features to [0, 1] range using min-max normalization. Handles zero range by returning 0.

Methods¶

def __init__(self, features: list[str], suffix: str = ''):
    """
    Parameters:
    -----------
    features : list[str]
        Columns to scale.
    suffix : str
        Suffix for scaled column names (default: '' overwrites original).
    """
    pass

RobustScaler¶

Scale features using median and interquartile range (IQR): (x - median) / (Q75 - Q25). Less sensitive to outliers than StandardScaler. Handles zero IQR by returning 0.

Methods¶

def __init__(self, features: list[str], suffix: str = ''):
    """
    Parameters:
    -----------
    features : list[str]
        Columns to scale.
    suffix : str
        Suffix for scaled column names (default: '' overwrites original).
    """
    pass

Log1pFeatures¶

Apply log(1+x) transformation to features.

Methods¶

def __init__(self, features: list[str], suffix: str = ''):
    """
    Parameters:
    -----------
    features : list[str]
        Columns to transform.
    suffix : str
        Suffix for output column names (default: '' overwrites original).
    """
    pass

Expm1Features¶

Apply exp(x-1) transformation to features.

Methods¶

def __init__(self, features: list[str], suffix: str = ''):
    """
    Parameters:
    -----------
    features : list[str]
        Columns to transform.
    suffix : str
        Suffix for output column names (default: '' overwrites original).
    """
    pass

PowerFeatures¶

Raise features to a specified power.

Methods¶

def __init__(self, features: list[str], suffix: str = '', power: float = 2):
    """
    Parameters:
    -----------
    features : list[str]
        Columns to transform.
    suffix : str
        Suffix for output column names (default: '' overwrites original).
    power : float
        Exponent for power transformation (default: 2).
    """
    pass

InverseFeatures¶

Apply inverse (1/x) transformation to features.

Methods¶

def __init__(self, features: list[str], suffix: str = ''):
    """
    Parameters:
    -----------
    features : list[str]
        Columns to transform.
    suffix : str
        Suffix for output column names (default: '' overwrites original).
    """
    pass

QuantileBinning¶

Discretize continuous features into quantile-based bins. During fit, computes quantile bin edges. During transform, assigns bin indices (0 to num_bins-1).

Methods¶

def __init__(
    self,
    features: list[str],
    num_bins: int = 10,
    suffix: str = '_qbin',
    labels: list[str] | None = None
):
    """
    Parameters:
    -----------
    features : list[str]
        Columns to discretize.
    num_bins : int
        Number of quantile bins (default: 10).
    suffix : str
        Suffix for binned column names (default: '_qbin').
    labels : list[str] | None
        Optional string labels for bins; length must equal num_bins.

    Raises:
    -------
    ValueError
        If labels length does not match num_bins.
    """
    pass

RankFeatures¶

Convert features to percentile rank [0, 1] based on the training distribution.

Methods¶

def __init__(
    self,
    features: list[str],
    suffix: str = '_rank',
    method: Literal['average', 'min', 'max', 'dense'] = 'average'
):
    """
    Parameters:
    -----------
    features : list[str]
        Columns to rank.
    suffix : str
        Suffix for ranked column names (default: '_rank').
    method : str
        Ranking method (default: 'average'):
        - 'average': average of min and max rank positions
        - 'min': lowest rank position
        - 'max': highest rank position
        - 'dense': like 'min' but ranks always increase by 1
    """
    pass

SimpleImputer¶

Impute missing values (null and NaN) using mean or median strategy.

Methods¶

def __init__(self, features: list[str], strategy: str = 'mean'):
    """
    Parameters:
    -----------
    features : list[str]
        Columns to impute.
    strategy : str
        Imputation strategy: 'mean' or 'median' (default: 'mean').

    Raises:
    -------
    ValueError
        If strategy is not 'mean' or 'median'.
    """
    pass

FillNulls¶

Fill null and NaN values with a constant.

Methods¶

def __init__(self, features: list[str], value: float = -9999):
    """
    Parameters:
    -----------
    features : list[str]
        Columns to fill.
    value : float
        Constant to use for filling (default: -9999).
    """
    pass

GenerateLags¶

Generate lagged features for time series data by joining shifted dates.

Methods¶

def __init__(
    self,
    ts_index: str,
    date_col: str,
    lag_col: str,
    lag_frequency: str = 'days',
    lag_min: int = 1,
    lag_max: int = 1,
    lag_step: int = 1
):
    """
    Parameters:
    -----------
    ts_index : str
        Column for time series identifier (e.g., entity ID).
    date_col : str
        Date/time column (must be Polars Date or Datetime type).
    lag_col : str
        Column to lag.
    lag_frequency : str
        Time unit for lags (default: 'days').
        Options: 'weeks', 'days', 'hours', 'minutes', 'seconds',
        'milliseconds', 'microseconds', 'nanoseconds'.
    lag_min : int
        Minimum lag period (default: 1).
    lag_max : int
        Maximum lag period (default: 1).
    lag_step : int
        Step size between lags (default: 1).

    Raises:
    -------
    ValueError
        If lag_frequency is not a valid time unit.
    """
    pass

KMeansCluster¶

Assign cluster labels using KMeans on selected features. Uses scikit-learn's KMeans internally. Missing values are imputed with column means before clustering.

Methods¶

def __init__(
    self,
    features: list[str],
    num_clusters: int = 8,
    new_feature: str = 'kmeans_cluster',
    random_state: int = 42
):
    """
    Parameters:
    -----------
    features : list[str]
        Numeric columns to use for clustering.
    num_clusters : int
        Number of clusters (default: 8).
    new_feature : str
        Name of the output cluster column (default: 'kmeans_cluster').
    random_state : int
        Random seed for reproducibility (default: 42).
    """
    pass

PCATransformer¶

Reduce dimensionality using Principal Component Analysis. Uses scikit-learn's PCA internally. Missing values are imputed with column means before PCA.

Methods¶

def __init__(
    self,
    features: list[str],
    n_components: int = 2,
    prefix: str = 'pc_'
):
    """
    Parameters:
    -----------
    features : list[str]
        Numeric columns to use for PCA.
    n_components : int
        Number of principal components to keep (default: 2).
    prefix : str
        Prefix for output column names (default: 'pc_').
        Columns are named '{prefix}0', '{prefix}1', etc.
    """
    pass