empml.transformers¶
| Object | Description |
|---|---|
Identity |
Pass-through transformer. |
AvgFeatures |
Compute mean across multiple features row-wise. |
MaxFeatures |
Compute max across multiple features row-wise. |
MinFeatures |
Compute min across multiple features row-wise. |
StdFeatures |
Compute standard deviation across multiple features row-wise. |
MedianFeatures |
Compute median across multiple features row-wise. |
ModuleFeatures |
Compute Euclidean norm (module) of two features. |
InteractionFeatures |
Create pairwise multiplication features from feature pairs. |
MeanTargetEncoder |
Target encoding using mean of target variable. |
StdTargetEncoder |
Target encoding using standard deviation of target variable. |
MaxTargetEncoder |
Target encoding using max of target variable. |
MinTargetEncoder |
Target encoding using min of target variable. |
MedianTargetEncoder |
Target encoding using median of target variable. |
KurtTargetEncoder |
Target encoding using kurtosis of target variable. |
SkewTargetEncoder |
Target encoding using skewness of target variable. |
OrdinalEncoder |
Encode categorical features as ordinal integers. |
DummyEncoder |
One-hot encode categorical features. |
FrequencyEncoder |
Encode categorical features by their frequency or proportion. |
StandardScaler |
Standardize features by removing the mean and scaling to unit variance. |
MinMaxScaler |
Scale features to [0, 1] range using min-max normalization. |
RobustScaler |
Scale features using median and interquartile range (IQR). |
Log1pFeatures |
Apply log(1+x) transformation. |
Expm1Features |
Apply exp(x-1) transformation. |
PowerFeatures |
Apply power transformation. |
InverseFeatures |
Apply inverse transformation (1/x). |
QuantileBinning |
Discretize continuous features into quantile-based bins. |
RankFeatures |
Convert features to percentile rank based on training distribution. |
SimpleImputer |
Impute missing values using mean or median. |
FillNulls |
Fill null and NaN values with a constant. |
GenerateLags |
Generate lagged features for time series data. |
KMeansCluster |
Assign cluster labels using KMeans on selected features. |
PCATransformer |
Reduce dimensionality using Principal Component Analysis. |
Identity¶
Pass-through transformer that returns data unchanged. No parameters required.
AvgFeatures¶
Compute mean across multiple features row-wise.
Methods¶
def __init__(self, features: list[str], new_feature: str):
"""
Parameters:
-----------
features : list[str]
Columns to average.
new_feature : str
Name of the output column.
"""
pass
MaxFeatures¶
Compute max across multiple features row-wise.
Methods¶
def __init__(self, features: list[str], new_feature: str):
"""
Parameters:
-----------
features : list[str]
Columns to compute max over.
new_feature : str
Name of the output column.
"""
pass
MinFeatures¶
Compute min across multiple features row-wise.
Methods¶
def __init__(self, features: list[str], new_feature: str):
"""
Parameters:
-----------
features : list[str]
Columns to compute min over.
new_feature : str
Name of the output column.
"""
pass
StdFeatures¶
Compute standard deviation across multiple features row-wise.
Methods¶
def __init__(self, features: list[str], new_feature: str):
"""
Parameters:
-----------
features : list[str]
Columns to compute std over.
new_feature : str
Name of the output column.
"""
pass
MedianFeatures¶
Compute median across multiple features row-wise.
Methods¶
def __init__(self, features: list[str], new_feature: str):
"""
Parameters:
-----------
features : list[str]
Columns to compute median over.
new_feature : str
Name of the output column.
"""
pass
ModuleFeatures¶
Compute Euclidean norm (module) of two features: sqrt(f1^2 + f2^2).
Methods¶
def __init__(self, features: tuple[str, str], new_feature: str):
"""
Parameters:
-----------
features : tuple[str, str]
Tuple of two column names.
new_feature : str
Name of the output column.
"""
pass
InteractionFeatures¶
Create pairwise multiplication features from feature pairs. For each pair (f1, f2), creates a column f1 * f2.
Methods¶
def __init__(self, feature_pairs: list[tuple[str, str]], separator: str = '_x_'):
"""
Parameters:
-----------
feature_pairs : list[tuple[str, str]]
List of (col1, col2) tuples to multiply.
separator : str
String between feature names in the output column name (default: '_x_').
Output column name: '{col1}{separator}{col2}'.
"""
pass
MeanTargetEncoder¶
Encode categorical features with the mean of a target variable. Unseen categories during transform are filled with the global mean.
Methods¶
def __init__(
self,
features: list[str],
encoder_col: str,
prefix: str = 'mean_',
suffix: str = '_encoded',
replace_original: bool = False
):
"""
Parameters:
-----------
features : list[str]
Categorical columns to encode.
encoder_col : str
Target column to aggregate.
prefix : str
Prefix for encoded column names (default: 'mean_').
suffix : str
Suffix for encoded column names (default: '_encoded').
replace_original : bool
If True, drop original columns and use their names
for encoded columns, ignoring prefix/suffix (default: False).
"""
pass
StdTargetEncoder¶
Encode categorical features with the standard deviation of a target variable. Same interface as MeanTargetEncoder.
Methods¶
def __init__(
self,
features: list[str],
encoder_col: str,
prefix: str = 'std_',
suffix: str = '_encoded',
replace_original: bool = False
):
pass
MaxTargetEncoder¶
Encode categorical features with the max of a target variable. Same interface as MeanTargetEncoder.
Methods¶
def __init__(
self,
features: list[str],
encoder_col: str,
prefix: str = 'max_',
suffix: str = '_encoded',
replace_original: bool = False
):
pass
MinTargetEncoder¶
Encode categorical features with the min of a target variable. Same interface as MeanTargetEncoder.
Methods¶
def __init__(
self,
features: list[str],
encoder_col: str,
prefix: str = 'min_',
suffix: str = '_encoded',
replace_original: bool = False
):
pass
MedianTargetEncoder¶
Encode categorical features with the median of a target variable. Same interface as MeanTargetEncoder.
Methods¶
def __init__(
self,
features: list[str],
encoder_col: str,
prefix: str = 'median_',
suffix: str = '_encoded',
replace_original: bool = False
):
pass
KurtTargetEncoder¶
Encode categorical features with the kurtosis of a target variable. Same interface as MeanTargetEncoder.
Methods¶
def __init__(
self,
features: list[str],
encoder_col: str,
prefix: str = 'kurt_',
suffix: str = '_encoded',
replace_original: bool = False
):
pass
SkewTargetEncoder¶
Encode categorical features with the skewness of a target variable. Same interface as MeanTargetEncoder.
Methods¶
def __init__(
self,
features: list[str],
encoder_col: str,
prefix: str = 'skew_',
suffix: str = '_encoded',
replace_original: bool = False
):
pass
OrdinalEncoder¶
Encode categorical features with ordinal integers based on sorted order. Null values are encoded as -99, unknown categories (not seen during fit) as -9999.
Methods¶
def __init__(
self,
features: list[str],
suffix: str = '_ordinal_encoded',
replace_original: bool = False
):
"""
Parameters:
-----------
features : list[str]
Categorical columns to encode.
suffix : str
Suffix for encoded column names (default: '_ordinal_encoded').
replace_original : bool
If True, drop original columns and use their names
for encoded columns, ignoring suffix (default: False).
"""
pass
DummyEncoder¶
One-hot encode categorical features. Creates binary columns for each category, plus dedicated columns for null and unknown values.
Methods¶
def __init__(self, features: list[str]):
"""
Parameters:
-----------
features : list[str]
Categorical columns to encode.
"""
pass
FrequencyEncoder¶
Encode categorical features by their frequency (count) or proportion. Unseen categories during transform are filled with 0.
Methods¶
def __init__(
self,
features: list[str],
normalize: bool = True,
prefix: str = 'freq_',
suffix: str = '_encoded',
replace_original: bool = False
):
"""
Parameters:
-----------
features : list[str]
Categorical columns to encode.
normalize : bool
If True, encode as proportion (0-1); if False, as raw count (default: True).
prefix : str
Prefix for encoded column names (default: 'freq_').
suffix : str
Suffix for encoded column names (default: '_encoded').
replace_original : bool
If True, drop original columns and use their names
for encoded columns, ignoring prefix/suffix (default: False).
"""
pass
StandardScaler¶
Standardize features by removing the mean and scaling to unit variance (z-score normalization). Handles zero standard deviation by returning 0.
Methods¶
def __init__(self, features: list[str], suffix: str = ''):
"""
Parameters:
-----------
features : list[str]
Columns to standardize.
suffix : str
Suffix for scaled column names (default: '' overwrites original).
"""
pass
MinMaxScaler¶
Scale features to [0, 1] range using min-max normalization. Handles zero range by returning 0.
Methods¶
def __init__(self, features: list[str], suffix: str = ''):
"""
Parameters:
-----------
features : list[str]
Columns to scale.
suffix : str
Suffix for scaled column names (default: '' overwrites original).
"""
pass
RobustScaler¶
Scale features using median and interquartile range (IQR): (x - median) / (Q75 - Q25). Less sensitive to outliers than StandardScaler. Handles zero IQR by returning 0.
Methods¶
def __init__(self, features: list[str], suffix: str = ''):
"""
Parameters:
-----------
features : list[str]
Columns to scale.
suffix : str
Suffix for scaled column names (default: '' overwrites original).
"""
pass
Log1pFeatures¶
Apply log(1+x) transformation to features.
Methods¶
def __init__(self, features: list[str], suffix: str = ''):
"""
Parameters:
-----------
features : list[str]
Columns to transform.
suffix : str
Suffix for output column names (default: '' overwrites original).
"""
pass
Expm1Features¶
Apply exp(x-1) transformation to features.
Methods¶
def __init__(self, features: list[str], suffix: str = ''):
"""
Parameters:
-----------
features : list[str]
Columns to transform.
suffix : str
Suffix for output column names (default: '' overwrites original).
"""
pass
PowerFeatures¶
Raise features to a specified power.
Methods¶
def __init__(self, features: list[str], suffix: str = '', power: float = 2):
"""
Parameters:
-----------
features : list[str]
Columns to transform.
suffix : str
Suffix for output column names (default: '' overwrites original).
power : float
Exponent for power transformation (default: 2).
"""
pass
InverseFeatures¶
Apply inverse (1/x) transformation to features.
Methods¶
def __init__(self, features: list[str], suffix: str = ''):
"""
Parameters:
-----------
features : list[str]
Columns to transform.
suffix : str
Suffix for output column names (default: '' overwrites original).
"""
pass
QuantileBinning¶
Discretize continuous features into quantile-based bins. During fit, computes quantile bin edges. During transform, assigns bin indices (0 to num_bins-1).
Methods¶
def __init__(
self,
features: list[str],
num_bins: int = 10,
suffix: str = '_qbin',
labels: list[str] | None = None
):
"""
Parameters:
-----------
features : list[str]
Columns to discretize.
num_bins : int
Number of quantile bins (default: 10).
suffix : str
Suffix for binned column names (default: '_qbin').
labels : list[str] | None
Optional string labels for bins; length must equal num_bins.
Raises:
-------
ValueError
If labels length does not match num_bins.
"""
pass
RankFeatures¶
Convert features to percentile rank [0, 1] based on the training distribution.
Methods¶
def __init__(
self,
features: list[str],
suffix: str = '_rank',
method: Literal['average', 'min', 'max', 'dense'] = 'average'
):
"""
Parameters:
-----------
features : list[str]
Columns to rank.
suffix : str
Suffix for ranked column names (default: '_rank').
method : str
Ranking method (default: 'average'):
- 'average': average of min and max rank positions
- 'min': lowest rank position
- 'max': highest rank position
- 'dense': like 'min' but ranks always increase by 1
"""
pass
SimpleImputer¶
Impute missing values (null and NaN) using mean or median strategy.
Methods¶
def __init__(self, features: list[str], strategy: str = 'mean'):
"""
Parameters:
-----------
features : list[str]
Columns to impute.
strategy : str
Imputation strategy: 'mean' or 'median' (default: 'mean').
Raises:
-------
ValueError
If strategy is not 'mean' or 'median'.
"""
pass
FillNulls¶
Fill null and NaN values with a constant.
Methods¶
def __init__(self, features: list[str], value: float = -9999):
"""
Parameters:
-----------
features : list[str]
Columns to fill.
value : float
Constant to use for filling (default: -9999).
"""
pass
GenerateLags¶
Generate lagged features for time series data by joining shifted dates.
Methods¶
def __init__(
self,
ts_index: str,
date_col: str,
lag_col: str,
lag_frequency: str = 'days',
lag_min: int = 1,
lag_max: int = 1,
lag_step: int = 1
):
"""
Parameters:
-----------
ts_index : str
Column for time series identifier (e.g., entity ID).
date_col : str
Date/time column (must be Polars Date or Datetime type).
lag_col : str
Column to lag.
lag_frequency : str
Time unit for lags (default: 'days').
Options: 'weeks', 'days', 'hours', 'minutes', 'seconds',
'milliseconds', 'microseconds', 'nanoseconds'.
lag_min : int
Minimum lag period (default: 1).
lag_max : int
Maximum lag period (default: 1).
lag_step : int
Step size between lags (default: 1).
Raises:
-------
ValueError
If lag_frequency is not a valid time unit.
"""
pass
KMeansCluster¶
Assign cluster labels using KMeans on selected features. Uses scikit-learn's KMeans internally. Missing values are imputed with column means before clustering.
Methods¶
def __init__(
self,
features: list[str],
num_clusters: int = 8,
new_feature: str = 'kmeans_cluster',
random_state: int = 42
):
"""
Parameters:
-----------
features : list[str]
Numeric columns to use for clustering.
num_clusters : int
Number of clusters (default: 8).
new_feature : str
Name of the output cluster column (default: 'kmeans_cluster').
random_state : int
Random seed for reproducibility (default: 42).
"""
pass
PCATransformer¶
Reduce dimensionality using Principal Component Analysis. Uses scikit-learn's PCA internally. Missing values are imputed with column means before PCA.
Methods¶
def __init__(
self,
features: list[str],
n_components: int = 2,
prefix: str = 'pc_'
):
"""
Parameters:
-----------
features : list[str]
Numeric columns to use for PCA.
n_components : int
Number of principal components to keep (default: 2).
prefix : str
Prefix for output column names (default: 'pc_').
Columns are named '{prefix}0', '{prefix}1', etc.
"""
pass