Transformers

44 data transformers covering scaling, encoding, text vectorization, image transforms, feature selection, imputation, and class-imbalance correction. All implement the Transformer interface.

Transformer Interface

All transformers implement Pml\Interfaces\Transformer:

interface Transformer
{
    public function fit(Dataset $dataset): void;
    public function transform(Dataset $dataset): Dataset;
    public function fitTransform(Dataset $dataset): Dataset;  // default: fit + transform
}

Transformers that implement Stateful additionally support getStateDict() / loadStateDict() for SafeTensors checkpoint I/O.

Scalers & Normalizers

Namespace: Pml\Transformers\

Class	Parameters	Description
`StandardScaler`	—	Subtracts mean, divides by std. μ=0, σ=1 per feature. Most common default.
`ZScaleStandardizer`	`center=true`	Same as StandardScaler but optionally skips centering (for sparse data).
`MinMaxScaler`	`min=0.0, max=1.0`	Scales each feature to [min, max]. Sensitive to outliers.
`MaxAbsScaler`	—	Divides by the maximum absolute value. Preserves sparsity.
`RobustScaler`	—	Centers by median, scales by IQR. Robust to outliers.
`L1Normalizer`	—	Normalizes each sample to unit L1 norm.
`L2Normalizer`	—	Normalizes each sample to unit L2 norm.
`PowerTransformer`	—	Yeo-Johnson or Box-Cox power transform to make features more Gaussian-like. Uses C-level fit.
`QuantileTransformer`	`nQuantiles=1000`	Maps to uniform or normal distribution via quantile landmarks. C-accelerated.

Encoding

Class	Parameters	Description
`OrdinalEncoder`	`columns=[]`	Maps string categories to integer codes per column. Deterministic sort order.
`OneHotLabelEncoder`	—	Expands integer class labels to binary one-hot vectors. For multi-class output encoding.
`CategoricalEncoder`	—	Encodes string feature columns as integer indices. Stores vocabulary per column.
`BooleanConverter`	—	Converts boolean/string boolean values (`'true'/'false'`, `'yes'/'no'`) to 0.0/1.0.
`NumericStringConverter`	—	Parses numeric string columns (`'3.14'`) to float. NaN for non-parsable values.
`IntervalDiscretizer`	`bins=5`	Bins continuous features into equal-width intervals. Returns integer bin indices.
`TargetEncoder`	—	Replaces categorical levels with the target mean (smoothed). Must call `fit()` with labels present.
`PolynomialExpander`	—	Adds polynomial and interaction features. degree=2 by default.

Text Vectorizers

Class	Parameters	Description
`WordCountVectorizer`	`maxFeatures=null, textColumn='text'`	Bag-of-words counts. Builds a vocabulary on `fit()`, outputs sparse-like count matrix.
`TfIdfTransformer`	—	Transforms raw term counts to TF-IDF weights. Apply after WordCountVectorizer.
`BM25Transformer`	`k1=1.2, b=0.75`	BM25 ranking weighting. Better than TF-IDF for information retrieval tasks.
`TokenHashingVectorizer`	—	Hashing trick vectorizer — no vocabulary stored. Fixed output dim, collision-tolerant.
`TextNormalizer`	—	Lowercases, removes punctuation, trims whitespace. ASCII-only.
`MultibyteTextNormalizer`	`stripDiacritics=false`	Unicode-aware text normalization. Optionally strips diacritics (accents).
`HtmlStripper`	—	Strips HTML tags from text columns.
`StopWordFilter`	`stopWords: string[]`	Removes specified stop words from tokenized text.
`RegexFilter`	—	Removes tokens matching a regex pattern from text columns.

Image Transformers

Class	Parameters	Description
`ImageResizer`	`targetWidth, targetHeight`	Bilinear resize for image datasets stored as pixel rows.
`ImageCropper`	`xOffset, yOffset, width, height`	Fixed-region crop from image tensor.
`ImageRotator`	—	Random rotation augmentation (configurable angle range).
`ImageVectorizer`	`maxPixelValue=255.0`	Flattens image tensors to 1-D feature vectors and scales pixel values to [0, 1].
`ColorSpaceConverter`	—	Converts between RGB, HSV, LAB color spaces.

Feature Selection & Projection

Class	Parameters	Description
`VarianceThreshold`	`minVariance=1e-4`	Removes features whose variance falls below the threshold. Fast unsupervised selection.
`SelectKBest`	`k=10`	Selects K features with highest ANOVA F-score (supervised). Requires labeled dataset.
`PrincipalComponentAnalysis`	`nComponents: int`	PCA via SVD. Projects features to the top-k principal components.
`LinearDiscriminantAnalysis`	`nComponents=null`	Supervised dimensionality reduction maximizing class separability. nComponents ≤ (C−1).
`TruncatedSVD`	`nComponents=2`	Latent Semantic Analysis — SVD without mean centering. Suitable for sparse matrices.
`GaussianRandomProjector`	`nComponents=100`	Random Gaussian projection for fast approximate dimensionality reduction (Johnson-Lindenstrauss).
`SparseRandomProjector`	—	Sparse sign random projection — memory-efficient alternative to Gaussian projection.

Imputers

Class	Parameters	Description
`Imputer`	—	Simple imputer: fills missing values with column mean, median, most frequent, or a constant. Configurable per strategy.
`KNNImputer`	`k=5`	Imputes missing values using the mean of K nearest (non-missing) neighbors. More accurate, slower.

Class Imbalance

Class	Parameters	Description
`SMOTE`	`amount=100, k=5`	Synthetic Minority Over-sampling. Generates synthetic minority-class samples by interpolating between K nearest neighbors.
`TomekLinks`	—	Under-sampling by removing majority-class samples that form Tomek links (near-boundary pairs).
`NeighborhoodClearing`	`k=3`	Edited Nearest Neighbors — removes borderline samples. Cleans decision boundary.

Other Transformers

Class	Parameters	Description
`LambdaFunction`	`fn: callable`	Wraps any PHP callable as a stateless transformer. Useful for custom feature engineering steps.

Pipeline Example

use Pml\Pipeline;
use Pml\Transformers\StandardScaler;
use Pml\Transformers\Imputer;
use Pml\Transformers\OrdinalEncoder;
use Pml\Transformers\SelectKBest;
use Pml\Transformers\SMOTE;
use Pml\Estimators\Classifiers\GBDTClassifier;

$pipeline = new Pipeline([
    new OrdinalEncoder(columns: ['gender', 'city']),
    new Imputer(),                        // fill NaNs with column mean
    new StandardScaler(),
    new SelectKBest(k: 20),              // keep top-20 ANOVA features
    new SMOTE(amount: 150, k: 5),       // oversample minority class
    new GBDTClassifier(nEstimators: 200),
]);

$pipeline->train($trainSet);
$preds = $pipeline->predict($testSet);

// Text classification pipeline
use Pml\Transformers\TextNormalizer;
use Pml\Transformers\StopWordFilter;
use Pml\Transformers\WordCountVectorizer;
use Pml\Transformers\TfIdfTransformer;
use Pml\Estimators\Classifiers\MultinomialNB;

$pipeline = new Pipeline([
    new TextNormalizer(),
    new StopWordFilter(['the', 'a', 'is', 'are']),
    new WordCountVectorizer(maxFeatures: 10_000),
    new TfIdfTransformer(),
    new MultinomialNB(alpha: 0.1),
]);

$pipeline->train($trainSet);
$sentiment = $pipeline->predict($testSet);

Stateless vs Stateful Transformers Scalers, encoders, and vectorizers compute their parameters on fit() and store them. You must fit() on training data only, then transform() on test data. The fitTransform() shortcut does both in one call (training data only).

Transformers

On this page

Transformer Interface

Scalers & Normalizers

Encoding

Text Vectorizers

Image Transformers

Feature Selection & Projection

Imputers

Class Imbalance

Other Transformers

Pipeline Example