PML / Transformers
Transformers

Transformers

44 data transformers covering scaling, encoding, text vectorization, image transforms, feature selection, imputation, and class-imbalance correction. All implement the Transformer interface.

Transformer Interface

All transformers implement Pml\Interfaces\Transformer:

interface Transformer
{
    public function fit(Dataset $dataset): void;
    public function transform(Dataset $dataset): Dataset;
    public function fitTransform(Dataset $dataset): Dataset;  // default: fit + transform
}

Transformers that implement Stateful additionally support getStateDict() / loadStateDict() for SafeTensors checkpoint I/O.

Scalers & Normalizers

Namespace: Pml\Transformers\

ClassParametersDescription
StandardScaler Subtracts mean, divides by std. μ=0, σ=1 per feature. Most common default.
ZScaleStandardizer center=true Same as StandardScaler but optionally skips centering (for sparse data).
MinMaxScaler min=0.0, max=1.0 Scales each feature to [min, max]. Sensitive to outliers.
MaxAbsScaler Divides by the maximum absolute value. Preserves sparsity.
RobustScaler Centers by median, scales by IQR. Robust to outliers.
L1Normalizer Normalizes each sample to unit L1 norm.
L2Normalizer Normalizes each sample to unit L2 norm.
PowerTransformer Yeo-Johnson or Box-Cox power transform to make features more Gaussian-like. Uses C-level fit.
QuantileTransformer nQuantiles=1000 Maps to uniform or normal distribution via quantile landmarks. C-accelerated.

Encoding

ClassParametersDescription
OrdinalEncoder columns=[] Maps string categories to integer codes per column. Deterministic sort order.
OneHotLabelEncoder Expands integer class labels to binary one-hot vectors. For multi-class output encoding.
CategoricalEncoder Encodes string feature columns as integer indices. Stores vocabulary per column.
BooleanConverter Converts boolean/string boolean values ('true'/'false', 'yes'/'no') to 0.0/1.0.
NumericStringConverter Parses numeric string columns ('3.14') to float. NaN for non-parsable values.
IntervalDiscretizer bins=5 Bins continuous features into equal-width intervals. Returns integer bin indices.
TargetEncoder Replaces categorical levels with the target mean (smoothed). Must call fit() with labels present.
PolynomialExpander Adds polynomial and interaction features. degree=2 by default.

Text Vectorizers

ClassParametersDescription
WordCountVectorizer maxFeatures=null, textColumn='text' Bag-of-words counts. Builds a vocabulary on fit(), outputs sparse-like count matrix.
TfIdfTransformer Transforms raw term counts to TF-IDF weights. Apply after WordCountVectorizer.
BM25Transformer k1=1.2, b=0.75 BM25 ranking weighting. Better than TF-IDF for information retrieval tasks.
TokenHashingVectorizer Hashing trick vectorizer — no vocabulary stored. Fixed output dim, collision-tolerant.
TextNormalizer Lowercases, removes punctuation, trims whitespace. ASCII-only.
MultibyteTextNormalizer stripDiacritics=false Unicode-aware text normalization. Optionally strips diacritics (accents).
HtmlStripper Strips HTML tags from text columns.
StopWordFilter stopWords: string[] Removes specified stop words from tokenized text.
RegexFilter Removes tokens matching a regex pattern from text columns.

Image Transformers

ClassParametersDescription
ImageResizer targetWidth, targetHeight Bilinear resize for image datasets stored as pixel rows.
ImageCropper xOffset, yOffset, width, height Fixed-region crop from image tensor.
ImageRotator Random rotation augmentation (configurable angle range).
ImageVectorizer maxPixelValue=255.0 Flattens image tensors to 1-D feature vectors and scales pixel values to [0, 1].
ColorSpaceConverter Converts between RGB, HSV, LAB color spaces.

Feature Selection & Projection

ClassParametersDescription
VarianceThreshold minVariance=1e-4 Removes features whose variance falls below the threshold. Fast unsupervised selection.
SelectKBest k=10 Selects K features with highest ANOVA F-score (supervised). Requires labeled dataset.
PrincipalComponentAnalysis nComponents: int PCA via SVD. Projects features to the top-k principal components.
LinearDiscriminantAnalysis nComponents=null Supervised dimensionality reduction maximizing class separability. nComponents ≤ (C−1).
TruncatedSVD nComponents=2 Latent Semantic Analysis — SVD without mean centering. Suitable for sparse matrices.
GaussianRandomProjector nComponents=100 Random Gaussian projection for fast approximate dimensionality reduction (Johnson-Lindenstrauss).
SparseRandomProjector Sparse sign random projection — memory-efficient alternative to Gaussian projection.

Imputers

ClassParametersDescription
Imputer Simple imputer: fills missing values with column mean, median, most frequent, or a constant. Configurable per strategy.
KNNImputer k=5 Imputes missing values using the mean of K nearest (non-missing) neighbors. More accurate, slower.

Class Imbalance

ClassParametersDescription
SMOTE amount=100, k=5 Synthetic Minority Over-sampling. Generates synthetic minority-class samples by interpolating between K nearest neighbors.
TomekLinks Under-sampling by removing majority-class samples that form Tomek links (near-boundary pairs).
NeighborhoodClearing k=3 Edited Nearest Neighbors — removes borderline samples. Cleans decision boundary.

Other Transformers

ClassParametersDescription
LambdaFunction fn: callable Wraps any PHP callable as a stateless transformer. Useful for custom feature engineering steps.

Pipeline Example

use Pml\Pipeline;
use Pml\Transformers\StandardScaler;
use Pml\Transformers\Imputer;
use Pml\Transformers\OrdinalEncoder;
use Pml\Transformers\SelectKBest;
use Pml\Transformers\SMOTE;
use Pml\Estimators\Classifiers\GBDTClassifier;

$pipeline = new Pipeline([
    new OrdinalEncoder(columns: ['gender', 'city']),
    new Imputer(),                        // fill NaNs with column mean
    new StandardScaler(),
    new SelectKBest(k: 20),              // keep top-20 ANOVA features
    new SMOTE(amount: 150, k: 5),       // oversample minority class
    new GBDTClassifier(nEstimators: 200),
]);

$pipeline->train($trainSet);
$preds = $pipeline->predict($testSet);
// Text classification pipeline
use Pml\Transformers\TextNormalizer;
use Pml\Transformers\StopWordFilter;
use Pml\Transformers\WordCountVectorizer;
use Pml\Transformers\TfIdfTransformer;
use Pml\Estimators\Classifiers\MultinomialNB;

$pipeline = new Pipeline([
    new TextNormalizer(),
    new StopWordFilter(['the', 'a', 'is', 'are']),
    new WordCountVectorizer(maxFeatures: 10_000),
    new TfIdfTransformer(),
    new MultinomialNB(alpha: 0.1),
]);

$pipeline->train($trainSet);
$sentiment = $pipeline->predict($testSet);
Stateless vs Stateful Transformers Scalers, encoders, and vectorizers compute their parameters on fit() and store them. You must fit() on training data only, then transform() on test data. The fitTransform() shortcut does both in one call (training data only).