Transformers
Transformers
44 data transformers covering scaling, encoding, text vectorization, image transforms,
feature selection, imputation, and class-imbalance correction.
All implement the Transformer interface.
Transformer Interface
All transformers implement Pml\Interfaces\Transformer:
interface Transformer
{
public function fit(Dataset $dataset): void;
public function transform(Dataset $dataset): Dataset;
public function fitTransform(Dataset $dataset): Dataset; // default: fit + transform
}
Transformers that implement Stateful additionally support getStateDict() / loadStateDict() for SafeTensors checkpoint I/O.
Scalers & Normalizers
Namespace: Pml\Transformers\
| Class | Parameters | Description |
StandardScaler |
— |
Subtracts mean, divides by std. μ=0, σ=1 per feature. Most common default. |
ZScaleStandardizer |
center=true |
Same as StandardScaler but optionally skips centering (for sparse data). |
MinMaxScaler |
min=0.0, max=1.0 |
Scales each feature to [min, max]. Sensitive to outliers. |
MaxAbsScaler |
— |
Divides by the maximum absolute value. Preserves sparsity. |
RobustScaler |
— |
Centers by median, scales by IQR. Robust to outliers. |
L1Normalizer |
— |
Normalizes each sample to unit L1 norm. |
L2Normalizer |
— |
Normalizes each sample to unit L2 norm. |
PowerTransformer |
— |
Yeo-Johnson or Box-Cox power transform to make features more Gaussian-like. Uses C-level fit. |
QuantileTransformer |
nQuantiles=1000 |
Maps to uniform or normal distribution via quantile landmarks. C-accelerated. |
Encoding
| Class | Parameters | Description |
OrdinalEncoder |
columns=[] |
Maps string categories to integer codes per column. Deterministic sort order. |
OneHotLabelEncoder |
— |
Expands integer class labels to binary one-hot vectors. For multi-class output encoding. |
CategoricalEncoder |
— |
Encodes string feature columns as integer indices. Stores vocabulary per column. |
BooleanConverter |
— |
Converts boolean/string boolean values ('true'/'false', 'yes'/'no') to 0.0/1.0. |
NumericStringConverter |
— |
Parses numeric string columns ('3.14') to float. NaN for non-parsable values. |
IntervalDiscretizer |
bins=5 |
Bins continuous features into equal-width intervals. Returns integer bin indices. |
TargetEncoder |
— |
Replaces categorical levels with the target mean (smoothed). Must call fit() with labels present. |
PolynomialExpander |
— |
Adds polynomial and interaction features. degree=2 by default. |
Text Vectorizers
| Class | Parameters | Description |
WordCountVectorizer |
maxFeatures=null, textColumn='text' |
Bag-of-words counts. Builds a vocabulary on fit(), outputs sparse-like count matrix. |
TfIdfTransformer |
— |
Transforms raw term counts to TF-IDF weights. Apply after WordCountVectorizer. |
BM25Transformer |
k1=1.2, b=0.75 |
BM25 ranking weighting. Better than TF-IDF for information retrieval tasks. |
TokenHashingVectorizer |
— |
Hashing trick vectorizer — no vocabulary stored. Fixed output dim, collision-tolerant. |
TextNormalizer |
— |
Lowercases, removes punctuation, trims whitespace. ASCII-only. |
MultibyteTextNormalizer |
stripDiacritics=false |
Unicode-aware text normalization. Optionally strips diacritics (accents). |
HtmlStripper |
— |
Strips HTML tags from text columns. |
StopWordFilter |
stopWords: string[] |
Removes specified stop words from tokenized text. |
RegexFilter |
— |
Removes tokens matching a regex pattern from text columns. |
Image Transformers
| Class | Parameters | Description |
ImageResizer |
targetWidth, targetHeight |
Bilinear resize for image datasets stored as pixel rows. |
ImageCropper |
xOffset, yOffset, width, height |
Fixed-region crop from image tensor. |
ImageRotator |
— |
Random rotation augmentation (configurable angle range). |
ImageVectorizer |
maxPixelValue=255.0 |
Flattens image tensors to 1-D feature vectors and scales pixel values to [0, 1]. |
ColorSpaceConverter |
— |
Converts between RGB, HSV, LAB color spaces. |
Feature Selection & Projection
| Class | Parameters | Description |
VarianceThreshold |
minVariance=1e-4 |
Removes features whose variance falls below the threshold. Fast unsupervised selection. |
SelectKBest |
k=10 |
Selects K features with highest ANOVA F-score (supervised). Requires labeled dataset. |
PrincipalComponentAnalysis |
nComponents: int |
PCA via SVD. Projects features to the top-k principal components. |
LinearDiscriminantAnalysis |
nComponents=null |
Supervised dimensionality reduction maximizing class separability. nComponents ≤ (C−1). |
TruncatedSVD |
nComponents=2 |
Latent Semantic Analysis — SVD without mean centering. Suitable for sparse matrices. |
GaussianRandomProjector |
nComponents=100 |
Random Gaussian projection for fast approximate dimensionality reduction (Johnson-Lindenstrauss). |
SparseRandomProjector |
— |
Sparse sign random projection — memory-efficient alternative to Gaussian projection. |
Imputers
| Class | Parameters | Description |
Imputer |
— |
Simple imputer: fills missing values with column mean, median, most frequent, or a constant. Configurable per strategy. |
KNNImputer |
k=5 |
Imputes missing values using the mean of K nearest (non-missing) neighbors. More accurate, slower. |
Class Imbalance
| Class | Parameters | Description |
SMOTE |
amount=100, k=5 |
Synthetic Minority Over-sampling. Generates synthetic minority-class samples by interpolating between K nearest neighbors. |
TomekLinks |
— |
Under-sampling by removing majority-class samples that form Tomek links (near-boundary pairs). |
NeighborhoodClearing |
k=3 |
Edited Nearest Neighbors — removes borderline samples. Cleans decision boundary. |
Other Transformers
| Class | Parameters | Description |
LambdaFunction |
fn: callable |
Wraps any PHP callable as a stateless transformer. Useful for custom feature engineering steps. |
Pipeline Example
use Pml\Pipeline;
use Pml\Transformers\StandardScaler;
use Pml\Transformers\Imputer;
use Pml\Transformers\OrdinalEncoder;
use Pml\Transformers\SelectKBest;
use Pml\Transformers\SMOTE;
use Pml\Estimators\Classifiers\GBDTClassifier;
$pipeline = new Pipeline([
new OrdinalEncoder(columns: ['gender', 'city']),
new Imputer(), // fill NaNs with column mean
new StandardScaler(),
new SelectKBest(k: 20), // keep top-20 ANOVA features
new SMOTE(amount: 150, k: 5), // oversample minority class
new GBDTClassifier(nEstimators: 200),
]);
$pipeline->train($trainSet);
$preds = $pipeline->predict($testSet);
// Text classification pipeline
use Pml\Transformers\TextNormalizer;
use Pml\Transformers\StopWordFilter;
use Pml\Transformers\WordCountVectorizer;
use Pml\Transformers\TfIdfTransformer;
use Pml\Estimators\Classifiers\MultinomialNB;
$pipeline = new Pipeline([
new TextNormalizer(),
new StopWordFilter(['the', 'a', 'is', 'are']),
new WordCountVectorizer(maxFeatures: 10_000),
new TfIdfTransformer(),
new MultinomialNB(alpha: 0.1),
]);
$pipeline->train($trainSet);
$sentiment = $pipeline->predict($testSet);
Stateless vs Stateful Transformers
Scalers, encoders, and vectorizers compute their parameters on fit() and store them.
You must fit() on training data only, then transform() on test data.
The fitTransform() shortcut does both in one call (training data only).