Pipeline, Cross-Validation & Autograd
The Pipeline chains transformers with a terminal estimator.
Six cross-validation strategies for model evaluation.
An autograd engine for custom gradient computations outside of Sequential.
On this page
Pipeline
Pml\Pipeline chains zero or more Transformer steps followed by a terminal
Learner. On train(), transformers are fit on the training data in order;
on predict(), the same transformers are applied before calling the estimator.
Implements Learner and Persistable — it can be saved and loaded atomically.
$pipeline = new Pipeline(
transformers: [
new Imputer(),
new StandardScaler(),
new SelectKBest(k: 15),
],
estimator: new GBDTClassifier(nEstimators: 200),
);
Fits each transformer in sequence (each on the output of the previous), then trains the estimator on the final transformed dataset. Extra $args are forwarded to the estimator's train().
Applies each transformer's transform() (NOT fit()) in sequence, then calls the estimator's predict().
| Method | Description |
|---|---|
trained(): bool | Whether train() has been called |
save(string $dir): void | Serializes all transformer parameters and estimator weights to $dir/ |
load(string $dir): static | Restores a saved pipeline |
train() — not predict().
Pipeline handles this correctly: transformers receive the accumulated dataset including new rows.
Cross-Validation
Namespace: Pml\CrossValidation\
| Class | Constructor | Description |
|---|---|---|
KFold |
int $k = 5 |
Shuffled K-fold. Returns average metric across folds. |
StratifiedKFold |
int $k = 5 |
Stratified K-fold — preserves class proportions in each fold. Preferred for classification. |
HoldOut |
float $ratio = 0.2 |
Single train/test split. Fastest — use for large datasets. |
MonteCarlo |
int $rounds = 5, float $ratio = 0.2 |
Repeated random splits (Monte Carlo cross-validation). More stable than single holdout. |
LeavePOut |
int $p = 1 |
Leave-P-Out. Exhaustive but O(N^p) — only practical for tiny datasets or LOO (p=1). |
KFoldOOF |
int $k = 5 |
Generates out-of-fold predictions for stacking. Returns [$oofPreds, $scores]. |
use Pml\CrossValidation\StratifiedKFold;
$cv = new StratifiedKFold(k: 5);
$scores = $cv->test(new GBDTClassifier(), $dataset, metric: 'accuracy');
printf("CV Accuracy: %.3f ± %.3f\n",
array_sum($scores) / count($scores),
sqrt(array_sum(array_map(fn($s) => pow($s - array_sum($scores)/count($scores), 2), $scores)) / count($scores))
);
// Out-of-fold predictions for stacking
use Pml\CrossValidation\KFoldOOF;
$oof = new KFoldOOF(k: 5);
[$oofPreds, $scores] = $oof->run(
fn() => new GBDTClassifier(nEstimators: 200),
$dataset
);
// $oofPreds: Tensor [N] — predictions for every sample in the dataset
Dataset Generators
Namespace: Pml\Datasets\Generators\. Useful for testing and prototyping.
| Generator | Description |
|---|---|
Blob | Gaussian blobs. Multiple clusters, configurable separation. |
Circle | Concentric circles. Binary classification benchmark. |
HalfMoon | Two interleaved half-moon shapes. Classic non-linear benchmark. |
Hyperplane | Linearly separable data. For regression or linear classifier validation. |
SwissRoll | 3-D Swiss roll manifold. For manifold learning and t-SNE demos. |
Agglomerate | Agglomerated clusters with variable density. |
use Pml\Datasets\Generators\HalfMoon;
$gen = new HalfMoon(n: 1000, noise: 0.1);
$ds = $gen->generate(); // Dataset with 1000 labeled samples
Ensemble Methods
BootstrapAggregator
use Pml\BootstrapAggregator;
$bag = new BootstrapAggregator(
base: new DecisionTreeClassifier(maxDepth: 10),
nEstimators: 50,
ratio: 0.8, // bootstrap sample size
);
$bag->train($dataset);
$preds = $bag->predict($testSet);
StackingRegressor
use Pml\Ensemble\StackingRegressor;
$stacker = new StackingRegressor(
estimators: [
new GBDTRegressor(nEstimators: 100),
new Ridge(alpha: 1.0),
new KNNRegressor(k: 10),
],
metaEstimator: new LinearRegression(),
folds: 5,
);
$stacker->train($dataset);
$preds = $stacker->predict($testSet);
Autograd
Pml\Autograd\Variable wraps a Tensor and builds a dynamic compute graph
via a C-backed Tape. Use for custom gradient computations outside of Sequential.
add, mul, matmul, relu, and backward(). For full training loops, use Sequential which has a complete hand-written backward pass in C.
use Pml\Autograd\Variable;
use Pml\Tensor;
$x = new Variable(Tensor::fromArray([[1.0, 2.0]]), requiresGrad: true);
$W = new Variable(Tensor::randomNormal([2, 4]), requiresGrad: true);
$y = $x->matmul($W)->relu();
$y->backward();
$dW = $W->grad(); // dL/dW
$dx = $x->grad(); // dL/dx
| Method | Description |
|---|---|
add(Variable $other): Variable | Elementwise addition, tracked |
mul(Variable $other): Variable | Elementwise multiply, tracked |
matmul(Variable $other): Variable | Matrix multiply, tracked |
relu(): Variable | ReLU activation, tracked |
backward(): void | Backprop from this node through the tape |
data(): Tensor | Forward value tensor |
grad(): ?Tensor | Gradient tensor (populated after backward) |
requiresGrad(): bool | Whether gradients are tracked for this Variable |