PML / Pipeline & Cross-Validation
Pipeline

Pipeline, Cross-Validation & Autograd

The Pipeline chains transformers with a terminal estimator. Six cross-validation strategies for model evaluation. An autograd engine for custom gradient computations outside of Sequential.

Pipeline

Pml\Pipeline chains zero or more Transformer steps followed by a terminal Learner. On train(), transformers are fit on the training data in order; on predict(), the same transformers are applied before calling the estimator. Implements Learner and Persistable — it can be saved and loaded atomically.

__construct (Transformer[] $transformers, Learner $estimator)
$pipeline = new Pipeline(
    transformers: [
        new Imputer(),
        new StandardScaler(),
        new SelectKBest(k: 15),
    ],
    estimator: new GBDTClassifier(nEstimators: 200),
);
train (Dataset $dataset, mixed ...$args): void

Fits each transformer in sequence (each on the output of the previous), then trains the estimator on the final transformed dataset. Extra $args are forwarded to the estimator's train().

predict (Dataset $dataset): Tensor

Applies each transformer's transform() (NOT fit()) in sequence, then calls the estimator's predict().

MethodDescription
trained(): boolWhether train() has been called
save(string $dir): voidSerializes all transformer parameters and estimator weights to $dir/
load(string $dir): staticRestores a saved pipeline
SMOTE in a Pipeline Some transformers (SMOTE, TomekLinks) modify the number of rows. They must come before the estimator and will only fire during train() — not predict(). Pipeline handles this correctly: transformers receive the accumulated dataset including new rows.

Cross-Validation

Namespace: Pml\CrossValidation\

ClassConstructorDescription
KFold int $k = 5 Shuffled K-fold. Returns average metric across folds.
StratifiedKFold int $k = 5 Stratified K-fold — preserves class proportions in each fold. Preferred for classification.
HoldOut float $ratio = 0.2 Single train/test split. Fastest — use for large datasets.
MonteCarlo int $rounds = 5, float $ratio = 0.2 Repeated random splits (Monte Carlo cross-validation). More stable than single holdout.
LeavePOut int $p = 1 Leave-P-Out. Exhaustive but O(N^p) — only practical for tiny datasets or LOO (p=1).
KFoldOOF int $k = 5 Generates out-of-fold predictions for stacking. Returns [$oofPreds, $scores].
use Pml\CrossValidation\StratifiedKFold;

$cv     = new StratifiedKFold(k: 5);
$scores = $cv->test(new GBDTClassifier(), $dataset, metric: 'accuracy');

printf("CV Accuracy: %.3f ± %.3f\n",
    array_sum($scores) / count($scores),
    sqrt(array_sum(array_map(fn($s) => pow($s - array_sum($scores)/count($scores), 2), $scores)) / count($scores))
);
// Out-of-fold predictions for stacking
use Pml\CrossValidation\KFoldOOF;

$oof = new KFoldOOF(k: 5);
[$oofPreds, $scores] = $oof->run(
    fn() => new GBDTClassifier(nEstimators: 200),
    $dataset
);
// $oofPreds: Tensor [N] — predictions for every sample in the dataset

Dataset Generators

Namespace: Pml\Datasets\Generators\. Useful for testing and prototyping.

GeneratorDescription
BlobGaussian blobs. Multiple clusters, configurable separation.
CircleConcentric circles. Binary classification benchmark.
HalfMoonTwo interleaved half-moon shapes. Classic non-linear benchmark.
HyperplaneLinearly separable data. For regression or linear classifier validation.
SwissRoll3-D Swiss roll manifold. For manifold learning and t-SNE demos.
AgglomerateAgglomerated clusters with variable density.
use Pml\Datasets\Generators\HalfMoon;

$gen = new HalfMoon(n: 1000, noise: 0.1);
$ds  = $gen->generate();   // Dataset with 1000 labeled samples

Ensemble Methods

BootstrapAggregator

use Pml\BootstrapAggregator;

$bag = new BootstrapAggregator(
    base:        new DecisionTreeClassifier(maxDepth: 10),
    nEstimators: 50,
    ratio:       0.8,  // bootstrap sample size
);
$bag->train($dataset);
$preds = $bag->predict($testSet);

StackingRegressor

use Pml\Ensemble\StackingRegressor;

$stacker = new StackingRegressor(
    estimators: [
        new GBDTRegressor(nEstimators: 100),
        new Ridge(alpha: 1.0),
        new KNNRegressor(k: 10),
    ],
    metaEstimator: new LinearRegression(),
    folds: 5,
);
$stacker->train($dataset);
$preds = $stacker->predict($testSet);

Autograd

Pml\Autograd\Variable wraps a Tensor and builds a dynamic compute graph via a C-backed Tape. Use for custom gradient computations outside of Sequential.

Partial implementation. The autograd engine currently supports 5 operations: add, mul, matmul, relu, and backward(). For full training loops, use Sequential which has a complete hand-written backward pass in C.
use Pml\Autograd\Variable;
use Pml\Tensor;

$x = new Variable(Tensor::fromArray([[1.0, 2.0]]), requiresGrad: true);
$W = new Variable(Tensor::randomNormal([2, 4]),   requiresGrad: true);

$y = $x->matmul($W)->relu();
$y->backward();

$dW = $W->grad();   // dL/dW
$dx = $x->grad();   // dL/dx
MethodDescription
add(Variable $other): VariableElementwise addition, tracked
mul(Variable $other): VariableElementwise multiply, tracked
matmul(Variable $other): VariableMatrix multiply, tracked
relu(): VariableReLU activation, tracked
backward(): voidBackprop from this node through the tape
data(): TensorForward value tensor
grad(): ?TensorGradient tensor (populated after backward)
requiresGrad(): boolWhether gradients are tracked for this Variable