Pipeline

Pipeline, Cross-Validation & Autograd

The Pipeline chains transformers with a terminal estimator. Six cross-validation strategies for model evaluation. An autograd engine for custom gradient computations outside of Sequential.

Pipeline

Pml\Pipeline chains zero or more Transformer steps followed by a terminal Learner. On train(), transformers are fit on the training data in order; on predict(), the same transformers are applied before calling the estimator. Implements Learner and Persistable — it can be saved and loaded atomically.

__construct (Transformer[] $transformers, Learner $estimator)

$pipeline = new Pipeline(
    transformers: [
        new Imputer(),
        new StandardScaler(),
        new SelectKBest(k: 15),
    ],
    estimator: new GBDTClassifier(nEstimators: 200),
);

train (Dataset $dataset, mixed ...$args): void

Fits each transformer in sequence (each on the output of the previous), then trains the estimator on the final transformed dataset. Extra $args are forwarded to the estimator's train().

predict (Dataset $dataset): Tensor

Applies each transformer's transform() (NOT fit()) in sequence, then calls the estimator's predict().

Method	Description
`trained(): bool`	Whether `train()` has been called
`save(string $dir): void`	Serializes all transformer parameters and estimator weights to `$dir/`
`load(string $dir): static`	Restores a saved pipeline

SMOTE in a Pipeline Some transformers (SMOTE, TomekLinks) modify the number of rows. They must come before the estimator and will only fire during train() — not predict(). Pipeline handles this correctly: transformers receive the accumulated dataset including new rows.

Cross-Validation

Namespace: Pml\CrossValidation\

Class	Constructor	Description
`KFold`	`int $k = 5`	Shuffled K-fold. Returns average metric across folds.
`StratifiedKFold`	`int $k = 5`	Stratified K-fold — preserves class proportions in each fold. Preferred for classification.
`HoldOut`	`float $ratio = 0.2`	Single train/test split. Fastest — use for large datasets.
`MonteCarlo`	`int $rounds = 5, float $ratio = 0.2`	Repeated random splits (Monte Carlo cross-validation). More stable than single holdout.
`LeavePOut`	`int $p = 1`	Leave-P-Out. Exhaustive but O(N^p) — only practical for tiny datasets or LOO (p=1).
`KFoldOOF`	`int $k = 5`	Generates out-of-fold predictions for stacking. Returns `[$oofPreds, $scores]`.

use Pml\CrossValidation\StratifiedKFold;

$cv     = new StratifiedKFold(k: 5);
$scores = $cv->test(new GBDTClassifier(), $dataset, metric: 'accuracy');

printf("CV Accuracy: %.3f ± %.3f\n",
    array_sum($scores) / count($scores),
    sqrt(array_sum(array_map(fn($s) => pow($s - array_sum($scores)/count($scores), 2), $scores)) / count($scores))
);

// Out-of-fold predictions for stacking
use Pml\CrossValidation\KFoldOOF;

$oof = new KFoldOOF(k: 5);
[$oofPreds, $scores] = $oof->run(
    fn() => new GBDTClassifier(nEstimators: 200),
    $dataset
);
// $oofPreds: Tensor [N] — predictions for every sample in the dataset

Dataset Generators

Namespace: Pml\Datasets\Generators\. Useful for testing and prototyping.

Generator	Description
`Blob`	Gaussian blobs. Multiple clusters, configurable separation.
`Circle`	Concentric circles. Binary classification benchmark.
`HalfMoon`	Two interleaved half-moon shapes. Classic non-linear benchmark.
`Hyperplane`	Linearly separable data. For regression or linear classifier validation.
`SwissRoll`	3-D Swiss roll manifold. For manifold learning and t-SNE demos.
`Agglomerate`	Agglomerated clusters with variable density.

use Pml\Datasets\Generators\HalfMoon;

$gen = new HalfMoon(n: 1000, noise: 0.1);
$ds  = $gen->generate();   // Dataset with 1000 labeled samples

Ensemble Methods

BootstrapAggregator

use Pml\BootstrapAggregator;

$bag = new BootstrapAggregator(
    base:        new DecisionTreeClassifier(maxDepth: 10),
    nEstimators: 50,
    ratio:       0.8,  // bootstrap sample size
);
$bag->train($dataset);
$preds = $bag->predict($testSet);

StackingRegressor

use Pml\Ensemble\StackingRegressor;

$stacker = new StackingRegressor(
    estimators: [
        new GBDTRegressor(nEstimators: 100),
        new Ridge(alpha: 1.0),
        new KNNRegressor(k: 10),
    ],
    metaEstimator: new LinearRegression(),
    folds: 5,
);
$stacker->train($dataset);
$preds = $stacker->predict($testSet);

Autograd

Pml\Autograd\Variable wraps a Tensor and builds a dynamic compute graph via a C-backed Tape. Use for custom gradient computations outside of Sequential.

Partial implementation. The autograd engine currently supports 5 operations: add, mul, matmul, relu, and backward(). For full training loops, use Sequential which has a complete hand-written backward pass in C.

use Pml\Autograd\Variable;
use Pml\Tensor;

$x = new Variable(Tensor::fromArray([[1.0, 2.0]]), requiresGrad: true);
$W = new Variable(Tensor::randomNormal([2, 4]),   requiresGrad: true);

$y = $x->matmul($W)->relu();
$y->backward();

$dW = $W->grad();   // dL/dW
$dx = $x->grad();   // dL/dx

Method	Description
`add(Variable $other): Variable`	Elementwise addition, tracked
`mul(Variable $other): Variable`	Elementwise multiply, tracked
`matmul(Variable $other): Variable`	Matrix multiply, tracked
`relu(): Variable`	ReLU activation, tracked
`backward(): void`	Backprop from this node through the tape
`data(): Tensor`	Forward value tensor
`grad(): ?Tensor`	Gradient tensor (populated after backward)
`requiresGrad(): bool`	Whether gradients are tracked for this Variable

Pipeline, Cross-Validation & Autograd

On this page

Pipeline

Cross-Validation

Dataset Generators

Ensemble Methods

BootstrapAggregator

StackingRegressor

Autograd