Neural Networks

The Sequential model container, 29 layer types, 9 optimizers, 5 loss functions, INT8 quantization, and pre-made activation/initializer modules. All weight math runs in C via FFI; PHP only moves layer handles.

Sequential

Pml\NeuralNetwork\Sequential is the primary model container. Implements TrainableWithOptions, Persistable, and Verbose.

__construct (Layer[] $layers, Loss $lossFn, Optimizer $optimizer)

$model = new Sequential(
    layers:    [new Dense(784, 256), new ReLU(), new Dense(256, 10), new Softmax()],
    lossFn:    new CategoricalCrossEntropy(),
    optimizer: new AdamW(lr: 3e-4),
);

train (Dataset $dataset, mixed ...$options): void

Full training loop with optional validation, early stopping, and gradient clipping.

Option	Type	Default	Description
`epochs`	int	10	Number of full passes over the dataset
`batchSize`	int	32	Mini-batch size
`validation`	Dataset	null	Held-out validation set for loss monitoring
`patience`	int	0	Early stopping patience epochs (0 = disabled)
`minDelta`	float	1e-4	Minimum improvement to reset patience counter
`clipGradNorm`	float	0.0	Global gradient-norm clip threshold (0 = disabled). Recommended 1.0–5.0 for RNN/LSTM.

$model->train($train,
    epochs:       50,
    batchSize:    128,
    validation:   $val,
    patience:     5,
    clipGradNorm: 1.0,
);

stepOnBatch (Dataset $batch, float $clipGradNorm = 0.0): float float (loss)

Single forward + backward + optimizer step on one pre-made batch. Returns scalar loss. Used for streaming training loops that manage their own epoch logic.

predict (Dataset $dataset): Tensor Tensor

Sets all layers to inference mode (disables Dropout, freezes BatchNorm), runs forward pass, then restores training mode. Requires trained() === true.

quantize (int $groupSize = 32): void

Quantizes all Dense layers to INT8 block format. After this call: 4× less weight memory, faster decode, backward() disabled. See INT8 Quantization.

Method	Description
`forward(Tensor $x): Tensor`	Run inference pass (no training-mode flip). Use internally by custom training loops.
`backward(Tensor $grad): void`	Backpropagation through all layers in reverse.
`add(Layer $layer): void`	Append a layer to the end of the stack.
`getLayers(): Layer[]`	Read-only access to the layer list.
`getOptimizer(): Optimizer`
`getLoss(): Loss`
`trained(): bool`	True after `train()` or `markTrained()` completes.
`markTrained(): void`	Marks model trained without calling `train()`. Use after streaming training loops.
`setLogger(LoggerInterface): void`	PSR-3 logger for per-epoch loss output.
`save(string $dir): void`	Saves config.json + model.safetensors to directory.
`load(string $dir): static`	Restores model from directory. Weights are mmap'd.

Layer Interface

interface Layer
{
    public function forward(Tensor $input): Tensor;
    public function backward(Tensor $gradient): Tensor;
    public function getParameters(): array;   // name → Tensor
    public function getGradients(): array;    // name → Tensor
    public function getConfig(): array;       // for ModelStore serialization
    public static function fromConfig(array $config): static;
}

Additional optional interfaces a layer may implement:

Interface	Methods	Description
`HasTrainingMode`	`setTraining(bool)`	Layers that behave differently during training vs inference (Dropout, BatchNorm)
`Stateful`	`getStateDict(prefix)` · `loadStateDict(dict, prefix)`	SafeTensors checkpoint I/O
`Quantizable`	`quantize(int)` · `isQuantized()`	INT8 block quantization (Dense only)

Layers Reference

Linear & Core

Layer	Constructor	Description
`Dense`	`int $inputDim, int $outputDim, bool $useBias = true`	Fully-connected layer. `Y = XW^T + b`. He initialization. Implements Stateful + Quantizable.
`Embedding`	`int $vocabSize, int $embedDim`	Token embedding lookup table. Forward: gather rows by integer indices. Backward: scatter gradients.
`Flatten`	—	Reshapes input to [batch, D]. No parameters.
`Reshape`	`array $targetShape`	Arbitrary reshape (excluding batch dimension).
`Squeeze`	—	Removes all size-1 dimensions from the tensor.
`MLP`	`int $inputSize, array $hiddenSizes, int $outputSize, string $activation = 'relu'`	Pre-made multi-layer perceptron block (Dense + Activation stack). Useful inside transformer blocks.

Normalization

Layer	Constructor	Description
`BatchNormalization`	`int $features, float $momentum = 0.9, float $eps = 1e-5`	Batch normalization for 2-D tensors [batch, features]. Trainable scale+shift. Running stats for inference. Stateful.
`BatchNorm2D`	—	Batch normalization for 4-D tensors [N, C, H, W]. Used with Conv2D.
`LayerNorm`	`int $dim, float $eps = 1e-5`	Layer normalization over the last dimension. Used in Transformers. Stateful.

Regularization

Layer	Constructor	Description
`Dropout`	`float $rate = 0.5`	Drops activations randomly during training. No-op at inference. Implements HasTrainingMode.
`Noise`	`float $stddev = 0.1`	Additive Gaussian noise during training. Improves robustness.

Convolutional

Layer	Constructor	Description
`Conv2D`	`int $inC, int $outC, int $kernelSize, int $stride = 1, int $padding = 0, bool $useBias = true`	2-D convolution. OpenBLAS im2col GEMM. Stateful. Backward computes dX, dW, db.
`DepthwiseConv2D`	—	Depthwise separable convolution. ~9× fewer FLOPs than regular Conv2D. Used in MobileNet.
`GlobalAveragePooling2D`	—	Reduces [N, C, H, W] to [N, C] by averaging over spatial dims.
`InvertedResidual`	—	MobileNetV2/V3 inverted residual block: expand → depthwise → project + residual.
`SEBlock`	`int $channels, int $reduction = 4`	Squeeze-and-Excitation block. Channel attention: global pool → FC → ReLU → FC → Sigmoid → scale.

Activation Layers

Layer	Notes
`ReLU`	max(0, x). Zero parameters. AVX2 vectorized.
`Sigmoid`	σ(x) = 1/(1+e⁻ˣ). AVX2 vectorized.
`Tanh`	tanh(x). AVX2 vectorized.
`Softmax`	Numerically stable softmax over last axis.
`Gelu`	Gaussian Error Linear Unit. Used in BERT/GPT.
`Swish`	x · σ(βx). Smooth, non-monotonic. Default β=1.
`HardSigmoid`	Piecewise linear σ approximation. Fast inference.
`HardSwish`	Hard approximation to Swish. Used in MobileNetV3.
`PReLU`	Parametric ReLU — learnable negative slope α.
`Activation`	Wraps any `ActivationFunction` object as a layer.

Recurrent Layers

Layer	Constructor	Description
`LSTM`	`int $inputSize, int $hiddenSize`	Long Short-Term Memory. C-level gated cell. Input: [batch, seq, inputSize] → [batch, seq, hiddenSize]. Stateful.
`RNN`	`int $inputSize, int $hiddenSize`	Vanilla Elman RNN. Faster than LSTM, worse for long sequences.
`Mamba`	`int $dModel, int $dState = 16`	State Space Model block (Mamba-style SSM). Linear-time sequence modeling alternative to attention.

Attention

Layer	Constructor	Description
`CausalSelfAttention`	`int $dModel, int $nHeads`	Multi-head causal self-attention. Three forward paths: • Training (`$kv = null`): full O(T²) causal attention, caches activations for backward. • Prefill (`$kv, T > 1`): full attention + populate KV-cache. • Decode (`$kv, T = 1`): append 1 token to cache + O(1) attend. Signature: `forward(Tensor $x, ?KVCache $kv = null, int $layerIdx = 0): Tensor`

Optimizers

Namespace: Pml\NeuralNetwork\Optimizers\. All implement Optimizer::step(Layer[] $layers): void.

Optimizer	Constructor	Notes
`Adam`	`lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8`	Adaptive moment estimation. Best general-purpose optimizer.
`AdamW`	`lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8, weightDecay=1e-2`	Adam with decoupled weight decay. Preferred over Adam + L2 reg for transformers.
`AdaMax`	`lr=0.002, beta1=0.9, beta2=0.999, eps=1e-8`	Adam variant using infinity-norm. More stable for embeddings.
`AdaGrad`	`lr=0.01, eps=1e-8`	Accumulates squared gradients. Good for sparse features. Learning rate decays over time.
`RMSProp`	`lr=0.001, decay=0.9, eps=1e-8`	Leaky AdaGrad — maintains moving average of squared gradients.
`SGD`	`lr=0.01`	Pure stochastic gradient descent. Fast, simple.
`Momentum`	`lr=0.01, momentum=0.9`	SGD with momentum. Faster convergence on ravines.
`Cyclical`	—	Cyclical learning rate — oscillates between min and max LR. Can escape local minima.
`StepDecay`	—	Decays learning rate by a factor every N steps.

Optimizer step is fused in C. Each optimizer calls a single FFI function (e.g., tensor_fused_adam_step) per parameter tensor — no intermediate allocation, all math in-place in C.

Loss Functions

Namespace: Pml\Losses\. All implement Loss: compute(Tensor $preds, Tensor $labels): float + differentiate(...): Tensor.

Loss	Use Case	Notes
`CategoricalCrossEntropy`	Multi-class classification	Labels as class indices or one-hot. Numerically stable log computation in C.
`BinaryCrossEntropy`	Binary classification	Fused BCE + gradient via `tensor_fused_bce_loss_and_grad`. Avoids two passes.
`MeanSquaredError`	Regression	(y_hat − y)² / N. Gradient: 2(y_hat − y) / N.
`Hinge`	SVM-style	max(0, 1 − y · y_hat). Sparse gradient.
`Huber`	Robust regression	Quadratic near zero, linear for large errors. Configurable δ.

INT8 Quantization

PML supports symmetric INT8 block quantization (Q8_0-class) for Dense layers. Each group of 32 weights shares one float32 scale factor. The forward kernel uses AVX2 fused int8→fp32 dot products — no temporary fp32 weight allocation per inference call.

Memory impact

7B model: ~28 GB fp32 → ~7.2 GB INT8.
Compression ratio typically 3.9–4.0×.

Throughput

~4× decode token throughput improvement vs fp32 on CPU (memory-bandwidth bound).

Accuracy

Perplexity degrades <1% for groupSize=32. Larger models degrade less.

// Train in fp32
$model->train($dataset, epochs: 20);
$model->save('ckpt/mnist');

// Load and quantize for deployment
$model = Sequential::load('ckpt/mnist');
$model->quantize(groupSize: 32);   // all Dense layers → INT8

// Inference is now 4× faster / uses 4× less memory
$preds = $model->predict($testSet);

// Check compression
foreach ($model->getLayers() as $layer) {
    if ($layer instanceof Dense && $layer->isQuantized()) {
        // Dense layer is quantized
    }
}

Quantization is irreversible for a loaded model. After quantize(), backward() throws LogicException. Quantized checkpoints can still be saved (getStateDict() dequantizes for export).

Initializers

Namespace: Pml\NeuralNetwork\Initializers\. Used by Layers that accept a custom initializer.

Initializer	Formula	Notes
`He`	N(0, √(2/fan_in))	Default for ReLU networks
`Xavier1`	U(−√(6/(fan_in+fan_out)), +√(...))	Glorot uniform — for tanh
`Xavier2`	N(0, √(2/(fan_in+fan_out)))	Glorot normal
`LeCun`	N(0, √(1/fan_in))	For SELU networks
`Normal`	N(mean, stddev)	Custom Gaussian
`Uniform`	U(min, max)	Custom uniform
`Constant`	fill(value)	Fixed value (e.g., for bias init)

Activation Functions

Namespace: Pml\NeuralNetwork\ActivationFunctions\. Used with the Activation layer wrapper.

Activation	Formula
`ELU`	x if x > 0, else α(eˣ − 1)
`GELU`	x·Φ(x) — Gaussian CDF approximation
`LeakyReLU`	max(αx, x) — fixed negative slope
`SELU`	λ·max(αeˣ−α, x) — self-normalizing
`SiLU`	x·σ(x) — Swish with β=1
`SoftPlus`	log(1 + eˣ) — smooth ReLU
`Softsign`	x / (1 + \|x\|)
`ThresholdedReLU`	x if x > θ, else 0

SLM Utilities

Namespace: Pml\SLM\. Small Language Model utilities.

Class	Description
`BpeTrainer`	Trains a BPE vocabulary from raw text corpus. Produces a `tokenizer.json` compatible with `Pml\Inference\Tokenizer`.
`TrainableEmbedding`	Embedding layer with positional encoding. Combines token + positional embeddings in one forward call. Stateful.

Neural Networks

On this page

Sequential

Layer Interface

Layers Reference

Linear & Core

Normalization

Regularization

Convolutional

Activation Layers

Recurrent Layers

Attention

Optimizers

Loss Functions

INT8 Quantization

Memory impact

Throughput

Accuracy

Initializers

Activation Functions

SLM Utilities