PML / Neural Networks
Neural Networks

Neural Networks

The Sequential model container, 29 layer types, 9 optimizers, 5 loss functions, INT8 quantization, and pre-made activation/initializer modules. All weight math runs in C via FFI; PHP only moves layer handles.

Sequential

Pml\NeuralNetwork\Sequential is the primary model container. Implements TrainableWithOptions, Persistable, and Verbose.

__construct (Layer[] $layers, Loss $lossFn, Optimizer $optimizer)
$model = new Sequential(
    layers:    [new Dense(784, 256), new ReLU(), new Dense(256, 10), new Softmax()],
    lossFn:    new CategoricalCrossEntropy(),
    optimizer: new AdamW(lr: 3e-4),
);
train (Dataset $dataset, mixed ...$options): void

Full training loop with optional validation, early stopping, and gradient clipping.

OptionTypeDefaultDescription
epochsint10Number of full passes over the dataset
batchSizeint32Mini-batch size
validationDatasetnullHeld-out validation set for loss monitoring
patienceint0Early stopping patience epochs (0 = disabled)
minDeltafloat1e-4Minimum improvement to reset patience counter
clipGradNormfloat0.0Global gradient-norm clip threshold (0 = disabled). Recommended 1.0–5.0 for RNN/LSTM.
$model->train($train,
    epochs:       50,
    batchSize:    128,
    validation:   $val,
    patience:     5,
    clipGradNorm: 1.0,
);
stepOnBatch (Dataset $batch, float $clipGradNorm = 0.0): float float (loss)

Single forward + backward + optimizer step on one pre-made batch. Returns scalar loss. Used for streaming training loops that manage their own epoch logic.

predict (Dataset $dataset): Tensor Tensor

Sets all layers to inference mode (disables Dropout, freezes BatchNorm), runs forward pass, then restores training mode. Requires trained() === true.

quantize (int $groupSize = 32): void

Quantizes all Dense layers to INT8 block format. After this call: 4× less weight memory, faster decode, backward() disabled. See INT8 Quantization.

MethodDescription
forward(Tensor $x): TensorRun inference pass (no training-mode flip). Use internally by custom training loops.
backward(Tensor $grad): voidBackpropagation through all layers in reverse.
add(Layer $layer): voidAppend a layer to the end of the stack.
getLayers(): Layer[]Read-only access to the layer list.
getOptimizer(): Optimizer
getLoss(): Loss
trained(): boolTrue after train() or markTrained() completes.
markTrained(): voidMarks model trained without calling train(). Use after streaming training loops.
setLogger(LoggerInterface): voidPSR-3 logger for per-epoch loss output.
save(string $dir): voidSaves config.json + model.safetensors to directory.
load(string $dir): staticRestores model from directory. Weights are mmap'd.

Layer Interface

interface Layer
{
    public function forward(Tensor $input): Tensor;
    public function backward(Tensor $gradient): Tensor;
    public function getParameters(): array;   // name → Tensor
    public function getGradients(): array;    // name → Tensor
    public function getConfig(): array;       // for ModelStore serialization
    public static function fromConfig(array $config): static;
}

Additional optional interfaces a layer may implement:

InterfaceMethodsDescription
HasTrainingModesetTraining(bool)Layers that behave differently during training vs inference (Dropout, BatchNorm)
StatefulgetStateDict(prefix) · loadStateDict(dict, prefix)SafeTensors checkpoint I/O
Quantizablequantize(int) · isQuantized()INT8 block quantization (Dense only)

Layers Reference

Linear & Core

LayerConstructorDescription
Dense int $inputDim, int $outputDim, bool $useBias = true Fully-connected layer. Y = XW^T + b. He initialization. Implements Stateful + Quantizable.
Embedding int $vocabSize, int $embedDim Token embedding lookup table. Forward: gather rows by integer indices. Backward: scatter gradients.
Flatten Reshapes input to [batch, D]. No parameters.
Reshape array $targetShape Arbitrary reshape (excluding batch dimension).
Squeeze Removes all size-1 dimensions from the tensor.
MLP int $inputSize, array $hiddenSizes, int $outputSize, string $activation = 'relu' Pre-made multi-layer perceptron block (Dense + Activation stack). Useful inside transformer blocks.

Normalization

LayerConstructorDescription
BatchNormalization int $features, float $momentum = 0.9, float $eps = 1e-5 Batch normalization for 2-D tensors [batch, features]. Trainable scale+shift. Running stats for inference. Stateful.
BatchNorm2D Batch normalization for 4-D tensors [N, C, H, W]. Used with Conv2D.
LayerNorm int $dim, float $eps = 1e-5 Layer normalization over the last dimension. Used in Transformers. Stateful.

Regularization

LayerConstructorDescription
Dropout float $rate = 0.5 Drops activations randomly during training. No-op at inference. Implements HasTrainingMode.
Noise float $stddev = 0.1 Additive Gaussian noise during training. Improves robustness.

Convolutional

LayerConstructorDescription
Conv2D int $inC, int $outC, int $kernelSize, int $stride = 1, int $padding = 0, bool $useBias = true 2-D convolution. OpenBLAS im2col GEMM. Stateful. Backward computes dX, dW, db.
DepthwiseConv2D Depthwise separable convolution. ~9× fewer FLOPs than regular Conv2D. Used in MobileNet.
GlobalAveragePooling2D Reduces [N, C, H, W] to [N, C] by averaging over spatial dims.
InvertedResidual MobileNetV2/V3 inverted residual block: expand → depthwise → project + residual.
SEBlock int $channels, int $reduction = 4 Squeeze-and-Excitation block. Channel attention: global pool → FC → ReLU → FC → Sigmoid → scale.

Activation Layers

LayerNotes
ReLUmax(0, x). Zero parameters. AVX2 vectorized.
Sigmoidσ(x) = 1/(1+e⁻ˣ). AVX2 vectorized.
Tanhtanh(x). AVX2 vectorized.
SoftmaxNumerically stable softmax over last axis.
GeluGaussian Error Linear Unit. Used in BERT/GPT.
Swishx · σ(βx). Smooth, non-monotonic. Default β=1.
HardSigmoidPiecewise linear σ approximation. Fast inference.
HardSwishHard approximation to Swish. Used in MobileNetV3.
PReLUParametric ReLU — learnable negative slope α.
ActivationWraps any ActivationFunction object as a layer.

Recurrent Layers

LayerConstructorDescription
LSTM int $inputSize, int $hiddenSize Long Short-Term Memory. C-level gated cell. Input: [batch, seq, inputSize] → [batch, seq, hiddenSize]. Stateful.
RNN int $inputSize, int $hiddenSize Vanilla Elman RNN. Faster than LSTM, worse for long sequences.
Mamba int $dModel, int $dState = 16 State Space Model block (Mamba-style SSM). Linear-time sequence modeling alternative to attention.

Attention

LayerConstructorDescription
CausalSelfAttention int $dModel, int $nHeads Multi-head causal self-attention. Three forward paths:
Training ($kv = null): full O(T²) causal attention, caches activations for backward.
Prefill ($kv, T > 1): full attention + populate KV-cache.
Decode ($kv, T = 1): append 1 token to cache + O(1) attend.
Signature: forward(Tensor $x, ?KVCache $kv = null, int $layerIdx = 0): Tensor

Optimizers

Namespace: Pml\NeuralNetwork\Optimizers\. All implement Optimizer::step(Layer[] $layers): void.

OptimizerConstructorNotes
Adam lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8 Adaptive moment estimation. Best general-purpose optimizer.
AdamW lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8, weightDecay=1e-2 Adam with decoupled weight decay. Preferred over Adam + L2 reg for transformers.
AdaMax lr=0.002, beta1=0.9, beta2=0.999, eps=1e-8 Adam variant using infinity-norm. More stable for embeddings.
AdaGrad lr=0.01, eps=1e-8 Accumulates squared gradients. Good for sparse features. Learning rate decays over time.
RMSProp lr=0.001, decay=0.9, eps=1e-8 Leaky AdaGrad — maintains moving average of squared gradients.
SGD lr=0.01 Pure stochastic gradient descent. Fast, simple.
Momentum lr=0.01, momentum=0.9 SGD with momentum. Faster convergence on ravines.
Cyclical Cyclical learning rate — oscillates between min and max LR. Can escape local minima.
StepDecay Decays learning rate by a factor every N steps.
Optimizer step is fused in C. Each optimizer calls a single FFI function (e.g., tensor_fused_adam_step) per parameter tensor — no intermediate allocation, all math in-place in C.

Loss Functions

Namespace: Pml\Losses\. All implement Loss: compute(Tensor $preds, Tensor $labels): float + differentiate(...): Tensor.

LossUse CaseNotes
CategoricalCrossEntropy Multi-class classification Labels as class indices or one-hot. Numerically stable log computation in C.
BinaryCrossEntropy Binary classification Fused BCE + gradient via tensor_fused_bce_loss_and_grad. Avoids two passes.
MeanSquaredError Regression (y_hat − y)² / N. Gradient: 2(y_hat − y) / N.
Hinge SVM-style max(0, 1 − y · y_hat). Sparse gradient.
Huber Robust regression Quadratic near zero, linear for large errors. Configurable δ.

INT8 Quantization

PML supports symmetric INT8 block quantization (Q8_0-class) for Dense layers. Each group of 32 weights shares one float32 scale factor. The forward kernel uses AVX2 fused int8→fp32 dot products — no temporary fp32 weight allocation per inference call.

Memory impact

7B model: ~28 GB fp32 → ~7.2 GB INT8.
Compression ratio typically 3.9–4.0×.

Throughput

~4× decode token throughput improvement vs fp32 on CPU (memory-bandwidth bound).

Accuracy

Perplexity degrades <1% for groupSize=32. Larger models degrade less.

// Train in fp32
$model->train($dataset, epochs: 20);
$model->save('ckpt/mnist');

// Load and quantize for deployment
$model = Sequential::load('ckpt/mnist');
$model->quantize(groupSize: 32);   // all Dense layers → INT8

// Inference is now 4× faster / uses 4× less memory
$preds = $model->predict($testSet);

// Check compression
foreach ($model->getLayers() as $layer) {
    if ($layer instanceof Dense && $layer->isQuantized()) {
        // Dense layer is quantized
    }
}
Quantization is irreversible for a loaded model. After quantize(), backward() throws LogicException. Quantized checkpoints can still be saved (getStateDict() dequantizes for export).

Initializers

Namespace: Pml\NeuralNetwork\Initializers\. Used by Layers that accept a custom initializer.

InitializerFormulaNotes
HeN(0, √(2/fan_in))Default for ReLU networks
Xavier1U(−√(6/(fan_in+fan_out)), +√(...))Glorot uniform — for tanh
Xavier2N(0, √(2/(fan_in+fan_out)))Glorot normal
LeCunN(0, √(1/fan_in))For SELU networks
NormalN(mean, stddev)Custom Gaussian
UniformU(min, max)Custom uniform
Constantfill(value)Fixed value (e.g., for bias init)

Activation Functions

Namespace: Pml\NeuralNetwork\ActivationFunctions\. Used with the Activation layer wrapper.

ActivationFormula
ELUx if x > 0, else α(eˣ − 1)
GELUx·Φ(x) — Gaussian CDF approximation
LeakyReLUmax(αx, x) — fixed negative slope
SELUλ·max(αeˣ−α, x) — self-normalizing
SiLUx·σ(x) — Swish with β=1
SoftPluslog(1 + eˣ) — smooth ReLU
Softsignx / (1 + |x|)
ThresholdedReLUx if x > θ, else 0

SLM Utilities

Namespace: Pml\SLM\. Small Language Model utilities.

ClassDescription
BpeTrainer Trains a BPE vocabulary from raw text corpus. Produces a tokenizer.json compatible with Pml\Inference\Tokenizer.
TrainableEmbedding Embedding layer with positional encoding. Combines token + positional embeddings in one forward call. Stateful.