Neural Networks
The Sequential model container, 29 layer types, 9 optimizers, 5 loss functions,
INT8 quantization, and pre-made activation/initializer modules.
All weight math runs in C via FFI; PHP only moves layer handles.
On this page
Sequential
Pml\NeuralNetwork\Sequential is the primary model container.
Implements TrainableWithOptions, Persistable, and Verbose.
$model = new Sequential(
layers: [new Dense(784, 256), new ReLU(), new Dense(256, 10), new Softmax()],
lossFn: new CategoricalCrossEntropy(),
optimizer: new AdamW(lr: 3e-4),
);
Full training loop with optional validation, early stopping, and gradient clipping.
| Option | Type | Default | Description |
|---|---|---|---|
epochs | int | 10 | Number of full passes over the dataset |
batchSize | int | 32 | Mini-batch size |
validation | Dataset | null | Held-out validation set for loss monitoring |
patience | int | 0 | Early stopping patience epochs (0 = disabled) |
minDelta | float | 1e-4 | Minimum improvement to reset patience counter |
clipGradNorm | float | 0.0 | Global gradient-norm clip threshold (0 = disabled). Recommended 1.0–5.0 for RNN/LSTM. |
$model->train($train,
epochs: 50,
batchSize: 128,
validation: $val,
patience: 5,
clipGradNorm: 1.0,
);
Single forward + backward + optimizer step on one pre-made batch. Returns scalar loss. Used for streaming training loops that manage their own epoch logic.
Sets all layers to inference mode (disables Dropout, freezes BatchNorm), runs forward pass, then restores training mode. Requires trained() === true.
Quantizes all Dense layers to INT8 block format. After this call: 4× less weight memory, faster decode, backward() disabled. See INT8 Quantization.
| Method | Description |
|---|---|
forward(Tensor $x): Tensor | Run inference pass (no training-mode flip). Use internally by custom training loops. |
backward(Tensor $grad): void | Backpropagation through all layers in reverse. |
add(Layer $layer): void | Append a layer to the end of the stack. |
getLayers(): Layer[] | Read-only access to the layer list. |
getOptimizer(): Optimizer | |
getLoss(): Loss | |
trained(): bool | True after train() or markTrained() completes. |
markTrained(): void | Marks model trained without calling train(). Use after streaming training loops. |
setLogger(LoggerInterface): void | PSR-3 logger for per-epoch loss output. |
save(string $dir): void | Saves config.json + model.safetensors to directory. |
load(string $dir): static | Restores model from directory. Weights are mmap'd. |
Layer Interface
interface Layer
{
public function forward(Tensor $input): Tensor;
public function backward(Tensor $gradient): Tensor;
public function getParameters(): array; // name → Tensor
public function getGradients(): array; // name → Tensor
public function getConfig(): array; // for ModelStore serialization
public static function fromConfig(array $config): static;
}
Additional optional interfaces a layer may implement:
| Interface | Methods | Description |
|---|---|---|
HasTrainingMode | setTraining(bool) | Layers that behave differently during training vs inference (Dropout, BatchNorm) |
Stateful | getStateDict(prefix) · loadStateDict(dict, prefix) | SafeTensors checkpoint I/O |
Quantizable | quantize(int) · isQuantized() | INT8 block quantization (Dense only) |
Layers Reference
Linear & Core
| Layer | Constructor | Description |
|---|---|---|
Dense |
int $inputDim, int $outputDim, bool $useBias = true |
Fully-connected layer. Y = XW^T + b. He initialization. Implements Stateful + Quantizable. |
Embedding |
int $vocabSize, int $embedDim |
Token embedding lookup table. Forward: gather rows by integer indices. Backward: scatter gradients. |
Flatten |
— | Reshapes input to [batch, D]. No parameters. |
Reshape |
array $targetShape |
Arbitrary reshape (excluding batch dimension). |
Squeeze |
— | Removes all size-1 dimensions from the tensor. |
MLP |
int $inputSize, array $hiddenSizes, int $outputSize, string $activation = 'relu' |
Pre-made multi-layer perceptron block (Dense + Activation stack). Useful inside transformer blocks. |
Normalization
| Layer | Constructor | Description |
|---|---|---|
BatchNormalization |
int $features, float $momentum = 0.9, float $eps = 1e-5 |
Batch normalization for 2-D tensors [batch, features]. Trainable scale+shift. Running stats for inference. Stateful. |
BatchNorm2D |
— | Batch normalization for 4-D tensors [N, C, H, W]. Used with Conv2D. |
LayerNorm |
int $dim, float $eps = 1e-5 |
Layer normalization over the last dimension. Used in Transformers. Stateful. |
Regularization
| Layer | Constructor | Description |
|---|---|---|
Dropout |
float $rate = 0.5 |
Drops activations randomly during training. No-op at inference. Implements HasTrainingMode. |
Noise |
float $stddev = 0.1 |
Additive Gaussian noise during training. Improves robustness. |
Convolutional
| Layer | Constructor | Description |
|---|---|---|
Conv2D |
int $inC, int $outC, int $kernelSize, int $stride = 1, int $padding = 0, bool $useBias = true |
2-D convolution. OpenBLAS im2col GEMM. Stateful. Backward computes dX, dW, db. |
DepthwiseConv2D |
— | Depthwise separable convolution. ~9× fewer FLOPs than regular Conv2D. Used in MobileNet. |
GlobalAveragePooling2D |
— | Reduces [N, C, H, W] to [N, C] by averaging over spatial dims. |
InvertedResidual |
— | MobileNetV2/V3 inverted residual block: expand → depthwise → project + residual. |
SEBlock |
int $channels, int $reduction = 4 |
Squeeze-and-Excitation block. Channel attention: global pool → FC → ReLU → FC → Sigmoid → scale. |
Activation Layers
| Layer | Notes |
|---|---|
ReLU | max(0, x). Zero parameters. AVX2 vectorized. |
Sigmoid | σ(x) = 1/(1+e⁻ˣ). AVX2 vectorized. |
Tanh | tanh(x). AVX2 vectorized. |
Softmax | Numerically stable softmax over last axis. |
Gelu | Gaussian Error Linear Unit. Used in BERT/GPT. |
Swish | x · σ(βx). Smooth, non-monotonic. Default β=1. |
HardSigmoid | Piecewise linear σ approximation. Fast inference. |
HardSwish | Hard approximation to Swish. Used in MobileNetV3. |
PReLU | Parametric ReLU — learnable negative slope α. |
Activation | Wraps any ActivationFunction object as a layer. |
Recurrent Layers
| Layer | Constructor | Description |
|---|---|---|
LSTM |
int $inputSize, int $hiddenSize |
Long Short-Term Memory. C-level gated cell. Input: [batch, seq, inputSize] → [batch, seq, hiddenSize]. Stateful. |
RNN |
int $inputSize, int $hiddenSize |
Vanilla Elman RNN. Faster than LSTM, worse for long sequences. |
Mamba |
int $dModel, int $dState = 16 |
State Space Model block (Mamba-style SSM). Linear-time sequence modeling alternative to attention. |
Attention
| Layer | Constructor | Description |
|---|---|---|
CausalSelfAttention |
int $dModel, int $nHeads |
Multi-head causal self-attention. Three forward paths:
• Training ( $kv = null): full O(T²) causal attention, caches activations for backward.
• Prefill ( $kv, T > 1): full attention + populate KV-cache.
• Decode ( $kv, T = 1): append 1 token to cache + O(1) attend.
Signature: forward(Tensor $x, ?KVCache $kv = null, int $layerIdx = 0): Tensor
|
Optimizers
Namespace: Pml\NeuralNetwork\Optimizers\. All implement Optimizer::step(Layer[] $layers): void.
| Optimizer | Constructor | Notes |
|---|---|---|
Adam |
lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8 |
Adaptive moment estimation. Best general-purpose optimizer. |
AdamW |
lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8, weightDecay=1e-2 |
Adam with decoupled weight decay. Preferred over Adam + L2 reg for transformers. |
AdaMax |
lr=0.002, beta1=0.9, beta2=0.999, eps=1e-8 |
Adam variant using infinity-norm. More stable for embeddings. |
AdaGrad |
lr=0.01, eps=1e-8 |
Accumulates squared gradients. Good for sparse features. Learning rate decays over time. |
RMSProp |
lr=0.001, decay=0.9, eps=1e-8 |
Leaky AdaGrad — maintains moving average of squared gradients. |
SGD |
lr=0.01 |
Pure stochastic gradient descent. Fast, simple. |
Momentum |
lr=0.01, momentum=0.9 |
SGD with momentum. Faster convergence on ravines. |
Cyclical |
— | Cyclical learning rate — oscillates between min and max LR. Can escape local minima. |
StepDecay |
— | Decays learning rate by a factor every N steps. |
tensor_fused_adam_step) per parameter tensor — no intermediate allocation, all math in-place in C.
Loss Functions
Namespace: Pml\Losses\. All implement Loss: compute(Tensor $preds, Tensor $labels): float + differentiate(...): Tensor.
| Loss | Use Case | Notes |
|---|---|---|
CategoricalCrossEntropy |
Multi-class classification | Labels as class indices or one-hot. Numerically stable log computation in C. |
BinaryCrossEntropy |
Binary classification | Fused BCE + gradient via tensor_fused_bce_loss_and_grad. Avoids two passes. |
MeanSquaredError |
Regression | (y_hat − y)² / N. Gradient: 2(y_hat − y) / N. |
Hinge |
SVM-style | max(0, 1 − y · y_hat). Sparse gradient. |
Huber |
Robust regression | Quadratic near zero, linear for large errors. Configurable δ. |
INT8 Quantization
PML supports symmetric INT8 block quantization (Q8_0-class) for Dense layers.
Each group of 32 weights shares one float32 scale factor. The forward kernel uses AVX2 fused
int8→fp32 dot products — no temporary fp32 weight allocation per inference call.
Memory impact
7B model: ~28 GB fp32 → ~7.2 GB INT8.
Compression ratio typically 3.9–4.0×.
Throughput
~4× decode token throughput improvement vs fp32 on CPU (memory-bandwidth bound).
Accuracy
Perplexity degrades <1% for groupSize=32. Larger models degrade less.
// Train in fp32
$model->train($dataset, epochs: 20);
$model->save('ckpt/mnist');
// Load and quantize for deployment
$model = Sequential::load('ckpt/mnist');
$model->quantize(groupSize: 32); // all Dense layers → INT8
// Inference is now 4× faster / uses 4× less memory
$preds = $model->predict($testSet);
// Check compression
foreach ($model->getLayers() as $layer) {
if ($layer instanceof Dense && $layer->isQuantized()) {
// Dense layer is quantized
}
}
quantize(), backward() throws LogicException.
Quantized checkpoints can still be saved (getStateDict() dequantizes for export).
Initializers
Namespace: Pml\NeuralNetwork\Initializers\. Used by Layers that accept a custom initializer.
| Initializer | Formula | Notes |
|---|---|---|
He | N(0, √(2/fan_in)) | Default for ReLU networks |
Xavier1 | U(−√(6/(fan_in+fan_out)), +√(...)) | Glorot uniform — for tanh |
Xavier2 | N(0, √(2/(fan_in+fan_out))) | Glorot normal |
LeCun | N(0, √(1/fan_in)) | For SELU networks |
Normal | N(mean, stddev) | Custom Gaussian |
Uniform | U(min, max) | Custom uniform |
Constant | fill(value) | Fixed value (e.g., for bias init) |
Activation Functions
Namespace: Pml\NeuralNetwork\ActivationFunctions\. Used with the Activation layer wrapper.
| Activation | Formula |
|---|---|
ELU | x if x > 0, else α(eˣ − 1) |
GELU | x·Φ(x) — Gaussian CDF approximation |
LeakyReLU | max(αx, x) — fixed negative slope |
SELU | λ·max(αeˣ−α, x) — self-normalizing |
SiLU | x·σ(x) — Swish with β=1 |
SoftPlus | log(1 + eˣ) — smooth ReLU |
Softsign | x / (1 + |x|) |
ThresholdedReLU | x if x > θ, else 0 |
SLM Utilities
Namespace: Pml\SLM\. Small Language Model utilities.
| Class | Description |
|---|---|
BpeTrainer |
Trains a BPE vocabulary from raw text corpus. Produces a tokenizer.json compatible with Pml\Inference\Tokenizer. |
TrainableEmbedding |
Embedding layer with positional encoding. Combines token + positional embeddings in one forward call. Stateful. |