Core

Tensor

The fundamental data structure of PML. A Tensor is a PHP object that holds a pointer to a C-allocated, 64-byte aligned, contiguous float32 (or int32/int64) buffer. All math is executed in C. PHP never copies the raw bytes.

Memory Model

Each Tensor wraps a TensorC* pointer. The C struct holds a float* (64-byte aligned via posix_memalign), shape info, stride info, total element count, and a reference-count field used for views. PHP's __destruct() decrements the ref-count; memory is freed when it reaches zero.

Zero-Copy View Rule view(), row(), col(), slice(), and transpose() return views — they do not allocate new C memory. Operations that require contiguous memory (BLAS calls, saves) will call contiguous() internally or throw.

Creation

zeros static (int ...$shape): Tensor

Allocates a zero-filled tensor with the given shape.

$t = Tensor::zeros(3, 4);   // [3, 4] of 0.0

ones static (int ...$shape): Tensor

Allocates a one-filled tensor.

$t = Tensor::ones(8);

randomNormal static (array $shape, float $mean = 0.0, float $stddev = 1.0, ?CData $arena = null): Tensor

Samples from N(mean, stddev²). Used by Dense, Embedding, and most initializers. Can allocate from an Arena.

$W = Tensor::randomNormal([512, 256], 0.0, sqrt(2.0/256));

randomUniform static (array $shape, float $min = 0.0, float $max = 1.0, ?CData $arena = null): Tensor

Uniform random in [min, max).

range static (float $start, float $end, float $step = 1.0): Tensor

1-D arange, equivalent to NumPy np.arange(start, end, step).

linspace static (float $start, float $end, int $steps): Tensor

Evenly spaced values including both endpoints.

fromArray static (array $data, int $dtype = DTYPE_FLOAT32): Tensor

Copies a nested PHP array into a new C tensor. Supports 1-D and 2-D arrays. Do not use on large datasets — prefer Dataset::fromCsv() for files.

$t = Tensor::fromArray([[1.0, 2.0], [3.0, 4.0]]);

emptyLike static (Tensor $t): Tensor

Allocates an uninitialised tensor with the same shape and dtype as $t. Faster than zeros when the buffer will be fully overwritten.

Properties & Shape

Method	Returns	Description
`shape()`	`int[]`	Array of dimension sizes, e.g. `[64, 128]`
`ndim()`	`int`	Number of dimensions
`size()`	`int`	Total element count (product of all dims)
`dtype()`	`int`	`DTYPE_FLOAT32`, `DTYPE_INT32`, or `DTYPE_INT64`
`isContiguous()`	`bool`	True if strides match a row-major layout

Data-type constants

Tensor::DTYPE_FLOAT32   // 0 — default, 32-bit float
Tensor::DTYPE_INT32     // 1 — 32-bit integer
Tensor::DTYPE_INT64     // 2 — 64-bit integer (token ids)

Indexing & Views

All view operations are O(1) — they change strides and offset, not data. If you then need contiguous data (e.g., for a BLAS call) use contiguous().

Method	Signature	Description
`view()`	`(): self`	View with same data pointer, increments ref-count
`row(int $r)`	`(): self`	View of one row of a 2-D tensor → 1-D
`col(int $c)`	`(): self`	View of one column (non-contiguous stride)
`slice(int $axis, int $start, int $len)`	`(): self`	Sub-tensor slice along an axis
`sliceStep(int $axis, int $start, int $end, int $step)`	`(): self`	Strided slice (like Python `a[::2]`)
`copy()`	`(): self`	Deep copy — allocates new C memory
`contiguous()`	`(): self`	Returns self if already contiguous, else copies to contiguous buffer
`fill(float $val)`	`(): self`	Fills all elements in-place, returns self
`copyFrom(Tensor $src)`	`(): void`	Overwrites this tensor's data with `$src` (shapes must match)
`toFlatArray()`	`(): float[]`	Copies C memory into a PHP array — avoid in hot paths
`buffer()`	`(): CData`	Raw `float*` CData pointer for zero-copy FFI interop

Arithmetic (returns new Tensor)

Method	Operation	Notes
`add(Tensor $b)`	A + B	Broadcasting supported
`sub(Tensor $b)`	A − B
`mul(Tensor $b)`	A ⊙ B	Element-wise (Hadamard)
`div(Tensor $b)`	A ÷ B
`pow(Tensor $b)`	A^B	Element-wise power
`addScalar(float $v)`	A + v
`mulScalar(float $v)`	A × v
`clip(float $min, float $max)`	clamp(A)
`lessScalar(float $v)`	A < v → {0,1}	Boolean mask as float
`greaterScalar(float $v)`	A > v → {0,1}

In-place Operations

All in-place methods modify the tensor and return $this for chaining. No heap allocation.

Method	Operation
`addInplace(Tensor $b)`	A += B
`subInplace(Tensor $b)`	A -= B
`mulInplace(Tensor $b)`	A *= B
`divInplace(Tensor $b)`	A /= B
`addScalarInplace(float $v)`	A += v
`mulScalarInplace(float $v)`	A *= v
`clampInplace(float $lo, float $hi)`	A = clamp(A, lo, hi)
`expInplace()`	A = exp(A)
`logInplace()`	A = log(A)
`sqrtInplace()`	A = sqrt(A)
`sigmoidInplace()`	A = σ(A)
`tanhInplace()`	A = tanh(A)
`reluInplace()`	A = max(0, A)
`rowSoftmaxInplace()`	Softmax over last axis, numerically stable

Unary / Element-wise (returns new Tensor)

Math

sqrt() · square() · abs() · sign() · exp() · log() · log1p() · round() · floor() · ceil()

Trig

sin() · cos() · tan() · asin() · acos() · atan()

Activations

sigmoid() · tanh() · relu() · gelu() · silu() · elu() · selu() · softplus() · mish()

Norm

softmax(int $axis) · logSoftmax(int $axis) · layerNorm(Tensor $g, Tensor $b, float $eps)

Reductions

Method	Returns	Description
`sum()`	`float`	Sum of all elements
`mean()`	`float`	Mean of all elements
`max()`	`float`	Global maximum
`min()`	`float`	Global minimum
`sumAxis(int $axis)`	`Tensor`	Reduces along axis, removes that dim
`meanAxis(int $axis)`	`Tensor`	Mean along axis
`maxAxis(int $axis)`	`Tensor`	Max along axis
`minAxis(int $axis)`	`Tensor`	Min along axis
`argmax()`	`int`	Global argmax (flat index)
`argmaxAxis(int $axis)`	`Tensor`	Argmax along axis (int32 result)
`argmin()`	`int`	Global argmin
`std(bool $bessel = true)`	`float`	Standard deviation
`variance(bool $bessel = true)`	`float`	Variance
`normL2()`	`float`	L2 norm
`sumAxisInto(Tensor $src, int $axis)`	`void`	Zero-allocation sum: writes result into pre-allocated tensor

Linear Algebra

matmul (Tensor $b, bool $transposeA = false, bool $transposeB = false): Tensor Tensor

General matrix multiply via OpenBLAS SGEMM. Supports batched matmul for 3-D tensors (batch × M × K) × (batch × K × N).

$C = $A->matmul($B);              // [M,K] × [K,N] → [M,N]
$C = $A->matmul($B, true);       // A^T × B

matmulInto (Tensor $a, Tensor $b, bool $transposeA = false, bool $transposeB = false): void

Zero-allocation GEMM: writes result into $this (pre-allocated buffer). Used by Dense::backward() to reuse gradient buffers across steps.

linear (Tensor $W, ?Tensor $bias = null): Tensor Tensor

Fused X @ W^T + bias in one BLAS call. Used by Dense::forward(). Shape: X[batch, in] × W[out, in]^T → [batch, out].

pairwiseSqL2 static (Tensor $A, Tensor $B): Tensor Tensor [N, M]

Squared Euclidean distance matrix between every pair of rows. Used by KNN, DBSCAN, KMeans. O(NM·D) via BLAS.

ridgeSolve static (Tensor $X, Tensor $y, float $lambda = 1.0): Tensor Tensor [D]

Closed-form Ridge regression: W = (X^T X + λI)⁻¹ X^T y via LAPACKE SGESV.

Method	Description
`svd(): array`	Returns `[U, S, Vt]` via LAPACKE SGESVD. Used by PCA, TruncatedSVD, LDA.
`dot(Tensor $b): float`	1-D inner product
`outer(Tensor $b): Tensor`	Outer product → 2-D matrix
`norm(int $ord = 2): float`	Vector or matrix norm
`addRelu(Tensor $b): Tensor`	Fused (A + B) then ReLU — one kernel, no intermediate tensor
`mulAdd(Tensor $B, Tensor $C): Tensor`	Fused A⊙B + C (FMA)

Shape Ops

Method	Description
`reshape(int ...$shape)`	Returns view with new shape (must have same element count)
`flatten()`	Reshape to 1-D
`expandDims(int $axis)`	Insert size-1 dimension at `$axis`
`squeeze()`	Remove all size-1 dimensions
`transpose()`	2-D matrix transpose (view, no copy)
`transposeNd(array $axes)`	N-D permutation, e.g. `[1,0,2]`
`swapaxes(int $a, int $b)`	Swap two axes
`concat(Tensor $b, int $axis = 0)`	Concatenate two tensors along an axis
`stack(array $tensors, int $axis = 0)`	Stack list of tensors into new dimension
`split(int $chunks, int $axis = 0)`	Split into N equal chunks along axis
`pad(array $padding, float $val = 0.0)`	Zero/constant pad on any axis

Fused Kernels

These are hand-written C kernels that combine multiple operations to avoid intermediate tensors:

Method	Description
`fusedBceLossAndGrad(Tensor $preds, Tensor $targets, ?Tensor $grads)`	Binary cross-entropy loss + gradient in one pass
`fusedAdamStep(Tensor $p, Tensor $g, Tensor $m, Tensor $v, float $lr, ...)`	Adam optimizer step — no intermediate allocation
`fusedAdamWStep(...)`	AdamW (decoupled weight decay)
`fusedSgdStep(Tensor $p, Tensor $g, float $lr)`	SGD weight update
`fusedRmsPropStep(...)`	RMSProp update
`fusedAdaGradStep(...)`	AdaGrad update
`configureThreading(int $omp, int $blas = 1)`	Set OpenMP and OpenBLAS thread counts at runtime

Serialization

Method	Description
`save(string $filepath)`	Write tensor to a raw binary file (`.tensor`)
`load(string $filepath)`	Read tensor from raw binary file
`saveSafetensors(string $path, string $header, Tensor[] $tensors)`	Write HuggingFace SafeTensors format (multiple tensors)
`datasetFromCsv(string $path, int $labelCol, bool $header)`	mmap CSV → [Tensor $X, Tensor $y]. Never loads full file into PHP heap.

Arena — Memory Pool

Arena is a bump-pointer memory pool backed by a single malloc call. Allocating from an Arena is O(1) with no per-object overhead. All tensors allocated from an Arena are freed at once when reset() or __destruct() is called — ideal for per-batch scratch memory.

__construct (int $capacityBytes = 32 * 1024 * 1024)

Pre-allocates the arena buffer. Default is 32 MB. Choose a size that fits your largest batch.

tensor (array $shape, int $dtype = DTYPE_FLOAT32): Tensor Tensor

Bump-allocates a tensor from the pool. O(1), no malloc. The returned Tensor is only valid until reset().

reset (): void

Resets the bump pointer to the start. All previously allocated tensors are invalid after this call.

use Pml\Lib\Arena;
use Pml\Tensor;

$arena = new Arena(64 * 1024 * 1024); // 64 MB pool

foreach ($batches as $batch) {
    // Scratch tensors allocated from the pool — no malloc
    $tmp1 = Tensor::randomNormal([128, 256], arena: $arena->ptr());
    $tmp2 = $arena->tensor([128, 256]);

    // ... compute ...

    $arena->reset(); // reclaim all scratch memory instantly
}

QuantizedTensor (INT8 Block Quantization)

QuantizedTensor wraps a QuantizedWeight* C struct: an INT8 weight matrix with per-group-of-32 float32 scale factors (Q8_0-class format). Compression is ~4× vs fp32. The forward kernel uses AVX2 fused int8→fp32 dot products — no temporary weight allocation.

fromTensor static (Tensor $w, int $groupSize = 32): self

Quantizes a fp32 weight matrix to INT8. Each group of $groupSize elements shares a scale factor. Smaller groups → less quantization error, more scale overhead.

$qw = QuantizedTensor::fromTensor($weights, groupSize: 32);

linear (Tensor $X, ?Tensor $bias = null): Tensor Tensor

Quantized linear forward: Y = X @ W^T + bias. Hot path for LLM decode — AVX2 fused, OpenMP-parallel over rows. No temporary fp32 weight buffer.

toTensor (): Tensor

Dequantizes back to fp32. Use only for checkpoint export — not in decode hot path.

Method	Returns	Description
`rows()`	`int`	Output dimension
`cols()`	`int`	Input dimension
`groupSize()`	`int`	Elements per quantization group
`numGroups()`	`int`	Total number of groups
`memoryBytes()`	`int`	Bytes used by data + scales (excludes struct header)
`compressionRatio()`	`float`	fp32 bytes ÷ INT8 bytes (typically ~3.9×)

use Pml\QuantizedTensor;
use Pml\Tensor;

$W   = Tensor::randomNormal([4096, 4096]);   // 64 MB fp32
$qW  = QuantizedTensor::fromTensor($W, 32);   // ~16 MB INT8
$W   = null;                                  // free fp32

printf("Compression: %.2f×\n", $qW->compressionRatio());
// → Compression: 3.94×

$X   = Tensor::randomNormal([1, 4096]);
$out = $qW->linear($X);                       // AVX2 fused kernel

Backward not supported Once a layer is quantized it is permanently in inference mode. Calling backward() on a quantized Dense layer throws LogicException.

Tensor

On this page

Memory Model

Creation

Properties & Shape

Data-type constants

Indexing & Views

Arithmetic (returns new Tensor)

In-place Operations

Unary / Element-wise (returns new Tensor)

Math

Trig

Activations

Norm

Reductions

Linear Algebra

Shape Ops

Fused Kernels

Serialization

Arena — Memory Pool

QuantizedTensor (INT8 Block Quantization)