PML / Core / Tensor
Core

Tensor

The fundamental data structure of PML. A Tensor is a PHP object that holds a pointer to a C-allocated, 64-byte aligned, contiguous float32 (or int32/int64) buffer. All math is executed in C. PHP never copies the raw bytes.

Memory Model

Each Tensor wraps a TensorC* pointer. The C struct holds a float* (64-byte aligned via posix_memalign), shape info, stride info, total element count, and a reference-count field used for views. PHP's __destruct() decrements the ref-count; memory is freed when it reaches zero.

Zero-Copy View Rule view(), row(), col(), slice(), and transpose() return views — they do not allocate new C memory. Operations that require contiguous memory (BLAS calls, saves) will call contiguous() internally or throw.

Creation

zeros static (int ...$shape): Tensor

Allocates a zero-filled tensor with the given shape.

$t = Tensor::zeros(3, 4);   // [3, 4] of 0.0
ones static (int ...$shape): Tensor

Allocates a one-filled tensor.

$t = Tensor::ones(8);
randomNormal static (array $shape, float $mean = 0.0, float $stddev = 1.0, ?CData $arena = null): Tensor

Samples from N(mean, stddev²). Used by Dense, Embedding, and most initializers. Can allocate from an Arena.

$W = Tensor::randomNormal([512, 256], 0.0, sqrt(2.0/256));
randomUniform static (array $shape, float $min = 0.0, float $max = 1.0, ?CData $arena = null): Tensor

Uniform random in [min, max).

range static (float $start, float $end, float $step = 1.0): Tensor

1-D arange, equivalent to NumPy np.arange(start, end, step).

linspace static (float $start, float $end, int $steps): Tensor

Evenly spaced values including both endpoints.

fromArray static (array $data, int $dtype = DTYPE_FLOAT32): Tensor

Copies a nested PHP array into a new C tensor. Supports 1-D and 2-D arrays. Do not use on large datasets — prefer Dataset::fromCsv() for files.

$t = Tensor::fromArray([[1.0, 2.0], [3.0, 4.0]]);
emptyLike static (Tensor $t): Tensor

Allocates an uninitialised tensor with the same shape and dtype as $t. Faster than zeros when the buffer will be fully overwritten.

Properties & Shape

MethodReturnsDescription
shape()int[]Array of dimension sizes, e.g. [64, 128]
ndim()intNumber of dimensions
size()intTotal element count (product of all dims)
dtype()intDTYPE_FLOAT32, DTYPE_INT32, or DTYPE_INT64
isContiguous()boolTrue if strides match a row-major layout

Data-type constants

Tensor::DTYPE_FLOAT32   // 0 — default, 32-bit float
Tensor::DTYPE_INT32     // 1 — 32-bit integer
Tensor::DTYPE_INT64     // 2 — 64-bit integer (token ids)

Indexing & Views

All view operations are O(1) — they change strides and offset, not data. If you then need contiguous data (e.g., for a BLAS call) use contiguous().
MethodSignatureDescription
view()(): selfView with same data pointer, increments ref-count
row(int $r)(): selfView of one row of a 2-D tensor → 1-D
col(int $c)(): selfView of one column (non-contiguous stride)
slice(int $axis, int $start, int $len)(): selfSub-tensor slice along an axis
sliceStep(int $axis, int $start, int $end, int $step)(): selfStrided slice (like Python a[::2])
copy()(): selfDeep copy — allocates new C memory
contiguous()(): selfReturns self if already contiguous, else copies to contiguous buffer
fill(float $val)(): selfFills all elements in-place, returns self
copyFrom(Tensor $src)(): voidOverwrites this tensor's data with $src (shapes must match)
toFlatArray()(): float[]Copies C memory into a PHP array — avoid in hot paths
buffer()(): CDataRaw float* CData pointer for zero-copy FFI interop

Arithmetic (returns new Tensor)

MethodOperationNotes
add(Tensor $b)A + BBroadcasting supported
sub(Tensor $b)A − B
mul(Tensor $b)A ⊙ BElement-wise (Hadamard)
div(Tensor $b)A ÷ B
pow(Tensor $b)A^BElement-wise power
addScalar(float $v)A + v
mulScalar(float $v)A × v
clip(float $min, float $max)clamp(A)
lessScalar(float $v)A < v → {0,1}Boolean mask as float
greaterScalar(float $v)A > v → {0,1}

In-place Operations

All in-place methods modify the tensor and return $this for chaining. No heap allocation.

MethodOperation
addInplace(Tensor $b)A += B
subInplace(Tensor $b)A -= B
mulInplace(Tensor $b)A *= B
divInplace(Tensor $b)A /= B
addScalarInplace(float $v)A += v
mulScalarInplace(float $v)A *= v
clampInplace(float $lo, float $hi)A = clamp(A, lo, hi)
expInplace()A = exp(A)
logInplace()A = log(A)
sqrtInplace()A = sqrt(A)
sigmoidInplace()A = σ(A)
tanhInplace()A = tanh(A)
reluInplace()A = max(0, A)
rowSoftmaxInplace()Softmax over last axis, numerically stable

Unary / Element-wise (returns new Tensor)

Math

sqrt() · square() · abs() · sign() · exp() · log() · log1p() · round() · floor() · ceil()

Trig

sin() · cos() · tan() · asin() · acos() · atan()

Activations

sigmoid() · tanh() · relu() · gelu() · silu() · elu() · selu() · softplus() · mish()

Norm

softmax(int $axis) · logSoftmax(int $axis) · layerNorm(Tensor $g, Tensor $b, float $eps)

Reductions

MethodReturnsDescription
sum()floatSum of all elements
mean()floatMean of all elements
max()floatGlobal maximum
min()floatGlobal minimum
sumAxis(int $axis)TensorReduces along axis, removes that dim
meanAxis(int $axis)TensorMean along axis
maxAxis(int $axis)TensorMax along axis
minAxis(int $axis)TensorMin along axis
argmax()intGlobal argmax (flat index)
argmaxAxis(int $axis)TensorArgmax along axis (int32 result)
argmin()intGlobal argmin
std(bool $bessel = true)floatStandard deviation
variance(bool $bessel = true)floatVariance
normL2()floatL2 norm
sumAxisInto(Tensor $src, int $axis)voidZero-allocation sum: writes result into pre-allocated tensor

Linear Algebra

matmul (Tensor $b, bool $transposeA = false, bool $transposeB = false): Tensor Tensor

General matrix multiply via OpenBLAS SGEMM. Supports batched matmul for 3-D tensors (batch × M × K) × (batch × K × N).

$C = $A->matmul($B);              // [M,K] × [K,N] → [M,N]
$C = $A->matmul($B, true);       // A^T × B
matmulInto (Tensor $a, Tensor $b, bool $transposeA = false, bool $transposeB = false): void

Zero-allocation GEMM: writes result into $this (pre-allocated buffer). Used by Dense::backward() to reuse gradient buffers across steps.

linear (Tensor $W, ?Tensor $bias = null): Tensor Tensor

Fused X @ W^T + bias in one BLAS call. Used by Dense::forward(). Shape: X[batch, in] × W[out, in]^T → [batch, out].

pairwiseSqL2 static (Tensor $A, Tensor $B): Tensor Tensor [N, M]

Squared Euclidean distance matrix between every pair of rows. Used by KNN, DBSCAN, KMeans. O(NM·D) via BLAS.

ridgeSolve static (Tensor $X, Tensor $y, float $lambda = 1.0): Tensor Tensor [D]

Closed-form Ridge regression: W = (X^T X + λI)⁻¹ X^T y via LAPACKE SGESV.

MethodDescription
svd(): arrayReturns [U, S, Vt] via LAPACKE SGESVD. Used by PCA, TruncatedSVD, LDA.
dot(Tensor $b): float1-D inner product
outer(Tensor $b): TensorOuter product → 2-D matrix
norm(int $ord = 2): floatVector or matrix norm
addRelu(Tensor $b): TensorFused (A + B) then ReLU — one kernel, no intermediate tensor
mulAdd(Tensor $B, Tensor $C): TensorFused A⊙B + C (FMA)

Shape Ops

MethodDescription
reshape(int ...$shape)Returns view with new shape (must have same element count)
flatten()Reshape to 1-D
expandDims(int $axis)Insert size-1 dimension at $axis
squeeze()Remove all size-1 dimensions
transpose()2-D matrix transpose (view, no copy)
transposeNd(array $axes)N-D permutation, e.g. [1,0,2]
swapaxes(int $a, int $b)Swap two axes
concat(Tensor $b, int $axis = 0)Concatenate two tensors along an axis
stack(array $tensors, int $axis = 0)Stack list of tensors into new dimension
split(int $chunks, int $axis = 0)Split into N equal chunks along axis
pad(array $padding, float $val = 0.0)Zero/constant pad on any axis

Fused Kernels

These are hand-written C kernels that combine multiple operations to avoid intermediate tensors:

MethodDescription
fusedBceLossAndGrad(Tensor $preds, Tensor $targets, ?Tensor $grads)Binary cross-entropy loss + gradient in one pass
fusedAdamStep(Tensor $p, Tensor $g, Tensor $m, Tensor $v, float $lr, ...)Adam optimizer step — no intermediate allocation
fusedAdamWStep(...)AdamW (decoupled weight decay)
fusedSgdStep(Tensor $p, Tensor $g, float $lr)SGD weight update
fusedRmsPropStep(...)RMSProp update
fusedAdaGradStep(...)AdaGrad update
configureThreading(int $omp, int $blas = 1)Set OpenMP and OpenBLAS thread counts at runtime

Serialization

MethodDescription
save(string $filepath)Write tensor to a raw binary file (.tensor)
load(string $filepath)Read tensor from raw binary file
saveSafetensors(string $path, string $header, Tensor[] $tensors)Write HuggingFace SafeTensors format (multiple tensors)
datasetFromCsv(string $path, int $labelCol, bool $header)mmap CSV → [Tensor $X, Tensor $y]. Never loads full file into PHP heap.

Arena — Memory Pool

Arena is a bump-pointer memory pool backed by a single malloc call. Allocating from an Arena is O(1) with no per-object overhead. All tensors allocated from an Arena are freed at once when reset() or __destruct() is called — ideal for per-batch scratch memory.

__construct (int $capacityBytes = 32 * 1024 * 1024)

Pre-allocates the arena buffer. Default is 32 MB. Choose a size that fits your largest batch.

tensor (array $shape, int $dtype = DTYPE_FLOAT32): Tensor Tensor

Bump-allocates a tensor from the pool. O(1), no malloc. The returned Tensor is only valid until reset().

reset (): void

Resets the bump pointer to the start. All previously allocated tensors are invalid after this call.

use Pml\Lib\Arena;
use Pml\Tensor;

$arena = new Arena(64 * 1024 * 1024); // 64 MB pool

foreach ($batches as $batch) {
    // Scratch tensors allocated from the pool — no malloc
    $tmp1 = Tensor::randomNormal([128, 256], arena: $arena->ptr());
    $tmp2 = $arena->tensor([128, 256]);

    // ... compute ...

    $arena->reset(); // reclaim all scratch memory instantly
}

QuantizedTensor (INT8 Block Quantization)

QuantizedTensor wraps a QuantizedWeight* C struct: an INT8 weight matrix with per-group-of-32 float32 scale factors (Q8_0-class format). Compression is ~4× vs fp32. The forward kernel uses AVX2 fused int8→fp32 dot products — no temporary weight allocation.

fromTensor static (Tensor $w, int $groupSize = 32): self

Quantizes a fp32 weight matrix to INT8. Each group of $groupSize elements shares a scale factor. Smaller groups → less quantization error, more scale overhead.

$qw = QuantizedTensor::fromTensor($weights, groupSize: 32);
linear (Tensor $X, ?Tensor $bias = null): Tensor Tensor

Quantized linear forward: Y = X @ W^T + bias. Hot path for LLM decode — AVX2 fused, OpenMP-parallel over rows. No temporary fp32 weight buffer.

toTensor (): Tensor

Dequantizes back to fp32. Use only for checkpoint export — not in decode hot path.

MethodReturnsDescription
rows()intOutput dimension
cols()intInput dimension
groupSize()intElements per quantization group
numGroups()intTotal number of groups
memoryBytes()intBytes used by data + scales (excludes struct header)
compressionRatio()floatfp32 bytes ÷ INT8 bytes (typically ~3.9×)
use Pml\QuantizedTensor;
use Pml\Tensor;

$W   = Tensor::randomNormal([4096, 4096]);   // 64 MB fp32
$qW  = QuantizedTensor::fromTensor($W, 32);   // ~16 MB INT8
$W   = null;                                  // free fp32

printf("Compression: %.2f×\n", $qW->compressionRatio());
// → Compression: 3.94×

$X   = Tensor::randomNormal([1, 4096]);
$out = $qW->linear($X);                       // AVX2 fused kernel
Backward not supported Once a layer is quantized it is permanently in inference mode. Calling backward() on a quantized Dense layer throws LogicException.