Tensor
The fundamental data structure of PML. A Tensor is a PHP object that holds a pointer to a
C-allocated, 64-byte aligned, contiguous float32 (or int32/int64) buffer.
All math is executed in C. PHP never copies the raw bytes.
On this page
Memory Model
Each Tensor wraps a TensorC* pointer. The C struct holds a
float* (64-byte aligned via posix_memalign), shape info,
stride info, total element count, and a reference-count field used for views.
PHP's __destruct() decrements the ref-count; memory is freed when it reaches zero.
view(), row(), col(), slice(), and
transpose() return views — they do not allocate new C memory.
Operations that require contiguous memory (BLAS calls, saves) will call
contiguous() internally or throw.
Creation
Allocates a zero-filled tensor with the given shape.
$t = Tensor::zeros(3, 4); // [3, 4] of 0.0
Allocates a one-filled tensor.
$t = Tensor::ones(8);
Samples from N(mean, stddev²). Used by Dense, Embedding, and most initializers. Can allocate from an Arena.
$W = Tensor::randomNormal([512, 256], 0.0, sqrt(2.0/256));
Uniform random in [min, max).
1-D arange, equivalent to NumPy np.arange(start, end, step).
Evenly spaced values including both endpoints.
Copies a nested PHP array into a new C tensor. Supports 1-D and 2-D arrays. Do not use on large datasets — prefer Dataset::fromCsv() for files.
$t = Tensor::fromArray([[1.0, 2.0], [3.0, 4.0]]);
Allocates an uninitialised tensor with the same shape and dtype as $t. Faster than zeros when the buffer will be fully overwritten.
Properties & Shape
| Method | Returns | Description |
|---|---|---|
shape() | int[] | Array of dimension sizes, e.g. [64, 128] |
ndim() | int | Number of dimensions |
size() | int | Total element count (product of all dims) |
dtype() | int | DTYPE_FLOAT32, DTYPE_INT32, or DTYPE_INT64 |
isContiguous() | bool | True if strides match a row-major layout |
Data-type constants
Tensor::DTYPE_FLOAT32 // 0 — default, 32-bit float
Tensor::DTYPE_INT32 // 1 — 32-bit integer
Tensor::DTYPE_INT64 // 2 — 64-bit integer (token ids)
Indexing & Views
contiguous().
| Method | Signature | Description |
|---|---|---|
view() | (): self | View with same data pointer, increments ref-count |
row(int $r) | (): self | View of one row of a 2-D tensor → 1-D |
col(int $c) | (): self | View of one column (non-contiguous stride) |
slice(int $axis, int $start, int $len) | (): self | Sub-tensor slice along an axis |
sliceStep(int $axis, int $start, int $end, int $step) | (): self | Strided slice (like Python a[::2]) |
copy() | (): self | Deep copy — allocates new C memory |
contiguous() | (): self | Returns self if already contiguous, else copies to contiguous buffer |
fill(float $val) | (): self | Fills all elements in-place, returns self |
copyFrom(Tensor $src) | (): void | Overwrites this tensor's data with $src (shapes must match) |
toFlatArray() | (): float[] | Copies C memory into a PHP array — avoid in hot paths |
buffer() | (): CData | Raw float* CData pointer for zero-copy FFI interop |
Arithmetic (returns new Tensor)
| Method | Operation | Notes |
|---|---|---|
add(Tensor $b) | A + B | Broadcasting supported |
sub(Tensor $b) | A − B | |
mul(Tensor $b) | A ⊙ B | Element-wise (Hadamard) |
div(Tensor $b) | A ÷ B | |
pow(Tensor $b) | A^B | Element-wise power |
addScalar(float $v) | A + v | |
mulScalar(float $v) | A × v | |
clip(float $min, float $max) | clamp(A) | |
lessScalar(float $v) | A < v → {0,1} | Boolean mask as float |
greaterScalar(float $v) | A > v → {0,1} |
In-place Operations
All in-place methods modify the tensor and return $this for chaining. No heap allocation.
| Method | Operation |
|---|---|
addInplace(Tensor $b) | A += B |
subInplace(Tensor $b) | A -= B |
mulInplace(Tensor $b) | A *= B |
divInplace(Tensor $b) | A /= B |
addScalarInplace(float $v) | A += v |
mulScalarInplace(float $v) | A *= v |
clampInplace(float $lo, float $hi) | A = clamp(A, lo, hi) |
expInplace() | A = exp(A) |
logInplace() | A = log(A) |
sqrtInplace() | A = sqrt(A) |
sigmoidInplace() | A = σ(A) |
tanhInplace() | A = tanh(A) |
reluInplace() | A = max(0, A) |
rowSoftmaxInplace() | Softmax over last axis, numerically stable |
Unary / Element-wise (returns new Tensor)
Math
sqrt() · square() · abs() · sign() · exp() · log() · log1p() · round() · floor() · ceil()
Trig
sin() · cos() · tan() · asin() · acos() · atan()
Activations
sigmoid() · tanh() · relu() · gelu() · silu() · elu() · selu() · softplus() · mish()
Norm
softmax(int $axis) · logSoftmax(int $axis) · layerNorm(Tensor $g, Tensor $b, float $eps)
Reductions
| Method | Returns | Description |
|---|---|---|
sum() | float | Sum of all elements |
mean() | float | Mean of all elements |
max() | float | Global maximum |
min() | float | Global minimum |
sumAxis(int $axis) | Tensor | Reduces along axis, removes that dim |
meanAxis(int $axis) | Tensor | Mean along axis |
maxAxis(int $axis) | Tensor | Max along axis |
minAxis(int $axis) | Tensor | Min along axis |
argmax() | int | Global argmax (flat index) |
argmaxAxis(int $axis) | Tensor | Argmax along axis (int32 result) |
argmin() | int | Global argmin |
std(bool $bessel = true) | float | Standard deviation |
variance(bool $bessel = true) | float | Variance |
normL2() | float | L2 norm |
sumAxisInto(Tensor $src, int $axis) | void | Zero-allocation sum: writes result into pre-allocated tensor |
Linear Algebra
General matrix multiply via OpenBLAS SGEMM. Supports batched matmul for 3-D tensors (batch × M × K) × (batch × K × N).
$C = $A->matmul($B); // [M,K] × [K,N] → [M,N]
$C = $A->matmul($B, true); // A^T × B
Zero-allocation GEMM: writes result into $this (pre-allocated buffer). Used by Dense::backward() to reuse gradient buffers across steps.
Fused X @ W^T + bias in one BLAS call. Used by Dense::forward(). Shape: X[batch, in] × W[out, in]^T → [batch, out].
Squared Euclidean distance matrix between every pair of rows. Used by KNN, DBSCAN, KMeans. O(NM·D) via BLAS.
Closed-form Ridge regression: W = (X^T X + λI)⁻¹ X^T y via LAPACKE SGESV.
| Method | Description |
|---|---|
svd(): array | Returns [U, S, Vt] via LAPACKE SGESVD. Used by PCA, TruncatedSVD, LDA. |
dot(Tensor $b): float | 1-D inner product |
outer(Tensor $b): Tensor | Outer product → 2-D matrix |
norm(int $ord = 2): float | Vector or matrix norm |
addRelu(Tensor $b): Tensor | Fused (A + B) then ReLU — one kernel, no intermediate tensor |
mulAdd(Tensor $B, Tensor $C): Tensor | Fused A⊙B + C (FMA) |
Shape Ops
| Method | Description |
|---|---|
reshape(int ...$shape) | Returns view with new shape (must have same element count) |
flatten() | Reshape to 1-D |
expandDims(int $axis) | Insert size-1 dimension at $axis |
squeeze() | Remove all size-1 dimensions |
transpose() | 2-D matrix transpose (view, no copy) |
transposeNd(array $axes) | N-D permutation, e.g. [1,0,2] |
swapaxes(int $a, int $b) | Swap two axes |
concat(Tensor $b, int $axis = 0) | Concatenate two tensors along an axis |
stack(array $tensors, int $axis = 0) | Stack list of tensors into new dimension |
split(int $chunks, int $axis = 0) | Split into N equal chunks along axis |
pad(array $padding, float $val = 0.0) | Zero/constant pad on any axis |
Fused Kernels
These are hand-written C kernels that combine multiple operations to avoid intermediate tensors:
| Method | Description |
|---|---|
fusedBceLossAndGrad(Tensor $preds, Tensor $targets, ?Tensor $grads) | Binary cross-entropy loss + gradient in one pass |
fusedAdamStep(Tensor $p, Tensor $g, Tensor $m, Tensor $v, float $lr, ...) | Adam optimizer step — no intermediate allocation |
fusedAdamWStep(...) | AdamW (decoupled weight decay) |
fusedSgdStep(Tensor $p, Tensor $g, float $lr) | SGD weight update |
fusedRmsPropStep(...) | RMSProp update |
fusedAdaGradStep(...) | AdaGrad update |
configureThreading(int $omp, int $blas = 1) | Set OpenMP and OpenBLAS thread counts at runtime |
Serialization
| Method | Description |
|---|---|
save(string $filepath) | Write tensor to a raw binary file (.tensor) |
load(string $filepath) | Read tensor from raw binary file |
saveSafetensors(string $path, string $header, Tensor[] $tensors) | Write HuggingFace SafeTensors format (multiple tensors) |
datasetFromCsv(string $path, int $labelCol, bool $header) | mmap CSV → [Tensor $X, Tensor $y]. Never loads full file into PHP heap. |
Arena — Memory Pool
Arena is a bump-pointer memory pool backed by a single malloc call.
Allocating from an Arena is O(1) with no per-object overhead.
All tensors allocated from an Arena are freed at once when reset() or
__destruct() is called — ideal for per-batch scratch memory.
Pre-allocates the arena buffer. Default is 32 MB. Choose a size that fits your largest batch.
Bump-allocates a tensor from the pool. O(1), no malloc. The returned Tensor is only valid until reset().
Resets the bump pointer to the start. All previously allocated tensors are invalid after this call.
use Pml\Lib\Arena;
use Pml\Tensor;
$arena = new Arena(64 * 1024 * 1024); // 64 MB pool
foreach ($batches as $batch) {
// Scratch tensors allocated from the pool — no malloc
$tmp1 = Tensor::randomNormal([128, 256], arena: $arena->ptr());
$tmp2 = $arena->tensor([128, 256]);
// ... compute ...
$arena->reset(); // reclaim all scratch memory instantly
}
QuantizedTensor (INT8 Block Quantization)
QuantizedTensor wraps a QuantizedWeight* C struct: an INT8 weight matrix
with per-group-of-32 float32 scale factors (Q8_0-class format). Compression is ~4× vs fp32.
The forward kernel uses AVX2 fused int8→fp32 dot products — no temporary weight allocation.
Quantizes a fp32 weight matrix to INT8. Each group of $groupSize elements shares a scale factor. Smaller groups → less quantization error, more scale overhead.
$qw = QuantizedTensor::fromTensor($weights, groupSize: 32);
Quantized linear forward: Y = X @ W^T + bias. Hot path for LLM decode — AVX2 fused, OpenMP-parallel over rows. No temporary fp32 weight buffer.
Dequantizes back to fp32. Use only for checkpoint export — not in decode hot path.
| Method | Returns | Description |
|---|---|---|
rows() | int | Output dimension |
cols() | int | Input dimension |
groupSize() | int | Elements per quantization group |
numGroups() | int | Total number of groups |
memoryBytes() | int | Bytes used by data + scales (excludes struct header) |
compressionRatio() | float | fp32 bytes ÷ INT8 bytes (typically ~3.9×) |
use Pml\QuantizedTensor;
use Pml\Tensor;
$W = Tensor::randomNormal([4096, 4096]); // 64 MB fp32
$qW = QuantizedTensor::fromTensor($W, 32); // ~16 MB INT8
$W = null; // free fp32
printf("Compression: %.2f×\n", $qW->compressionRatio());
// → Compression: 3.94×
$X = Tensor::randomNormal([1, 4096]);
$out = $qW->linear($X); // AVX2 fused kernel
backward() on a quantized Dense layer throws LogicException.