Performance
Performance
This framework is built around C-level compute kernels and zero-copy data movement.
Why it is fast
- C computation: All heavy math is implemented in
src/Lib/*.cand exposed throughFFI. - OpenBLAS / LAPACKE: Linear algebra operations such as matrix multiplication and decompositions use optimized native libraries.
- AVX2 and fused kernels: The backend includes fused kernels for common deep learning patterns such as
linear,addRelu, andmulAdd. - Zero-copy views: Tensor slicing and batching create views instead of copying data when possible.
- SafeTensors mmap: Model weights can be loaded directly from disk without copying into the PHP heap.
CPU optimizations
- OpenMP: The C runtime is compiled with
-fopenmpand exposestensor_configure_threading(). - BLAS threading:
configureThreading()lets you control BLAS and OpenMP independently. - Vectorized kernels: Fused operations use AVX2 instructions for fewer memory passes.
Zero-copy design
Tensor::slice()andDataset::slice()use C views that share the same underlying buffer.- Views retain a
parentreference to prevent the original memory from being freed. SafeTensorsIO::load()returns mmap-backed tensors that are not copied into the process memory.Dataset::batches()yields zero-copy tensor slices.
Data ingestion performance
Dataset::fromCSV()usestensor_dataset_from_csv()for numeric CSVs, bypassing PHP array allocation.- For mixed-type CSVs,
Dataset::load()uses the ETL-mode C DataFrame to parse and transform data. Tensor::fromArray()packs nested PHP arrays into a binary string and copies them in a single FFI boundary crossing.
Practical tips
- Call
Tensor::configureThreading()once at startup to avoid oversubscription. - Prefer
Dataset::randomize()over manual PHP shuffling. - Avoid
toArray()ortoFlatArray()in hot loops. - Use
Tensor::copyFrom()andmatmulInto()to reuse pre-allocated buffers.
Example
use Pml\Tensor;
Tensor::configureThreading(8, 2);
$x = Tensor::randomNormal([1024, 1024]);
$y = Tensor::randomNormal([1024, 1024]);
$z = $x->matmul($y);