PML / Inference & LLM
Inference

Inference & LLM

Run LLaMA-style language models directly in PHP. The C inference engine (inference.c) implements Grouped Query Attention (GQA), RoPE, RMSNorm, SwiGLU MLP, and autoregressive decode with a multi-head KV-cache. The KVCache class eliminates O(T²) recompute — each decode step is O(1) in sequence length.

Architecture

PHP call: $session->step($tokenId, $pos) ↓ 1 FFI crossing inference.c: inf_forward() ├─ Embedding lookup (token → dModel vector) ├─ For each layer (nLayers): │ ├─ RMSNorm(x) │ ├─ CausalSelfAttention (GQA): │ │ ├─ Q = x @ Wq K = x @ Wk V = x @ Wv │ │ ├─ RoPE: Q, K rotated by position $pos │ │ ├─ KV-cache: append K,V; attend with all cached K,V │ │ │ └─ Milakov online-softmax (O(head_dim) working mem) │ │ └─ out = attn_out @ Wo │ ├─ Residual: x += out │ ├─ RMSNorm(x) │ ├─ SwiGLU MLP: x2 = silu(x @ Wg) ⊙ (x @ Wu); out = x2 @ Wd │ └─ Residual: x += out └─ Final RMSNorm → logits = x @ Wout [vocabSize]
GQA (Grouped Query Attention) When nKvHeads < nHeads, K/V heads are shared across Q head groups (LLaMA-3, Mistral-style). The C kernel handles the grouping natively.

InferenceSession

The main entry point for LLM inference. Loads weights from a SafeTensors file, sets up the KV-cache, and exposes token-step and generation APIs.

load static (string $safetensorsPath, ModelConfig $config): self

Loads model weights from a SafeTensors file and initialises the C inference context. Weights are mmap'd — no full copy into RAM.

$cfg     = ModelConfig::llama3_8b();
$session = InferenceSession::load('models/llama3-8b.safetensors', $cfg);
loadFile static (string $modelDir): self

Convenience loader: reads $modelDir/config.json for ModelConfig and $modelDir/model.safetensors for weights.

step (int $tokenId, int $pos): Tensor Tensor [vocabSize]

Single-token forward pass with KV-cache. Called in a loop for autoregressive decode. One FFI crossing per token.

forward (array $tokens): Tensor Tensor [vocabSize]

Full prefill forward pass: processes all prompt tokens in one C call (batch attention), returns logits for the last token. Populates the KV-cache.

generate (array $promptIds, int $maxNewTokens = 256, float $temperature = 0.0, float $topP = 0.9): Generator

Autoregressive generation. Yields one token ID per step. Stops at EOS or $maxNewTokens.

foreach ($session->generate($promptIds, maxNewTokens: 512, temperature: 0.8) as $id) {
    echo $tok->decode([$id]);
    flush();
}
generateIds (array $promptIds, int $maxNewTokens = 256, float $temperature = 0.0, float $topP = 0.9): array

Collects all generated token IDs into an array (non-streaming).

chat (array $messages, int $maxNewTokens = 512, float $temperature = 0.0): string

Chat-template generation. Messages format: [['role' => 'user', 'content' => '...'], ...]. Applies LLaMA-3 / ChatML template, generates and decodes response.

MethodDescription
sampleGreedy(Tensor $logits): intArgmax sampling — deterministic
sample(Tensor $logits, float $temp, float $topP): intTemperature + top-p nucleus sampling
resetKv(): voidClears KV-cache for a new conversation
getWeight(string $name): ?TensorAccess a named weight tensor by SafeTensors key
parseConfig(string $jsonPath): ModelConfigParse config.json to ModelConfig

KVCache

Pml\KVCache wraps a C MultiKVCache* struct that manages nLayers × nHeads key/value cache slots. Eliminates the O(T²) attention recompute during autoregressive decode.

FFI crossings per decode step: 2 total — one mkvca_append (add K,V for new token) and one mkvca_attend (Milakov online-softmax attention over all cached tokens). OpenMP parallelizes over heads inside C.
__construct (int $nLayers, int $nHeads, int $maxSeqLen, int $headDim)

Allocates the multi-head cache. Memory: nLayers × nHeads × maxSeqLen × 2 × headDim × 4 bytes.

$kv = new KVCache(
    nLayers:   32,
    nHeads:    8,    // nKvHeads for GQA
    maxSeqLen: 4096,
    headDim:   128,
);
printf("Cache memory: %.1f MB\n", $kv->memoryBytes() / 1e6);
MethodSignatureDescription
prefill(int $layerIdx, Tensor $k, Tensor $v): voidPopulate cache with T prompt tokens (K: [T, nH, hd], V: same)
append(int $layerIdx, Tensor $k, Tensor $v): voidAppend one decode token (K: [nH, hd], V: same)
attend(int $layerIdx, Tensor $q): TensorCompute attention over all cached K,V for query Q [nH, 1, hd]. Returns [1, nH, hd].
reset(): voidClear all caches (start new generation)
memoryBytes(): intTotal RAM used by K+V buffers
nLayers/nHeads/maxSeqLen/headDimConstruction parameters

Memory estimate

// LLaMA-3 8B: 32 layers, 8 KV-heads, 4096 max context, 128 head-dim
$kv = new KVCache(32, 8, 4096, 128);
// = 32 × 8 × 4096 × 2 × 128 × 4 bytes = 1,073,741,824 bytes = 1.07 GB

// Reduce context to 2048 for lower memory:
$kv = new KVCache(32, 8, 2048, 128);
// = 536 MB

Tokenizer (BPE)

Pml\Inference\Tokenizer wraps the C BPE tokenizer (tokenizer.c). Supports HuggingFace tokenizer.json format and raw vocabulary + merges files. Zero-copy: the PHP Tokenizer holds a C Tokenizer* pointer.

fromJson static (string $path): self

Load from a HuggingFace tokenizer.json file. Compatible with LLaMA, Mistral, Gemma, SmolLM2.

$tok = Tokenizer::fromJson('models/llama3/tokenizer.json');
fromFiles static (string $vocabPath, string $mergesPath): self

Load from separate vocabulary (vocab.json) and BPE merges (merges.txt) files.

encode (string $text, bool $addBos = false): int[]

BPE-encodes a string to token IDs. Optionally prepends the BOS token.

$ids = $tok->encode("Hello world", addBos: true);
// → [1, 22557, 1879]  (example)
encodeBatch (array $texts, bool $addBos = false, int $maxLen = 0): Tensor Tensor [N, maxLen] int64

Encode a batch of strings. Pads to the longest sequence (or $maxLen). Returns an int64 Tensor for batch inference.

decode (array $ids, bool $skipSpecial = true): string

Converts token IDs back to a string. Skips special tokens (BOS, EOS, PAD) unless $skipSpecial = false.

MethodReturnsDescription
vocabSize(): intintTotal vocabulary size
bosId(): intintBeginning-of-sequence token ID
eosId(): intintEnd-of-sequence token ID
padId(): intintPadding token ID
unkId(): intintUnknown token ID
idToStr(int $id): ?stringstringDecode single token to string
strToId(string $s): intintEncode single token (exact match in vocab)
isSpecial(int $id): boolboolWhether token is a special token

ModelConfig

Pml\Inference\ModelConfig is a pure value object describing the model architecture. Passed to InferenceSession::load(). Pre-built configs are available as static factory methods.

Pre-built configs

FactoryParametersNotes
ModelConfig::llama3_8b() 32L · 32H · 8KVH · dModel=4096 · vocab=128k LLaMA-3 8B (Meta). GQA, RoPE 500k, RMSNorm.
ModelConfig::mistral_7b() 32L · 32H · 8KVH · dModel=4096 · vocab=32k Mistral 7B v0.3. Sliding window attention (not yet implemented in C).
ModelConfig::smollm2_135m() 30L · 9H · 3KVH · dModel=576 · vocab=49k SmolLM2-135M (HuggingFace). Fits in ~270 MB fp32.

Constructor properties

PropertyTypeDescription
archintARCH_LLAMA = 0 (currently only supported arch)
vocabSizeintVocabulary size
nLayersintNumber of transformer layers
nHeadsintNumber of Q attention heads
nKvHeadsintNumber of KV heads (GQA: nKvHeads < nHeads)
dModelintHidden state dimension
dFfintFeed-forward intermediate dimension
maxSeqLenintMaximum context length
rmsEpsfloatRMSNorm epsilon (1e-5)
ropeBasefloatRoPE theta base (10000 or 500000 for LLaMA-3)

Generation Examples

Streaming generation

use Pml\Inference\{InferenceSession, Tokenizer, ModelConfig};

$tok     = Tokenizer::fromJson('models/smollm2/tokenizer.json');
$session = InferenceSession::load('models/smollm2/model.safetensors', ModelConfig::smollm2_135m());

$prompt  = "def fibonacci(n):\n";
$ids     = $tok->encode($prompt, addBos: true);

// Stream token by token
foreach ($session->generate($ids, maxNewTokens: 200, temperature: 0.7, topP: 0.9) as $tokenId) {
    if ($tokenId === $tok->eosId()) break;
    echo $tok->decode([$tokenId]);
    flush();
}

Chat interface

$response = $session->chat([
    ['role' => 'system',    'content' => 'You are a helpful assistant.'],
    ['role' => 'user',      'content' => 'What is gradient descent?'],
], maxNewTokens: 256, temperature: 0.8);

echo $response;

INT8-quantized LLM

// Load weights and quantize all Dense layers to INT8
$session = InferenceSession::load('models/llama3-8b.safetensors', ModelConfig::llama3_8b());
// $session internally uses Sequential — call quantize() on its model:
$session->getModel()->quantize(groupSize: 32);
// fp32 weights freed immediately → ~7 GB instead of ~28 GB

foreach ($session->generate($ids, maxNewTokens: 500) as $id) {
    echo $tok->decode([$id]);
}

Performance Notes

Without KV-cache

O(T²) per decode step. Unusable above ~256 tokens context. 1–3 t/s for 7B models.

With KV-cache

O(1) per decode step. 4096 token context is practical. 3–6 t/s for 7B fp32 on 8-core CPU.

KV-cache + INT8

~8–12 t/s for 7B. 7 GB total RAM (4× reduction). Recommended for production.

Next step: fp16

fp16 weights + AVX-512 fp16 FMA would give another ~2× decode speedup and halve weight RAM again.

Thread tuning For decode-heavy workloads, set OPENBLAS_NUM_THREADS=1 and rely on OMP_NUM_THREADS=N for the OpenMP attention/MLP loops. BLAS thrashing on multi-thread decode is counterproductive.