Inference

Inference & LLM

Run LLaMA-style language models directly in PHP. The C inference engine (inference.c) implements Grouped Query Attention (GQA), RoPE, RMSNorm, SwiGLU MLP, and autoregressive decode with a multi-head KV-cache. The KVCache class eliminates O(T²) recompute — each decode step is O(1) in sequence length.

Architecture

PHP call: $session->step($tokenId, $pos) ↓ 1 FFI crossing inference.c: inf_forward() ├─ Embedding lookup (token → dModel vector) ├─ For each layer (nLayers): │ ├─ RMSNorm(x) │ ├─ CausalSelfAttention (GQA): │ │ ├─ Q = x @ Wq K = x @ Wk V = x @ Wv │ │ ├─ RoPE: Q, K rotated by position $pos │ │ ├─ KV-cache: append K,V; attend with all cached K,V │ │ │ └─ Milakov online-softmax (O(head_dim) working mem) │ │ └─ out = attn_out @ Wo │ ├─ Residual: x += out │ ├─ RMSNorm(x) │ ├─ SwiGLU MLP: x2 = silu(x @ Wg) ⊙ (x @ Wu); out = x2 @ Wd │ └─ Residual: x += out └─ Final RMSNorm → logits = x @ Wout [vocabSize]

GQA (Grouped Query Attention) When nKvHeads < nHeads, K/V heads are shared across Q head groups (LLaMA-3, Mistral-style). The C kernel handles the grouping natively.

InferenceSession

The main entry point for LLM inference. Loads weights from a SafeTensors file, sets up the KV-cache, and exposes token-step and generation APIs.

load static (string $safetensorsPath, ModelConfig $config): self

Loads model weights from a SafeTensors file and initialises the C inference context. Weights are mmap'd — no full copy into RAM.

$cfg     = ModelConfig::llama3_8b();
$session = InferenceSession::load('models/llama3-8b.safetensors', $cfg);

loadFile static (string $modelDir): self

Convenience loader: reads $modelDir/config.json for ModelConfig and $modelDir/model.safetensors for weights.

step (int $tokenId, int $pos): Tensor Tensor [vocabSize]

Single-token forward pass with KV-cache. Called in a loop for autoregressive decode. One FFI crossing per token.

forward (array $tokens): Tensor Tensor [vocabSize]

Full prefill forward pass: processes all prompt tokens in one C call (batch attention), returns logits for the last token. Populates the KV-cache.

generate (array $promptIds, int $maxNewTokens = 256, float $temperature = 0.0, float $topP = 0.9): Generator

Autoregressive generation. Yields one token ID per step. Stops at EOS or $maxNewTokens.

foreach ($session->generate($promptIds, maxNewTokens: 512, temperature: 0.8) as $id) {
    echo $tok->decode([$id]);
    flush();
}

generateIds (array $promptIds, int $maxNewTokens = 256, float $temperature = 0.0, float $topP = 0.9): array

Collects all generated token IDs into an array (non-streaming).

chat (array $messages, int $maxNewTokens = 512, float $temperature = 0.0): string

Chat-template generation. Messages format: [['role' => 'user', 'content' => '...'], ...]. Applies LLaMA-3 / ChatML template, generates and decodes response.

Method	Description
`sampleGreedy(Tensor $logits): int`	Argmax sampling — deterministic
`sample(Tensor $logits, float $temp, float $topP): int`	Temperature + top-p nucleus sampling
`resetKv(): void`	Clears KV-cache for a new conversation
`getWeight(string $name): ?Tensor`	Access a named weight tensor by SafeTensors key
`parseConfig(string $jsonPath): ModelConfig`	Parse `config.json` to ModelConfig

KVCache

Pml\KVCache wraps a C MultiKVCache* struct that manages nLayers × nHeads key/value cache slots. Eliminates the O(T²) attention recompute during autoregressive decode.

FFI crossings per decode step: 2 total — one mkvca_append (add K,V for new token) and one mkvca_attend (Milakov online-softmax attention over all cached tokens). OpenMP parallelizes over heads inside C.

__construct (int $nLayers, int $nHeads, int $maxSeqLen, int $headDim)

Allocates the multi-head cache. Memory: nLayers × nHeads × maxSeqLen × 2 × headDim × 4 bytes.

$kv = new KVCache(
    nLayers:   32,
    nHeads:    8,    // nKvHeads for GQA
    maxSeqLen: 4096,
    headDim:   128,
);
printf("Cache memory: %.1f MB\n", $kv->memoryBytes() / 1e6);

Method	Signature	Description
`prefill`	`(int $layerIdx, Tensor $k, Tensor $v): void`	Populate cache with T prompt tokens (K: [T, nH, hd], V: same)
`append`	`(int $layerIdx, Tensor $k, Tensor $v): void`	Append one decode token (K: [nH, hd], V: same)
`attend`	`(int $layerIdx, Tensor $q): Tensor`	Compute attention over all cached K,V for query Q [nH, 1, hd]. Returns [1, nH, hd].
`reset`	`(): void`	Clear all caches (start new generation)
`memoryBytes`	`(): int`	Total RAM used by K+V buffers
`nLayers/nHeads/maxSeqLen/headDim`	—	Construction parameters

Memory estimate

// LLaMA-3 8B: 32 layers, 8 KV-heads, 4096 max context, 128 head-dim
$kv = new KVCache(32, 8, 4096, 128);
// = 32 × 8 × 4096 × 2 × 128 × 4 bytes = 1,073,741,824 bytes = 1.07 GB

// Reduce context to 2048 for lower memory:
$kv = new KVCache(32, 8, 2048, 128);
// = 536 MB

Tokenizer (BPE)

Pml\Inference\Tokenizer wraps the C BPE tokenizer (tokenizer.c). Supports HuggingFace tokenizer.json format and raw vocabulary + merges files. Zero-copy: the PHP Tokenizer holds a C Tokenizer* pointer.

fromJson static (string $path): self

Load from a HuggingFace tokenizer.json file. Compatible with LLaMA, Mistral, Gemma, SmolLM2.

$tok = Tokenizer::fromJson('models/llama3/tokenizer.json');

fromFiles static (string $vocabPath, string $mergesPath): self

Load from separate vocabulary (vocab.json) and BPE merges (merges.txt) files.

encode (string $text, bool $addBos = false): int[]

BPE-encodes a string to token IDs. Optionally prepends the BOS token.

$ids = $tok->encode("Hello world", addBos: true);
// → [1, 22557, 1879]  (example)

encodeBatch (array $texts, bool $addBos = false, int $maxLen = 0): Tensor Tensor [N, maxLen] int64

Encode a batch of strings. Pads to the longest sequence (or $maxLen). Returns an int64 Tensor for batch inference.

decode (array $ids, bool $skipSpecial = true): string

Converts token IDs back to a string. Skips special tokens (BOS, EOS, PAD) unless $skipSpecial = false.

Method	Returns	Description
`vocabSize(): int`	int	Total vocabulary size
`bosId(): int`	int	Beginning-of-sequence token ID
`eosId(): int`	int	End-of-sequence token ID
`padId(): int`	int	Padding token ID
`unkId(): int`	int	Unknown token ID
`idToStr(int $id): ?string`	string	Decode single token to string
`strToId(string $s): int`	int	Encode single token (exact match in vocab)
`isSpecial(int $id): bool`	bool	Whether token is a special token

ModelConfig

Pml\Inference\ModelConfig is a pure value object describing the model architecture. Passed to InferenceSession::load(). Pre-built configs are available as static factory methods.

Pre-built configs

Factory	Parameters	Notes
`ModelConfig::llama3_8b()`	32L · 32H · 8KVH · dModel=4096 · vocab=128k	LLaMA-3 8B (Meta). GQA, RoPE 500k, RMSNorm.
`ModelConfig::mistral_7b()`	32L · 32H · 8KVH · dModel=4096 · vocab=32k	Mistral 7B v0.3. Sliding window attention (not yet implemented in C).
`ModelConfig::smollm2_135m()`	30L · 9H · 3KVH · dModel=576 · vocab=49k	SmolLM2-135M (HuggingFace). Fits in ~270 MB fp32.

Constructor properties

Property	Type	Description
`arch`	int	`ARCH_LLAMA` = 0 (currently only supported arch)
`vocabSize`	int	Vocabulary size
`nLayers`	int	Number of transformer layers
`nHeads`	int	Number of Q attention heads
`nKvHeads`	int	Number of KV heads (GQA: nKvHeads < nHeads)
`dModel`	int	Hidden state dimension
`dFf`	int	Feed-forward intermediate dimension
`maxSeqLen`	int	Maximum context length
`rmsEps`	float	RMSNorm epsilon (1e-5)
`ropeBase`	float	RoPE theta base (10000 or 500000 for LLaMA-3)

Generation Examples

Streaming generation

use Pml\Inference\{InferenceSession, Tokenizer, ModelConfig};

$tok     = Tokenizer::fromJson('models/smollm2/tokenizer.json');
$session = InferenceSession::load('models/smollm2/model.safetensors', ModelConfig::smollm2_135m());

$prompt  = "def fibonacci(n):\n";
$ids     = $tok->encode($prompt, addBos: true);

// Stream token by token
foreach ($session->generate($ids, maxNewTokens: 200, temperature: 0.7, topP: 0.9) as $tokenId) {
    if ($tokenId === $tok->eosId()) break;
    echo $tok->decode([$tokenId]);
    flush();
}

Chat interface

$response = $session->chat([
    ['role' => 'system',    'content' => 'You are a helpful assistant.'],
    ['role' => 'user',      'content' => 'What is gradient descent?'],
], maxNewTokens: 256, temperature: 0.8);

echo $response;

INT8-quantized LLM

// Load weights and quantize all Dense layers to INT8
$session = InferenceSession::load('models/llama3-8b.safetensors', ModelConfig::llama3_8b());
// $session internally uses Sequential — call quantize() on its model:
$session->getModel()->quantize(groupSize: 32);
// fp32 weights freed immediately → ~7 GB instead of ~28 GB

foreach ($session->generate($ids, maxNewTokens: 500) as $id) {
    echo $tok->decode([$id]);
}

Performance Notes

Without KV-cache

O(T²) per decode step. Unusable above ~256 tokens context. 1–3 t/s for 7B models.

With KV-cache

O(1) per decode step. 4096 token context is practical. 3–6 t/s for 7B fp32 on 8-core CPU.

KV-cache + INT8

~8–12 t/s for 7B. 7 GB total RAM (4× reduction). Recommended for production.

Next step: fp16

fp16 weights + AVX-512 fp16 FMA would give another ~2× decode speedup and halve weight RAM again.

Thread tuning For decode-heavy workloads, set OPENBLAS_NUM_THREADS=1 and rely on OMP_NUM_THREADS=N for the OpenMP attention/MLP loops. BLAS thrashing on multi-thread decode is counterproductive.

Inference & LLM

On this page

Architecture

InferenceSession

KVCache

Memory estimate

Tokenizer (BPE)

ModelConfig

Pre-built configs

Constructor properties

Generation Examples

Streaming generation

Chat interface

INT8-quantized LLM

Performance Notes

Without KV-cache

With KV-cache

KV-cache + INT8

Next step: fp16