Inference & LLM
Run LLaMA-style language models directly in PHP. The C inference engine
(inference.c) implements Grouped Query Attention (GQA), RoPE,
RMSNorm, SwiGLU MLP, and autoregressive decode with a multi-head KV-cache.
The KVCache class eliminates O(T²) recompute — each decode step
is O(1) in sequence length.
On this page
Architecture
nKvHeads < nHeads, K/V heads are shared across Q head groups
(LLaMA-3, Mistral-style). The C kernel handles the grouping natively.
InferenceSession
The main entry point for LLM inference. Loads weights from a SafeTensors file, sets up the KV-cache, and exposes token-step and generation APIs.
Loads model weights from a SafeTensors file and initialises the C inference context. Weights are mmap'd — no full copy into RAM.
$cfg = ModelConfig::llama3_8b();
$session = InferenceSession::load('models/llama3-8b.safetensors', $cfg);
Convenience loader: reads $modelDir/config.json for ModelConfig and $modelDir/model.safetensors for weights.
Single-token forward pass with KV-cache. Called in a loop for autoregressive decode. One FFI crossing per token.
Full prefill forward pass: processes all prompt tokens in one C call (batch attention), returns logits for the last token. Populates the KV-cache.
Autoregressive generation. Yields one token ID per step. Stops at EOS or $maxNewTokens.
foreach ($session->generate($promptIds, maxNewTokens: 512, temperature: 0.8) as $id) {
echo $tok->decode([$id]);
flush();
}
Collects all generated token IDs into an array (non-streaming).
Chat-template generation. Messages format: [['role' => 'user', 'content' => '...'], ...]. Applies LLaMA-3 / ChatML template, generates and decodes response.
| Method | Description |
|---|---|
sampleGreedy(Tensor $logits): int | Argmax sampling — deterministic |
sample(Tensor $logits, float $temp, float $topP): int | Temperature + top-p nucleus sampling |
resetKv(): void | Clears KV-cache for a new conversation |
getWeight(string $name): ?Tensor | Access a named weight tensor by SafeTensors key |
parseConfig(string $jsonPath): ModelConfig | Parse config.json to ModelConfig |
KVCache
Pml\KVCache wraps a C MultiKVCache* struct that manages
nLayers × nHeads key/value cache slots.
Eliminates the O(T²) attention recompute during autoregressive decode.
mkvca_append (add K,V for new token)
and one mkvca_attend (Milakov online-softmax attention over all cached tokens).
OpenMP parallelizes over heads inside C.
Allocates the multi-head cache. Memory: nLayers × nHeads × maxSeqLen × 2 × headDim × 4 bytes.
$kv = new KVCache(
nLayers: 32,
nHeads: 8, // nKvHeads for GQA
maxSeqLen: 4096,
headDim: 128,
);
printf("Cache memory: %.1f MB\n", $kv->memoryBytes() / 1e6);
| Method | Signature | Description |
|---|---|---|
prefill | (int $layerIdx, Tensor $k, Tensor $v): void | Populate cache with T prompt tokens (K: [T, nH, hd], V: same) |
append | (int $layerIdx, Tensor $k, Tensor $v): void | Append one decode token (K: [nH, hd], V: same) |
attend | (int $layerIdx, Tensor $q): Tensor | Compute attention over all cached K,V for query Q [nH, 1, hd]. Returns [1, nH, hd]. |
reset | (): void | Clear all caches (start new generation) |
memoryBytes | (): int | Total RAM used by K+V buffers |
nLayers/nHeads/maxSeqLen/headDim | — | Construction parameters |
Memory estimate
// LLaMA-3 8B: 32 layers, 8 KV-heads, 4096 max context, 128 head-dim
$kv = new KVCache(32, 8, 4096, 128);
// = 32 × 8 × 4096 × 2 × 128 × 4 bytes = 1,073,741,824 bytes = 1.07 GB
// Reduce context to 2048 for lower memory:
$kv = new KVCache(32, 8, 2048, 128);
// = 536 MB
Tokenizer (BPE)
Pml\Inference\Tokenizer wraps the C BPE tokenizer (tokenizer.c).
Supports HuggingFace tokenizer.json format and raw vocabulary + merges files.
Zero-copy: the PHP Tokenizer holds a C Tokenizer* pointer.
Load from a HuggingFace tokenizer.json file. Compatible with LLaMA, Mistral, Gemma, SmolLM2.
$tok = Tokenizer::fromJson('models/llama3/tokenizer.json');
Load from separate vocabulary (vocab.json) and BPE merges (merges.txt) files.
BPE-encodes a string to token IDs. Optionally prepends the BOS token.
$ids = $tok->encode("Hello world", addBos: true);
// → [1, 22557, 1879] (example)
Encode a batch of strings. Pads to the longest sequence (or $maxLen). Returns an int64 Tensor for batch inference.
Converts token IDs back to a string. Skips special tokens (BOS, EOS, PAD) unless $skipSpecial = false.
| Method | Returns | Description |
|---|---|---|
vocabSize(): int | int | Total vocabulary size |
bosId(): int | int | Beginning-of-sequence token ID |
eosId(): int | int | End-of-sequence token ID |
padId(): int | int | Padding token ID |
unkId(): int | int | Unknown token ID |
idToStr(int $id): ?string | string | Decode single token to string |
strToId(string $s): int | int | Encode single token (exact match in vocab) |
isSpecial(int $id): bool | bool | Whether token is a special token |
ModelConfig
Pml\Inference\ModelConfig is a pure value object describing the model architecture.
Passed to InferenceSession::load(). Pre-built configs are available as static factory methods.
Pre-built configs
| Factory | Parameters | Notes |
|---|---|---|
ModelConfig::llama3_8b() |
32L · 32H · 8KVH · dModel=4096 · vocab=128k | LLaMA-3 8B (Meta). GQA, RoPE 500k, RMSNorm. |
ModelConfig::mistral_7b() |
32L · 32H · 8KVH · dModel=4096 · vocab=32k | Mistral 7B v0.3. Sliding window attention (not yet implemented in C). |
ModelConfig::smollm2_135m() |
30L · 9H · 3KVH · dModel=576 · vocab=49k | SmolLM2-135M (HuggingFace). Fits in ~270 MB fp32. |
Constructor properties
| Property | Type | Description |
|---|---|---|
arch | int | ARCH_LLAMA = 0 (currently only supported arch) |
vocabSize | int | Vocabulary size |
nLayers | int | Number of transformer layers |
nHeads | int | Number of Q attention heads |
nKvHeads | int | Number of KV heads (GQA: nKvHeads < nHeads) |
dModel | int | Hidden state dimension |
dFf | int | Feed-forward intermediate dimension |
maxSeqLen | int | Maximum context length |
rmsEps | float | RMSNorm epsilon (1e-5) |
ropeBase | float | RoPE theta base (10000 or 500000 for LLaMA-3) |
Generation Examples
Streaming generation
use Pml\Inference\{InferenceSession, Tokenizer, ModelConfig};
$tok = Tokenizer::fromJson('models/smollm2/tokenizer.json');
$session = InferenceSession::load('models/smollm2/model.safetensors', ModelConfig::smollm2_135m());
$prompt = "def fibonacci(n):\n";
$ids = $tok->encode($prompt, addBos: true);
// Stream token by token
foreach ($session->generate($ids, maxNewTokens: 200, temperature: 0.7, topP: 0.9) as $tokenId) {
if ($tokenId === $tok->eosId()) break;
echo $tok->decode([$tokenId]);
flush();
}
Chat interface
$response = $session->chat([
['role' => 'system', 'content' => 'You are a helpful assistant.'],
['role' => 'user', 'content' => 'What is gradient descent?'],
], maxNewTokens: 256, temperature: 0.8);
echo $response;
INT8-quantized LLM
// Load weights and quantize all Dense layers to INT8
$session = InferenceSession::load('models/llama3-8b.safetensors', ModelConfig::llama3_8b());
// $session internally uses Sequential — call quantize() on its model:
$session->getModel()->quantize(groupSize: 32);
// fp32 weights freed immediately → ~7 GB instead of ~28 GB
foreach ($session->generate($ids, maxNewTokens: 500) as $id) {
echo $tok->decode([$id]);
}
Performance Notes
Without KV-cache
O(T²) per decode step. Unusable above ~256 tokens context. 1–3 t/s for 7B models.
With KV-cache
O(1) per decode step. 4096 token context is practical. 3–6 t/s for 7B fp32 on 8-core CPU.
KV-cache + INT8
~8–12 t/s for 7B. 7 GB total RAM (4× reduction). Recommended for production.
Next step: fp16
fp16 weights + AVX-512 fp16 FMA would give another ~2× decode speedup and halve weight RAM again.
OPENBLAS_NUM_THREADS=1 and rely on
OMP_NUM_THREADS=N for the OpenMP attention/MLP loops.
BLAS thrashing on multi-thread decode is counterproductive.