Universal GPU Support via Vulkan
The planned GPU backend replaces OpenBLAS + AVX2 with Vulkan compute shaders. No CUDA. No ROCm. No Metal. One C backend for every GPU that supports Vulkan — NVIDIA, AMD, Intel, Mali, Adreno, Apple M-series (via MoltenVK).
On this page
Why Vulkan
Cross-vendor GPU support
CUDA only runs on NVIDIA. ROCm only on AMD. Metal only on Apple. Vulkan runs on all three — plus Intel, Qualcomm Adreno, ARM Mali, and every WebGPU-capable browser via WebAssembly.
Stable ABI
Vulkan exposes a stable C ABI via vulkan.h.
No Python runtime, no driver SDK installation required.
Ships as a single libgpu.so loaded by TensorEngine.
FFI-compatible
The existing FFI boundary stays identical.
PHP calls the same functions (tensor_matmul, tensor_softmax).
The C layer decides which backend executes the kernel.
Expected speedup
Training a 7B model: CPU ~1–3 t/s, GPU (RTX 4090) ~80–120 t/s. Dense matmul: 30–100× faster than AVX2 OpenBLAS on mid-range GPU.
Proposed Architecture
GPU Memory Model
Tensors that are moved to the GPU will have two representations:
Key design decisions:
- Lazy transfer — data is uploaded to GPU only when first needed by a GPU kernel. PHP code never triggers a transfer explicitly.
- Unified weight memory for LLMs — model weights reside permanently on GPU after the first forward pass. No round-trip per token.
- Activation tensors — intermediate activations are created on GPU, freed after the backward pass. Never touch CPU RAM.
- Gradient synchronization — optimizer step runs on GPU. Updated weights stay on GPU. Only loss scalar and metrics cross the PCIe bus.
SPIR-V Compute Kernels
Each C tensor operator (tensor_matmul, tensor_softmax, etc.)
will have a corresponding GLSL compute shader that compiles to SPIR-V via glslc.
SPIR-V binaries are embedded into libgpu.so at build time.
Example: GEMM kernel (matmul.comp)
#version 450
// Tiled GEMM: each workgroup handles a 16×16 output tile
layout(local_size_x = 16, local_size_y = 16) in;
layout(set = 0, binding = 0) readonly buffer MatA { float A[]; };
layout(set = 0, binding = 1) readonly buffer MatB { float B[]; };
layout(set = 0, binding = 2) writeonly buffer MatC { float C[]; };
layout(push_constant) uniform PushConstants {
uint M; uint N; uint K;
float alpha; float beta;
};
shared float tileA[16][16];
shared float tileB[16][16];
void main() {
uint row = gl_GlobalInvocationID.x;
uint col = gl_GlobalInvocationID.y;
float acc = 0.0;
for (uint t = 0; t < (K + 15) / 16; t++) {
tileA[gl_LocalInvocationID.x][gl_LocalInvocationID.y] =
(row < M && t*16+gl_LocalInvocationID.y < K)
? A[row * K + t*16 + gl_LocalInvocationID.y] : 0.0;
tileB[gl_LocalInvocationID.x][gl_LocalInvocationID.y] =
(t*16+gl_LocalInvocationID.x < K && col < N)
? B[(t*16+gl_LocalInvocationID.x) * N + col] : 0.0;
barrier();
for (uint k = 0; k < 16; k++)
acc += tileA[gl_LocalInvocationID.x][k]
* tileB[k][gl_LocalInvocationID.y];
barrier();
}
if (row < M && col < N)
C[row * N + col] = alpha * acc + beta * C[row * N + col];
}
Priority kernel list
| Kernel | Maps to C function | Priority |
|---|---|---|
matmul.spv | tensor_matmul | Critical |
linear.spv | tensor_linear | Critical |
qlinear.spv | qweight_linear | Critical |
attention.spv | mkvca_attend | Critical |
softmax.spv | tensor_softmax | High |
layer_norm.spv | tensor_layer_norm | High |
adam_step.spv | tensor_fused_adam_step | High |
adamw_step.spv | tensor_fused_adamw_step | High |
elementwise.spv | add/mul/relu/sigmoid/etc. | Medium |
reduce.spv | sum/mean/max along axis | Medium |
conv2d.spv | tensor_conv2d | Medium |
embedding.spv | gather/scatter | Medium |
Dispatch Strategy
Vulkan compute shaders are dispatched via vkCmdDispatch().
The workgroup size is tuned per operator:
| Operator class | Workgroup layout | Tiles |
|---|---|---|
| GEMM | local_size = (16, 16, 1) |
16×16 output tiles, shared-memory tiling for L1 reuse |
| Attention (per head) | local_size = (32, 1, 1) |
32 threads per row of Q — online softmax reduction in shared memory |
| Element-wise | local_size = (256, 1, 1) |
1 invocation per element, vectorized 4-float loads |
| Reduction | local_size = (256, 1, 1) |
Tree-reduction in shared memory, two-pass for large arrays |
For batched GEMM (LLM prefill), workgroups are dispatched in a 3-D grid:
dispatchX = ceil(M/16), dispatchY = ceil(N/16), dispatchZ = batch.
This maps batch → Z and keeps the tile loop identical to single-batch.
PHP API — Unchanged
The entire PHP API is 100% backward-compatible with the CPU backend.
The GPU backend is selected at TensorEngine initialization time by an environment variable or config flag.
All Tensor, Sequential, InferenceSession, etc. calls remain identical.
// No PHP changes required — backend selected by env var
// PML_BACKEND=vulkan php train.php
// PML_BACKEND=cpu php train.php (default)
$model->train($dataset, epochs: 20); // same call — GPU if available
$preds = $model->predict($test); // same call
Optionally, explicit device placement will be exposed:
// Future API (not yet implemented)
Tensor::setDefaultDevice('gpu');
$t = Tensor::randomNormal([4096, 4096]); // allocated on GPU VRAM
$t->to('cpu'); // explicit download
Implementation Phases
Phase 1 — Vulkan foundation + GEMM (3–4 weeks)
- Add
gpu_backend.c: VkInstance, VkPhysicalDevice selection, VkDevice, compute queue - GPU tensor allocation via VkBuffer + VkDeviceMemory (host-visible + device-local)
- Lazy CPU→GPU upload and GPU→CPU download
- SPIR-V GEMM kernel with shared-memory tiling (16×16)
- Route
tensor_matmulthrough GPU when both tensors are on-device - Benchmark: validate correctness vs OpenBLAS, measure speedup
Phase 2 — LLM inference kernels (2–3 weeks)
- GPU attention:
mkvca_attend→ Vulkan online-softmax attention kernel - GPU layer-norm, RMSNorm, SwiGLU (combined activation + gate MLP)
- GPU RoPE encoding
- GPU INT8 quantized linear (
qweight_linear) - End-to-end benchmark: LLaMA-3 8B token decode t/s on RTX 3090 / 4090
Phase 3 — Training kernels (2–3 weeks)
- GPU Adam / AdamW fused step kernel
- GPU softmax cross-entropy backward
- GPU Conv2D forward + backward
- GPU BatchNorm forward + backward
- End-to-end benchmark: Sequential::train() on MNIST/CIFAR-10
Phase 4 — Platform breadth (4–6 weeks)
- MoltenVK integration for Apple M-series (macOS / iOS)
- fp16 (VK_EXT_shader_float16) kernel variants for 2× memory + throughput
- Multi-GPU sharding via VkQueue per device
- Memory pool: VkDeviceMemory slab allocator to reduce allocation overhead
- Async GPU-CPU overlap: prefetch next batch to GPU while current batch computes
Expected Performance
| Operation | CPU (8-core AVX2) | GPU (RTX 3080, Vulkan fp32) | GPU (RTX 4090, Vulkan fp16) |
|---|---|---|---|
| Dense 4096×4096 GEMM | ~18 ms | ~0.6 ms | ~0.15 ms |
| LLaMA-3 8B decode (1 token) | ~280 ms (3.6 t/s) | ~12 ms (83 t/s) | ~5 ms (200 t/s) |
| Sequential train 1 epoch MNIST (60k, bs=256) | ~4.2 s | ~0.3 s | ~0.12 s |
| Conv2D 224×224 ResNet-50 fwd | ~120 ms | ~4 ms | ~1.5 ms |
Estimates based on operator FLOP counts and hardware peak throughput (TFLOPS). Actual results depend on memory bandwidth, kernel launch overhead, and occupancy.
Current Limitations to Resolve First
Before the Vulkan backend can be integrated, these CPU-side items should be addressed:
| Issue | Impact on GPU backend | Priority |
|---|---|---|
| fp16 storage not implemented | GPU backend will want to keep weights in fp16 to halve VRAM. Current TensorC only stores fp32. | High |
tensor.c is a 1500-line monolith |
Must split into modules (tensor_math.c, tensor_nn.c, tensor_io.c) before adding a parallel GPU dispatch path. |
Medium |
| No mmap weight loading for LLM weights | GPU backend needs vkMapMemory zero-copy path; currently SafeTensors does a full memcpy. |
Medium |
| LSTM/attention not in single C call | GPU dispatch requires all LSTM state in one kernel; currently split over multiple PHP-driven steps. | Low-medium |
| Flash Attention not implemented | Standard attention allocates O(T²) scores. GPU backend should use Flash Attention 2 / 3 to keep memory O(T). | Medium |