Roadmap

Universal GPU Support via Vulkan

The planned GPU backend replaces OpenBLAS + AVX2 with Vulkan compute shaders. No CUDA. No ROCm. No Metal. One C backend for every GPU that supports Vulkan — NVIDIA, AMD, Intel, Mali, Adreno, Apple M-series (via MoltenVK).

Why Vulkan

Cross-vendor GPU support

CUDA only runs on NVIDIA. ROCm only on AMD. Metal only on Apple. Vulkan runs on all three — plus Intel, Qualcomm Adreno, ARM Mali, and every WebGPU-capable browser via WebAssembly.

Stable ABI

Vulkan exposes a stable C ABI via vulkan.h. No Python runtime, no driver SDK installation required. Ships as a single libgpu.so loaded by TensorEngine.

FFI-compatible

The existing FFI boundary stays identical. PHP calls the same functions (tensor_matmul, tensor_softmax). The C layer decides which backend executes the kernel.

Expected speedup

Training a 7B model: CPU ~1–3 t/s, GPU (RTX 4090) ~80–120 t/s. Dense matmul: 30–100× faster than AVX2 OpenBLAS on mid-range GPU.

Proposed Architecture

PHP (unchanged) └─ Tensor::matmul(), Sequential::train(), etc. │ FFI — same signatures as today ▼ src/Lib/gpu_backend.c (new file) ├─ gpu_init() — VkInstance, VkDevice, VkQueue ├─ gpu_tensor_alloc() — VkBuffer + VkDeviceMemory ├─ gpu_upload() — CPU → GPU (VkCommandBuffer copy) ├─ gpu_download() — GPU → CPU (sync download) └─ gpu_dispatch() — load SPIR-V, vkCmdDispatch src/Lib/kernels/ (GLSL compute shaders → SPIR-V) ├─ matmul.comp → matmul.spv ├─ softmax.comp → softmax.spv ├─ relu.comp → relu.spv ├─ layer_norm.comp → layer_norm.spv ├─ attention.comp → attention.spv ├─ adam_step.comp → adam_step.spv └─ ... (one .spv per hot-path operator) src/Lib/tensor.c (modified) ├─ #ifdef PML_GPU_VULKAN │ call gpu_backend.c functions │ #else │ existing AVX2 / OpenBLAS implementation │ #endif

GPU Memory Model

Tensors that are moved to the GPU will have two representations:

TensorC struct (C) float* data_cpu; // host-side buffer (nullable when on GPU) VkBuffer vk_buffer; // device buffer handle VkDeviceMemory vk_memory; // device memory uint8_t on_gpu; // 1 if resident on GPU uint32_t dirty_cpu; // 1 if CPU copy is stale (needs download)

Key design decisions:

Lazy transfer — data is uploaded to GPU only when first needed by a GPU kernel. PHP code never triggers a transfer explicitly.
Unified weight memory for LLMs — model weights reside permanently on GPU after the first forward pass. No round-trip per token.
Activation tensors — intermediate activations are created on GPU, freed after the backward pass. Never touch CPU RAM.
Gradient synchronization — optimizer step runs on GPU. Updated weights stay on GPU. Only loss scalar and metrics cross the PCIe bus.

SPIR-V Compute Kernels

Each C tensor operator (tensor_matmul, tensor_softmax, etc.) will have a corresponding GLSL compute shader that compiles to SPIR-V via glslc. SPIR-V binaries are embedded into libgpu.so at build time.

Example: GEMM kernel (matmul.comp)

#version 450
// Tiled GEMM: each workgroup handles a 16×16 output tile
layout(local_size_x = 16, local_size_y = 16) in;

layout(set = 0, binding = 0) readonly buffer MatA { float A[]; };
layout(set = 0, binding = 1) readonly buffer MatB { float B[]; };
layout(set = 0, binding = 2) writeonly buffer MatC { float C[]; };

layout(push_constant) uniform PushConstants {
    uint M; uint N; uint K;
    float alpha; float beta;
};

shared float tileA[16][16];
shared float tileB[16][16];

void main() {
    uint row = gl_GlobalInvocationID.x;
    uint col = gl_GlobalInvocationID.y;
    float acc = 0.0;

    for (uint t = 0; t < (K + 15) / 16; t++) {
        tileA[gl_LocalInvocationID.x][gl_LocalInvocationID.y] =
            (row < M && t*16+gl_LocalInvocationID.y < K)
            ? A[row * K + t*16 + gl_LocalInvocationID.y] : 0.0;
        tileB[gl_LocalInvocationID.x][gl_LocalInvocationID.y] =
            (t*16+gl_LocalInvocationID.x < K && col < N)
            ? B[(t*16+gl_LocalInvocationID.x) * N + col] : 0.0;
        barrier();
        for (uint k = 0; k < 16; k++)
            acc += tileA[gl_LocalInvocationID.x][k]
                 * tileB[k][gl_LocalInvocationID.y];
        barrier();
    }
    if (row < M && col < N)
        C[row * N + col] = alpha * acc + beta * C[row * N + col];
}

Priority kernel list

Kernel	Maps to C function	Priority
`matmul.spv`	`tensor_matmul`	Critical
`linear.spv`	`tensor_linear`	Critical
`qlinear.spv`	`qweight_linear`	Critical
`attention.spv`	`mkvca_attend`	Critical
`softmax.spv`	`tensor_softmax`	High
`layer_norm.spv`	`tensor_layer_norm`	High
`adam_step.spv`	`tensor_fused_adam_step`	High
`adamw_step.spv`	`tensor_fused_adamw_step`	High
`elementwise.spv`	add/mul/relu/sigmoid/etc.	Medium
`reduce.spv`	sum/mean/max along axis	Medium
`conv2d.spv`	`tensor_conv2d`	Medium
`embedding.spv`	gather/scatter	Medium

Dispatch Strategy

Vulkan compute shaders are dispatched via vkCmdDispatch(). The workgroup size is tuned per operator:

Operator class	Workgroup layout	Tiles
GEMM	`local_size = (16, 16, 1)`	16×16 output tiles, shared-memory tiling for L1 reuse
Attention (per head)	`local_size = (32, 1, 1)`	32 threads per row of Q — online softmax reduction in shared memory
Element-wise	`local_size = (256, 1, 1)`	1 invocation per element, vectorized 4-float loads
Reduction	`local_size = (256, 1, 1)`	Tree-reduction in shared memory, two-pass for large arrays

For batched GEMM (LLM prefill), workgroups are dispatched in a 3-D grid: dispatchX = ceil(M/16), dispatchY = ceil(N/16), dispatchZ = batch. This maps batch → Z and keeps the tile loop identical to single-batch.

PHP API — Unchanged

The entire PHP API is 100% backward-compatible with the CPU backend. The GPU backend is selected at TensorEngine initialization time by an environment variable or config flag. All Tensor, Sequential, InferenceSession, etc. calls remain identical.

// No PHP changes required — backend selected by env var
// PML_BACKEND=vulkan php train.php
// PML_BACKEND=cpu php train.php  (default)

$model->train($dataset, epochs: 20);  // same call — GPU if available
$preds = $model->predict($test);      // same call

Optionally, explicit device placement will be exposed:

// Future API (not yet implemented)
Tensor::setDefaultDevice('gpu');
$t = Tensor::randomNormal([4096, 4096]);  // allocated on GPU VRAM
$t->to('cpu');                                // explicit download

Implementation Phases

Phase 1 — Vulkan foundation + GEMM (3–4 weeks)

Add gpu_backend.c: VkInstance, VkPhysicalDevice selection, VkDevice, compute queue
GPU tensor allocation via VkBuffer + VkDeviceMemory (host-visible + device-local)
Lazy CPU→GPU upload and GPU→CPU download
SPIR-V GEMM kernel with shared-memory tiling (16×16)
Route tensor_matmul through GPU when both tensors are on-device
Benchmark: validate correctness vs OpenBLAS, measure speedup

Phase 2 — LLM inference kernels (2–3 weeks)

GPU attention: mkvca_attend → Vulkan online-softmax attention kernel
GPU layer-norm, RMSNorm, SwiGLU (combined activation + gate MLP)
GPU RoPE encoding
GPU INT8 quantized linear (qweight_linear)
End-to-end benchmark: LLaMA-3 8B token decode t/s on RTX 3090 / 4090

Phase 3 — Training kernels (2–3 weeks)

GPU Adam / AdamW fused step kernel
GPU softmax cross-entropy backward
GPU Conv2D forward + backward
GPU BatchNorm forward + backward
End-to-end benchmark: Sequential::train() on MNIST/CIFAR-10

Phase 4 — Platform breadth (4–6 weeks)

MoltenVK integration for Apple M-series (macOS / iOS)
fp16 (VK_EXT_shader_float16) kernel variants for 2× memory + throughput
Multi-GPU sharding via VkQueue per device
Memory pool: VkDeviceMemory slab allocator to reduce allocation overhead
Async GPU-CPU overlap: prefetch next batch to GPU while current batch computes

Expected Performance

Operation	CPU (8-core AVX2)	GPU (RTX 3080, Vulkan fp32)	GPU (RTX 4090, Vulkan fp16)
Dense 4096×4096 GEMM	~18 ms	~0.6 ms	~0.15 ms
LLaMA-3 8B decode (1 token)	~280 ms (3.6 t/s)	~12 ms (83 t/s)	~5 ms (200 t/s)
Sequential train 1 epoch MNIST (60k, bs=256)	~4.2 s	~0.3 s	~0.12 s
Conv2D 224×224 ResNet-50 fwd	~120 ms	~4 ms	~1.5 ms

Estimates based on operator FLOP counts and hardware peak throughput (TFLOPS). Actual results depend on memory bandwidth, kernel launch overhead, and occupancy.

Current Limitations to Resolve First

Before the Vulkan backend can be integrated, these CPU-side items should be addressed:

Issue	Impact on GPU backend	Priority
fp16 storage not implemented	GPU backend will want to keep weights in fp16 to halve VRAM. Current TensorC only stores fp32.	High
`tensor.c` is a 1500-line monolith	Must split into modules (`tensor_math.c`, `tensor_nn.c`, `tensor_io.c`) before adding a parallel GPU dispatch path.	Medium
No mmap weight loading for LLM weights	GPU backend needs `vkMapMemory` zero-copy path; currently SafeTensors does a full memcpy.	Medium
LSTM/attention not in single C call	GPU dispatch requires all LSTM state in one kernel; currently split over multiple PHP-driven steps.	Low-medium
Flash Attention not implemented	Standard attention allocates O(T²) scores. GPU backend should use Flash Attention 2 / 3 to keep memory O(T).	Medium

Bottom line The Vulkan backend adds roughly 2,000 lines of C (backend + kernels) and zero changes to PHP. The FFI boundary already exists and is the correct abstraction point. The hard part is kernel tuning — not architecture.

Universal GPU Support via Vulkan

On this page

Why Vulkan

Cross-vendor GPU support

Stable ABI

FFI-compatible

Expected speedup

Proposed Architecture

GPU Memory Model

SPIR-V Compute Kernels

Example: GEMM kernel (matmul.comp)

Priority kernel list

Dispatch Strategy

PHP API — Unchanged

Implementation Phases

Phase 1 — Vulkan foundation + GEMM (3–4 weeks)

Phase 2 — LLM inference kernels (2–3 weeks)

Phase 3 — Training kernels (2–3 weeks)

Phase 4 — Platform breadth (4–6 weeks)

Expected Performance

Current Limitations to Resolve First