PML / Roadmap
Roadmap

Universal GPU Support via Vulkan

The planned GPU backend replaces OpenBLAS + AVX2 with Vulkan compute shaders. No CUDA. No ROCm. No Metal. One C backend for every GPU that supports Vulkan — NVIDIA, AMD, Intel, Mali, Adreno, Apple M-series (via MoltenVK).

Why Vulkan

Cross-vendor GPU support

CUDA only runs on NVIDIA. ROCm only on AMD. Metal only on Apple. Vulkan runs on all three — plus Intel, Qualcomm Adreno, ARM Mali, and every WebGPU-capable browser via WebAssembly.

Stable ABI

Vulkan exposes a stable C ABI via vulkan.h. No Python runtime, no driver SDK installation required. Ships as a single libgpu.so loaded by TensorEngine.

FFI-compatible

The existing FFI boundary stays identical. PHP calls the same functions (tensor_matmul, tensor_softmax). The C layer decides which backend executes the kernel.

Expected speedup

Training a 7B model: CPU ~1–3 t/s, GPU (RTX 4090) ~80–120 t/s. Dense matmul: 30–100× faster than AVX2 OpenBLAS on mid-range GPU.

Proposed Architecture

PHP (unchanged) └─ Tensor::matmul(), Sequential::train(), etc. │ FFI — same signatures as today ▼ src/Lib/gpu_backend.c (new file) ├─ gpu_init() — VkInstance, VkDevice, VkQueue ├─ gpu_tensor_alloc() — VkBuffer + VkDeviceMemory ├─ gpu_upload() — CPU → GPU (VkCommandBuffer copy) ├─ gpu_download() — GPU → CPU (sync download) └─ gpu_dispatch() — load SPIR-V, vkCmdDispatch src/Lib/kernels/ (GLSL compute shaders → SPIR-V) ├─ matmul.comp → matmul.spv ├─ softmax.comp → softmax.spv ├─ relu.comp → relu.spv ├─ layer_norm.comp → layer_norm.spv ├─ attention.comp → attention.spv ├─ adam_step.comp → adam_step.spv └─ ... (one .spv per hot-path operator) src/Lib/tensor.c (modified) ├─ #ifdef PML_GPU_VULKAN │ call gpu_backend.c functions │ #else │ existing AVX2 / OpenBLAS implementation │ #endif

GPU Memory Model

Tensors that are moved to the GPU will have two representations:

TensorC struct (C) float* data_cpu; // host-side buffer (nullable when on GPU) VkBuffer vk_buffer; // device buffer handle VkDeviceMemory vk_memory; // device memory uint8_t on_gpu; // 1 if resident on GPU uint32_t dirty_cpu; // 1 if CPU copy is stale (needs download)

Key design decisions:

SPIR-V Compute Kernels

Each C tensor operator (tensor_matmul, tensor_softmax, etc.) will have a corresponding GLSL compute shader that compiles to SPIR-V via glslc. SPIR-V binaries are embedded into libgpu.so at build time.

Example: GEMM kernel (matmul.comp)

#version 450
// Tiled GEMM: each workgroup handles a 16×16 output tile
layout(local_size_x = 16, local_size_y = 16) in;

layout(set = 0, binding = 0) readonly buffer MatA { float A[]; };
layout(set = 0, binding = 1) readonly buffer MatB { float B[]; };
layout(set = 0, binding = 2) writeonly buffer MatC { float C[]; };

layout(push_constant) uniform PushConstants {
    uint M; uint N; uint K;
    float alpha; float beta;
};

shared float tileA[16][16];
shared float tileB[16][16];

void main() {
    uint row = gl_GlobalInvocationID.x;
    uint col = gl_GlobalInvocationID.y;
    float acc = 0.0;

    for (uint t = 0; t < (K + 15) / 16; t++) {
        tileA[gl_LocalInvocationID.x][gl_LocalInvocationID.y] =
            (row < M && t*16+gl_LocalInvocationID.y < K)
            ? A[row * K + t*16 + gl_LocalInvocationID.y] : 0.0;
        tileB[gl_LocalInvocationID.x][gl_LocalInvocationID.y] =
            (t*16+gl_LocalInvocationID.x < K && col < N)
            ? B[(t*16+gl_LocalInvocationID.x) * N + col] : 0.0;
        barrier();
        for (uint k = 0; k < 16; k++)
            acc += tileA[gl_LocalInvocationID.x][k]
                 * tileB[k][gl_LocalInvocationID.y];
        barrier();
    }
    if (row < M && col < N)
        C[row * N + col] = alpha * acc + beta * C[row * N + col];
}

Priority kernel list

KernelMaps to C functionPriority
matmul.spvtensor_matmulCritical
linear.spvtensor_linearCritical
qlinear.spvqweight_linearCritical
attention.spvmkvca_attendCritical
softmax.spvtensor_softmaxHigh
layer_norm.spvtensor_layer_normHigh
adam_step.spvtensor_fused_adam_stepHigh
adamw_step.spvtensor_fused_adamw_stepHigh
elementwise.spvadd/mul/relu/sigmoid/etc.Medium
reduce.spvsum/mean/max along axisMedium
conv2d.spvtensor_conv2dMedium
embedding.spvgather/scatterMedium

Dispatch Strategy

Vulkan compute shaders are dispatched via vkCmdDispatch(). The workgroup size is tuned per operator:

Operator classWorkgroup layoutTiles
GEMM local_size = (16, 16, 1) 16×16 output tiles, shared-memory tiling for L1 reuse
Attention (per head) local_size = (32, 1, 1) 32 threads per row of Q — online softmax reduction in shared memory
Element-wise local_size = (256, 1, 1) 1 invocation per element, vectorized 4-float loads
Reduction local_size = (256, 1, 1) Tree-reduction in shared memory, two-pass for large arrays

For batched GEMM (LLM prefill), workgroups are dispatched in a 3-D grid: dispatchX = ceil(M/16), dispatchY = ceil(N/16), dispatchZ = batch. This maps batch → Z and keeps the tile loop identical to single-batch.

PHP API — Unchanged

The entire PHP API is 100% backward-compatible with the CPU backend. The GPU backend is selected at TensorEngine initialization time by an environment variable or config flag. All Tensor, Sequential, InferenceSession, etc. calls remain identical.

// No PHP changes required — backend selected by env var
// PML_BACKEND=vulkan php train.php
// PML_BACKEND=cpu php train.php  (default)

$model->train($dataset, epochs: 20);  // same call — GPU if available
$preds = $model->predict($test);      // same call

Optionally, explicit device placement will be exposed:

// Future API (not yet implemented)
Tensor::setDefaultDevice('gpu');
$t = Tensor::randomNormal([4096, 4096]);  // allocated on GPU VRAM
$t->to('cpu');                                // explicit download

Implementation Phases

Phase 1 — Vulkan foundation + GEMM (3–4 weeks)

  • Add gpu_backend.c: VkInstance, VkPhysicalDevice selection, VkDevice, compute queue
  • GPU tensor allocation via VkBuffer + VkDeviceMemory (host-visible + device-local)
  • Lazy CPU→GPU upload and GPU→CPU download
  • SPIR-V GEMM kernel with shared-memory tiling (16×16)
  • Route tensor_matmul through GPU when both tensors are on-device
  • Benchmark: validate correctness vs OpenBLAS, measure speedup

Phase 2 — LLM inference kernels (2–3 weeks)

  • GPU attention: mkvca_attend → Vulkan online-softmax attention kernel
  • GPU layer-norm, RMSNorm, SwiGLU (combined activation + gate MLP)
  • GPU RoPE encoding
  • GPU INT8 quantized linear (qweight_linear)
  • End-to-end benchmark: LLaMA-3 8B token decode t/s on RTX 3090 / 4090

Phase 3 — Training kernels (2–3 weeks)

  • GPU Adam / AdamW fused step kernel
  • GPU softmax cross-entropy backward
  • GPU Conv2D forward + backward
  • GPU BatchNorm forward + backward
  • End-to-end benchmark: Sequential::train() on MNIST/CIFAR-10

Phase 4 — Platform breadth (4–6 weeks)

  • MoltenVK integration for Apple M-series (macOS / iOS)
  • fp16 (VK_EXT_shader_float16) kernel variants for 2× memory + throughput
  • Multi-GPU sharding via VkQueue per device
  • Memory pool: VkDeviceMemory slab allocator to reduce allocation overhead
  • Async GPU-CPU overlap: prefetch next batch to GPU while current batch computes

Expected Performance

Operation CPU (8-core AVX2) GPU (RTX 3080, Vulkan fp32) GPU (RTX 4090, Vulkan fp16)
Dense 4096×4096 GEMM ~18 ms ~0.6 ms ~0.15 ms
LLaMA-3 8B decode (1 token) ~280 ms (3.6 t/s) ~12 ms (83 t/s) ~5 ms (200 t/s)
Sequential train 1 epoch MNIST (60k, bs=256) ~4.2 s ~0.3 s ~0.12 s
Conv2D 224×224 ResNet-50 fwd ~120 ms ~4 ms ~1.5 ms

Estimates based on operator FLOP counts and hardware peak throughput (TFLOPS). Actual results depend on memory bandwidth, kernel launch overhead, and occupancy.

Current Limitations to Resolve First

Before the Vulkan backend can be integrated, these CPU-side items should be addressed:

IssueImpact on GPU backendPriority
fp16 storage not implemented GPU backend will want to keep weights in fp16 to halve VRAM. Current TensorC only stores fp32. High
tensor.c is a 1500-line monolith Must split into modules (tensor_math.c, tensor_nn.c, tensor_io.c) before adding a parallel GPU dispatch path. Medium
No mmap weight loading for LLM weights GPU backend needs vkMapMemory zero-copy path; currently SafeTensors does a full memcpy. Medium
LSTM/attention not in single C call GPU dispatch requires all LSTM state in one kernel; currently split over multiple PHP-driven steps. Low-medium
Flash Attention not implemented Standard attention allocates O(T²) scores. GPU backend should use Flash Attention 2 / 3 to keep memory O(T). Medium
Bottom line The Vulkan backend adds roughly 2,000 lines of C (backend + kernels) and zero changes to PHP. The FFI boundary already exists and is the correct abstraction point. The hard part is kernel tuning — not architecture.