CPU Tuning
The native backend is CPU-first. Effective tuning aligns threads, vectorization, and memory with physical hardware.
Thread control
TensorEngineexposes OpenMP-backed kernels.Tensor::configureThreading()may set BLAS and OpenMP thread counts.- Use a small number of threads for high-core contention environments.
BLAS settings
OPENBLAS_NUM_THREADS=1is often the best default for single-process workloads.- For multi-process jobs, constrain BLAS threads to avoid oversubscription.
Cache locality
- Prefer batch sizes that fit in L2 cache when possible.
- Use contiguous row-major tensors for matrix multiplication.
- Avoid random access patterns inside the hot loop.
NUMA and affinity
- Bind processes to local NUMA nodes for large datasets.
- Keep tensor buffers on the same memory node as CPU threads.
When to use
- Apply CPU tuning to production training and inference.
- Adjust thread counts when the same machine runs multiple workers.
When not to use
- Do not rely on default operating system thread scheduling for peak throughput.
- Do not enable aggressive OpenMP scaling without measuring end-to-end latency.