Training Lifecycle
Training in PML is a staged workflow: data ingestion, transformation, batching, optimization, and persistence.
Stage 1: Dataset preparation
- Start with a
Datasetin Tensor mode. - Ensure label columns are set and tensors are contiguous.
- Use
Pipelineto fit preprocessing on training data only.
Stage 2: Batch execution
- Training loops consume minibatches.
- Batching is best handled at the dataset or pipeline layer.
- Avoid materializing the same batch multiple times.
Stage 3: Optimization
- Estimators may implement native gradient steps.
- Fused Adam and fused BCE kernels reduce FFI overhead.
- Weight updates should happen in-place when safe.
Stage 4: Validation and checkpoints
- Validation datasets should be separate and kept in Tensor mode.
- Checkpointing writes configuration and weight tensors separately.
Pipeline::save()persists transformer state in SafeTensors.
Best practices
- Use deterministic seeds for reproducibility.
- Keep the training dataset contiguous across feature dimensions.
- Avoid Python-style eager loops over rows in PHP.
Example
$pipeline = new Pml\Pipeline([
new Pml\Transformers\StandardScaler(),
], new Pml\Estimators\Regression\GBDTRegressor());
$pipeline->train($trainDataset, epochs: 10, batchSize: 64);
When to use
- Use training lifecycle patterns for production experiments.
- Use explicit validation loops for performance regression checks.
When not to use
- Do not skip tensor-level validation for mixed-type datasets.
- Do not serialize live FFI pointers; persist state with SafeTensors and JSON metadata.