PML / Core / Dataset & DataFrame
Core

Dataset & DataFrame

Dataset is the universal data container used by every estimator and transformer. DataFrame is a typed tabular store (C-backed) for ETL operations. DataLoader wraps a Dataset for mini-batch streaming.

Dataset Overview

A Dataset operates in two modes:

Tensor mode

Holds a Tensor for samples (shape [N, D]) and optionally a label Tensor ([N] or [N, C]). All estimators and neural networks require Tensor mode.

ETL / DataFrame mode

Holds a C-backed DataFrameC* pointer. Supports heterogeneous column types (float, string, category). Call materialize() to convert to Tensor mode.

Never load large CSVs into PHP memory. Use Dataset::fromCSV() or Dataset::load() — both use tensor_dataset_from_csv() which mmap's the file and never allocates a PHP array of rows.

Creating Datasets

fromCSV static (string $path, string|int $labelColumn = -1, bool $hasHeader = true): self

Loads a CSV via mmap(). The label column can be specified by name (if header present) or by integer index. Returns a Dataset in ETL mode; call materialize() to get Tensors.

$ds = Dataset::fromCSV('data/housing.csv', labelColumn: 'price');
$ds = $ds->materialize();  // convert ETL → Tensor mode
fromArray static (array $samples, ?array $labels = null): self

Constructs a Tensor-mode Dataset from PHP arrays. Copies data into C memory once. Use only for small datasets; prefer CSV loading for production.

$ds = Dataset::fromArray(
    [[1.0, 2.0], [3.0, 4.0]],
    ['cat', 'dog']
);
__construct (Tensor $samples, ?Tensor $labels = null)

Build a Dataset directly from existing Tensors. Both tensors remain zero-copy — Dataset wraps the pointers, not copies.

seed static (?int $seed): void

Set the global RNG seed used for shuffling and splits. Pass null for non-deterministic behavior.

Accessing Data

MethodReturnsDescription
samples()TensorFeature matrix [N, D]
labels()?TensorLabel vector [N] or label matrix [N, C]
numRows()intNumber of samples N
numColumns()intFeature dimensionality D
isLabeled()boolWhether labels are present
schema()arrayColumn names and types (ETL mode)
categories(int $colIdx)string[]Unique string values in a categorical column (ETL mode)
describe()arrayPer-column statistics: mean, std, min, max, median
toArray()arrayCopies to PHP array — avoid on large datasets

Splitting & Folding

split (float $ratio = 0.8): array [Dataset, Dataset]

Splits into [$train, $test] with no shuffle. Use randomize() first for random splits.

[$train, $test] = $ds->randomize()->split(0.8);
stratifiedSplit (float $ratio = 0.8): array [Dataset, Dataset]

Stratified split preserving class proportions. Required for imbalanced classification.

fold (int $k = 10): Generator

Yields k [$train, $val] pairs for K-fold cross-validation.

foreach ($ds->fold(5) as [$train, $val]) {
    $model->train($train);
    $score = $model->score($val);
}
MethodDescription
head(int $n)First N rows
tail(int $n)Last N rows
take(int $n)Take first N (alias of head)
leave(int $n)Drop first N rows
slice(int $offset, int $length)Slice N rows starting at offset

Transforms & Utilities

MethodDescription
randomize()Shuffle rows in-place (Fisher-Yates, C-level). Returns self.
standardize()Standardize features to zero mean, unit variance in-place. Returns self.
dropNans()Remove rows containing NaN in either samples or labels. Returns new Dataset.
filterByMask(Tensor $mask)Keep rows where mask == 1.0. Returns new Dataset.
select(array $columns)Keep only the specified column indices (ETL mode: column names). Returns self.
drop(array $columns)Remove columns by index or name. Returns self.
stack(Dataset $other)Concatenate two datasets row-wise (same number of columns required). Returns new Dataset.
join(Dataset $other)Concatenate column-wise (same number of rows required). Returns new Dataset.
apply(callable $fn)Apply a PHP closure to samples Tensor, returns new Dataset. Use sparingly (crosses FFI boundary per call).
sortByColumn(int $col)Sort rows ascending by feature column index.
oneHotEncode(int $colIdx)One-hot expand an integer column. Returns new Dataset with expanded features.
withLabelColumn(int $col)Move a feature column to labels, remove it from samples. Returns new Dataset.
materialize(?int $labelCol = null)Convert ETL mode → Tensor mode. Pass label column index to extract labels.
toCSV(string $path)Write to CSV file. Works in both modes.
bagOfWords(int|string $col, ?int $maxFeatures)Vectorize a text column using bag-of-words. Returns new Dataset with BoW features appended.
isEtlMode()Returns true if Dataset is in ETL / DataFrame mode.

Batching

batches (int $batchSize): Generator

Yields Dataset objects of up to $batchSize rows. Each yielded Dataset wraps a view (no copy) into the parent's Tensor. The last batch may be smaller.

foreach ($dataset->batches(64) as $batch) {
    $X = $batch->samples();   // Tensor view [64, D]
    $y = $batch->labels();    // Tensor view [64]
    $model->stepOnBatch($batch);
}

DataFrame

Pml\Data\DataFrame is a pure C-backed columnar store. It supports mixed column types (float32, int32, string/category) and is used for ETL pipelines that need named columns, where filtering, groupby, and join operations before converting to a training Tensor.

Construction

MethodDescription
DataFrame::fromCSV(string $path, bool $hasHeader = true)mmap CSV load. Headers become column names.
DataFrame::fromTensor(Tensor $t, array $colNames)Wrap an existing Tensor as a named-column DataFrame.
copy()Deep copy — allocates new C memory for all columns.

Metadata

MethodReturnsDescription
shape()[int, int][rows, cols]
numRows()int
numCols()int
columns()string[]Column names
dtypes()arrayMap of column name → dtype string
describe()TensorSummary statistics (mean, std, min, max per numeric column)
categories(string $col)string[]Unique values of a categorical column

Selection & Filtering

MethodDescription
head(int $n = 5)First N rows → new DataFrame
tail(int $n = 5)Last N rows → new DataFrame
iloc(int $offset, int $length)Row slice
select(array $cols)Keep named columns
drop(array $cols)Remove named columns
col(string $name)Get a column as a Tensor
where(string $col, string $op, float|string $val)Filter rows. Ops: '=', '!=', '<', '>', '<=', '>='
dropNulls()Remove rows with any null/NaN
sample(int $n, bool $replace, ?int $seed)Random row sampling

Mutation

MethodDescription
rename(array $mapping)Rename columns: ['old' => 'new']
withColumn(string $name, Tensor $data)Add or replace a column with a Tensor
castToFloat(string $col)Cast column to float32 in-place
fillNull(string $col, float $val)Fill nulls with a constant
oneHotEncode(string $col)Expand categorical column into binary indicator columns
sortBy(string $col, bool $ascending)Sort all rows by a column

Aggregation & Join

MethodDescription
valueCounts(string $col)Count occurrences per unique value → new DataFrame
groupBy(string $col)Returns a GroupBy object for aggregation
join(DataFrame $right, string $on, string $how = 'inner')SQL-style join on a common key column. How: 'inner', 'left', 'right'
concat(DataFrame[] $frames)Row-wise concatenation of multiple DataFrames
toTensor(?array $cols = null)Extract numeric columns to a Tensor [N, D]

Target Encoding (C-accelerated)

// Fit on training data
$enc = $trainDf->targetEncodeFit('category_col', $trainLabels, smoothing: 10.0);

// Apply to test data
$testDf->targetEncodeTransform('category_col', $enc);

DataLoader

Pml\Data\DataLoader wraps a Dataset with shuffle-per-epoch, drop-last, and prefetch options. Suitable for large training loops using Sequential::stepOnBatch().

use Pml\Data\DataLoader;

$loader = new DataLoader(
    dataset:   $trainDataset,
    batchSize: 128,
    shuffle:   true,
    dropLast:  true,
);

echo "Steps per epoch: " . $loader->steps() . "\n";

foreach ($loader->batches() as $batch) {
    $loss = $model->stepOnBatch($batch, clipGradNorm: 1.0);
}
MethodReturnsDescription
batches()Generator<Dataset>Yields one batch per step. Shuffles at start of each call if enabled.
steps()intNumber of batches per epoch
batchSize()int
dataset()DatasetThe underlying Dataset

StreamingDataset

Pml\Data\StreamingDataset reads a large CSV file in fixed-size chunks without loading the entire file. Each chunk is a Dataset in Tensor mode. Use with Sequential::stepOnBatch() for datasets that exceed RAM.

use Pml\Data\StreamingDataset;

$stream = new StreamingDataset(
    filepath:    'data/huge.csv',
    chunkSize:   10_000,     // rows per chunk
    labelColumn: -1,          // last column
    hasHeader:   true,
);

for ($epoch = 1; $epoch <= 10; $epoch++) {
    foreach ($stream->chunks() as $chunk) {
        foreach ($chunk->batches(128) as $batch) {
            $model->stepOnBatch($batch);
        }
    }
}
$model->markTrained();