Core

Dataset & DataFrame

Dataset is the universal data container used by every estimator and transformer. DataFrame is a typed tabular store (C-backed) for ETL operations. DataLoader wraps a Dataset for mini-batch streaming.

Dataset Overview

A Dataset operates in two modes:

Tensor mode

Holds a Tensor for samples (shape [N, D]) and optionally a label Tensor ([N] or [N, C]). All estimators and neural networks require Tensor mode.

ETL / DataFrame mode

Holds a C-backed DataFrameC* pointer. Supports heterogeneous column types (float, string, category). Call materialize() to convert to Tensor mode.

Never load large CSVs into PHP memory. Use Dataset::fromCSV() or Dataset::load() — both use tensor_dataset_from_csv() which mmap's the file and never allocates a PHP array of rows.

Creating Datasets

fromCSV static (string $path, string|int $labelColumn = -1, bool $hasHeader = true): self

Loads a CSV via mmap(). The label column can be specified by name (if header present) or by integer index. Returns a Dataset in ETL mode; call materialize() to get Tensors.

$ds = Dataset::fromCSV('data/housing.csv', labelColumn: 'price');
$ds = $ds->materialize();  // convert ETL → Tensor mode

fromArray static (array $samples, ?array $labels = null): self

Constructs a Tensor-mode Dataset from PHP arrays. Copies data into C memory once. Use only for small datasets; prefer CSV loading for production.

$ds = Dataset::fromArray(
    [[1.0, 2.0], [3.0, 4.0]],
    ['cat', 'dog']
);

__construct (Tensor $samples, ?Tensor $labels = null)

Build a Dataset directly from existing Tensors. Both tensors remain zero-copy — Dataset wraps the pointers, not copies.

seed static (?int $seed): void

Set the global RNG seed used for shuffling and splits. Pass null for non-deterministic behavior.

Accessing Data

Method	Returns	Description
`samples()`	`Tensor`	Feature matrix [N, D]
`labels()`	`?Tensor`	Label vector [N] or label matrix [N, C]
`numRows()`	`int`	Number of samples N
`numColumns()`	`int`	Feature dimensionality D
`isLabeled()`	`bool`	Whether labels are present
`schema()`	`array`	Column names and types (ETL mode)
`categories(int $colIdx)`	`string[]`	Unique string values in a categorical column (ETL mode)
`describe()`	`array`	Per-column statistics: mean, std, min, max, median
`toArray()`	`array`	Copies to PHP array — avoid on large datasets

Splitting & Folding

split (float $ratio = 0.8): array [Dataset, Dataset]

Splits into [$train, $test] with no shuffle. Use randomize() first for random splits.

[$train, $test] = $ds->randomize()->split(0.8);

stratifiedSplit (float $ratio = 0.8): array [Dataset, Dataset]

Stratified split preserving class proportions. Required for imbalanced classification.

fold (int $k = 10): Generator

Yields k [$train, $val] pairs for K-fold cross-validation.

foreach ($ds->fold(5) as [$train, $val]) {
    $model->train($train);
    $score = $model->score($val);
}

Method	Description
`head(int $n)`	First N rows
`tail(int $n)`	Last N rows
`take(int $n)`	Take first N (alias of head)
`leave(int $n)`	Drop first N rows
`slice(int $offset, int $length)`	Slice N rows starting at offset

Transforms & Utilities

Method	Description
`randomize()`	Shuffle rows in-place (Fisher-Yates, C-level). Returns self.
`standardize()`	Standardize features to zero mean, unit variance in-place. Returns self.
`dropNans()`	Remove rows containing NaN in either samples or labels. Returns new Dataset.
`filterByMask(Tensor $mask)`	Keep rows where mask == 1.0. Returns new Dataset.
`select(array $columns)`	Keep only the specified column indices (ETL mode: column names). Returns self.
`drop(array $columns)`	Remove columns by index or name. Returns self.
`stack(Dataset $other)`	Concatenate two datasets row-wise (same number of columns required). Returns new Dataset.
`join(Dataset $other)`	Concatenate column-wise (same number of rows required). Returns new Dataset.
`apply(callable $fn)`	Apply a PHP closure to samples Tensor, returns new Dataset. Use sparingly (crosses FFI boundary per call).
`sortByColumn(int $col)`	Sort rows ascending by feature column index.
`oneHotEncode(int $colIdx)`	One-hot expand an integer column. Returns new Dataset with expanded features.
`withLabelColumn(int $col)`	Move a feature column to labels, remove it from samples. Returns new Dataset.
`materialize(?int $labelCol = null)`	Convert ETL mode → Tensor mode. Pass label column index to extract labels.
`toCSV(string $path)`	Write to CSV file. Works in both modes.
`bagOfWords(int\|string $col, ?int $maxFeatures)`	Vectorize a text column using bag-of-words. Returns new Dataset with BoW features appended.
`isEtlMode()`	Returns true if Dataset is in ETL / DataFrame mode.

Batching

batches (int $batchSize): Generator

Yields Dataset objects of up to $batchSize rows. Each yielded Dataset wraps a view (no copy) into the parent's Tensor. The last batch may be smaller.

foreach ($dataset->batches(64) as $batch) {
    $X = $batch->samples();   // Tensor view [64, D]
    $y = $batch->labels();    // Tensor view [64]
    $model->stepOnBatch($batch);
}

DataFrame

Pml\Data\DataFrame is a pure C-backed columnar store. It supports mixed column types (float32, int32, string/category) and is used for ETL pipelines that need named columns, where filtering, groupby, and join operations before converting to a training Tensor.

Construction

Method	Description
`DataFrame::fromCSV(string $path, bool $hasHeader = true)`	mmap CSV load. Headers become column names.
`DataFrame::fromTensor(Tensor $t, array $colNames)`	Wrap an existing Tensor as a named-column DataFrame.
`copy()`	Deep copy — allocates new C memory for all columns.

Metadata

Method	Returns	Description
`shape()`	`[int, int]`	[rows, cols]
`numRows()`	`int`
`numCols()`	`int`
`columns()`	`string[]`	Column names
`dtypes()`	`array`	Map of column name → dtype string
`describe()`	`Tensor`	Summary statistics (mean, std, min, max per numeric column)
`categories(string $col)`	`string[]`	Unique values of a categorical column

Selection & Filtering

Method	Description
`head(int $n = 5)`	First N rows → new DataFrame
`tail(int $n = 5)`	Last N rows → new DataFrame
`iloc(int $offset, int $length)`	Row slice
`select(array $cols)`	Keep named columns
`drop(array $cols)`	Remove named columns
`col(string $name)`	Get a column as a Tensor
`where(string $col, string $op, float\|string $val)`	Filter rows. Ops: `'='`, `'!='`, `'<'`, `'>'`, `'<='`, `'>='`
`dropNulls()`	Remove rows with any null/NaN
`sample(int $n, bool $replace, ?int $seed)`	Random row sampling

Mutation

Method	Description
`rename(array $mapping)`	Rename columns: `['old' => 'new']`
`withColumn(string $name, Tensor $data)`	Add or replace a column with a Tensor
`castToFloat(string $col)`	Cast column to float32 in-place
`fillNull(string $col, float $val)`	Fill nulls with a constant
`oneHotEncode(string $col)`	Expand categorical column into binary indicator columns
`sortBy(string $col, bool $ascending)`	Sort all rows by a column

Aggregation & Join

Method	Description
`valueCounts(string $col)`	Count occurrences per unique value → new DataFrame
`groupBy(string $col)`	Returns a `GroupBy` object for aggregation
`join(DataFrame $right, string $on, string $how = 'inner')`	SQL-style join on a common key column. How: `'inner'`, `'left'`, `'right'`
`concat(DataFrame[] $frames)`	Row-wise concatenation of multiple DataFrames
`toTensor(?array $cols = null)`	Extract numeric columns to a Tensor [N, D]

Target Encoding (C-accelerated)

// Fit on training data
$enc = $trainDf->targetEncodeFit('category_col', $trainLabels, smoothing: 10.0);

// Apply to test data
$testDf->targetEncodeTransform('category_col', $enc);

DataLoader

Pml\Data\DataLoader wraps a Dataset with shuffle-per-epoch, drop-last, and prefetch options. Suitable for large training loops using Sequential::stepOnBatch().

use Pml\Data\DataLoader;

$loader = new DataLoader(
    dataset:   $trainDataset,
    batchSize: 128,
    shuffle:   true,
    dropLast:  true,
);

echo "Steps per epoch: " . $loader->steps() . "\n";

foreach ($loader->batches() as $batch) {
    $loss = $model->stepOnBatch($batch, clipGradNorm: 1.0);
}

Method	Returns	Description
`batches()`	`Generator<Dataset>`	Yields one batch per step. Shuffles at start of each call if enabled.
`steps()`	`int`	Number of batches per epoch
`batchSize()`	`int`
`dataset()`	`Dataset`	The underlying Dataset

StreamingDataset

Pml\Data\StreamingDataset reads a large CSV file in fixed-size chunks without loading the entire file. Each chunk is a Dataset in Tensor mode. Use with Sequential::stepOnBatch() for datasets that exceed RAM.

use Pml\Data\StreamingDataset;

$stream = new StreamingDataset(
    filepath:    'data/huge.csv',
    chunkSize:   10_000,     // rows per chunk
    labelColumn: -1,          // last column
    hasHeader:   true,
);

for ($epoch = 1; $epoch <= 10; $epoch++) {
    foreach ($stream->chunks() as $chunk) {
        foreach ($chunk->batches(128) as $batch) {
            $model->stepOnBatch($batch);
        }
    }
}
$model->markTrained();

Dataset & DataFrame

On this page

Dataset Overview

Tensor mode

ETL / DataFrame mode

Creating Datasets

Accessing Data

Splitting & Folding

Transforms & Utilities

Batching

DataFrame

Construction

Metadata

Selection & Filtering

Mutation

Aggregation & Join

Target Encoding (C-accelerated)

DataLoader

StreamingDataset