Dataset & DataFrame
Dataset is the universal data container used by every estimator and transformer.
DataFrame is a typed tabular store (C-backed) for ETL operations.
DataLoader wraps a Dataset for mini-batch streaming.
On this page
Dataset Overview
A Dataset operates in two modes:
Tensor mode
Holds a Tensor for samples (shape [N, D]) and optionally a label Tensor ([N] or [N, C]).
All estimators and neural networks require Tensor mode.
ETL / DataFrame mode
Holds a C-backed DataFrameC* pointer. Supports heterogeneous column types (float, string, category).
Call materialize() to convert to Tensor mode.
Dataset::fromCSV() or Dataset::load() — both use
tensor_dataset_from_csv() which mmap's the file and never allocates a PHP array of rows.
Creating Datasets
Loads a CSV via mmap(). The label column can be specified by name (if header present) or by integer index. Returns a Dataset in ETL mode; call materialize() to get Tensors.
$ds = Dataset::fromCSV('data/housing.csv', labelColumn: 'price');
$ds = $ds->materialize(); // convert ETL → Tensor mode
Constructs a Tensor-mode Dataset from PHP arrays. Copies data into C memory once. Use only for small datasets; prefer CSV loading for production.
$ds = Dataset::fromArray(
[[1.0, 2.0], [3.0, 4.0]],
['cat', 'dog']
);
Build a Dataset directly from existing Tensors. Both tensors remain zero-copy — Dataset wraps the pointers, not copies.
Set the global RNG seed used for shuffling and splits. Pass null for non-deterministic behavior.
Accessing Data
| Method | Returns | Description |
|---|---|---|
samples() | Tensor | Feature matrix [N, D] |
labels() | ?Tensor | Label vector [N] or label matrix [N, C] |
numRows() | int | Number of samples N |
numColumns() | int | Feature dimensionality D |
isLabeled() | bool | Whether labels are present |
schema() | array | Column names and types (ETL mode) |
categories(int $colIdx) | string[] | Unique string values in a categorical column (ETL mode) |
describe() | array | Per-column statistics: mean, std, min, max, median |
toArray() | array | Copies to PHP array — avoid on large datasets |
Splitting & Folding
Splits into [$train, $test] with no shuffle. Use randomize() first for random splits.
[$train, $test] = $ds->randomize()->split(0.8);
Stratified split preserving class proportions. Required for imbalanced classification.
Yields k [$train, $val] pairs for K-fold cross-validation.
foreach ($ds->fold(5) as [$train, $val]) {
$model->train($train);
$score = $model->score($val);
}
| Method | Description |
|---|---|
head(int $n) | First N rows |
tail(int $n) | Last N rows |
take(int $n) | Take first N (alias of head) |
leave(int $n) | Drop first N rows |
slice(int $offset, int $length) | Slice N rows starting at offset |
Transforms & Utilities
| Method | Description |
|---|---|
randomize() | Shuffle rows in-place (Fisher-Yates, C-level). Returns self. |
standardize() | Standardize features to zero mean, unit variance in-place. Returns self. |
dropNans() | Remove rows containing NaN in either samples or labels. Returns new Dataset. |
filterByMask(Tensor $mask) | Keep rows where mask == 1.0. Returns new Dataset. |
select(array $columns) | Keep only the specified column indices (ETL mode: column names). Returns self. |
drop(array $columns) | Remove columns by index or name. Returns self. |
stack(Dataset $other) | Concatenate two datasets row-wise (same number of columns required). Returns new Dataset. |
join(Dataset $other) | Concatenate column-wise (same number of rows required). Returns new Dataset. |
apply(callable $fn) | Apply a PHP closure to samples Tensor, returns new Dataset. Use sparingly (crosses FFI boundary per call). |
sortByColumn(int $col) | Sort rows ascending by feature column index. |
oneHotEncode(int $colIdx) | One-hot expand an integer column. Returns new Dataset with expanded features. |
withLabelColumn(int $col) | Move a feature column to labels, remove it from samples. Returns new Dataset. |
materialize(?int $labelCol = null) | Convert ETL mode → Tensor mode. Pass label column index to extract labels. |
toCSV(string $path) | Write to CSV file. Works in both modes. |
bagOfWords(int|string $col, ?int $maxFeatures) | Vectorize a text column using bag-of-words. Returns new Dataset with BoW features appended. |
isEtlMode() | Returns true if Dataset is in ETL / DataFrame mode. |
Batching
Yields Dataset objects of up to $batchSize rows. Each yielded Dataset wraps a view (no copy) into the parent's Tensor. The last batch may be smaller.
foreach ($dataset->batches(64) as $batch) {
$X = $batch->samples(); // Tensor view [64, D]
$y = $batch->labels(); // Tensor view [64]
$model->stepOnBatch($batch);
}
DataFrame
Pml\Data\DataFrame is a pure C-backed columnar store. It supports mixed column types
(float32, int32, string/category) and is used for ETL pipelines that need named columns,
where filtering, groupby, and join operations before converting to a training Tensor.
Construction
| Method | Description |
|---|---|
DataFrame::fromCSV(string $path, bool $hasHeader = true) | mmap CSV load. Headers become column names. |
DataFrame::fromTensor(Tensor $t, array $colNames) | Wrap an existing Tensor as a named-column DataFrame. |
copy() | Deep copy — allocates new C memory for all columns. |
Metadata
| Method | Returns | Description |
|---|---|---|
shape() | [int, int] | [rows, cols] |
numRows() | int | |
numCols() | int | |
columns() | string[] | Column names |
dtypes() | array | Map of column name → dtype string |
describe() | Tensor | Summary statistics (mean, std, min, max per numeric column) |
categories(string $col) | string[] | Unique values of a categorical column |
Selection & Filtering
| Method | Description |
|---|---|
head(int $n = 5) | First N rows → new DataFrame |
tail(int $n = 5) | Last N rows → new DataFrame |
iloc(int $offset, int $length) | Row slice |
select(array $cols) | Keep named columns |
drop(array $cols) | Remove named columns |
col(string $name) | Get a column as a Tensor |
where(string $col, string $op, float|string $val) | Filter rows. Ops: '=', '!=', '<', '>', '<=', '>=' |
dropNulls() | Remove rows with any null/NaN |
sample(int $n, bool $replace, ?int $seed) | Random row sampling |
Mutation
| Method | Description |
|---|---|
rename(array $mapping) | Rename columns: ['old' => 'new'] |
withColumn(string $name, Tensor $data) | Add or replace a column with a Tensor |
castToFloat(string $col) | Cast column to float32 in-place |
fillNull(string $col, float $val) | Fill nulls with a constant |
oneHotEncode(string $col) | Expand categorical column into binary indicator columns |
sortBy(string $col, bool $ascending) | Sort all rows by a column |
Aggregation & Join
| Method | Description |
|---|---|
valueCounts(string $col) | Count occurrences per unique value → new DataFrame |
groupBy(string $col) | Returns a GroupBy object for aggregation |
join(DataFrame $right, string $on, string $how = 'inner') | SQL-style join on a common key column. How: 'inner', 'left', 'right' |
concat(DataFrame[] $frames) | Row-wise concatenation of multiple DataFrames |
toTensor(?array $cols = null) | Extract numeric columns to a Tensor [N, D] |
Target Encoding (C-accelerated)
// Fit on training data
$enc = $trainDf->targetEncodeFit('category_col', $trainLabels, smoothing: 10.0);
// Apply to test data
$testDf->targetEncodeTransform('category_col', $enc);
DataLoader
Pml\Data\DataLoader wraps a Dataset with shuffle-per-epoch,
drop-last, and prefetch options. Suitable for large training loops using Sequential::stepOnBatch().
use Pml\Data\DataLoader;
$loader = new DataLoader(
dataset: $trainDataset,
batchSize: 128,
shuffle: true,
dropLast: true,
);
echo "Steps per epoch: " . $loader->steps() . "\n";
foreach ($loader->batches() as $batch) {
$loss = $model->stepOnBatch($batch, clipGradNorm: 1.0);
}
| Method | Returns | Description |
|---|---|---|
batches() | Generator<Dataset> | Yields one batch per step. Shuffles at start of each call if enabled. |
steps() | int | Number of batches per epoch |
batchSize() | int | |
dataset() | Dataset | The underlying Dataset |
StreamingDataset
Pml\Data\StreamingDataset reads a large CSV file in fixed-size chunks without
loading the entire file. Each chunk is a Dataset in Tensor mode.
Use with Sequential::stepOnBatch() for datasets that exceed RAM.
use Pml\Data\StreamingDataset;
$stream = new StreamingDataset(
filepath: 'data/huge.csv',
chunkSize: 10_000, // rows per chunk
labelColumn: -1, // last column
hasHeader: true,
);
for ($epoch = 1; $epoch <= 10; $epoch++) {
foreach ($stream->chunks() as $chunk) {
foreach ($chunk->batches(128) as $batch) {
$model->stepOnBatch($batch);
}
}
}
$model->markTrained();