Core Dataset Engine

The dataset subsystem separates ETL from tensor execution. It keeps CSV ingestion and feature pipeline logic in C until numeric tensors are required.

ETL mode vs Tensor mode

ETL mode

Tensor mode

Lifecycle

CSV file
   ├─ tensor_dataset_from_csv() → direct Tensor mode (numeric-only)
   └─ df_read_csv() → ETL DataFrame
                    ├─ ETL transforms
                    └─ df_to_tensor() → Tensor mode

Lazy materialization

Dataset::load() creates an ETL dataset without immediate tensor conversion. materialize() triggers the conversion on demand.

Fast path vs fallback

Data flow and memory

Dataset::fromCSV() returns either:

In both cases, the dataset owns only the C pointer or tensor wrappers.

Key methods

Dataset::load(string $filepath, bool $hasHeader = true): self

Dataset::fromCSV(string $filepath, int $labelColumn = -1, bool $hasHeader = true): self

Dataset::materialize(int $labelCol = -1): self

Code example

use Pml\Dataset;

$dataset = Dataset::load('datasets/housing.csv')
    ->withLabelColumn(0)
    ->dropNans()
    ->oneHotEncode(2)
    ->materialize(labelCol: 0);

C-level behavior

Performance implications

When to use

When not to use