NLP API
NLP API
This section documents tokenization and text preprocessing.
Tokenizer interface
Pml\Tokenizers\Tokenizer
public function tokenize(string $text): array;
Built-in tokenizers
Pml\Tokenizers\WhitespacePml\Tokenizers\WordPml\Tokenizers\WordStemmerPml\Tokenizers\SentencePml\Tokenizers\NGramPml\Tokenizers\KSkipNGramPml\Tokenizers\CharGram
Each tokenizer returns a string token array.
Text transformers
Pml\Transformers\WordCountVectorizer
A BoW transformer that builds a vocabulary in C and produces dense count features.
public function __construct(?int $maxFeatures = null, string $textColumn = 'text')
public function fit(Dataset $data): void
public function transform(Dataset $data): Dataset
public function fitTransform(Dataset $data): Dataset
public function fitted(): bool
public function vocabSize(): int
public function vocabPtr(): ?\FFI\CData
WordCountVectorizer requires ETL mode and a text column.
Pml\Transformers\TfIdfTransformer
public function fit(Dataset $dataset): void
public function transform(Dataset $dataset): Dataset
public function fitted(): bool
public function getStateDict(string $prefix = ''): array
public function loadStateDict(array $dict, string $prefix = ''): void
This transformer computes TF-IDF in C using tensor operations.
Pml\Transformers\TextNormalizer
public function fit(Dataset $dataset): void
public function transform(Dataset $dataset): Dataset
public static function normalize(string $text): string
public function fitted(): bool
TextNormalizer normalizes text by lowercasing, stripping punctuation, and collapsing whitespace.
Dataset convenience
Dataset::bagOfWords($column, ?int $maxFeatures = null) is a helper that fits and transforms a WordCountVectorizer on the given text column.
Example usage
use Pml\Dataset;
use Pml\Transformers\WordCountVectorizer;
$ds = Dataset::load('datasets/emails.csv', true)
->withLabelColumn(0);
$vectorizer = new WordCountVectorizer(maxFeatures: 1000, textColumn: 'body');
$dataset = $vectorizer->fitTransform($ds);
print_r($dataset->samples()->shape());
When to use it
- Use
WordCountVectorizerfor bag-of-words features from text columns. - Use
TfIdfTransformerafterWordCountVectorizerto reweight term frequencies. - Use
TextNormalizerbefore tokenization to standardize raw strings.
Common mistakes
- Trying to fit
WordCountVectorizeron a tensor-mode dataset. - Assuming
Dataset::bagOfWords()modifies the original dataset in-place; it returns a new dataset. - Forgetting to set the label column before
extractLabelTensor()ormaterialize().