PML / Estimators
Estimators

Estimators

45 estimators covering classification, regression, anomaly detection, clustering, dimensionality reduction, and meta-estimation. All implement the Learner interface (train(Dataset) + predict(Dataset)).

Estimator Interfaces

InterfaceMethodsDescription
Estimatortype(): EstimatorTypeRoot interface for all estimators
Learnertrain(Dataset): void · predict(Dataset): TensorSupervised and unsupervised learners
ProbabilisticpredictProba(Dataset): TensorReturns class probability matrix [N, C]
Scoringscore(Dataset): floatReturns accuracy (classifiers) or R² (regressors)
OnlinepartialFit(Dataset): voidIncremental / streaming training
RanksFeaturesfeatureImportances(): TensorReturns per-feature importance scores
VerbosesetLogger(LoggerInterface): voidPSR-3 training progress logging
Persistablesave(string): void · load(string): staticCheckpoint save/load via SafeTensors

Classifiers

Namespace: Pml\Estimators\Classifiers\

ClassConstructor ParametersInterfacesNotes
GBDTClassifier nEstimators=100, maxDepth=4, numBins=255, lr=0.1, lambda=1.0, minChildWeight=1, subSample=1.0, colSampleByTree=1.0 Scoring RanksFeatures Histogram-based leaf-wise GBDT (LightGBM-style). C-level trees.
RandomForestClassifier nEstimators=100, maxDepth=10, minSamplesSplit=2 Scoring Bagged CART trees, random feature subsets. Parallelized over trees.
ExtraTreesClassifier nEstimators=100, maxDepth=10, minSamplesSplit=2 Scoring Extremely randomized trees — random split thresholds, lower variance.
DecisionTreeClassifier maxDepth=10, minSamplesSplit=2, maxFeatures=null Scoring CART with Gini impurity. C-level split search.
LogisticRegression epochs=100, learningRate=0.01, batchSize=32 Scoring Probabilistic Binary or multi-class via softmax. Mini-batch SGD.
SoftmaxClassifier epochs, learningRate, batchSize, lambda Probabilistic Multi-class linear classifier with L2 regularization.
SVC c=1.0, epochs=100, learningRate=0.01, batchSize=32 Scoring Linear SVM with hinge loss. SGD optimization.
MLPClassifier hidden=[100], epochs=100, batchSize=32, learningRate=0.001, dropout=0.0 Scoring Probabilistic Multi-layer perceptron using Sequential internally. ReLU activations.
AdaBoostClassifier nEstimators=50, learningRate=1.0 Scoring SAMME algorithm with DecisionTree weak learners.
LogitBoost nEstimators, learningRate, maxDepth Probabilistic Gradient boosting with logistic loss. Probabilistic outputs.
GaussianNB Probabilistic Gaussian Naive Bayes. C-level log-likelihood computation.
BernoulliNB alpha=1.0 Probabilistic For binary feature vectors. Laplace smoothing.
MultinomialNB alpha=1.0 Probabilistic For count features (e.g., TF-IDF). Laplace smoothing.
CategoricalNB alpha=1.0 Probabilistic For categorical integer-encoded features.
KNNClassifier k=5 Scoring Probabilistic Brute-force KNN via C pairwise-L2. O(N·D) per query.
KDNeighborsClassifier k=5, leafSize=30 Scoring KD-Tree based KNN. Faster for low-D (<20 features).
RadiusNeighborsClassifier radius=1.0 Scoring Classifies using all neighbors within a fixed radius.
OneVsRest prototype: Learner Scoring Wraps any binary classifier for multi-class OVR decomposition.
VotingClassifier estimators: Learner[] Scoring Majority-vote ensemble. Optionally soft-vote with Probabilistic members.
DummyClassifier Returns the most frequent class label. Baseline.

Regressors

Namespace: Pml\Estimators\Regression\

ClassConstructor ParametersNotes
GBDTRegressor nEstimators=100, maxDepth=4, numBins=255, lr=0.1, lambda=1.0 Histogram leaf-wise GBDT with MSE loss. Same C kernel as GBDTClassifier.
GradientBoostingRegressor nEstimators, learningRate, maxDepth, loss Stage-wise gradient boosting with pluggable loss (MSE, Huber, MAE).
RandomForestRegressor — (see classifier) Alias for ExtraTreeRegressor ensemble.
ExtraTreeRegressor nEstimators, maxDepth, minSamplesSplit Extremely randomized regression trees.
DecisionTreeRegressor maxDepth=3, minSamplesSplit=2, maxFeatures=null CART with MSE criterion.
LinearRegression Closed-form OLS via LAPACKE SGELSD. O(D³) — for small D only.
Ridge alpha=1.0, epochs=100, learningRate=0.01, batchSize=32 L2-regularized linear regression. Mini-batch SGD.
Lasso alpha=1.0, epochs=100, learningRate=0.01, batchSize=32 L1-regularized via subgradient SGD. Produces sparse weights.
ElasticNet alpha=1.0, l1Ratio=0.5, epochs=100, learningRate=0.01, batchSize=32 Combines L1 + L2 regularization.
Adaline epochs, learningRate, batchSize Adaptive linear neuron. Mini-batch gradient descent, MSE loss.
MLPRegressor hidden=[100], epochs=100, batchSize=32, learningRate=0.001 Deep MLP for regression. Outputs continuous values via linear head.
SVR c, epsilon, kernel, epochs, learningRate ε-insensitive SVR. Hinge loss SGD with RBF or linear kernel.
KNNRegressor k=5 KNN regression — average of k nearest target values.
KDNeighborsRegressor k=5 KD-Tree KNN regression.
VotingRegressor estimators: Learner[] Average predictions from an ensemble of regressors.
DummyRegressor Returns training label mean. Baseline.

Anomaly Detectors

Namespace: Pml\Estimators\AnomalyDetectors\

Anomaly detectors implement Learner. predict() returns a Tensor of labels: 1.0 = inlier, -1.0 = outlier.

ClassConstructor ParametersNotes
IsolationForest nEstimators=100, sampleSize=256, contamination=0.1 Builds random isolation trees. Anomaly score via average path length. C-accelerated.
LocalOutlierFactor k=20 LOF score based on local reachability density vs k neighbors.
GaussianMLE threshold=-10.0 Fits a multivariate Gaussian. Points below log-likelihood threshold are anomalies.
OneClassSVM nu=0.1, kernel=RBF(0.1), epochs=100, learningRate=0.01 One-class SVM with RBF or linear kernel. SGD optimization.
Loda nProjections=100, bins=10 Lightweight Online Detector of Anomalies. Random projections + histogram density.
RobustZScore threshold=3.5 Modified Z-Score using median absolute deviation. Robust to outliers in the reference.

Clusterers

Namespace: Pml\Estimators\Clusterers\

Clusterers implement Learner. predict() returns cluster index labels.

ClassConstructor ParametersNotes
KMeans k=3, maxIter=300, tolerance=1e-4 Lloyd's algorithm. Centroid init via KMeans++ seeder. C-level assignment and update.
GaussianMixture k=3, maxIter=100, tolerance=1e-4 EM algorithm for GMM. Soft cluster assignments. Uses LAPACKE for covariance inversion.
FuzzyCMeans k=3, fuzziness=2.0, maxIter=300, tolerance=1e-4 Fuzzy / soft clustering. Fuzziness m > 1; higher m = softer boundaries.
DBSCAN epsilon=0.5, minSamples=5 Density-based. No pre-specified k. Marks noise points as label -1.
MeanShift bandwidth=1.0, maxIter=100, tolerance=1e-4 Non-parametric; finds cluster centers as density peaks. Automatically determines k.

KMeans Seeders

Pass via seeder: parameter to KMeans or GaussianMixture:

SeederDescription
PlusPlusKMeans++ — default, reduces poor initializations
KMC2Fast KMeans++ approximation — O(k) vs O(Nk)
RandomUniform random seeding — fast but prone to bad local minima
PresetFixed user-supplied centroids

Decomposition & Manifold

ClassNamespaceParametersNotes
PrincipalComponentAnalysis Estimators\Decomposition\ nComponents=2 Full SVD via LAPACKE. Returns loadings, explained variance ratio.
TSNE Estimators\Manifold\ nComponents=2, perplexity=30, lr=200, maxIter=1000 t-SNE via Barnes-Hut approximation. For visualization. O(N log N) per iter.

Meta Estimators

ClassNamespaceDescription
GridSearch Estimators\Meta\ Exhaustive hyperparameter search over a parameter grid. Trains one model per combination.
RandomSearch Estimators\Meta\ Random hyperparameter sampling — faster than grid for large spaces.
PlattScaler Estimators\Meta\ Wraps any classifier and calibrates probability outputs via Platt scaling (logistic regression on decision scores).
BootstrapAggregator Pml\ Bagging meta-estimator. Wraps any Learner, trains N copies on bootstrap samples, averages/votes predictions.
CommitteeMachine Pml\ Weighted ensemble of experts. Weights are updated based on per-expert validation error.
StackingRegressor Pml\Ensemble\ Stacked generalization: base estimators → meta-regressor trained on OOF predictions.

Usage Example

use Pml\Estimators\Classifiers\GBDTClassifier;
use Pml\Dataset;

// Train
$clf = new GBDTClassifier(
    nEstimators: 300,
    maxDepth:    6,
    lr:          0.05,
    lambda:      1.0,
    subSample:   0.8,
);
$clf->train($trainSet);

// Predict
$preds = $clf->predict($testSet);   // Tensor [N] of class indices
$score = $clf->score($testSet);     // float accuracy

// Feature importance
$fi = $clf->featureImportances();   // Tensor [D]

// Save / load
$clf->save('models/gbdt/');
$clf2 = GBDTClassifier::load('models/gbdt/');
// Cross-validation with any estimator
use Pml\CrossValidation\StratifiedKFold;

$cv     = new StratifiedKFold(k: 5);
$scores = $cv->score(new GBDTClassifier(), $dataset);
echo "Mean CV accuracy: " . array_sum($scores) / count($scores);
// Hyperparameter search
use Pml\Estimators\Meta\GridSearch;

$gs = new GridSearch(
    estimator: new GBDTClassifier(),
    grid: [
        'nEstimators' => [100, 200, 300],
        'maxDepth'    => [4, 6, 8],
        'lr'          => [0.05, 0.1],
    ],
    metric: 'accuracy',
    cv:     3,
);
$gs->train($dataset);
echo "Best params: "; print_r($gs->bestParams());