Exploring DRESS Kit V2. Exploring new features and notable… | by Waihong Chung | Oct, 2024

0


Exploring new features and notable changes in the latest version of the DRESS Kit

Photo by Google DeepMind on Unsplash

Overview

Since the original DRESS Kit was first released in 2021, it has been successfully implemented in a handful of biomedical research projects. If you have never heard of the DRESS Kit, then you may be interested to know that it is a fully open-sourced, dependency-free, plain ES6 JavaScript library specifically designed for performing advanced statistical analysis and machine learning tasks. The DRESS Kit was aimed to serve biomedical researchers who are not trained biostatisticians and have no access to dedicated statistics software.

Not only was the DRESS Kit proven to be a practical and effective tool for analyzing complex datasets and building machine-learning models, but these real-world experiences have also provided us with valuable opportunities to identify potential areas of improvement to the DRESS Kit. To support certain new features and to achieve a substantial performance improvement, however, much of the original codebase has to be rewritten from scratch. After many sleepless nights and countless cups of coffee, we are finally ready to share with you — DRESS Kit V2.

Although the new version of the DRESS Kit is no longer backward compatible with the previous one, we have tried our best to preserve the method signatures (i.e. the name of the methods and the expected parameters) as much as possible. This means that research projects that were implemented using DRESS Kit V1 can be migrated to V2 with only a few modifications. This also means, however, that many of the feature enhancements may not be immediately obvious just by scanning through the source code. We will, therefore, spend some time in this article exploring the new features and notable changes in the latest version of the DRESS Kit.

New Features

Incremental Training
One of the most exciting new features in DRESS Kit V2 is the ability to perform incremental training on any regression or classification machine-learning algorithms. In the previous version of the DRESS Kit, this capability was only supported by the kNN algorithm and the multilayer perceptron algorithm. This feature allows models to be trained using larger datasets, but in a resource-efficient manner, or to adapt to evolving data sources in real time.

Photo by Alessia Cocconi on Unsplash

Here is the pseudocode to implement incremental training using the random forest algorithm.

// Create an empty model.
let model = DRESS.randomForst([], outcome, numericals, categoricals);
// Train the existing model using new samples. Repeat this step whenever a sufficient number of new training samples is accumulated.
model.train(samples);

Incremental training is implemented differently on different machine-learning algorithms. With the kNN algorithm, new samples are added to existing training samples, as a result, the model will increase in size over time. With the logistic regression or linear regression algorithm, existing regression coefficients are updated using the new training samples. With the random forest or gradient boosting algorithm, existing decision trees or branches of a decision tree can be pruned and new trees or new branches can be added based on the new training samples. With the multilayer perceptron algorithm, the weights and the biases of the neural network are updated as new training samples are added.

Model Tuning
Another exciting new feature in DRESS Kit V2 is the addition of the `dress-modeling.js` module, which contains methods to facilitate the tedious process of fine-tuning machine-learning models. These methods are designed to work with any regression or classification model created using the `dress-regression.js` module, the `dress-tree.js` module, and the `dress-neural.js` module. Because all of these tasks are rather computationally intensive, these methods are designed to work asynchronously by default.

  • Permutation Feature Importance
    The first method in this module is `DRESS.importances`, which computes permutation feature importance. It allows one to estimate the relative contribution of each feature to a trained model by randomly permuting the values of one of the features, thus breaking the correlation between said feature and the outcome.
// Split a sample dataset into training/vadilation dataset
const [trainings, validations] = DRESS.split(samples);
// Create a model using a training dataset.
let model = DRESS.gradientBoosting(trainings, outcome, numericals, categoricals);
// Compute the permutation feature importances using a validation dataset.
DRESS.print(
DRESS.importances(model, validations)
);
  • Cross Validation
    The second method in this module is `DRESS.crossValidate`, which performs k-fold cross-validation. It automatically divides a dataset into k (default is 5) equally sized folds, and applies each fold as a validation set while training a machine-learning model on the remaining k-1 folds. It helps assess model performance more robustly.
// Training parameters
const trainParams = [outcomes, features];
// Validation parameters
const validateParams = [0.5];
// Perform cross validation on sample dataset using the logistic regression algorithm. Note that the training parameters and validations parameters MUST be passed as arrays.
DRESS.print(
DRESS.crossValidate(DRESS.logistic, samples, trainParams, validateParams)
);
  • Hyperparameter Optimization
    The third, and perhaps the most powerful, method in this module is `DRESS.hyperparameters`, which performs automatic hyperparameter optimization, on any numerical hyperparameters, using a grid search approach with early stopping. It uses the `DRESS.crossValidate` method internally to assess model performance. There are several steps to the process. First, one must specify the initial values of the hyperparameters. Any hyperparameter that is not explicitly defined will be set to its default value by the machine-learning algorithm. Second, one must specify the end value of the search space for each hyperparameter that is being optimized. The order in which these hyperparameters are specified also determines the search order, therefore, it is advisable to specify the most pertinent hyperparameter first. Third, one must select a performance metric (e.g. `f1` for classification and `r2` for regression) for assessing model performance. Here is the pseudocode to perform automatic hyperparameter optimization on a multilayer perceptron algorithm.
// Specify the initial hyperparameter values. Hyperparameters that are not defined will be set to the default values by the multilayer perceptron algorithm itself.
const initial = {
alpha: 0.001,
epoch: 100,
dilution: 0.1,
layout: [20, 10]
}
// Specify the end values of the search space. Only hyperparameters that are being optimized are included.
const eventual = {
dilution: 0.6, // the dilution hyperparameter will be searched first.
epoch: 1000 // the epoch hyperparameter will be searched second.
// the alpha hyperparameter will not be optimized.
// the layout hyperparameter cannot be optimized since it is not strictly a numerical value.
}
// Specify the performace metric.
const metric = 'f1',
// Training parameters
const trainParams = [outcome, features];
DRESS.print(
DRESS.hyperparameters(initial, eventual, metric, DRESS.multilayerPerceptron, samples, trainParams)
)

Model Import & Export
One of the primary motivations for creating the DRESS Kit using plain JavaScript, instead of another high performance language, is to ensure cross-platform compatibility and ease of integration with other technologies. DRESS Kit V2 now includes methods to facilitate the distribution of trained models. The internal representations of the models have also been optimized to maximize portability.

// To export a model in JSON format.
DRESS.save(DRESS.deflate(model), 'model.json');
// To import a model from a JSON file.
DRESS.local('model.json').then(json => {
const model = DRESS.inflate(json)
})

Dataset Inspection
One of the most often requested features for DRESS Kit V2 is a method that is comparable to `pandas.DataFrame.info` in Python. We have, therefore, released a new method `DRESS.summary` in the `dress-descriptive.js` module for generating a concise summary from a dataset. Simply pass an array of objects as the parameter and the method will automatically identify the enumerable features, the data type (numeric vs categoric), and the number of `null` values found in these objects.

// Print a concise summary of the specified dataset.
DRESS.print(
DRESS.summary(samples)
);

Toy Dataset

Photo by Rick Mason on Unsplash

Last but not least, DRESS Kit V2 comes with a brand new toy dataset for testing and learning the various statistical methods and machine-learning algorithms. This toy dataset contains 6000 synthetic subjects modeled after a cohort of patients with various chronic liver diseases. Each subject includes 23 features, which consist of a combination of numerical and categorical features with varying cardinalities. Here is the structure of each subject:

{
ID: number, // Unique identifier
Etiology: string, // Etiology of liver disease (ASH, NASH, HCV, AIH, PBC)
Grade: number, // Degree of steatotsis (1, 2, 3, 4)
Stage: number, // Stage of fibrosis (1, 2, 3, 4)
Admissions: number[], // List of numerical IDs representing hospital admissions
Demographics: {
Age: number, // Age of subject
Barriers: string[], // List of psychosocial barriers
Ethnicity: string, // Ethnicity (white, latino, black, asian, other)
Gender: string // M or F
},
Exams: {
BMI: number // Body mass index
Ascites: string // Ascites on exam (none, small, large)
Encephalopathy: string // West Haven encephalopathy grade (0, 1, 2, 3, 4)
Varices: string // Varices on endoscopy (none, small, large)
},
Labs: {
WBC: number, // WBC count (1000/uL)
Hemoglobin: number, // Hemoglobin (g/dL)
MCV: number, // MCV (fL)
Platelet: number, // Platelet count (1000/uL)
AST: number, // AST (U/L)
ALT: number, // ALT (U/L)
ALP: number, // Alkaline Phosphatase (IU/L)
Bilirubin: number, // Total bilirubin (mg/dL)
INR: number // INR
}
}

This intentionally crafted toy dataset supports both classification and regression tasks. Its data structure closely resembles that of real patient data, making it suitable for debugging real-world scenario workflows. Here is a concise summary of the toy dataset generated using the aforementioned `DRESS.summary` method.

6000 row(s) 23 feature(s)
Admissions : categoric null: 4193 unique: 1806 [1274533, 631455, 969679, …]
Demographics.Age : numeric null: 0 unique: 51 [45, 48, 50, …]
Demographics.Barriers : categoric null: 3378 unique: 139 [insurance, substance use, mental health, …]
Demographics.Ethnicity: categoric null: 0 unique: 5 [white, latino, black, …]
Demographics.Gender : categoric null: 0 unique: 2 [M, F]
Etiology : categoric null: 0 unique: 5 [NASH, ASH, HCV, …]
Exams.Ascites : categoric null: 0 unique: 3 [large, small, none]
Exams.BMI : numeric null: 0 unique: 346 [33.8, 23, 31.3, …]
Exams.Encephalopathy : numeric null: 0 unique: 5 [1, 4, 0, …]
Exams.Varices : categoric null: 0 unique: 3 [none, large, small]
Grade : numeric null: 0 unique: 4 [2, 4, 1, …]
ID : numeric null: 0 unique: 6000 [1, 2, 3, …]
Labs.ALP : numeric null: 0 unique: 236 [120, 100, 93, …]
Labs.ALT : numeric null: 0 unique: 373 [31, 87, 86, …]
Labs.AST : numeric null: 0 unique: 370 [31, 166, 80, …]
Labs.Bilirubin : numeric null: 0 unique: 103 [1.5, 3.9, 2.6, …]
Labs.Hemoglobin : numeric null: 0 unique: 88 [14.9, 13.4, 11, …]
Labs.INR : numeric null: 0 unique: 175 [1, 2.72, 1.47, …]
Labs.MCV : numeric null: 0 unique: 395 [97.9, 91, 96.7, …]
Labs.Platelet : numeric null: 0 unique: 205 [268, 170, 183, …]
Labs.WBC : numeric null: 0 unique: 105 [7.3, 10.5, 5.5, …]
MELD : numeric null: 0 unique: 33 [17, 32, 21, …]
Stage : numeric null: 0 unique: 4 [3, 4, 2, …]

Feature Enhancements

Propensity and Proximity Matching
The `DRESS.propensity` method, which performs propensity score matching, now supports both numerical and categorical features as confounders. Internally, the method uses `DRESS.logistic` to estimate the propensity score if only numerical features are specified; otherwise, it uses `DRESS.gradientBoosting`. We have also introduced a new method called `DRESS.proximity` that uses `DRESS.kNN` to perform K-nearest neighbor matching.

// Split samples to controls and subjects.
const [controls, subjects] = DRESS.split(samples);
// If only numerical features are specified, then the method will build a logistic regression model.
let numerical_matches = DRESS.propensity(subjects, controls, numericals);
// If only categorical features (or both categorical and numberical features) are specified, then the method will build a gradient boosting regression model.
let categorical_matches = DRESS.propensity(subjects, controls, numericals, categoricals);

Categorize and Numericize
The `DRESS.categorize` method in the `dress-transform.js` module has been completely rewritten and behaves very differently, but more intuitively, now. The new `DRESS.categorize` method accepts an array of numerical values as boundaries and converts a numerical feature into a categorical feature based on the specified boundaries. The old `DRESS.categorize` method has been renamed as `DRESS.numericize`, which converts a categorical feature into a numerical feature by matching the feature value against an ordered array of categories.

// Define boundaries.
const boundaries = [3, 6, 9];
// Categorize any feature value less than 3 as 0, values between 3 and 6 as 1, values between 6 and 9 as 2, and values greater than 9 as 3.
DRESS.categorize(samples, [feature], boundaries);
// Define categories.
const categories = [A, [B, C], D];
// Numericize any feature value A to 0, B or C to 1, and D to 2.
DRESS.numericize(samples, [feature], categories);

Linear, Logistic, and Polytomous Regression
In DRESS Kit V1, the `DRESS.logistic` regression algorithm was implemented using Newton’s method, while the `DRESS.linear` regression algorithm utilized the matrix approach. In DRESS Kit V2, both regression algorithms were implemented using the same optimized gradient descent regression method, which also supports hyperparameters such as learning rate and ridge (L2) regularization. We have also introduced a new method called `DRESS.polytomous`, which uses `DRESS.logistic` internally to perform multiclass classification using the one-vs-rest approach.

Precision-Recall Curve
The `dress-roc.js` module now contains a method, `DRESS.pr`, to generate precision-recall curves based on one or more numerical classifiers. This method has a method signature identical to that of `DRESS.roc` and can be used as a direct replacement for the latter.

// Generate a receiver-operating characteristic (roc) curve.
let roc = DRESS.roc(samples, outcomes, classifiers);
// Generate a precision-recall (pr) curve.
let pr = DRESS.pr(samples, outcomes, classifiers);

Breaking Changes

JavaScript Promise
DRESS Kit V2 uses Promise exclusively to handle all asynchronous operations. Callback functions are no longer supported. Most notably, the coding pattern of passing a custom callback function named `processJSON` to `DRESS.local` or `DRESS.remote` (as shown in the examples from DRESS Kit V1) is no longer valid. Instead, the following coding pattern is preferred.

DRESS.local('data.json').then(subjects => {
// Do something with the subjects.
})

kNN Model
Several breaking changes have been made to the `DRESS.kNN` method. First, the outcome of the model must be specified during the training phase, instead of during the prediction phase, similar to how other machine learning models in the DRESS Kit, such as `DRESS.gradientBoosting`, `DRESS.multilayerPerceptron` are created.

The kNN imputation functionality has been moved from the model object returned by the `DRESS.kNN` method to a separate method named `DRESS.nearestNeighbor` in the `dress-imputation.js` module in order to better differentiate the machine-learning algorithm from its application.

The `importances` parameter has been removed and relative feature importances should be specified as a hyperparameter instead.

Model Performance
The method for evaluating/validating a machine learning model’s performance has been renamed from `model.performance` to `model.validate` in order to improve linguistic coherence (i.e. all method names are verbs).

Module Organization
The module containing the core statistical methods has been renamed from `dress-core.js` to `dress.js`, which must be included at all times when using DRESS Kit V2 in a modular fashion.

The module containing the decision-tree-based machine learning algorithms, including random forest and gradient boosting, has been renamed from `dress-ensemble.js` to `dress-tree.js` in order to better describe the underlying learning algorithm.

The methods for loading and saving data files as well as printing text output onto an HTML document have been moved from `dress-utility.js` to `dress-io.js`. Meanwhile, the `DRESS.async` method has been moved to its own module `DRESS-async.js`.

Default Boolean Parameters
All optional boolean (true/false) parameters are assigned a default value of `false`, in order to maintain a coherent syntax. The default behavoirs of the methods are carefully designed to be suitable for most common use-cases. For instance, the default behavior of the kNN machine learning model is to use the weighted kNN algorithm; the boolean parameter to select between the weighted vs unweighted kNN algorithm has, therefore, been renamed as `unweighted` and is set to a default value of `false`.

As a result of this change, however, the default behavior of all machine learning algorithms is set to produce a regression model, instead of a classification model.

Removed Methods
The following methods have been removed entirely because they were deemed ill-constructed or redundant:
– `DRESS.effectMeasures` from the `dress-association.js` module.
– `DRESS.polynomial` from the `dress-regression.js` module.
– `DRESS.uuid` from the `dress-transform.js` module.

Final Note

Apart from the major new features mentioned earlier, numerous enhancements have been made to nearly every method included in the DRESS Kit. Most operations are noticeably faster than before yet the minified codebase remains nearly the same size. If you have previously utilized DRESS Kit V1, upgrading to V2 is highly recommended. For those who haven’t yet incorporated the DRESS Kit into their research projects, now is an opportune moment to explore its capabilities. We genuinely value your interest in and your ongoing support for the DRESS Kit. Please do not hesitate to share your feedback and comments so that we can continue to improve this library.

Please do not hesitate to grab the latest version of the DRESS Kit from its GitHub repository and start building.

Leave a Reply

Your email address will not be published. Required fields are marked *