As machine learning (ML) models have improved, data scientists, ML engineers and researchers have shifted more of their attention to defining and bettering data quality. This has led to the emergence of a data-centric approach to ML and various techniques to improve model performance by focusing on data requirements. Applying these techniques allows ML practitioners to reduce the amount of data required to train an ML model.

As part of this approach, advanced data subset selection techniques have surfaced to speed up training by reducing input data quantity. This process is based on automatically selecting a given number of points that approximate the distribution of a larger dataset and using it for training. Applying this type of technique reduces the amount of time required to train an ML model.

In this post, we describe applying data-centric AI principles with Amazon SageMaker Ground Truth, how to implement data subset selection techniques using the CORDS repository on Amazon SageMaker to reduce the amount of data required to train an initial model, and how to run experiments using this approach with Amazon SageMaker Experiments.

## A data-centric approach to machine learning

Before diving into more advanced data-centric techniques like data subset selection, you can improve your datasets in multiple ways by applying a set of underlying principles to your data labeling process. For this, Ground Truth supports various mechanisms to improve label consistency and data quality.

Label consistency is important for improving model performance. Without it, models can’t produce a decision boundary that separates every point belonging to differing classes. One way to ensure consistency is by using annotation consolidation in Ground Truth, which allows you to serve a given example to multiple labelers and use the aggregated label provided as the ground truth for that example. Divergence in the label is measured by the confidence score generated by Ground Truth. When there is divergence in labels, you should look to see if there is ambiguity in the labeling instructions provided to your labelers that can be removed. This approach mitigates the effects of bias of individual labelers, which is central to making labels more consistent.

Another way to improve model performance by focusing on data involves developing methods to analyze errors in labels as they come up to identify the most important subset of data to improve upon. you can do this for your training dataset with a combination of manual efforts involving diving into labeled examples and using the Amazon CloudWatch logs and metrics generated by Ground Truth labeling jobs. It’s also important to look at errors the model makes at inference time to drive the next iteration of labeling for our dataset. In addition to these mechanisms, Amazon SageMaker Clarify allows data scientists and ML engineers to run algorithms like KernelSHAP to allow them to interpret predictions made by their model. As mentioned, a deeper explanation into the model’s predictions can be related back to the initial labeling process to improve it.

Lastly, you can consider tossing out noisy or overly redundant examples. Doing this allows you to reduce training time by removing examples that don’t contribute to improving model performance. However, identifying a useful subset of a given dataset manually is difficult and time consuming. Applying the data subset selection techniques described in this post allows you to automate this process along established frameworks.

## Use case

As mentioned, data-centric AI focuses on improving model input rather than the architecture of the model itself. Once you have applied these principles during data labeling or feature engineering, you can continue to focus on model input by applying data subset selection at training time.

For this post, we apply Generalization based Data Subset Selection for Efficient and Robust Learning (GLISTER), which is one of many data subset selection techniques implemented in the CORDS repository, to the training algorithm of a ResNet-18 model to minimize the time it takes to train a model to classify CIFAR-10 images. The following are some sample images with their respective labels pulled from the CIFAR-10 dataset.

ResNet-18 is often used for classification tasks. It is an 18-layer deep convolutional neural network. The CIFAR-10 dataset is often used to evaluate the validity of various techniques and approaches in ML. It’s composed of 60,000 32×32 color images labeled across 10 classes.

In the following sections, we show how GLISTER can help you answer the following question to some degree:

What percentage of a given dataset can we use and still achieve good model performance during training?

Applying GLISTER to your training algorithm will introduce fraction as a hyperparameter in your training algorithm. This represents the percentage of the given dataset you wish to use. As with any hyperparameter, finding the value producing the best result for your model and data requires tuning. We don’t go in depth into hyperparameter tuning in this post. For more information, refer to Optimize hyperparameters with Amazon SageMaker Automatic Model Tuning.

We run several tests using SageMaker Experiments to measure the impact of the approach. Results will vary depending on the initial dataset, so it’s important to test the approach against our data at different subset sizes.

Although we discuss using GLISTER on images, you can also apply it to training algorithms working with structured or tabular data.

## Data subset selection

The purpose of data subset selection is to accelerate the training process while minimizing the effects on accuracy and increasing model robustness. More specifically, GLISTER-ONLINE selects a subset as the model learns by attempting to maximize the log-likelihood of that training data subset on the validation set you specify. Optimizing data subset selection in this way mitigates against the noise and class imbalance that is often found in real-world datasets and allows the subset selection strategy to adapt as the model learns.

The initial GLISTER paper describes a speedup/accuracy trade-off at various data subset sizes as followed using a LeNet model:

 Subset size Speedup Accuracy 10% 6x -3% 30% 2.5x -1.20% 50% 1.5x -0.20%

To train the model, we run a SageMaker training job using a custom training script. We have also already uploaded our image dataset to Amazon Simple Storage Service (Amazon S3). As with any SageMaker training job, we need to define an Estimator object. The PyTorch estimator from the sagemaker.pytorch package allows us to run our own training script in a managed PyTorch container. The inputs variable passed to the estimator’s .fit function contains a dictionary of the training and validation dataset’s S3 location.

The train.py script is run when a training job is launched. In this script, we import the ResNet-18 model from the CORDS library and pass it the number of classes in our dataset as follows:

from cords.utils.models import ResNet18 numclasses = 10
model = ResNet18(numclasses)

Then, we use the gen_dataset function from CORDS to create training, validation, and test datasets:

from cords.utils.data.datasets.SL import gen_dataset train_set, validation_set, test_set, numclasses = gen_dataset(
dset_name="cifar10",
feature="dss",
type="image")

From each dataset, we create an equivalent PyTorch dataloader:

train_loader = torch.utils.data.DataLoader(train_set,
batch_size=batch_size,
batch_size=batch_size,
shuffle=False)

Lastly, we use these dataloaders to create a GLISTERDataLoader from the CORDS library. It uses an implementation of the GLISTER-ONLINE selection strategy, which applies subset selection as we update the model during training, as discussed earlier in this post.

To create the object, we pass the selection strategy specific arguments as a DotMap object along with the train_loader, validation_loader, and logger:

import logging
from dotmap import DotMap dss_args = # GLISTERDataLoader specific arguments
dss_args = DotMap(dss_args)
dss_args,
logger,
batch_size=batch_size,
shuffle=True,
pin_memory=False)

The GLISTERDataLoader can now be applied as a regular dataloader to a training loop. It will select data subsets for the next training batch as the model learns based on that model’s loss. As demonstrated in the preceding table, adding a data subset selection strategy allows us to significantly reduce training time, even with the additional step of data subset selection, with little trade-off in accuracy.

Data scientists and ML engineers often need to evaluate the validity of an approach by comparing it with some baseline. We demonstrate how to do this in the next section.

## Experiment tracking

You can use SageMaker Experiments to measure the validity of the data subset selection approach. For more information, see Next generation Amazon SageMaker Experiments – Organize, track, and compare your machine learning trainings at scale.

In our case, we perform four experiments: a baseline without applying data subset selection, and three others with differing fraction parameters, which represents the size of the subset relative to the overall dataset. Naturally, using a smaller fraction parameter should result in reduced training times, but lower model accuracy as well.

For this post, each training run is represented as a Run in SageMaker Experiments. The runs related to our experiment are all grouped under one Experiment object. Runs can be attached to a common experiment when creating the Estimator with the SDK. See the following code:

from sagemaker.utils import unique_name_from_base
from sagemaker.experiments.run import Run, load_run experiment_name = unique_name_from_base("data-centric-experiment")
with Run(
experiment_name=experiment_name,
sagemaker_session=sess
) as run:
estimator = PyTorch('train.py',
source_dir="source",
role=role,
instance_type=instance_type,
instance_count=1,
framework_version=framework_version,
py_version='py3',
env={ 'SAGEMAKER_REQUIREMENTS': 'requirements.txt',
})
estimator.fit(inputs)

As part of your custom training script, you can collect run metrics by using load_run:

from sagemaker.experiments.run import load_run
from sagemaker.session import Session if __name__ == "__main__":
args = parse_args()
session = Session(boto3.session.Session(region_name=args.region))
train(args, run)

Then, using the run object returned by the previous operation, u can collect data points per epoch by calling run.log_metric(name, value, step) and supplying the metric name, value, and current epoch number.

To measure the validity of our approach, we collect metrics corresponding to training loss, training accuracy, validation loss, validation accuracy, and time to complete an epoch. Then, after running the training jobs, we can review the results of our experiment in Amazon SageMaker Studio or through the SageMaker Experiments SDK.

To view validation accuracies within Studio, choose Analyze on the experiment Runs page.

Add a chart, set the chart properties, and choose Create. As shown in the following screenshot, you’ll see a plot of validation accuracies at each epoch for all runs.

The SDK also allows you to retrieve experiment-related information as a Pandas dataframe:

from sagemaker.analytics import ExperimentAnalytics trial_component_analytics = ExperimentAnalytics(
sagemaker_session=sess.sagemaker_client,
experiment_name=experiment_name
)
analytic_table = trial_component_analytics.dataframe()


Optionally, the training jobs can be sorted. For instance, we could add "metrics.validation:accuracy.max" as the value of the sort_by parameter passed to ExperimentAnalytics to return the result ordered by validation accuracy.

As expected, our experiments show that applying GLISTER and data subset selection to the training algorithm reduces training time. When running our baseline training algorithm, the median time to complete a single epoch hovers around 27 seconds. By contrast, applying GLISTER to select a subset equivalent to 50%, 30%, and 10% of the overall dataset results in times to complete an epoch of about 13, 8.5, and 2.75 seconds, respectively, on ml.p3.2xlarge instances.

We also observe a comparatively minimal impact on validation accuracy, especially when using data subsets of 50%. After training for 100 epochs, the baseline produces a validation accuracy of 92.72%. In contrast, applying GLISTER to select a subset equivalent to 50%, 30%, and 10% of the overall dataset results in validation accuracies of 91.42%, 89.76%, and 82.82%, respectively.

## Conclusion

SageMaker Ground Truth and SageMaker Experiments enable a data-centric approach to machine learning by allowing data scientists and ML engineers to produce more consistent datasets and track the impact of more advanced techniques as they implement them in the model building phase. Implementing a data-centric approach to ML allows you to reduce the amount of data required by your model and improve its robustness.

Give it a try, and let us know what you think in comments.