Home AI Run AutoML experiments with large parquet datasets using Amazon SageMaker Autopilot

Run AutoML experiments with large parquet datasets using Amazon SageMaker Autopilot

January 28, 2022

223

Starting today, you can use Amazon SageMaker Autopilot to tackle regression and classification tasks on large datasets up to 100 GB. Additionally, you can now provide your datasets in either CSV or Apache Parquet content types.

Businesses are generating more data than ever. A corresponding demand is growing for generating insights from these large datasets to shape business decisions. However, successfully training state-of-the-art machine learning (ML) algorithms on these large datasets can be challenging. Autopilot automates this process and provides a seamless experience for running automated machine learning (AutoML) on large datasets up to 100 GB.

Autopilot subsamples your large datasets automatically to fit the maximum supported limit while preserving the rare class in case of class imbalance. Class imbalance is an important problem to be aware of in ML, especially when dealing with large datasets. Consider a fraud detection dataset where only a small fraction of transactions is expected to be fraudulent. In this case, Autopilot subsamples only the majority class, non-fraudulent transactions, while preserving the rare class, fraudulent transactions.

When you run an AutoML job using Autopilot, all relevant information for subsampling is stored in Amazon CloudWatch. Navigate to the log group for /aws/sagemaker/ProcessingJobs, search for the name of your AutoML job, and choose the CloudWatch log stream that includes -db- in its name.

Many of our customers prefer the Parquet content type to store their large datasets. This is generally due to its compressed nature, support for advanced data structures, efficiency, and low-cost operations. This data can often reach up to tens or even hundreds of GBs. Now, you can directly bring these Parquet datasets to Autopilot. You can either use our API or navigate to Amazon SageMaker Studio to create an Autopilot job with a few clicks. You can specify the input location of your Parquet dataset as a single file or multiple files specified as a manifest file. Autopilot automatically detects the content type of your dataset, parses it, extracts meaningful features, and trains multiple ML algorithms.

You can get started using our sample notebook for running AutoML using Autopilot on Parquet datasets.

About the Authors

H. Furkan Bozkurt, Machine Learning Engineer, Amazon SageMaker Autopilot.

Hire a Hardware Engineer.

Valerio Perrone, Applied Science Manager, Amazon SageMaker Autopilot.

Run AutoML experiments with large parquet datasets using Amazon SageMaker Autopilot

About the Authors

Popular Topics

Popular photos

Channel your inner Wolverine with these 3D-printed, muscle-controlled bionic claws

Follow Us

About the Authors

Share this:

RELATED ARTICLESMORE FROM AUTHOR

Scale AI training and inference for drug discovery through Amazon EKS and Karpenter

Generate customized, compliant application IaC scripts for AWS Landing Zone using Amazon Bedrock

Slack delivers native and secure generative AI powered by Amazon SageMaker JumpStart

Popular Topics

Popular photos

Channel your inner Wolverine with these 3D-printed, muscle-controlled bionic claws

Follow Us

RELATED ARTICLES MORE FROM AUTHOR