When we analyze images, we may want to incorporate other metadata related to the image. Examples include when and where the image was taken, who took the image, as well as what is featured in the image. One way to represent this metadata is to use a JSON format, which is well-suited for a document database such as Amazon DocumentDB (with MongoDB compatibility). Example use cases include:

  • Photo-sharing services that want to enable image search and exploration capabilities for users
  • Online retailers who want to identify similar product images for product recommendation
  • Healthcare providers who want to query medical image scans related to specific patients or medical conditions
  • Environmental organizations who want to monitor wildlife conservation efforts using drone imagery

In this post, we focus on the first use case of enabling image search and exploration of a generic photo collection. We look at the JSON output of image analysis generated from Amazon Rekognition, which we ingest into Amazon DocumentDB, and then explore using Amazon SageMaker.

SageMaker is a fully managed service that provides every developer and data scientist the ability to build, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models.

Amazon Rekognition makes it easy to add image and video analysis to your applications. You just provide an image or video to the Amazon Rekognition API, and the service can identify objects, people, text, scenes, and activities. Amazon Rekognition has a simple, easy-to-use API that can quickly analyze any image or video file that’s stored in Amazon Simple Storage Service (Amazon S3). It requires no ML expertise to use.

Amazon DocumentDB is a fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads. You can use the same MongoDB 3.6 or 4.0 application code, drivers, and tools to run, manage, and scale workloads on Amazon DocumentDB without having to worry about managing the underlying infrastructure. As a document database, Amazon DocumentDB makes it easy to store, query, and index JSON data.

Solution overview

In this post, we explore images taken from Unsplash. In the source code, we have kept image file names in their original format, <first name>-<last name>-<image ID>-unsplash.jpg, thereby retaining the photographer’s name, as well as the image ID, from which you can use to determine the image’s original URL: https://unsplash.com/photos/<image ID>.

Each image is analyzed using Amazon Rekognition. The output from the Amazon Rekognition API is a nested JSON object, which is a format well-suited for Amazon DocumentDB. For example, we can analyze the following image, Gardens by the Bay, Singapore, by Coleen Rivas.

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,

Amazon Rekognition generates the following JSON output:

{'Labels': [{'Name': 'Outdoors', 'Confidence': 98.58585357666016, 'Instances': [], 'Parents': []}, {'Name': 'Garden', 'Confidence': 96.23029327392578, 'Instances': [], 'Parents': [{'Name': 'Outdoors'}]}, {'Name': 'Arbour', 'Confidence': 93.65332794189453, 'Instances': [], 'Parents': [{'Name': 'Garden'}, {'Name': 'Outdoors'}]}, {'Name': 'Person', 'Confidence': 93.00440979003906, 'Instances': [{'BoundingBox': {'Width': 0.016103893518447876, 'Height': 0.03213529288768768, 'Left': 0.6525371670722961, 'Top': 0.9264869689941406}, 'Confidence': 93.00440979003906}, {'BoundingBox': {'Width': 0.010800352320075035, 'Height': 0.020640190690755844, 'Left': 0.781416118144989, 'Top': 0.8592491149902344}, 'Confidence': 78.98234558105469}, {'BoundingBox': {'Width': 0.017044249922037125, 'Height': 0.02785704843699932, 'Left': 0.7455113530158997, 'Top': 0.8547402620315552}, 'Confidence': 66.65809631347656}], 'Parents': []}, {'Name': 'Human', 'Confidence': 93.00440979003906, 'Instances': [], 'Parents': []}, {'Name': 'Amusement Park', 'Confidence': 82.81632232666016, 'Instances': [], 'Parents': []}, {'Name': 'Theme Park', 'Confidence': 76.72222900390625, 'Instances': [], 'Parents': [{'Name': 'Amusement Park'}]}, {'Name': 'Plant', 'Confidence': 73.67972564697266, 'Instances': [], 'Parents': []}, {'Name': 'Potted Plant', 'Confidence': 68.09540557861328, 'Instances': [], 'Parents': [{'Name': 'Plant'}, {'Name': 'Vase'}, {'Name': 'Jar'}, {'Name': 'Pottery'}]}, {'Name': 'Pottery', 'Confidence': 68.09540557861328, 'Instances': [], 'Parents': []}, {'Name': 'Jar', 'Confidence': 68.09540557861328, 'Instances': [], 'Parents': []}, {'Name': 'Vase', 'Confidence': 68.09540557861328, 'Instances': [], 'Parents': [{'Name': 'Jar'}, {'Name': 'Pottery'}]}, {'Name': 'Ferris Wheel', 'Confidence': 64.03276824951172, 'Instances': [], 'Parents': [{'Name': 'Amusement Park'}]}, {'Name': 'Nature', 'Confidence': 62.96412658691406, 'Instances': [], 'Parents': []}, {'Name': 'Planter', 'Confidence': 58.99357604980469, 'Instances': [], 'Parents': [{'Name': 'Potted Plant'}, {'Name': 'Plant'}, {'Name': 'Vase'}, {'Name': 'Jar'}, {'Name': 'Pottery'}]}, {'Name': 'Herbs', 'Confidence': 57.66265869140625, 'Instances': [], 'Parents': [{'Name': 'Planter'}, {'Name': 'Potted Plant'}, {'Name': 'Plant'}, {'Name': 'Vase'}, {'Name': 'Jar'}, {'Name': 'Pottery'}]}, {'Name': 'Park', 'Confidence': 51.91413879394531, 'Instances': [], 'Parents': [{'Name': 'Lawn'}, {'Name': 'Outdoors'}, {'Name': 'Grass'}, {'Name': 'Plant'}]}, {'Name': 'Grass', 'Confidence': 51.91413879394531, 'Instances': [], 'Parents': [{'Name': 'Plant'}]}, {'Name': 'Lawn', 'Confidence': 51.91413879394531, 'Instances': [], 'Parents': [{'Name': 'Grass'}, {'Name': 'Plant'}]}], 'LabelModelVersion': '2.0', 'ResponseMetadata': {'RequestId': '8f0146c9-ff5e-4b7b-9469-346aa46b125f', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Thu, 04 Mar 2021 05:54:59 GMT', 'x-amzn-requestid': '8f0146c9-ff5e-4b7b-9469-346aa46b125f', 'content-length': '2511', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

This output contains the confidence score of finding a variety of types of objects, called labels, in the image.

Those types of objects include Garden, Person, and even Ferris Wheel, among others. You can download the list of supported labels from our documentation page. The output from Amazon Rekognition includes all detected labels over a specified confidence level. In addition to the confidence of the label, it outputs an array of instances in the case that multiple objects of that label have been identified. For example, in the preceding image, Amazon Rekognition identified three Person objects, along with the location in the picture for each identified object.

Amazon DocumentDB stores each JSON output as a document. Multiple documents are stored in a collection, and multiple collections are stored in a database. Borrowing terminology from relational databases, documents are analogous to rows, and collections are analogous to tables. The following table summarizes these terms.

Document Database Concepts SQL Concepts
Document Row
Collection Table
Database Database
Field Column

We now implement the following tasks:

  1. Connect to an Amazon DocumentDB cluster.
  2. Upload images to Amazon S3.
  3. Analyze images using Amazon Rekognition.
  4. Ingest Amazon Rekognition output into Amazon DocumentDB.
  5. Explore image labels using Amazon DocumentDB queries.

To conduct these tasks, we use a SageMaker notebook, which is a Jupyter notebook app provided by a SageMaker notebook instance. Although you can use SageMaker notebooks to train and deploy ML models, they’re also useful for code commentary and data exploration, the latter being the focus of our post.

Create resources

We have prepared an AWS CloudFormation template to create the required AWS resources for this post in our GitHub repository. For instructions on creating a CloudFormation stack, see the video Simplify your Infrastructure Management using AWS CloudFormation.

The CloudFormation stack provisions the following:

  • An Amazon Virtual Private Cloud (Amazon VPC) with three private subnets and one public subnet.
  • An Amazon DocumentDB cluster with three nodes, one in each private subnet. When creating an Amazon DocumentDB cluster in a VPC, its subnet group should have subnets in at least three Availability Zones in a given Region.
  • A security group granting access to the Amazon DocumentDB cluster to resources inside the Amazon VPC. This security group is how the SageMaker notebook instance is granted access to the Amazon DocumentDB cluster.
  • An AWS Secrets Manager secret to store login credentials for Amazon DocumentDB. This allows us to avoid storing plaintext credentials in our SageMaker notebook instance.
  • A SageMaker role to retrieve the Amazon DocumentDB login credentials, allowing connections to the Amazon DocumentDB cluster from a SageMaker notebook.
  • A SageMaker notebook instance to run queries and analysis.
  • A SageMaker instance lifecycle configuration to run a bash script every time the instance boots up and downloads a certificate bundle to create TLS connections to Amazon DocumentDB, as well as a Jupyter notebook containing the code for this tutorial. The script also installs required Python libraries (such as pymongo for database methods and ipyplot for displaying images), so that we don’t need to install these libraries from the notebook. Finally, we download 15 sample images onto the SageMaker instance. See the following code:
sudo -u ec2-user -i <<'EOF'
source /home/ec2-user/anaconda3/bin/activate python3
pip install --upgrade pymongo
pip install --upgrade ipyplot
source /home/ec2-user/anaconda3/bin/deactivate
cd /home/ec2-user/SageMaker
wget https://s3.amazonaws.com/rds-downloads/rds-combined-ca-bundle.pem
wget https://github.com/aws-samples/documentdb-sagemaker-example/raw/main/rekognition/script.ipynb
mkdir pics
cd pics
wget https://github.com/aws-samples/documentdb-sagemaker-example/raw/main/rekognition/pics.zip
unzip pics.zip
rm pics.zip

Prior to creating the CloudFormation stack, you need to create a bucket in Amazon S3 to store the image files for analysis. For instructions, see Creating a bucket.

When creating the CloudFormation stack, you need to specify the following:

  • Name for your CloudFormation stack
  • Amazon DocumentDB username and password (to be stored in Secrets Manager)
  • Amazon DocumentDB instance type (default db.r5.large)
  • SageMaker instance type (default ml.t3.xlarge)
  • Name of your existing S3 bucket where you store your images for analysis

It should take about 15 minutes to create the CloudFormation stack. The following diagram shows the resource architecture.

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,

This CloudFormation template incurs costs, and you should consult the relevant pricing pages before launching it.

Connect to an Amazon DocumentDB cluster

All the subsequent code in this tutorial is in the Jupyter notebook in the SageMaker instance created in your CloudFormation stack.

  1. To connect to your Amazon DocumentDB cluster from a SageMaker notebook, you have to first specify the following code:
stack_name = "docdb-rekognition" # name of CloudFormation stack

The stack_name refers to the name you specified for your CloudFormation stack upon its creation.

  1. Use this parameter in the following method to get your Amazon DocumentDB credentials stored in Secrets Manager:
def get_secret(stack_name): # Create a Secrets Manager client session = boto3.session.Session() client = session.client( service_name='secretsmanager', region_name=session.region_name ) secret_name = f'{stack_name}-DocDBSecret' get_secret_value_response = client.get_secret_value(SecretId=secret_name) secret = get_secret_value_response['SecretString'] return json.loads(secret)

  1. Next, we extract the login parameters from the stored secret:
secret = get_secret(secret_name) db_username = secret['username']
db_password = secret['password']
db_port = secret['port']
db_host = secret['host']

  1. With the extracted parameters, we create a MongoClient from the pymongo library to establish a connection to the Amazon DocumentDB cluster.
uri_str = f"mongodb://{db_username}:{db_password}@{db_host}:{db_port}/?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false"
client = MongoClient(uri_str)

  1. We can use the following command to view details of our Amazon DocumentDB cluster, which verifies that the connection has been established:

  1. After we establish the connection to our Amazon DocumentDB cluster, we create a database and collection to store our image analysis data generated from Amazon Rekognition. For this post, we name our database db and our collection coll:
db_name = "db" # name the database
coll_name = "coll" # name the collection db = client[db_name] # create a database object
coll = db[coll_name] # create a collection object

Preview images

We use the ipyplot library to preview the images that were downloaded onto our SageMaker instance using the following code:

# Get image paths
pic_local_paths = glob.glob(f"{local_prefix}/*.jpg")
pic_local_paths = sorted(pic_local_paths) # Preview images

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,

Upload images to Amazon S3

After you verify the images, upload the images to your S3 bucket for Amazon Rekognition to access and analyze:

for pic_local_path in pic_local_paths: pic_filename = os.path.basename(pic_local_path) boto3.Session().resource('s3').Bucket(s3_bucket).Object(os.path.join(s3_prefix, pic_filename)).upload_file(pic_local_path)

Then we get the Amazon S3 keys for the images, to tell the Amazon Rekognition API where the images are for analysis:

fs = s3fs.S3FileSystem()
pic_keylist = fs.ls(f's3://{s3_bucket}/{s3_prefix}/')[1:]
pic_keylist = [key.split('docdb-blog/')[1] for key in pic_keylist]

Ingest analysis results from Amazon Rekognition into Amazon DocumentDB

Next, we loop over every image, analyzing each one using the Amazon Rekognition API, and ingesting the analysis output into Amazon DocumentDB. The results of each image analysis are stored as a document, and all these documents are stored within a collection. Apart from ingesting the analysis results from Amazon Rekognition, we also store each image’s Amazon S3 key, which is used as a unique identifier. See the following code:

for pic_key in pic_keylist: # Analyze image with Rekognition pic_result = rekognition.detect_labels( Image={ 'S3Object':{ 'Bucket': s3_bucket, 'Name': pic_key }}, MinConfidence=50, MaxLabels=100) # Extract S3 key and image labels pic_label = pic_result['Labels'] doc = { "img": pic_key.split('/')[-1], "Labels": pic_result['Labels'] } # Ingest data into DocumentDB coll.insert_one(doc)

Explore image labels using Amazon DocumentDB queries

We can now explore the image labels using Amazon DocumentDB queries.

Frequency counts

As is a common first step in data science, we want to explore the data to get some general descriptive statistics. We can use database operations to calculate some of these basic descriptive statistics.

To get a count of the number of images we ingested, we use the count_documents() command:

> 15

The count_documents() command gets the number of documents in a collection. The output from Amazon Rekognition for each image is recorded as a document, and coll is the name of the collection.

Across the 15 images, Amazon Rekognition detected multiple entities. To see the frequency of each entity label, we query the database using the aggregate command. The following query counts the number of times each label appears with a confidence score greater than 90% and then sorts the results in descending order of counts:

result = pd.DataFrame(coll.aggregate([ {"$unwind": "$Labels"}, {"$match": {"Labels.Confidence": {"$gte": 90.0}}}, {"$group": {"_id": "$Labels.Name", "count": {"$sum": 1}}}, {"$sort": {"count": -1} } ]))

We wrap the output of the preceding query in pd.DataFrame() to convert the results to a DataFrame. This allows us to generate visualizations such as the following.

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,

Based on the plot, Person and Human labels were the most common, with six counts each.

Select images with minimum confidence threshold

Besides labels, Amazon Rekognition also outputs the confidence level with which those labels were applied. The following query identifies the images with a Book label applied with 90% or more confidence:

# Query images with a 'Book' label of 90% or more confidence
coll.find( {"Labels": {"$elemMatch": {"Name": "Book", "Confidence": {"$gte": 90.0}}}}, {"_id": 0, "img": 1}

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,

We can also search for images containing multiple labels. The following query identifies images that contain the Book and Person labels, both with the minimum confidence level of 90%:

# Query images with a 'Book' label of 90% or more confidence and a 'Person' label of 90% or more confidence
coll.find( {"$and": [ {"Labels": {"$elemMatch": {"Name": "Book", "Confidence": {"$gte": 90.0}}}}, {"Labels": {"$elemMatch": {"Name": "Person", "Confidence": {"$gte": 90.0}}}}] }, {"_id": 0, "img": 1}

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,

We can use the explain() method in the MongoDB API to determine what query plan the Amazon DocumentDB query planner used to conduct these queries:

coll.find( {"Labels": {"$elemMatch": {"Name": "Book", "Confidence": {"$gte": 90.0}}}}, {"_id": 0, "img": 1}
> {'queryPlanner': {'plannerVersion': 1, 'namespace': 'db.coll', 'winningPlan': {'stage': 'COLLSCAN'}}, 'serverInfo': {'host': 'documentdbinstancethree-haw55aziqvyy', 'port': 27017, 'version': '3.6.0'}, 'ok': 1.0}

The winningPlan field shows the plan that the Amazon DocumentDB query planner used to run this query. It chose a COLLSCAN, which is a full collection scan, namely to scan each document and apply the predicate on each one.

Similarly, we can see the Amazon DocumentDB query planner also chose a full collection scan for the second query:

coll.find( {"$and": [ {"Labels": {"$elemMatch": {"Name": "Book", "Confidence": {"$gte": 90.0}}}}, {"Labels": {"$elemMatch": {"Name": "Person", "Confidence": {"$gte": 90.0}}}}] }, {"_id": 0, "img": 1}
> {'queryPlanner': {'plannerVersion': 1, 'namespace': 'db.coll', 'winningPlan': {'stage': 'COLLSCAN'}}, 'serverInfo': {'host': 'documentdbinstancethree-haw55aziqvyy', 'port': 27017, 'version': '3.6.0'}, 'ok': 1.0}

Select images with minimum confidence threshold (with index)

As with many database management systems, we can make queries perform better in Amazon DocumentDB by creating an index on commonly queried fields. In this case, we create an index on the label name and label confidence, because these are two fields we’re using in our predicate. After we create the index, we can modify our queries to use it.

To create the index, run the following:

("Labels.Name", ASCENDING),
("Labels.Confidence", ASCENDING)],

With the index created, we can use the following code block to implement the query to identify images containing books. We add some extra predicates that only find records that have the label Book and a label with a confidence level greater than or equal to 90.0, though not necessarily for the Book label. The query planner uses the index to filter the documents based on these first predicates and then apply the predicate asking for the Book label to have a confidence level greater than or equal to 90.0.

# Query for 'Book' label with 90% or more confidence
query_book = coll.find({"$and": [ {"Labels.Name": "Book"}, {"Labels.Confidence": {"$gte": 90.0}}, {"Labels": {"$elemMatch": {"Name": "Book", "Confidence": {"$gte": 90.0}}}} ]}, {"_id": 0, "img": 1}

Similarly, we can modify the query looking for both Book and Person labels as follows:

# Query for 'Book' label with 90% or more confidence and
# 'Person' label with 90% or more confidence
query_book_person = coll.find( {"$and": [ {"Labels.Name": "Book"}, {"Labels.Confidence": {"$gte": 90.0}}, {"Labels.Name": "Person"}, {"Labels.Confidence": {"$gte": 90.0}}, ## unnecessary, but adding for clarity {"Labels": {"$elemMatch": {"Name": "Book", "Confidence": {"$gte": 90.0}}}}, {"Labels": {"$elemMatch": {"Name": "Person", "Confidence": {"$gte": 90.0}}}}] }, {"_id": 0, "img": 1}

To validate that the Amazon DocumentDB query planner is, in fact, using the index we created, we can again use the explain() method. When we add this method to the query, we can observe the plan that Amazon DocumentDB chose, namely the winningPlan field. It used an IXSCAN stage, indicating that it used the index for this query. This is more efficient than scanning all documents in the collection and applying the predicates to each one.

> {'queryPlanner': {'plannerVersion': 1, 'namespace': 'db.coll', 'winningPlan': {'stage': 'FETCH', 'inputStage': {'stage': 'IXSCAN', 'indexName': 'idx_labels'}}}, 'serverInfo': {'host': 'documentdbinstanceone-ba0lmvhl0dml', 'port': 27017, 'version': '3.6.0'}, 'ok': 1.0} query_book_person.explain()
> {'queryPlanner': {'plannerVersion': 1, 'namespace': 'db.coll', 'winningPlan': {'stage': 'FETCH', 'inputStage': {'stage': 'IXSCAN', 'indexName': 'idx_labels'}}}, 'serverInfo': {'host': 'documentdbinstancetwo-iulkk0vmfiln', 'port': 27017, 'version': '3.6.0'}, 'ok': 1.0}

Select images with specified number instances of a label (array queries)

Besides identifying images with a particular label, you can also specify the number of detected instances of that label. To find all images with at least four instances of Person, each with 90% or more confidence, use the following query:

{"Labels": {"$elemMatch": {"Name": "Person", "Confidence": {"$gte": 90.0}, "Instances.3": {"$exists": True}}}},
{"_id": 0, "img": 1}

The query checks if the fourth instance, Instances.3, exists, with instance count starting from zero.

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,

You can also set a maximum limit for the number of instances. The following query selects all images with at least two but fewer than four instances of a Person label with 90% or more confidence:

coll.find( {"Labels": {"$elemMatch": {"Name": "Person", "Confidence": {"$gte": 90.0}, "Instances.1": {"$exists": True}, "Instances.3": {"$exists": False}}}}, {"_id": 0, "img": 1}

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,

Looking closer, we can see that the first image actually contains many people. Possibly due to how small they appear, fewer than four were detected.

To perform the preceding analysis with your own album, you can replace the sample pictures in Amazon S3 with your own pictures.

Clean up resources

To save cost, delete the CloudFormation stack you created. This removes all the resources you provisioned using the CloudFormation template, including the Amazon VPC, Amazon DocumentDB cluster, and SageMaker notebook instance. For instructions, see Deleting a stack on the AWS CloudFormation console. You should also delete the images in the S3 bucket that you created, along with the images it contains.


In this post, we analyzed images using Amazon Rekognition, ingested the output into Amazon DocumentDB, and explored the results using queries implemented in SageMaker. For another example of how to use SageMaker to analyze and store data in Amazon DocumentDB for an ML use case, see Analyzing data stored in Amazon DocumentDB (with MongoDB compatibility) using Amazon SageMaker.

Amazon DocumentDB provides you with several capabilities that help you back up and restore your data based on your use case. For more information, see Best Practices for Amazon DocumentDB. If you’re new to Amazon DocumentDB, see Get Started with Amazon DocumentDB. If you’re planning to migrate to Amazon DocumentDB, see Migrating to Amazon DocumentDB.

About the Authors

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,Annalyn Ng is a Senior Solutions Architect based in Singapore, where she designs and builds cloud solutions for public sector agencies. Annalyn graduated from the University of Cambridge, and blogs about machine learning at algobeans.com. Her book, Numsense! Data Science for the Layman, has been translated into multiple languages and is used in top universities as reference text.

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence, Brian Hess  is a Senior Analytics Platform Specialist at AWS. He has been in the data and analytics space for over 20 years and has extensive experience in roles including solutions architect, product management, and director of advanced analytics.

Read more about this on: AWS