In Part 1 of this series, we discussed intelligent document processing (IDP), and how IDP can accelerate claims processing use cases in the insurance industry. We discussed how we can use AWS AI services to accurately categorize claims documents along with supporting documents. We also discussed how to extract various types of documents in an insurance claims package, such as forms, tables, or specialized documents such as invoices, receipts, or ID documents. We looked into the challenges in legacy document processes, which is time-consuming, error-prone, expensive, and difficult to process at scale, and how you can use AWS AI services to help implement your IDP pipeline.

In this post, we walk you through advanced IDP features for document extraction, querying, and enrichment. We also look into how to further use the extracted structured information from claims data to get insights using AWS Analytics and visualization services. We highlight on how extracted structured data from IDP can help against fraudulent claims using AWS Analytics services.

Solution overview

The following diagram illustrates the phases if IDP using AWS AI services. In Part 1, we discussed the first three phases of the IDP workflow. In this post, we expand on the extraction step and the remaining phases, which include integrating IDP with AWS Analytics services.

The different phases of intelligent document processing in insurance industry

We use these analytics services for further insights and visualizations, and to detect fraudulent claims using structured, normalized data from IDP. The following diagram illustrates the solution architecture.

IDP architecture diagram

The phases we discuss in this post use the following key services:

  • Amazon Comprehend Medical is a HIPAA-eligible natural language processing (NLP) service that uses machine learning (ML) models that have been pre-trained to understand and extract health data from medical text, such as prescriptions, procedures, or diagnoses.
  • AWS Glue is a part of the AWS Analytics services stack, and is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, ML, and application development.
  • Amazon Redshift is another service in the Analytics stack. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.


Before you get started, refer to Part 1 for a high-level overview of the insurance use case with IDP and details about the data capture and classification stages.

For more information regarding the code samples, refer to our GitHub repo.

Extraction phase

In Part 1, we saw how to use Amazon Textract APIs to extract information like forms and tables from documents, and how to analyze invoices and identity documents. In this post, we enhance the extraction phase with Amazon Comprehend to extract default and custom entities specific to custom use cases.

Insurance carriers often come across dense text in insurance claims applications, such a patient’s discharge summary letter (see the following example image). It can be difficult to automatically extract information from such types of documents where there is no definite structure. To address this, we can use the following methods to extract key business information from the document:

Discharge summary sample

Extract default entities with the Amazon Comprehend DetectEntities API

We run the following code on the sample medical transcription document:

comprehend = boto3.client('comprehend') response = comprehend.detect_entities( Text=text, LanguageCode='en') #print enitities from the response JSON for entity in response['Entities']: print(f'{entity["Type"]} : {entity["Text"]}')

The following screenshot shows a collection of entities identified in the input text. The output has been shortened for the purposes of this post. Refer to the GitHub repo for a detailed list of entities.

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,

Extract custom entities with Amazon Comprehend custom entity recognition

The response from the DetectEntities API includes the default entities. However, we’re interested in knowing specific entity values, such as the patient’s name (denoted by the default entity PERSON), or the patient’s ID (denoted by the default entity OTHER). To recognize these custom entities, we train an Amazon Comprehend custom entity recognizer model. We recommend following the comprehensive steps on how to train and deploy a custom entity recognition model in the GitHub repo.

After we deploy the custom model, we can use the helper function get_entities() to retrieve custom entities like PATIENT_NAME and PATIENT_D from the API response:

def get_entities(text):
try: #detect entities entities_custom = comprehend.detect_entities(LanguageCode="en", Text=text, EndpointArn=ER_ENDPOINT_ARN) df_custom = pd.DataFrame(entities_custom["Entities"], columns = ['Text', 'Type', 'Score']) df_custom = df_custom.drop_duplicates(subset=['Text']).reset_index() return df_custom
except Exception as e: print(e) # call the get_entities() function response = get_entities(text) #print the response from the get_entities() function

The following screenshot shows our results.

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,

Enrichment phase

In the document enrichment phase, we perform enrichment functions on healthcare-related documents to draw valuable insights. We look at the following types of enrichment:

  • Extract domain-specific language – We use Amazon Comprehend Medical to extract medical-specific ontologies like ICD-10-CM, RxNorm, and SNOMED CT
  • Redact sensitive information – We use Amazon Comprehend to redact personally identifiable information (PII), and Amazon Comprehend Medical for protected health information (PHI) redaction

Extract medical information from unstructured medical text

Documents such as medical providers’ notes and clinical trial reports include dense medical text. Insurance claims carriers need to identify the relationships among the extracted health information from this dense text and link them to medical ontologies like ICD-10-CM, RxNorm, and SNOMED CT codes. This is very valuable in automating claim capture, validation, and approval workflows for insurance companies to accelerate and simplify claim processing. Let’s look at how we can use the Amazon Comprehend Medical InferICD10CM API to detect possible medical conditions as entities and link them to their codes:

cm_json_data = comprehend_med.infer_icd10_cm(Text=text) print("\nMedical coding\n========") for entity in cm_json_data["Entities"]: for icd in entity["ICD10CMConcepts"]: description = icd['Description'] code = icd["Code"] print(f'{description}: {code}')

For the input text, which we can pass in from the Amazon Textract DetectDocumentText API, the InferICD10CM API returns the following output (the output has been abbreviated for brevity).

Extract medical information from unstructured medical text

Similarly, we can use the Amazon Comprehend Medical InferRxNorm API to identify medications and the InferSNOMEDCT API to detect medical entities within healthcare-related insurance documents.

Perform PII and PHI redaction

Insurance claims packages require a lot of privacy compliance and regulations because they contain both PII and PHI data. Insurance carriers can reduce compliance risk by redacting information like policy numbers or the patient’s name.

Let’s look at an example of a patient’s discharge summary. We use the Amazon Comprehend DetectPiiEntities API to detect PII entities within the document and protect the patient’s privacy by redacting these entities:

resp = call_textract(input_document = f's3://{data_bucket}/idp/textract/dr-note-sample.png')
text = get_string(textract_json=resp, output_type=[Textract_Pretty_Print.LINES]) # call Amazon Comprehend Detect PII Entities API
entity_resp = comprehend.detect_pii_entities(Text=text, LanguageCode="en") pii = []
for entity in entity_resp['Entities']: pii_entity={} pii_entity['Type'] = entity['Type'] pii_entity['Text'] = text[entity['BeginOffset']:entity['EndOffset']] pii.append(pii_entity)

We get the following PII entities in the response from the detect_pii_entities() API :

response from the detect_pii_entities() API

We can then redact the PII entities that were detected from the documents by utilizing the bounding box geometry of the entities from the document. For that, we use a helper tool called amazon-textract-overlayer. For more information, refer to Textract-Overlayer. The following screenshots compare a document before and after redaction.

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,

Similar to the Amazon Comprehend DetectPiiEntities API, we can also use the DetectPHI API to detect PHI data in the clinical text being examined. For more information, refer to Detect PHI.

Review and validation phase

In the document review and validation phase, we can now verify if the claim package meets the business’s requirements, because we have all the information collected from the documents in the package from earlier stages. We can do this by introducing a human in the loop that can review and validate all the fields or just an auto-approval process for low dollar claims before sending the package to downstream applications. We can use Amazon Augmented AI (Amazon A2I) to automate the human review process for insurance claims processing.

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,

Now that we have all required data extracted and normalized from claims processing using AI services for IDP, we can extend the solution to integrate with AWS Analytics services such as AWS Glue and Amazon Redshift to solve additional use cases and provide further analytics and visualizations.

Detect fraudulent insurance claims

In this post, we implement a serverless architecture where the extracted and processed data is stored in a data lake and is used to detect fraudulent insurance claims using ML. We use Amazon Simple Storage Service (Amazon S3) to store the processed data. We can then use AWS Glue or Amazon EMR to cleanse the data and add additional fields to make it consumable for reporting and ML. After that, we use Amazon Redshift ML to build a fraud detection ML model. Finally, we build reports using Amazon QuickSight to get insights into the data.

Setup Amazon Redshift external schema

For the purpose of this example, we have created a sample dataset the emulates the output of an ETL (extract, transform, and load) process, and use AWS Glue Data Catalog as the metadata catalog. First, we create a database named idp_demo in the Data Catalog and an external schema in Amazon Redshift called idp_insurance_demo (see the following code). We use an AWS Identity and Access Management (IAM) role to grant permissions to the Amazon Redshift cluster to access Amazon S3 and Amazon SageMaker. For more information about how to set up this IAM role with least privilege, refer to Cluster and configure setup for Amazon Redshift ML administration.

CREATE EXTERNAL SCHEMA idp_insurance_demo
DATABASE 'idp_demo' IAM_ROLE '<<<your IAM Role here>>>'

Create Amazon Redshift external table

The next step is to create an external table in Amazon Redshift referencing the S3 location where the file is located. In this case, our file is a comma-separated text file. We also want to skip the header row from the file, which can be configured in the table properties section. See the following code:

create external table INTEGER,
date_of_service date,
patients_address_city VARCHAR,
patients_address_state VARCHAR,
patients_address_zip VARCHAR,
patient_status VARCHAR,
insured_address_state VARCHAR,
insured_address_zip VARCHAR,
insured_date_of_birth date,
insurance_plan_name VARCHAR,
total_charges DECIMAL(14,4),
fraud VARCHAR,
duplicate varchar,
invalid_claim VARCHAR
row format delimited
fields terminated by ','
stored as textfile
location '<<<S3 path where file is located>>>'
table properties ( 'skip.header.line.count'='1');

Create training and test datasets

After we create the external table, we prepare our dataset for ML by splitting it into training set and test set. We create a new external table called claim_train, which consists of all records with ID <= 85000 from the claims table. This is the training set that we train our ML model on.

row format delimited
fields terminated by ','
stored as textfile
location '<<<S3 path where file is located>>>/train'
table properties ( 'skip.header.line.count'='1')
AS select * from where id <= 850000

We create another external table called claim_test that consists of all records with ID >85000 to be the test set that we test the ML model on:

row format delimited
fields terminated by ','
stored as textfile
location '<<<S3 path where file is located>>>/test'
table properties ( 'skip.header.line.count'='1')
AS select * from where id > 850000

Create an ML model with Amazon Redshift ML

Now we create the model using the CREATE MODEL command (see the following code). We select the relevant columns from the claims_train table that can determine a fraudulent transaction. The goal of this model is to predict the value of the fraud column; therefore, fraud is added as the prediction target. After the model is trained, it creates a function named insurance_fraud_model. This function is used for inference while running SQL statements to predict the value of the fraud column for new records.

CREATE MODEL idp_insurance_demo.insurance_fraud_model
FROM (SELECT total_charges ,
fraud ,
FROM idp_insurance_demo.claims_train
TARGET fraud
FUNCTION insurance_fraud_model
IAM_ROLE '<<<your IAM Role here>>>'
S3_BUCKET '<<<S3 bucket where model artifacts will be stored>>>'

Evaluate ML model metrics

After we create the model, we can run queries to check the accuracy of the model. We use the insurance_fraud_model function to predict the value of the fraud column for new records. Run the following query on the claims_test table to create a confusion matrix:

SELECT fraud,
idp_insurance_demo.insurance_fraud_model (total_charges ,duplicate,invalid_claim ) as fraud_calculcated,
FROM idp_insurance_demo.claims_test
GROUP BY fraud , fraud_calculcated;

Detect fraud using the ML model

After we create the new model, as new claims data is inserted into the data warehouse or data lake, we can use the insurance_fraud_model function to calculate the fraudulent transactions. We do this by first loading the new data into a temporary table. Then we use the insurance_fraud_model function to calculate the fraud flag for each new transaction and insert the data along with the flag into the final table, which in this case is the claims table.

Visualize the claims data

When the data is available in Amazon Redshift, we can create visualizations using QuickSight. We can then share the QuickSight dashboards with business users and analysts. To create the QuickSight dashboard, you first need to create an Amazon Redshift dataset in QuickSight. For instructions, refer to Creating a dataset from a database.

After you create the dataset, you can create a new analysis in QuickSight using the dataset. The following are some sample reports we created:

  • Total number of claims by state, grouped by the fraud field – This chart shows us the proportion of fraudulent transactions compared to the total number of transactions in a particular state.
  • Sum of the total dollar value of the claims, grouped by the fraud field – This chart shows us the proportion of dollar amount of fraudulent transactions compared to the total dollar amount of transactions in a particular state.
  • Total number of transactions per insurance company, grouped by the fraud field – This chart shows us how many claims were filed for each insurance company and how many of them are fraudulent.

• Total number of transactions per insurance company, grouped by the fraud field

  • Total sum of fraudulent transactions by state displayed on a US map – This chart just shows the fraudulent transactions and displays the total charges for those transactions by state on the map. The darker shade of blue indicates higher total charges. We can further analyze this by city within that state and zip codes with the city to better understand the trends.

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,

Clean up

To prevent incurring future charges to your AWS account, delete the resources that you provisioned in the setup by following the instructions in the Cleanup section in our repo.


In this two-part series, we saw how to build an end-to-end IDP pipeline with little or no ML experience. We explored a claims processing use case in the insurance industry and how IDP can help automate this use case using services such as Amazon Textract, Amazon Comprehend, Amazon Comprehend Medical, and Amazon A2I. In Part 1, we demonstrated how to use AWS AI services for document extraction. In Part 2, we extended the extraction phase and performed data enrichment. Finally, we extended the structured data extracted from IDP for further analytics, and created visualizations to detect fraudulent claims using AWS Analytics services.

We recommend reviewing the security sections of the Amazon Textract, Amazon Comprehend, and Amazon A2I documentation and following the guidelines provided. To learn more about the pricing of the solution, review the pricing details of Amazon Textract, Amazon Comprehend, and Amazon A2I.

About the Authors

authorChinmayee Rane is an AI/ML Specialist Solutions Architect at Amazon Web Services. She is passionate about applied mathematics and machine learning. She focuses on designing intelligent document processing solutions for AWS customers. Outside of work, she enjoys salsa and bachata dancing.

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,
Uday Narayanan
is an Analytics Specialist Solutions Architect at AWS. He enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are data analytics, big data systems, and machine learning. In his spare time, he enjoys playing sports, binge-watching TV shows, and traveling.

Hyperedge- . IoT, Embedded Systems, Artificial Intelligence,
Sonali Sahu
is leading the Intelligent Document Processing AI/ML Solutions Architect team at Amazon Web Services. She is a passionate technophile and enjoys working with customers to solve complex problems using innovation. Her core area of focus is artificial intelligence and machine learning for intelligent document processing.

Read more about this on: AWS