In Part 1 of this series, we discussed intelligent document processing (IDP), and how IDP can accelerate claims processing use cases in the insurance industry. We discussed how we can use AWS AI services to accurately categorize claims documents along with supporting documents. We also discussed how to extract various types of documents in an insurance claims package, such as forms, tables, or specialized documents such as invoices, receipts, or ID documents. We looked into the challenges in legacy document processes, which is time-consuming, error-prone, expensive, and difficult to process at scale, and how you can use AWS AI services to help implement your IDP pipeline.
In this post, we walk you through advanced IDP features for document extraction, querying, and enrichment. We also look into how to further use the extracted structured information from claims data to get insights using AWS Analytics and visualization services. We highlight on how extracted structured data from IDP can help against fraudulent claims using AWS Analytics services.
The following diagram illustrates the phases if IDP using AWS AI services. In Part 1, we discussed the first three phases of the IDP workflow. In this post, we expand on the extraction step and the remaining phases, which include integrating IDP with AWS Analytics services.
We use these analytics services for further insights and visualizations, and to detect fraudulent claims using structured, normalized data from IDP. The following diagram illustrates the solution architecture.
The phases we discuss in this post use the following key services:
- Amazon Comprehend Medical is a HIPAA-eligible natural language processing (NLP) service that uses machine learning (ML) models that have been pre-trained to understand and extract health data from medical text, such as prescriptions, procedures, or diagnoses.
- AWS Glue is a part of the AWS Analytics services stack, and is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, ML, and application development.
- Amazon Redshift is another service in the Analytics stack. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.
Before you get started, refer to Part 1 for a high-level overview of the insurance use case with IDP and details about the data capture and classification stages.
For more information regarding the code samples, refer to our GitHub repo.
In Part 1, we saw how to use Amazon Textract APIs to extract information like forms and tables from documents, and how to analyze invoices and identity documents. In this post, we enhance the extraction phase with Amazon Comprehend to extract default and custom entities specific to custom use cases.
Insurance carriers often come across dense text in insurance claims applications, such a patient’s discharge summary letter (see the following example image). It can be difficult to automatically extract information from such types of documents where there is no definite structure. To address this, we can use the following methods to extract key business information from the document:
Extract default entities with the Amazon Comprehend DetectEntities API
We run the following code on the sample medical transcription document:
The following screenshot shows a collection of entities identified in the input text. The output has been shortened for the purposes of this post. Refer to the GitHub repo for a detailed list of entities.
Extract custom entities with Amazon Comprehend custom entity recognition
The response from the
DetectEntities API includes the default entities. However, we’re interested in knowing specific entity values, such as the patient’s name (denoted by the default entity
PERSON), or the patient’s ID (denoted by the default entity
OTHER). To recognize these custom entities, we train an Amazon Comprehend custom entity recognizer model. We recommend following the comprehensive steps on how to train and deploy a custom entity recognition model in the GitHub repo.
After we deploy the custom model, we can use the helper function
get_entities() to retrieve custom entities like
PATIENT_D from the API response:
The following screenshot shows our results.
In the document enrichment phase, we perform enrichment functions on healthcare-related documents to draw valuable insights. We look at the following types of enrichment:
- Extract domain-specific language – We use Amazon Comprehend Medical to extract medical-specific ontologies like ICD-10-CM, RxNorm, and SNOMED CT
- Redact sensitive information – We use Amazon Comprehend to redact personally identifiable information (PII), and Amazon Comprehend Medical for protected health information (PHI) redaction
Extract medical information from unstructured medical text
Documents such as medical providers’ notes and clinical trial reports include dense medical text. Insurance claims carriers need to identify the relationships among the extracted health information from this dense text and link them to medical ontologies like ICD-10-CM, RxNorm, and SNOMED CT codes. This is very valuable in automating claim capture, validation, and approval workflows for insurance companies to accelerate and simplify claim processing. Let’s look at how we can use the Amazon Comprehend Medical
InferICD10CM API to detect possible medical conditions as entities and link them to their codes:
For the input text, which we can pass in from the Amazon Textract
DetectDocumentText API, the
InferICD10CM API returns the following output (the output has been abbreviated for brevity).
Perform PII and PHI redaction
Insurance claims packages require a lot of privacy compliance and regulations because they contain both PII and PHI data. Insurance carriers can reduce compliance risk by redacting information like policy numbers or the patient’s name.
Let’s look at an example of a patient’s discharge summary. We use the Amazon Comprehend
DetectPiiEntities API to detect PII entities within the document and protect the patient’s privacy by redacting these entities:
We get the following PII entities in the response from the
detect_pii_entities() API :
We can then redact the PII entities that were detected from the documents by utilizing the bounding box geometry of the entities from the document. For that, we use a helper tool called
amazon-textract-overlayer. For more information, refer to Textract-Overlayer. The following screenshots compare a document before and after redaction.
Similar to the Amazon Comprehend
DetectPiiEntities API, we can also use the
DetectPHI API to detect PHI data in the clinical text being examined. For more information, refer to Detect PHI.
Review and validation phase
In the document review and validation phase, we can now verify if the claim package meets the business’s requirements, because we have all the information collected from the documents in the package from earlier stages. We can do this by introducing a human in the loop that can review and validate all the fields or just an auto-approval process for low dollar claims before sending the package to downstream applications. We can use Amazon Augmented AI (Amazon A2I) to automate the human review process for insurance claims processing.
Now that we have all required data extracted and normalized from claims processing using AI services for IDP, we can extend the solution to integrate with AWS Analytics services such as AWS Glue and Amazon Redshift to solve additional use cases and provide further analytics and visualizations.
Detect fraudulent insurance claims
In this post, we implement a serverless architecture where the extracted and processed data is stored in a data lake and is used to detect fraudulent insurance claims using ML. We use Amazon Simple Storage Service (Amazon S3) to store the processed data. We can then use AWS Glue or Amazon EMR to cleanse the data and add additional fields to make it consumable for reporting and ML. After that, we use Amazon Redshift ML to build a fraud detection ML model. Finally, we build reports using Amazon QuickSight to get insights into the data.
Setup Amazon Redshift external schema
For the purpose of this example, we have created a sample dataset the emulates the output of an ETL (extract, transform, and load) process, and use AWS Glue Data Catalog as the metadata catalog. First, we create a database named
idp_demo in the Data Catalog and an external schema in Amazon Redshift called
idp_insurance_demo (see the following code). We use an AWS Identity and Access Management (IAM) role to grant permissions to the Amazon Redshift cluster to access Amazon S3 and Amazon SageMaker. For more information about how to set up this IAM role with least privilege, refer to Cluster and configure setup for Amazon Redshift ML administration.
Create Amazon Redshift external table
The next step is to create an external table in Amazon Redshift referencing the S3 location where the file is located. In this case, our file is a comma-separated text file. We also want to skip the header row from the file, which can be configured in the table properties section. See the following code:
Create training and test datasets
After we create the external table, we prepare our dataset for ML by splitting it into training set and test set. We create a new external table called
claim_train, which consists of all records with ID <= 85000 from the claims table. This is the training set that we train our ML model on.
We create another external table called
claim_test that consists of all records with ID >85000 to be the test set that we test the ML model on:
Create an ML model with Amazon Redshift ML
Now we create the model using the CREATE MODEL command (see the following code). We select the relevant columns from the
claims_train table that can determine a fraudulent transaction. The goal of this model is to predict the value of the
fraud column; therefore,
fraud is added as the prediction target. After the model is trained, it creates a function named
insurance_fraud_model. This function is used for inference while running SQL statements to predict the value of the
fraud column for new records.
Evaluate ML model metrics
After we create the model, we can run queries to check the accuracy of the model. We use the
insurance_fraud_model function to predict the value of the
fraud column for new records. Run the following query on the
claims_test table to create a confusion matrix:
Detect fraud using the ML model
After we create the new model, as new claims data is inserted into the data warehouse or data lake, we can use the
insurance_fraud_model function to calculate the fraudulent transactions. We do this by first loading the new data into a temporary table. Then we use the
insurance_fraud_model function to calculate the
fraud flag for each new transaction and insert the data along with the flag into the final table, which in this case is the
Visualize the claims data
When the data is available in Amazon Redshift, we can create visualizations using QuickSight. We can then share the QuickSight dashboards with business users and analysts. To create the QuickSight dashboard, you first need to create an Amazon Redshift dataset in QuickSight. For instructions, refer to Creating a dataset from a database.
After you create the dataset, you can create a new analysis in QuickSight using the dataset. The following are some sample reports we created:
- Total number of claims by state, grouped by the
fraudfield – This chart shows us the proportion of fraudulent transactions compared to the total number of transactions in a particular state.
- Sum of the total dollar value of the claims, grouped by the
fraudfield – This chart shows us the proportion of dollar amount of fraudulent transactions compared to the total dollar amount of transactions in a particular state.
- Total number of transactions per insurance company, grouped by the
fraudfield – This chart shows us how many claims were filed for each insurance company and how many of them are fraudulent.
- Total sum of fraudulent transactions by state displayed on a US map – This chart just shows the fraudulent transactions and displays the total charges for those transactions by state on the map. The darker shade of blue indicates higher total charges. We can further analyze this by city within that state and zip codes with the city to better understand the trends.
To prevent incurring future charges to your AWS account, delete the resources that you provisioned in the setup by following the instructions in the Cleanup section in our repo.
In this two-part series, we saw how to build an end-to-end IDP pipeline with little or no ML experience. We explored a claims processing use case in the insurance industry and how IDP can help automate this use case using services such as Amazon Textract, Amazon Comprehend, Amazon Comprehend Medical, and Amazon A2I. In Part 1, we demonstrated how to use AWS AI services for document extraction. In Part 2, we extended the extraction phase and performed data enrichment. Finally, we extended the structured data extracted from IDP for further analytics, and created visualizations to detect fraudulent claims using AWS Analytics services.
We recommend reviewing the security sections of the Amazon Textract, Amazon Comprehend, and Amazon A2I documentation and following the guidelines provided. To learn more about the pricing of the solution, review the pricing details of Amazon Textract, Amazon Comprehend, and Amazon A2I.
About the Authors
Chinmayee Rane is an AI/ML Specialist Solutions Architect at Amazon Web Services. She is passionate about applied mathematics and machine learning. She focuses on designing intelligent document processing solutions for AWS customers. Outside of work, she enjoys salsa and bachata dancing.
Uday Narayanan is an Analytics Specialist Solutions Architect at AWS. He enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are data analytics, big data systems, and machine learning. In his spare time, he enjoys playing sports, binge-watching TV shows, and traveling.
Sonali Sahu is leading the Intelligent Document Processing AI/ML Solutions Architect team at Amazon Web Services. She is a passionate technophile and enjoys working with customers to solve complex problems using innovation. Her core area of focus is artificial intelligence and machine learning for intelligent document processing.