Thursday, August 7, 2025

Streamline Multi-Page Document Processing with AI and Human Review Using Amazon Bedrock and SageMaker

Share

Streamlining Document Processing with Amazon Bedrock Data Automation and SageMaker AI

Organizations across various industries are grappling with the increasing complexity of high volumes of multi-page documents that necessitate intelligent processing to extract accurate information. While automation has significantly improved operational efficiency, human expertise remains crucial in certain scenarios to ensure data accuracy and quality.

In March 2025, AWS introduced Amazon Bedrock Data Automation, a cutting-edge tool that enables developers to automate the extraction of valuable insights from unstructured multimodal content such as documents, images, video, and audio. This innovative solution transforms document processing workflows by automating the extraction, transformation, and generation of insights, effectively minimizing time-consuming tasks like data preparation, model management, fine-tuning, and orchestration. With a unified multimodal inference API, Amazon Bedrock Data Automation delivers industry-leading accuracy at a lower cost compared to alternative solutions.

Simplifying Complex Document Processing Tasks

Amazon Bedrock Data Automation excels at streamlining complex document processing tasks, which include:

  • Document Splitting: Breaking down large documents into manageable parts.
  • Classification: Sorting documents based on specific criteria.
  • Extraction: Pulling relevant information from documents.
  • Normalization: Standardizing extracted data for consistency.
  • Validation: Ensuring the accuracy of the extracted information.

Moreover, this solution incorporates visual grounding with confidence scores for explainability, along with built-in strategies to mitigate hallucination, providing trustworthy insights from unstructured data sources. However, as advanced as Amazon Bedrock Data Automation is, there are situations where human judgment cannot be replaced. This is where the integration with Amazon SageMaker AI becomes indispensable, creating a powerful end-to-end solution.

Leveraging Human Review Loops

Integrating human review loops into the document processing workflow enables organizations to maintain the highest levels of accuracy while achieving processing efficiency. By employing human review, organizations can:

  • Validate AI predictions when confidence is low.
  • Effectively handle edge cases and exceptions.
  • Ensure regulatory compliance through appropriate oversight.
  • Maximize automation while maintaining high accuracy.
  • Create feedback loops for continuous model performance improvement.

This strategic approach allows human experts to focus on uncertain parts of documents while letting automated systems handle routine extractions.

Understanding Confidence Scores

Confidence scores play a critical role in determining when to invoke human review. These scores represent the percentage of certainty that Amazon Bedrock Data Automation has regarding the accuracy of its extraction. The goal is to simplify intelligent document processing (IDP) by managing accuracy calculation within the tool itself, freeing customers to tackle business challenges rather than navigating complex scoring systems.

Amazon Bedrock Data Automation optimizes its models for Expected Calibration Error (ECE), enhancing the reliability of confidence scores. In practice, confidence scores in document processing workflows are interpreted as follows:

  • High Confidence (90-100%): Very high certainty about the extraction.
  • Medium Confidence (70-89%): Reasonable certainty but potential for error exists.
  • Low Confidence (<70%): High uncertainty, typically requiring human verification.

Testing Amazon Bedrock Data Automation on specific datasets is recommended to establish the confidence threshold that triggers the human review workflow.

Architectural Overview of the Solution

To efficiently process multi-page documents using Amazon Bedrock Data Automation and SageMaker AI, a serverless architecture is employed. The workflow consists of several steps:

  1. Documents are uploaded to an Amazon Simple Storage Service (Amazon S3) input bucket, serving as the entry point for processing.

  2. An Amazon EventBridge rule detects new documents in the S3 bucket and triggers the AWS Step Functions workflow.

  3. The Step Functions workflow initiates the bda-document-processor AWS Lambda function, which invokes Amazon Bedrock Data Automation to execute preconfigured instructions for document extraction and processing.

  4. Amazon Bedrock Data Automation analyzes the document, extracts key fields with associated confidence scores, and stores the processed output in another S3 bucket.

  5. The workflow then invokes the bda-classifier Lambda function to evaluate the confidence scores against established thresholds.

  6. Documents with low confidence scores are routed to SageMaker AI for human review, where specialized reviewers can correct any erroneously extracted fields.

  7. The validated and corrected form data from human review is saved in an S3 bucket.

  8. Finally, once Sagemaker AI’s output is written to Amazon S3, the bda-a2i-aggregator AWS Lambda function updates the payload with the new values reviewed by humans, generating final, high-confidence output ready for downstream systems.

Deployment Prerequisites

To implement this solution, you will need the AWS Cloud Development Kit (AWS CDK), Node.js, and Docker installed on your deployment machine. A build script will facilitate the packaging and deployment of the solution.

Deployment Steps

To deploy the solution, follow these steps:

  1. Clone the solution repository to your deployment machine.

  2. Navigate to the project directory and run the build script using the command:
    bash
    ./build.sh

This will create various resources in your AWS account, including:

  • Two S3 buckets, one for document uploads and one for storing processed outputs.
  • An Amazon Bedrock Data Automation project along with five blueprints for document processing.
  • An Amazon Cognito user pool for managing a private workforce.
  • Two Lambda functions and a Step Function workflow for document processing.
  • Two Amazon Elastic Container Registry (Amazon ECR) container images for the Lambda functions.

Adding Workers to the Private Workforce

After the initial build, you must add workers to the private workforce in SageMaker Ground Truth. You can do this by:

  1. Navigating to Ground Truth in the SageMaker AI console and selecting the Labeling workforces tab.

  2. In the Workers section, choose Invite new workers and enter their email addresses.

  3. After inviting workers, they will receive an email with a temporary password.

  4. In the Labeling workforces page, access the Labeling portal sign-in URL and log in using the provided credentials.

  5. Upon login, they will need to set a new password and can start working on the document review tasks.

Testing the Solution

To test the completed solution, upload a test document from the assets folder to the S3 bucket for incoming documents. You can monitor the document processing via the Step Functions console or through Amazon CloudWatch. After processing, a new job will appear in SageMaker AI for the designated user.

Clean-Up Procedure

To stop all resources created as part of the solution, execute the following command from the root directory of your project:

bash
cdk destroy

Summary of the Solution’s Benefits

The combination of Amazon Bedrock Data Automation and SageMaker AI represents a significant leap forward in achieving both automation efficiency and human-level accuracy in document processing. This approach allows organizations to tackle a multitude of document types while customizing solutions to meet specific business needs. By exploring the available documentation and GitHub repository, users can begin to implement these solutions tailored to their unique challenges.

This blend of technology not only streamlines workflows but also enhances the accuracy and reliability of data extracted from unstructured content, setting a new standard in document intelligence.

Read more

Related updates