Building Production RAG Systems

A Complete Guide to AWS Bedrock Knowledge Bases & Agents

AWS Bedrock RAG Vector Search Terraform Python FAISS

Introduction

Core concepts and implementation overview

Retrieval Augmented Generation (RAG) has become the de facto standard for building AI systems that need to answer questions based on your own documents. But moving from a proof-of-concept to a production-ready system involves many decisions: chunking strategies, vector databases, embedding models, and infrastructure.

This tutorial walks you through building a complete RAG system using AWS Bedrock, featuring:

  • Hierarchical Chunking: Parent chunks (1500 tokens) + Child chunks (300 tokens)
  • FAISS Vector Search: High-performance similarity search with OpenSearch Serverless
  • Automatic RAG: Bedrock Agents that know when to retrieve context
  • Infrastructure as Code: Complete Terraform deployment

🎯 What You'll Learn

  • How RAG systems work under the hood
  • Why hierarchical chunking outperforms fixed-size chunking
  • Setting up OpenSearch Serverless with FAISS
  • Deploying Bedrock Knowledge Bases and Agents
  • Testing and optimizing your RAG pipeline

System Architecture

End-to-end pipeline design

Our RAG system follows a multi-stage pipeline that transforms documents into semantic vectors, stores them efficiently, and retrieves relevant context for generation.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  S3 Bucket  β”‚ ← Document Storage (PDFs, TXT, etc.)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Bedrock Knowledge Base   β”‚ ← Hierarchical Chunking
β”‚ (Titan Embeddings)       β”‚   Parent: 1500 tokens
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   Child: 300 tokens
       β”‚                        Overlap: 60 tokens
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ OpenSearch Serverless    β”‚ ← FAISS Vector Search
β”‚ (1536-dim vectors)       β”‚   L2 distance + HNSW
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Bedrock Agent          β”‚ ← Automatic RAG
β”‚   (Claude/Nova)          β”‚   Context-aware responses
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                

Key Components

πŸ“„ Document Storage (S3)

Stores source documents with versioning enabled for audit trails

🧠 Knowledge Base

Manages chunking, embedding, and synchronization with vector DB

πŸ” Vector Database

OpenSearch Serverless with FAISS for sub-millisecond similarity search

πŸ€– Bedrock Agent

Orchestrates retrieval and generation with context-aware responses

Understanding RAG

How retrieval augmented generation works

The Problem RAG Solves

Large Language Models (LLMs) are trained on vast amounts of data, but they have two fundamental limitations:

  • Knowledge Cutoff: They don't know about events after their training date
  • No Private Data: They can't access your company's documents, databases, or internal knowledge

πŸ’‘ RAG Solution

Instead of retraining the model (expensive and time-consuming), RAG retrieves relevant information from your documents and includes it in the prompt. The LLM then generates answers based on this retrieved context.

How RAG Works (Step by Step)

1

Document Ingestion

Documents are split into chunks (we'll use hierarchical chunking for better context)

2

Embedding Generation

Each chunk is converted to a 1536-dimensional vector using Amazon Titan embeddings

3

Vector Storage

Vectors are stored in OpenSearch Serverless with FAISS indexing for fast retrieval

4

Query Processing

User query is embedded using the same model

5

Similarity Search

FAISS finds the most similar chunks using L2 distance

6

Context Augmentation

Retrieved chunks are added to the prompt

7

Generation

LLM generates answer based on retrieved context

Hierarchical Chunking: The Secret Sauce

Advanced chunking strategy for better context

Most RAG systems use fixed-size chunking (e.g., 512 tokens per chunk). This works, but has a critical flaw: you lose context.

The Problem with Fixed-Size Chunking

Example: Imagine a document about "AWS Bedrock Features"

  • Chunk 1: "...supports multiple foundation models including..."
  • Chunk 2: "...Claude, Nova, and Titan. Each model has..."

If you retrieve Chunk 2, you get the model names but miss the broader context about "AWS Bedrock Features".

Hierarchical Chunking Solution

Hierarchical chunking creates two levels of chunks:

πŸ‘Ά Child Chunks (300 tokens)

  • βœ… Precise semantic search
  • βœ… Find exact information
  • βœ… Fast retrieval

πŸ‘¨ Parent Chunks (1500 tokens)

  • βœ… Broad context
  • βœ… Complete information
  • βœ… Better LLM understanding

🎯 How It Works

  1. System searches in child chunks (300 tokens) for precision
  2. When a match is found, it returns the parent chunk (1500 tokens)
  3. LLM gets both precise match AND surrounding context

Configuration

vector_ingestion_configuration {
  chunking_configuration {
    chunking_strategy = "HIERARCHICAL"
    hierarchical_chunking_configuration {
      level_configuration {
        max_tokens = 1500  # Parent chunk
      }
      level_configuration {
        max_tokens = 300   # Child chunk
      }
      overlap_tokens = 60  # Continuity between chunks
    }
  }
}

The 60-token overlap ensures that information spanning chunk boundaries isn't lost.

Implementation Guide

Step-by-step deployment instructions

Prerequisites

  • AWS Account with Bedrock access
  • Terraform >= 1.0
  • Python 3.8+
  • AWS CLI configured

Step 1: Clone the Repository

git clone https://github.com/raulprocha/bedrock-rag-system.git
cd bedrock-rag-system
pip install -r requirements.txt

Step 2: Configure AWS Resources

Copy the example configuration and update with your AWS details:

cp config.py.example scripts/config.py
# Edit scripts/config.py with your AWS account ID and region

Step 3: Deploy Infrastructure

Use Terraform to deploy all AWS resources:

cd terraform
terraform init
terraform plan
terraform apply -auto-approve

This creates: S3 bucket, OpenSearch Serverless collection, Knowledge Base, and Bedrock Agent

Step 4: Create FAISS Index

Important: Bedrock requires the OpenSearch index to use FAISS engine (not the default nmslib).

python -m scripts.opensearch_manager create

⚠️ Common Issue

If you get "invalid engine type" error, the index was created with nmslib. Fix it:

python -m scripts.opensearch_manager recreate

Step 5: Upload Documents

BUCKET=$(terraform output -raw s3_bucket_name)
aws s3 cp your-document.pdf s3://$BUCKET/

Step 6: Sync Knowledge Base

Start the ingestion job to process and index your documents:

KB_ID=$(terraform output -raw knowledge_base_id)
DS_ID=$(terraform output -raw data_source_id)

aws bedrock-agent start-ingestion-job \
  --knowledge-base-id $KB_ID \
  --data-source-id $DS_ID \
  --region us-east-1

⏱️ Processing time: ~4-5 minutes for a 50MB PDF

Step 7: Test Your RAG System

# Test agent with RAG
python tests/test_agent.py

# Test direct KB retrieval
python tests/test_kb.py

# Interactive CLI
python cli.py "What are the main features of Amazon Bedrock?"

Code Deep Dive

Key implementation details

BedrockClient: High-Level API

The BedrockClient class provides a clean interface for interacting with Bedrock services:

from scripts.bedrock_client import BedrockClient

# Initialize client
client = BedrockClient()

# Query agent (automatic RAG)
response = client.invoke_agent("What is hierarchical chunking?")
print(response)

# Direct KB retrieval
results = client.retrieve_from_kb("FAISS vector search", max_results=5)
for result in results:
    print(f"Score: {result['score']:.4f}")
    print(f"Text: {result['content']['text'][:200]}...")

Understanding the Code

1. invoke_agent()

Sends query to Bedrock Agent, which automatically:

  • Determines if KB retrieval is needed
  • Retrieves relevant chunks
  • Generates context-aware response

2. retrieve_from_kb()

Bypasses the agent for direct vector search:

  • Embeds query using Titan
  • Performs FAISS similarity search
  • Returns top-k results with scores

OpenSearch Manager

Manages OpenSearch Serverless index lifecycle:

from scripts.opensearch_manager import OpenSearchManager

manager = OpenSearchManager()

# Create index with FAISS
manager.create_index(
    index_name="bedrock-knowledge-base-index",
    dimension=1536,  # Titan embeddings
    engine="faiss"
)

# Check configuration
info = manager.get_index_info()
print(f"Engine: {info['mappings']['properties']['vector']['method']['engine']}")

# Recreate if needed
manager.recreate_index()

Performance & Optimization

Tuning for production workloads

Benchmarks

Based on testing with 49.4 MB PDF (Amazon Bedrock User Guide):

⏱️ Indexing Time

~4-5 min

One-time processing per document

πŸš€ Query Latency

~2-3 sec

Including retrieval + generation

🎯 Retrieval Accuracy

High

Hierarchical chunking advantage

πŸ“Š Context Window

1500 tokens

Per retrieved chunk

Optimization Tips

1. Adjust Chunk Sizes

Experiment with parent/child ratios. Larger parents = more context, but slower retrieval.

2. Tune Overlap

60 tokens works well, but increase for documents with dense cross-references.

3. Limit Retrieved Chunks

Start with 3-5 chunks. More isn't always betterβ€”can confuse the LLM.

4. Monitor Costs

OpenSearch Serverless is the main cost (~$175/month). Delete when not in use.

Cost Analysis

Understanding AWS infrastructure costs

Service Cost Notes
OpenSearch Serverless ~$175/month 2 OCUs (search + indexing)
Bedrock Agent ~$0.002/1K tokens Pay-per-use
S3 Storage ~$0.023/GB/month Minimal for documents
Titan Embeddings ~$0.0001/1K tokens One-time per document
Total ~$180-200/month For continuous operation

πŸ’° Cost Optimization

  • Delete OpenSearch collection when not in use (main cost driver)
  • Use S3 Intelligent-Tiering for documents
  • Batch document processing to minimize embedding costs
  • Monitor with AWS Cost Explorer

Troubleshooting

Common issues and solutions

❌ Error: "Invalid engine type"

Cause: OpenSearch index was created with nmslib instead of FAISS

Solution:

python -m scripts.opensearch_manager recreate

⚠️ Agent doesn't return document information

Checklist:

  • Verify ingestion job completed: aws bedrock-agent list-ingestion-jobs
  • Check documents in S3: aws s3 ls s3://your-bucket/
  • Test direct retrieval: python tests/test_kb.py
  • Verify agent-KB association in AWS Console

ℹ️ Slow query responses

Possible causes:

  • Too many chunks being retrieved (reduce max_results)
  • Large parent chunks (consider reducing from 1500 tokens)
  • Cold start (first query after idle period)

πŸ”’ Permission denied errors

Required IAM permissions:

{
  "Effect": "Allow",
  "Action": [
    "bedrock:InvokeModel",
    "bedrock:Retrieve",
    "aoss:APIAccessAll",
    "s3:GetObject",
    "s3:ListBucket"
  ],
  "Resource": "*"
}

Advanced Topics

Scaling and advanced features

Multimodal RAG

Bedrock Knowledge Bases support images, audio, and video through two approaches:

Nova Multimodal Embeddings

  • βœ… Direct embedding of multimedia
  • βœ… Visual similarity search
  • βœ… Image-based queries
  • βš™οΈ Configure chunk duration (1-30s)

Bedrock Data Automation

  • βœ… Converts to text first
  • βœ… Audio transcription
  • βœ… Video scene descriptions
  • βœ… Image OCR

Metadata Filtering

Add metadata to documents for filtered retrieval:

response = client.retrieve(
    knowledgeBaseId='KB_ID',
    retrievalQuery={'text': 'query'},
    retrievalConfiguration={
        'vectorSearchConfiguration': {
            'filter': {
                'equals': {
                    'key': 'category',
                    'value': 'technical'
                }
            }
        }
    }
)

Semantic Chunking

Alternative to hierarchical chunking that uses NLP to split at semantic boundaries:

chunking_configuration {
  chunking_strategy = "SEMANTIC"
  semantic_chunking_configuration {
    max_tokens = 300
    buffer_size = 1
    breakpoint_percentile_threshold = 95
  }
}

Best Practices

Production-ready recommendations

πŸ“š Document Preparation

  • Clean PDFs work best (avoid scanned images without OCR)
  • Structure documents with clear headings
  • Remove unnecessary formatting
  • Keep file sizes reasonable (<100MB per file)

πŸ” Retrieval Optimization

  • Start with 3-5 retrieved chunks, adjust based on results
  • Monitor retrieval scores (below 0.3 usually indicates poor match)
  • Use metadata filtering for large document sets
  • Test with diverse queries to validate coverage

πŸ”’ Security

  • Use IAM roles with least privilege
  • Enable S3 bucket versioning for audit trails
  • Encrypt data at rest (OpenSearch Serverless default)
  • Monitor access with CloudWatch
  • Never commit AWS credentials to Git

πŸ’° Cost Management

  • Delete OpenSearch collection when not actively developing
  • Use lifecycle policies for S3 documents
  • Batch document uploads to minimize ingestion jobs
  • Set up AWS Budget alerts

πŸ§ͺ Testing

  • Create a test suite with known questions and expected answers
  • Test edge cases (ambiguous queries, out-of-scope questions)
  • Monitor response quality over time
  • A/B test different chunking strategies

Conclusion

Key takeaways and next steps

You've now built a production-ready RAG system using AWS Bedrock with hierarchical chunking, FAISS vector search, and automatic retrieval through Bedrock Agents.

This architecture provides:

  • Better context through hierarchical chunking
  • Fast retrieval with FAISS indexing
  • Scalability via serverless infrastructure
  • Maintainability with Infrastructure as Code

πŸš€ Next Steps

  • βœ… Experiment with different chunking strategies
  • βœ… Add metadata filtering for your use case
  • βœ… Implement monitoring and logging
  • βœ… Test with multimodal content
  • βœ… Optimize costs for production

Questions or feedback?

Get in Touch