Building Production RAG Systems
A Complete Guide to AWS Bedrock Knowledge Bases & Agents
Introduction
Core concepts and implementation overview
Retrieval Augmented Generation (RAG) has become the de facto standard for building AI systems that need to answer questions based on your own documents. But moving from a proof-of-concept to a production-ready system involves many decisions: chunking strategies, vector databases, embedding models, and infrastructure.
This tutorial walks you through building a complete RAG system using AWS Bedrock, featuring:
- Hierarchical Chunking: Parent chunks (1500 tokens) + Child chunks (300 tokens)
- FAISS Vector Search: High-performance similarity search with OpenSearch Serverless
- Automatic RAG: Bedrock Agents that know when to retrieve context
- Infrastructure as Code: Complete Terraform deployment
π― What You'll Learn
- How RAG systems work under the hood
- Why hierarchical chunking outperforms fixed-size chunking
- Setting up OpenSearch Serverless with FAISS
- Deploying Bedrock Knowledge Bases and Agents
- Testing and optimizing your RAG pipeline
System Architecture
End-to-end pipeline design
Our RAG system follows a multi-stage pipeline that transforms documents into semantic vectors, stores them efficiently, and retrieves relevant context for generation.
βββββββββββββββ
β S3 Bucket β β Document Storage (PDFs, TXT, etc.)
ββββββββ¬βββββββ
β
β
ββββββββββββββββββββββββββββ
β Bedrock Knowledge Base β β Hierarchical Chunking
β (Titan Embeddings) β Parent: 1500 tokens
ββββββββ¬ββββββββββββββββββββ Child: 300 tokens
β Overlap: 60 tokens
β
ββββββββββββββββββββββββββββ
β OpenSearch Serverless β β FAISS Vector Search
β (1536-dim vectors) β L2 distance + HNSW
ββββββββ¬ββββββββββββββββββββ
β
β
ββββββββββββββββββββββββββββ
β Bedrock Agent β β Automatic RAG
β (Claude/Nova) β Context-aware responses
ββββββββββββββββββββββββββββ
Key Components
π Document Storage (S3)
Stores source documents with versioning enabled for audit trails
π§ Knowledge Base
Manages chunking, embedding, and synchronization with vector DB
π Vector Database
OpenSearch Serverless with FAISS for sub-millisecond similarity search
π€ Bedrock Agent
Orchestrates retrieval and generation with context-aware responses
Understanding RAG
How retrieval augmented generation works
The Problem RAG Solves
Large Language Models (LLMs) are trained on vast amounts of data, but they have two fundamental limitations:
- Knowledge Cutoff: They don't know about events after their training date
- No Private Data: They can't access your company's documents, databases, or internal knowledge
π‘ RAG Solution
Instead of retraining the model (expensive and time-consuming), RAG retrieves relevant information from your documents and includes it in the prompt. The LLM then generates answers based on this retrieved context.
How RAG Works (Step by Step)
Document Ingestion
Documents are split into chunks (we'll use hierarchical chunking for better context)
Embedding Generation
Each chunk is converted to a 1536-dimensional vector using Amazon Titan embeddings
Vector Storage
Vectors are stored in OpenSearch Serverless with FAISS indexing for fast retrieval
Query Processing
User query is embedded using the same model
Similarity Search
FAISS finds the most similar chunks using L2 distance
Context Augmentation
Retrieved chunks are added to the prompt
Generation
LLM generates answer based on retrieved context
Hierarchical Chunking: The Secret Sauce
Advanced chunking strategy for better context
Most RAG systems use fixed-size chunking (e.g., 512 tokens per chunk). This works, but has a critical flaw: you lose context.
The Problem with Fixed-Size Chunking
Example: Imagine a document about "AWS Bedrock Features"
- Chunk 1: "...supports multiple foundation models including..."
- Chunk 2: "...Claude, Nova, and Titan. Each model has..."
If you retrieve Chunk 2, you get the model names but miss the broader context about "AWS Bedrock Features".
Hierarchical Chunking Solution
Hierarchical chunking creates two levels of chunks:
πΆ Child Chunks (300 tokens)
- β Precise semantic search
- β Find exact information
- β Fast retrieval
π¨ Parent Chunks (1500 tokens)
- β Broad context
- β Complete information
- β Better LLM understanding
π― How It Works
- System searches in child chunks (300 tokens) for precision
- When a match is found, it returns the parent chunk (1500 tokens)
- LLM gets both precise match AND surrounding context
Configuration
vector_ingestion_configuration {
chunking_configuration {
chunking_strategy = "HIERARCHICAL"
hierarchical_chunking_configuration {
level_configuration {
max_tokens = 1500 # Parent chunk
}
level_configuration {
max_tokens = 300 # Child chunk
}
overlap_tokens = 60 # Continuity between chunks
}
}
}
The 60-token overlap ensures that information spanning chunk boundaries isn't lost.
Implementation Guide
Step-by-step deployment instructions
Prerequisites
- AWS Account with Bedrock access
- Terraform >= 1.0
- Python 3.8+
- AWS CLI configured
Step 1: Clone the Repository
git clone https://github.com/raulprocha/bedrock-rag-system.git
cd bedrock-rag-system
pip install -r requirements.txt
Step 2: Configure AWS Resources
Copy the example configuration and update with your AWS details:
cp config.py.example scripts/config.py
# Edit scripts/config.py with your AWS account ID and region
Step 3: Deploy Infrastructure
Use Terraform to deploy all AWS resources:
cd terraform
terraform init
terraform plan
terraform apply -auto-approve
This creates: S3 bucket, OpenSearch Serverless collection, Knowledge Base, and Bedrock Agent
Step 4: Create FAISS Index
Important: Bedrock requires the OpenSearch index to use FAISS engine (not the default nmslib).
python -m scripts.opensearch_manager create
β οΈ Common Issue
If you get "invalid engine type" error, the index was created with nmslib. Fix it:
python -m scripts.opensearch_manager recreate
Step 5: Upload Documents
BUCKET=$(terraform output -raw s3_bucket_name)
aws s3 cp your-document.pdf s3://$BUCKET/
Step 6: Sync Knowledge Base
Start the ingestion job to process and index your documents:
KB_ID=$(terraform output -raw knowledge_base_id)
DS_ID=$(terraform output -raw data_source_id)
aws bedrock-agent start-ingestion-job \
--knowledge-base-id $KB_ID \
--data-source-id $DS_ID \
--region us-east-1
β±οΈ Processing time: ~4-5 minutes for a 50MB PDF
Step 7: Test Your RAG System
# Test agent with RAG
python tests/test_agent.py
# Test direct KB retrieval
python tests/test_kb.py
# Interactive CLI
python cli.py "What are the main features of Amazon Bedrock?"
Code Deep Dive
Key implementation details
BedrockClient: High-Level API
The BedrockClient class provides a clean interface for interacting with Bedrock services:
from scripts.bedrock_client import BedrockClient
# Initialize client
client = BedrockClient()
# Query agent (automatic RAG)
response = client.invoke_agent("What is hierarchical chunking?")
print(response)
# Direct KB retrieval
results = client.retrieve_from_kb("FAISS vector search", max_results=5)
for result in results:
print(f"Score: {result['score']:.4f}")
print(f"Text: {result['content']['text'][:200]}...")
Understanding the Code
1. invoke_agent()
Sends query to Bedrock Agent, which automatically:
- Determines if KB retrieval is needed
- Retrieves relevant chunks
- Generates context-aware response
2. retrieve_from_kb()
Bypasses the agent for direct vector search:
- Embeds query using Titan
- Performs FAISS similarity search
- Returns top-k results with scores
OpenSearch Manager
Manages OpenSearch Serverless index lifecycle:
from scripts.opensearch_manager import OpenSearchManager
manager = OpenSearchManager()
# Create index with FAISS
manager.create_index(
index_name="bedrock-knowledge-base-index",
dimension=1536, # Titan embeddings
engine="faiss"
)
# Check configuration
info = manager.get_index_info()
print(f"Engine: {info['mappings']['properties']['vector']['method']['engine']}")
# Recreate if needed
manager.recreate_index()
Performance & Optimization
Tuning for production workloads
Benchmarks
Based on testing with 49.4 MB PDF (Amazon Bedrock User Guide):
β±οΈ Indexing Time
~4-5 min
One-time processing per document
π Query Latency
~2-3 sec
Including retrieval + generation
π― Retrieval Accuracy
High
Hierarchical chunking advantage
π Context Window
1500 tokens
Per retrieved chunk
Optimization Tips
1. Adjust Chunk Sizes
Experiment with parent/child ratios. Larger parents = more context, but slower retrieval.
2. Tune Overlap
60 tokens works well, but increase for documents with dense cross-references.
3. Limit Retrieved Chunks
Start with 3-5 chunks. More isn't always betterβcan confuse the LLM.
4. Monitor Costs
OpenSearch Serverless is the main cost (~$175/month). Delete when not in use.
Cost Analysis
Understanding AWS infrastructure costs
| Service | Cost | Notes |
|---|---|---|
| OpenSearch Serverless | ~$175/month | 2 OCUs (search + indexing) |
| Bedrock Agent | ~$0.002/1K tokens | Pay-per-use |
| S3 Storage | ~$0.023/GB/month | Minimal for documents |
| Titan Embeddings | ~$0.0001/1K tokens | One-time per document |
| Total | ~$180-200/month | For continuous operation |
π° Cost Optimization
- Delete OpenSearch collection when not in use (main cost driver)
- Use S3 Intelligent-Tiering for documents
- Batch document processing to minimize embedding costs
- Monitor with AWS Cost Explorer
Troubleshooting
Common issues and solutions
β Error: "Invalid engine type"
Cause: OpenSearch index was created with nmslib instead of FAISS
Solution:
python -m scripts.opensearch_manager recreate
β οΈ Agent doesn't return document information
Checklist:
- Verify ingestion job completed:
aws bedrock-agent list-ingestion-jobs - Check documents in S3:
aws s3 ls s3://your-bucket/ - Test direct retrieval:
python tests/test_kb.py - Verify agent-KB association in AWS Console
βΉοΈ Slow query responses
Possible causes:
- Too many chunks being retrieved (reduce max_results)
- Large parent chunks (consider reducing from 1500 tokens)
- Cold start (first query after idle period)
π Permission denied errors
Required IAM permissions:
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock:Retrieve",
"aoss:APIAccessAll",
"s3:GetObject",
"s3:ListBucket"
],
"Resource": "*"
}
Advanced Topics
Scaling and advanced features
Multimodal RAG
Bedrock Knowledge Bases support images, audio, and video through two approaches:
Nova Multimodal Embeddings
- β Direct embedding of multimedia
- β Visual similarity search
- β Image-based queries
- βοΈ Configure chunk duration (1-30s)
Bedrock Data Automation
- β Converts to text first
- β Audio transcription
- β Video scene descriptions
- β Image OCR
Metadata Filtering
Add metadata to documents for filtered retrieval:
response = client.retrieve(
knowledgeBaseId='KB_ID',
retrievalQuery={'text': 'query'},
retrievalConfiguration={
'vectorSearchConfiguration': {
'filter': {
'equals': {
'key': 'category',
'value': 'technical'
}
}
}
}
)
Semantic Chunking
Alternative to hierarchical chunking that uses NLP to split at semantic boundaries:
chunking_configuration {
chunking_strategy = "SEMANTIC"
semantic_chunking_configuration {
max_tokens = 300
buffer_size = 1
breakpoint_percentile_threshold = 95
}
}
Best Practices
Production-ready recommendations
π Document Preparation
- Clean PDFs work best (avoid scanned images without OCR)
- Structure documents with clear headings
- Remove unnecessary formatting
- Keep file sizes reasonable (<100MB per file)
π Retrieval Optimization
- Start with 3-5 retrieved chunks, adjust based on results
- Monitor retrieval scores (below 0.3 usually indicates poor match)
- Use metadata filtering for large document sets
- Test with diverse queries to validate coverage
π Security
- Use IAM roles with least privilege
- Enable S3 bucket versioning for audit trails
- Encrypt data at rest (OpenSearch Serverless default)
- Monitor access with CloudWatch
- Never commit AWS credentials to Git
π° Cost Management
- Delete OpenSearch collection when not actively developing
- Use lifecycle policies for S3 documents
- Batch document uploads to minimize ingestion jobs
- Set up AWS Budget alerts
π§ͺ Testing
- Create a test suite with known questions and expected answers
- Test edge cases (ambiguous queries, out-of-scope questions)
- Monitor response quality over time
- A/B test different chunking strategies
Resources & Links
Additional learning materials
π Documentation
π» Code & Tools
π Research Papers
Conclusion
Key takeaways and next steps
You've now built a production-ready RAG system using AWS Bedrock with hierarchical chunking, FAISS vector search, and automatic retrieval through Bedrock Agents.
This architecture provides:
- Better context through hierarchical chunking
- Fast retrieval with FAISS indexing
- Scalability via serverless infrastructure
- Maintainability with Infrastructure as Code
π Next Steps
- β Experiment with different chunking strategies
- β Add metadata filtering for your use case
- β Implement monitoring and logging
- β Test with multimodal content
- β Optimize costs for production
Questions or feedback?
Get in Touch