RAG with Bedrock
Deep dive into RAG concepts using AWS Bedrock Knowledge Bases
Chunking Strategies Deep Dive
Understanding different approaches and when to use each
Amazon Bedrock Knowledge Bases supports four main chunking strategies, each optimized for different use cases:
1. Default Chunking
How it works: Splits content into chunks of around 300 tokens, preserving complete sentences within each chunk.
Best for: General-purpose RAG applications, balanced performance
Trade-off: Good balance between context and precision, but may lose some semantic coherence
2. Fixed-Size Chunking
How it works: Configure exact chunk size (tokens) and overlap percentage between consecutive chunks.
Best for: When you need predictable chunk sizes, technical documentation
Trade-off: Maximum control but may split sentences mid-way
3. Hierarchical Chunking β
How it works: Creates parent chunks (1500 tokens) and child chunks (300 tokens). Searches in child chunks but returns parent chunks for context.
Best for: Complex documents, when context is critical (our implementation uses this!)
Trade-off: Best accuracy and context, slightly more complex setup
π‘ Why we chose this:
- Precision: Child chunks find exact matches
- Context: Parent chunks provide full picture
- Best of both worlds: Accurate retrieval + comprehensive answers
4. Semantic Chunking
How it works: Uses NLP to split at semantic boundaries (topic changes) rather than fixed sizes. Configurable: max tokens, buffer size, breakpoint threshold.
Best for: Long-form content, articles, books where topics flow naturally
Trade-off: Most intelligent splitting but variable chunk sizes
5. No Chunking
How it works: Treats each document as a single chunk.
Best for: Short documents, when full context is always needed
Trade-off: Simple but limited to small documents (token limits apply)
Multimodal RAG
Beyond text: images, audio, and video in RAG systems
Bedrock Knowledge Bases supports multimodal content through two distinct approaches:
Nova Multimodal Embeddings
Direct embedding of multimedia files without text conversion
- β Visual similarity search
- β Image-based queries
- β Preserves visual information
- βοΈ Configure chunk duration (1-30s for audio/video)
Best for:
Product catalogs, image discovery, visual search
Bedrock Data Automation
Converts multimedia to text first, then embeds
- β Audio transcription (speech-to-text)
- β Video scene descriptions
- β Image OCR + visual content extraction
- β Works with existing text-based RAG
Best for:
Meeting recordings, training videos, customer support calls
π‘ Key Decision
Choose Nova when visual features matter (product images, diagrams). Choose BDA when speech/text content matters (videos, audio recordings).
Production Best Practices
Lessons learned from real-world implementations
1. Retrieval Optimization
- Start with 3-5 chunks: More isn't always betterβcan confuse the LLM
- Monitor retrieval scores: Below 0.3 usually indicates poor match
- Use metadata filtering: Essential for large document sets
- Test with diverse queries: Validate coverage across topics
2. Document Preparation
- Clean PDFs work best: Avoid scanned images without OCR
- Structure with clear headings: Helps semantic chunking
- Remove unnecessary formatting: Focus on content
- Keep files reasonable: <100MB per file for best performance
3. Cost Management
- OpenSearch Serverless is the main cost: ~$175/month for 2 OCUs
- Delete when not in use: Save costs during development
- Batch document processing: Minimize ingestion jobs
- Use S3 Intelligent-Tiering: Optimize storage costs
4. Testing Strategy
- Create test suite: Known questions + expected answers
- Test edge cases: Ambiguous queries, out-of-scope questions
- Monitor quality over time: Track response accuracy
- A/B test chunking strategies: Find what works for your data
Real-World Performance Tips
β Do:
- β’ Use hierarchical chunking for complex docs
- β’ Enable CloudWatch logging
- β’ Version your documents in S3
- β’ Test with real user queries
β Avoid:
- β’ Retrieving too many chunks (>10)
- β’ Ignoring retrieval scores
- β’ Skipping document preprocessing
- β’ Forgetting to monitor costs
Technical Deep Dive
Insights from AWS Bedrock documentation and production systems
Vector Embeddings Architecture
Vector embeddings are the foundation of semantic search in RAG systems. They convert text chunks into numerical representations that enable mathematical comparison.
Float32 Vectors (Default)
- β’ 32 bits per dimension
- β’ Higher precision
- β’ Standard for most models
- β’ Larger storage requirements
Binary Vectors
- β’ 1 bit per dimension
- β’ 32x storage reduction
- β’ Faster similarity search
- β’ Requires compatible models
π‘ Production Tip
Binary vectors can reduce storage costs by 97% while maintaining acceptable accuracy for most use cases. Consider them for large-scale deployments.
RAG Evaluation Metrics
AWS Bedrock provides built-in metrics to evaluate RAG system performance across two dimensions: retrieval quality and generation quality.
Retrieve-Only Metrics
Context Relevance
Measures how contextually relevant the retrieved texts are to the questions. Essential for ensuring your vector search is finding the right content.
Context Coverage
Measures how much the retrieved texts cover all information in ground truth. Requires ground truth data in your evaluation dataset.
Retrieve-and-Generate Metrics
Correctness
Measures how accurate the responses are in answering questions. The most critical metric for production systems.
Completeness
Measures how well responses answer and resolve all aspects of questions. Prevents partial or incomplete answers.
Helpfulness
Holistic measure of how useful responses are. Combines accuracy, completeness, and user satisfaction.
π Supported Evaluator Models
- β’ Amazon Nova Pro
- β’ Claude 3.5 Sonnet (v1, v2)
- β’ Claude 3.7 Sonnet
- β’ Claude 3 Haiku & 3.5 Haiku
- β’ Meta Llama 3.1 70B
- β’ Meta Llama 3.3 70B
- β’ Mistral Large
- β’ Cross-Region inference profiles
GraphRAG with Amazon Neptune
GraphRAG combines graph modeling with generative AI to enhance RAG by leveraging relationships between entities and structural elements within documents.
How GraphRAG Works
- Performs initial vector search for relevant nodes
- Retrieves related graph nodes linked to document chunks
- Expands by traversing the graph and retrieving details
- Provides enriched context by understanding entities and connections
β Benefits
- β’ More comprehensive responses
- β’ Better cross-document reasoning
- β’ Enhanced multi-hop queries
- β’ Reduced hallucinations
β οΈ Limitations
- β’ S3-only data sources
- β’ 1000 files per data source
- β’ No autoscaling support
- β’ Claude 3 Haiku for graph building
π Available Regions
US East (N. Virginia), US West (Oregon), Europe (Frankfurt, London, Ireland), Asia Pacific (Tokyo, Singapore)
Pre-processing Pipeline
Understanding the data transformation pipeline is crucial for optimizing RAG performance.
Document Ingestion
Convert unstructured data to text format
Chunking
Split into manageable pieces (hierarchical, fixed-size, semantic)
Embedding Generation
Convert chunks to vector embeddings using Titan or other models
Vector Storage
Write to vector index (OpenSearch, FAISS) with document mapping
Semantic Search
Query embeddings find similar vectors for retrieval
π Key Insight
The quality of your RAG system is directly tied to your chunking strategy and embedding model choice. Hierarchical chunking with Titan embeddings provides the best balance for most production use cases.
Key Takeaways
Essential insights for building production RAG systems
π― Chunking
Hierarchical chunking provides the best balance of precision and context for most use cases
π Search
FAISS with HNSW algorithm offers sub-millisecond similarity search at scale
π Multimodal
Choose Nova for visual search, BDA for speech/text extraction from multimedia
π° Costs
OpenSearch Serverless is the main cost driverβmanage OCUs carefully