RAG with Bedrock

Deep dive into RAG concepts using AWS Bedrock Knowledge Bases

Chunking Strategies Semantic Search Multimodal RAG Best Practices

Chunking Strategies Deep Dive

Understanding different approaches and when to use each

Amazon Bedrock Knowledge Bases supports four main chunking strategies, each optimized for different use cases:

1. Default Chunking

How it works: Splits content into chunks of around 300 tokens, preserving complete sentences within each chunk.

Best for: General-purpose RAG applications, balanced performance

Trade-off: Good balance between context and precision, but may lose some semantic coherence

2. Fixed-Size Chunking

How it works: Configure exact chunk size (tokens) and overlap percentage between consecutive chunks.

Best for: When you need predictable chunk sizes, technical documentation

Trade-off: Maximum control but may split sentences mid-way

3. Hierarchical Chunking ⭐

How it works: Creates parent chunks (1500 tokens) and child chunks (300 tokens). Searches in child chunks but returns parent chunks for context.

Best for: Complex documents, when context is critical (our implementation uses this!)

Trade-off: Best accuracy and context, slightly more complex setup

💡 Why we chose this:

Precision: Child chunks find exact matches
Context: Parent chunks provide full picture
Best of both worlds: Accurate retrieval + comprehensive answers

4. Semantic Chunking

How it works: Uses NLP to split at semantic boundaries (topic changes) rather than fixed sizes. Configurable: max tokens, buffer size, breakpoint threshold.

Best for: Long-form content, articles, books where topics flow naturally

Trade-off: Most intelligent splitting but variable chunk sizes

5. No Chunking

How it works: Treats each document as a single chunk.

Best for: Short documents, when full context is always needed

Trade-off: Simple but limited to small documents (token limits apply)

Multimodal RAG

Beyond text: images, audio, and video in RAG systems

Bedrock Knowledge Bases supports multimodal content through two distinct approaches:

Nova Multimodal Embeddings

Direct embedding of multimedia files without text conversion

✅ Visual similarity search
✅ Image-based queries
✅ Preserves visual information
⚙️ Configure chunk duration (1-30s for audio/video)

Best for:

Product catalogs, image discovery, visual search

Bedrock Data Automation

Converts multimedia to text first, then embeds

✅ Audio transcription (speech-to-text)
✅ Video scene descriptions
✅ Image OCR + visual content extraction
✅ Works with existing text-based RAG

Best for:

Meeting recordings, training videos, customer support calls

💡 Key Decision

Choose Nova when visual features matter (product images, diagrams). Choose BDA when speech/text content matters (videos, audio recordings).

Production Best Practices

Lessons learned from real-world implementations

1. Retrieval Optimization

Start with 3-5 chunks: More isn't always better—can confuse the LLM
Monitor retrieval scores: Below 0.3 usually indicates poor match
Use metadata filtering: Essential for large document sets
Test with diverse queries: Validate coverage across topics

2. Document Preparation

Clean PDFs work best: Avoid scanned images without OCR
Structure with clear headings: Helps semantic chunking
Remove unnecessary formatting: Focus on content
Keep files reasonable: <100MB per file for best performance

3. Cost Management

OpenSearch Serverless is the main cost: ~$175/month for 2 OCUs
Delete when not in use: Save costs during development
Batch document processing: Minimize ingestion jobs
Use S3 Intelligent-Tiering: Optimize storage costs

4. Testing Strategy

Create test suite: Known questions + expected answers
Test edge cases: Ambiguous queries, out-of-scope questions
Monitor quality over time: Track response accuracy
A/B test chunking strategies: Find what works for your data

Real-World Performance Tips

✅ Do:

• Use hierarchical chunking for complex docs
• Enable CloudWatch logging
• Version your documents in S3
• Test with real user queries

❌ Avoid:

• Retrieving too many chunks (>10)
• Ignoring retrieval scores
• Skipping document preprocessing
• Forgetting to monitor costs

Technical Deep Dive

Insights from AWS Bedrock documentation and production systems

Vector Embeddings Architecture

Vector embeddings are the foundation of semantic search in RAG systems. They convert text chunks into numerical representations that enable mathematical comparison.

Float32 Vectors (Default)

• 32 bits per dimension
• Higher precision
• Standard for most models
• Larger storage requirements

Binary Vectors

• 1 bit per dimension
• 32x storage reduction
• Faster similarity search
• Requires compatible models

💡 Production Tip

Binary vectors can reduce storage costs by 97% while maintaining acceptable accuracy for most use cases. Consider them for large-scale deployments.

RAG Evaluation Metrics

AWS Bedrock provides built-in metrics to evaluate RAG system performance across two dimensions: retrieval quality and generation quality.

Retrieve-Only Metrics

Context Relevance

Measures how contextually relevant the retrieved texts are to the questions. Essential for ensuring your vector search is finding the right content.

Context Coverage

Measures how much the retrieved texts cover all information in ground truth. Requires ground truth data in your evaluation dataset.

Retrieve-and-Generate Metrics

Correctness

Measures how accurate the responses are in answering questions. The most critical metric for production systems.

Completeness

Measures how well responses answer and resolve all aspects of questions. Prevents partial or incomplete answers.

Helpfulness

Holistic measure of how useful responses are. Combines accuracy, completeness, and user satisfaction.

📊 Supported Evaluator Models

• Amazon Nova Pro
• Claude 3.5 Sonnet (v1, v2)
• Claude 3.7 Sonnet
• Claude 3 Haiku & 3.5 Haiku

• Meta Llama 3.1 70B
• Meta Llama 3.3 70B
• Mistral Large
• Cross-Region inference profiles

GraphRAG with Amazon Neptune

GraphRAG combines graph modeling with generative AI to enhance RAG by leveraging relationships between entities and structural elements within documents.

How GraphRAG Works

Performs initial vector search for relevant nodes
Retrieves related graph nodes linked to document chunks
Expands by traversing the graph and retrieving details
Provides enriched context by understanding entities and connections

✅ Benefits

• More comprehensive responses
• Better cross-document reasoning
• Enhanced multi-hop queries
• Reduced hallucinations

⚠️ Limitations

• S3-only data sources
• 1000 files per data source
• No autoscaling support
• Claude 3 Haiku for graph building

🌍 Available Regions

US East (N. Virginia), US West (Oregon), Europe (Frankfurt, London, Ireland), Asia Pacific (Tokyo, Singapore)

Pre-processing Pipeline

Understanding the data transformation pipeline is crucial for optimizing RAG performance.

1

Document Ingestion

Convert unstructured data to text format

2

Chunking

Split into manageable pieces (hierarchical, fixed-size, semantic)

3

Embedding Generation

Convert chunks to vector embeddings using Titan or other models

4

Vector Storage

Write to vector index (OpenSearch, FAISS) with document mapping

5

Semantic Search

Query embeddings find similar vectors for retrieval

🔑 Key Insight

The quality of your RAG system is directly tied to your chunking strategy and embedding model choice. Hierarchical chunking with Titan embeddings provides the best balance for most production use cases.

Key Takeaways

Essential insights for building production RAG systems

🎯 Chunking

Hierarchical chunking provides the best balance of precision and context for most use cases

🔍 Search

FAISS with HNSW algorithm offers sub-millisecond similarity search at scale

📊 Multimodal

Choose Nova for visual search, BDA for speech/text extraction from multimedia

💰 Costs

OpenSearch Serverless is the main cost driver—manage OCUs carefully

View Implementation Guide →