Architecting Agentic RAG Systems: Transform NPCs Into AI Agents
Taking Python to production on AWS. Architecting scalable and modular RAG feature pipelines.
This week’s topics:
Architecting scalable and modular RAG feature pipelines
Taking Python to production on AWS
Architecting agentic RAG systems for transforming NPCs into adaptable agents
Architecting scalable and modular RAG feature pipelines
Here's something most people overlook:
The RAG feature pipeline is the most important part of the entire AI stack. Not the LLM. Not the prompt. Not even the fancy agent framework.
Why?
Because if your retrieval sucks, your generation will too - no matter how good your model is.
Let me walk you through how we architected our RAG feature pipeline for the Second Brain AI Assistant:
The pipeline ingests raw documents from MongoDB, where both Notion and crawled content are stored in a single standardized collection. We don’t care about the source - just that the data is clean and usable.
The output?
Chunked + embedded documents
Stored in a dedicated vector store
Indexed and ready for semantic search
Pipeline Structure
Where does this pipeline fit?
It’s an offline batch pipeline, decoupled from the live user experience. At query time, the pipeline does not run. All processing is done beforehand, so retrieval is fast, stable, and cost-efficient. Meanwhile, the online pipeline (our agentic RAG module) fetches the chunks, reasons over them, and generates the answer.
The architecture is made up of 7 key components:
1/ Data Extraction
Pulls all raw documents from MongoDB, regardless of source.
2/ Document Filtering
Applies quality scores to drop noisy or low-value docs.
3/ Chunking
Splits documents into manageable segments for vectorization.
4/ Post-Processing
Applies one of two retrieval strategies (chosen from our YAML configs):
Parent Retrieval → Links each chunk to its full doc
Contextual Retrieval → Adds summaries to enrich semantic relevance
5/ Embedding
All chunks (for both strategies) are vectorized using a configurable embedding model (OpenAI or Hugging Face).
6/ Indexing
Embeddings are stored and indexed in MongoDB for fast lookups.
7/ Final Output
A structured, searchable knowledge base—ready for RAG-based generation.
Pipeline Management
The entire pipeline is managed by ZenML, which is:
Reproducible
Configurable
Versioned
Traceable
If you’re serious about building production-grade GenAI systems, this is where you focus. ... because 90% of your generation's quality is determined before the LLM even gets involved.
Let’s stop treating pipelines like afterthoughts.
They are the product.
Want to learn more?
Check out Lesson 5 from the Second Brain AI Assistant free course ↓
Taking Python to Production on AWS (Affiliate)
If you're an aspiring AI engineer, there are 2 essential skills you must master before touching any ML model or AI product:
Programming (Yes, even with LLMs! Python is a good start)
Cloud Engineering
Fortunately, my friend Erid Riddoch (Director of ML Platforms at Pattern and a brilliant cloud, DevOps, and MLOps engineer) noticed this problem and filled the need.
Enter the Taking Python to Production on AWS live course.
The next cohort begins next week! It runs from June 16 through July 31.

Using the code DECODINGML will get you 10% off your registration.
If that's not enough, Eric offers a scholarship program that can significantly reduce the price, depending on your use case.
Architecting agentic RAG systems for transforming NPCs into adaptable agents
95% of agents never leave the notebook.
And it’s not because the code is bad…
It’s because the system around them doesn’t exist.
Here's my point:
Anyone can build an agent that works in isolation.
The real challenge is shipping one that survives real-world conditions (e.g., live traffic, unpredictable users, scaling demands, and messy data).
That's exactly what we tackled in Lesson 1 of the PhiloAgents course.
We started by asking, "What does an agent need to survive in production?" 🤔
And decided on 4 things:
It needs an LLM to run in real-time.
A memory to understand what just happened.
A brain that can reason and retrieve factual information.
A monitor to ensure it all works under load.
So we designed a system around those needs.
The frontend is where the agent comes to life.
We used Phaser to simulate a browser-based world.
But more important than the tool is the fact that this layer is completely decoupled from the backend (so game logic and agent logic evolve independently).
The backend, built in FastAPI, is where the agent thinks.
We stream responses token-by-token using WebSockets.
All decisions, tool calls, and memory management happen server-side.
Inside that backend sits the agentic core - a dynamic state graph that lets the agent reason step-by-step.
The agent is orchestrated by LangGraph and powered by Groq for real-time inference speeds.
It can ask follow-up questions, query external knowledge, or summarize what’s already been said (all in a loop).
When the agent needs facts, it queries long-term memory.
We built a retrieval system that mixes semantic and keyword search, using cleaned, de-duplicated philosophical texts crawled from the open web.
That memory lives in MongoDB and gets queried in real time.
Meanwhile, short-term memory tracks the conversation thread across turns.
Without it, every new message would be a reset. With it, the agent knows what’s been said, what’s been missed, and how to respond.
But here’s the part most people skip: observability.
If you want to improve your system, you need to see and measure what it's doing.
Using Opik (by Comet), we track every prompt, log every decision, and evaluate multi-turn outputs using automatically generated test sets.
Put it all together and you get a complete framework that remembers, retrieves, reasons, and responds in a real-world environment.
Oh... and we made the whole thing open source.
Learn how to implement it yourself by starting the PhiloAgents free course ↓
Whenever you’re ready, there are 3 ways we can help you:
Perks: Exclusive discounts on our recommended learning resources
(books, live courses, self-paced courses and learning platforms).
The LLM Engineer’s Handbook: Our bestseller book on teaching you an end-to-end framework for building production-ready LLM and RAG applications, from data collection to deployment (get up to 20% off using our discount code).
Free open-source courses: Master production AI with our end-to-end open-source courses, which reflect real-world AI projects and cover everything from system architecture to data collection, training and deployment.
Images
If not otherwise stated, all images are created by the author.