The year 2025 has brought with it an explosion of interest in AI agents. In our community and commercial engagements, I’ve seen these agents move beyond simple chatbot interactions and start integrating into real business workflows. From writing software and executing GitHub workflows to creating complex, multi-modal content and even running business processes like sales alongside human staff, these agents take on complex tasks. They combine autonomous decision-making with human oversight, enabling teams to achieve more.
Agentic applications are different from traditional software like robotic process automation because they perceive a defined environment, assess available information, and dynamically adjust their behavior to make decisions and achieve assigned goals. This flexibility makes agentic systems particularly effective in handling unstructured data, ambiguous instructions, dynamic environments with ever-increasing edge cases, or multi-step processes.
In this article, I’ll walk you through a day in the life of Jay, a typical developer I (product manager) and Abhi (engineering manager) work with. We will be highlighting the various stages we go through and the challenges we solve. To make this more tangible, let's pretend we’re building an agentic system to turn technical documents into interactive podcasts for our R&D team.
Decoding ML partnered with Amir Feizpour, a pioneer in AI, known for his expertise in Multi-Agent Systems and AI-driven business optimization. He is the founder of Aggregate Intellect (a.i.), an R&D company, and AISC, a community of over 5,000 AI researchers and engineers. Amir instructs Build Multi-Agent Applications, a hands-on bootcamp that includes 1-on-1 mentorship from a team of industry experts, guiding participants in building real-world multi-agent systems. Amir acknowledges contributions from the teaching staff members, Abhi and Jay, in writing this article.
Phase 1: Design and Data Preparation
Defining the Problem
Before we start developing any product, defining the right problem is crucial. This involves understanding the scope of the project, gathering requirements from stakeholders, talking to the consumers of the application, and ensuring alignment with business objectives. For example, in the podcast generation system, there are two main requirements:
Creating a podcast that is interesting for a user, and
Adding a way for the user to interact with the content of the podcast.
Each of these can be rabbit holes, but what is the core requirement that we need to achieve?
To do this Abhi and I would create a Google form to survey the end users with questions like,
Do you read research papers?
Do you listen to podcasts?
Do you go to and fro between resources like wikipedia, google search while understanding a research paper?
Do you need to refer to cited papers for details of the current paper you are reading?
We then interview some of the respondents to watch their workflow and learn from the steps they usually take and note the friction points in that process. Then I have to also meet with the business stakeholders and understand their objectives and KPIs and overlap with what we have seen in the end user interviews. Abhi and Jay have to work with each other to determine the feasibility of what we want to build with performance and cost in mind. Balancing these sometimes contrasting points of view and priorities is usually a good way to narrow down to the exact workflow that we need to design our agentic system for.
Agent Design & Task Decomposition
The next step is to turn the human workflow we would like to impact into an agentic system architecture. At the heart of this lies the decomposition of the workflow into subtasks and diagraming how they feed into each other. These subtasks, depending on the scope of what they are, will turn into agents or software services. There would be a lot of back and forth between Abhi, Jay and myself as we sort this out.
For our podcast generation system, there are 4 components that Jay builds:
Text Extraction Service: Parses PDFs using Docling, ensuring structured text output that downstream agents and services can use.
Podcast Outline and Questions generation: This is managed by two flow “agents” using LangGraph allowing us to generate outline and question based on an intention provided by the user.
Dialogue Generation: This requires some emulation of a discussion happening on a topic between two people, so Jay would use a graph with 2 agents namely “host” and “expert” talking in turns responding to each other
Text-to-speech Service: Finally, Jay would use Elevenlabs API to convert text to speech for the audio generation of the content.
User Interaction flow has 4 components connected together via a graph using LangGraph:
Question classifier: A simple LLM call to classify questions into possible actions available.
RAG Agent: To answer questions from the technical document and related technical documents available
Web Agent: To answer questions from topics that are not part of the document
Host: Restructure the podcast based on the conversation with the user
This modular design provides a structured yet flexible approach to not only simplify development but also enable scalability and testability. Adding new features (e.g., multilingual support) becomes a matter of integrating additional subtask agents or services without overhauling the entire system
To enable these agents to collaborate, clear communication protocols and a solid way to manage the state of the system are essential. A lot of the time spent on the implementation is for establishing well-defined inputs, outputs, and error-handling mechanisms to allow agents to exchange data seamlessly and to ensure each agent's output meets the quality expectations required for subsequent steps. For example, the Text Extraction Agent may use a classifier to provide confidence scores for extracted text, helping the podcast outline generation Agent to prioritize content for inclusion in the final audio output.
Data Collection and Preprocessing
The foundation of any agentic application lies in its data. For the podcast system, we expect Jay to struggle a bit with extracting text from PDFs. Technical documents often contain complex layouts, such as tables, figures, and formulas, which are difficult to capture accurately. For PDF extraction the place to start is simple text extraction and putting everything in the context of the prompt.
For a project like this we need to build a knowledge base by extracting structured information from the document of interest and any related ones. We use Docling to extract the text, tables, images and bibliography of the paper selected and filter our unwanted text.
At this point we might need to invest in a database and also start looking into chunking and search strategies. Chroma is a simple in memory vector store that is a good place to start and for images we would probably use a multimodal vector store like LanceDB.
Phase 2: Development and Iterative Evaluation
Model Integration
As our cycles of experimentation continue with decomposing and composing various tasks in the process of generating podcasts, we might end up with multiple flows and agents requiring different levels of intelligence to solve the problems.
Choosing the right LLMs for each flow is crucial because it will impact how well you perform and how much it costs to maintain and service such an application. Models must be selected based on the demands of the system: “thinking” models for creating the outline, summarization models to generate concise content.
A great example of such a choice is mapping user asked questions to possible available actions. This is a task that does not require a lot of context and is easily solvable with a small trained model like BERT which can run on a single cpu and much faster. Though, as a fall back if the BERT model is indecisive we could hand off the decision to a larger model.
For our “thinking” tasks we could use Chat GPT 4o-mini but there are also several other ones to try out. To enable all these changes and assessments we would need multiple validation sets for each part of our system.
Evaluation and Debugging
Connecting agents and services is just the beginning—ensuring reliability is the real challenge. Evaluating agentic systems, especially those generating open-ended content, is far more complex than traditional software testing. We have to navigate three key hurdles:
1. Managing Non-Determinism and Complexity
LLM outputs vary, making exact-match testing impossible. Multi-agent interactions in LangGraph introduce emergent behaviors that require evaluation at scale. To track performance trends, Jay runs multiple tests and leverages LangSmith for detailed logging—critical for debugging failures, such as tracing factual errors back to the RAG or Dialogue Generation agent. Additionally, he builds unit tests for individual agents and services, such as verifying the Text Extraction Service’s accuracy in isolation before testing the integrated system.
2. Defining and Measuring Quality
Traditional NLP metrics like ROUGE and METEOR often fall short, especially in open-ended scenarios. Instead, Jay focuses on:
Factual Accuracy – Is the RAG agent grounding responses correctly? Does the "Expert" avoid hallucinations?
Coherence – Does the dialogue flow logically between the "Host" and "Expert"?
Relevance – Does the system classify queries correctly and retrieve the right information?
Task Completion – Does it generate a complete, usable podcast?
Audio Consistency – Are voices clear, natural, and consistent across segments? Are there noticeable variations in tone, speed, or quality that break immersion?
3. Scaling Evaluation
But how can these metrics be measured effectively? One approach is manual review by domain experts, but this is slow and costly. To scale, Jay turns to LLM-powered evaluation using LangSmith or Arize AI, configuring models as judges based on predefined rubrics. However, automated scoring must align with human judgment. To ensure reliability, Jay validates LLM assessments against a human-annotated dataset using metrics like Cohen’s Kappa before scaling. This hybrid approach—automated checks combined with targeted human feedback—enables continuous iteration and real-world impact.
Phase 3: Deployment, Monitoring, and Iteration
Transitioning our interactive podcast generator from testing to live production requires Jay, guided by Abhi, to tackle critical operational challenges. Success hinges on more than just functionality; it demands scalability, resilience, and security.
Handling Scale: Performance, Latency & Cost
Challenge: The multi-agent workflow can become slow and costly under heavy user load or with large documents. Jay needs fast response times and manageable operational expenses.
Solution: Jay employs optimization techniques like asynchronous processing, caching results (e.g., extracted text), potentially using faster/cheaper LLMs for simpler tasks (like classification), and leveraging cloud auto-scaling. Post deployment; continuous monitoring of latency and costs via Real-time monitoring tools like Datadog, or ELK stack helps identify and fix bottlenecks.
Ensuring Robustness and Consistent Quality
Challenge: Real-world data (messy PDFs, ambiguous user queries) can break the agent chain, and model performance or quality (accuracy, coherence) can degrade over time due to LLM updates or shifting data patterns.
Solution: Jay builds resilience with robust error handling and fallbacks within the agent graph. He uses LangSmith for production monitoring to catch failures from unexpected inputs. Crucially, he establishes continuous evaluation, using LLM-as-judge (via LangSmith or Arize AI) validated against human spot-checks (using Cohen's Kappa) to monitor key quality metrics. This feedback loop allows for proactive prompt tuning and logic updates to maintain high standards.
Addressing Safety and Security Risks
Challenge: Interactive agents face risks like users asking for harmful content generation or attackers attempting prompt injection to manipulate behavior (e.g., bypass filters or reveal system instructions).
Solution: Jay implements multiple safety layers. This includes input/output filtering, potentially using specialized safety models like Llama Guard, to screen for harmful content. He designs prompts defensively to resist injection attacks. Specific guardrails are coded into the agent logic – for instance, the Question Classifier can identify and refuse off-topic or unsafe requests. Continuous monitoring for adversarial inputs and potential security breaches remains essential.
Parting Thoughts
As you can see, a lot goes into building an agentic system. If you want to build like Jay and are wondering where to start, we have put together a workbook for you that walks you step by step through the process.
For a deeper dive into building agents, check out our course Building Multi-Agent Applications, and get $100 off with code PAUL.
Images
If not otherwise stated, all images are created by the author.
How much of this article was written by an LLM? I appreciate the article, but it feels like a lot of boilerplate. I'm starting to experience this a lot with online content.