Voice agents with RAG and function calling

A voice agent that only chats is a toy. Function calling and RAG turn it into a product. Here is how the pieces fit and where the latency hides.

SipPulse AI - Engineering TeamApril 5, 20267 min read

Share this article

Voice agents with RAG and function calling

A voice agent that can only chat is a demo. A voice agent that can call your CRM, look up an account, charge a card and update an order is a product. The difference is function calling and RAG. The hard part is wiring both into a streaming conversation without breaking the latency budget that makes voice agents feel natural in the first place. This post walks through what function calling and RAG actually do for a voice agent, the hybrid architectures that have become standard in 2026, when the agent should call a tool versus when it should keep talking, and how SipPulse AI handles the orchestration.

Why a voice agent needs tools and RAG

Real customer requests do not stay in the LLM's training data. The customer asks for their account balance, the agent has to call the billing API. The customer asks about a product feature, the agent has to retrieve the latest documentation. The customer asks to reschedule, the agent has to check the calendar API and write the new booking. None of these requests are answerable from the model's prompt alone.

Two patterns solve this:

Function calling: the LLM decides mid-conversation that it needs to call an API, the runtime executes the call, the result feeds back into the next LLM turn
RAG (Retrieval Augmented Generation): the agent retrieves relevant documents from a knowledge base, injects them into the prompt, then generates an answer grounded in the retrieved content

Function calling handles dynamic state (account data, transactions, real-time inventory). RAG handles documented knowledge (product docs, policies, FAQs). A production voice agent typically uses both.

Function calling: the API call mid-conversation

Function calling lets the LLM decide, in the middle of a turn, that it needs to invoke an API. The model emits a structured tool call with parameters, the runtime executes it, and the response is folded back into the prompt for the next generation step. To the user, the agent says "let me check that for you", a half-second pause happens, then "your balance is $432.18". The whole loop completes in under a second on a well-tuned stack.

The implementation matters. Function calling adds latency: the API call has to round-trip and the LLM has to generate a second response. Production voice agents handle this in three ways:

Speak filler while the call is in flight: "let me look that up for you" buys the round-trip time
Stream partial results: the agent starts speaking the response as soon as the first tokens arrive, not after the full response generates
Pre-fetch likely tool calls: if the customer is asking about an account, fetch account data before they finish the sentence

A voice agent that goes silent for 2 seconds while a tool call resolves feels broken. A voice agent that says "checking now" and continues smoothly feels professional.

RAG for voice: grounding the agent in your data

RAG turns your knowledge base into something the voice agent can answer from. The pattern: documents are chunked and embedded into a vector database. When the user asks a question, the agent retrieves the most relevant chunks, injects them into the LLM prompt as context, and generates a grounded answer.

For voice specifically, RAG has tighter constraints than for chat:

Latency: vector search has to complete fast enough to fit in the conversation budget, ideally under 100ms
Precision over recall: a wrong retrieval becomes a wrong spoken answer that the user cannot easily inspect
Length budget: spoken answers should be short, so RAG context has to be selective rather than dumping the entire chunk

Native RAG systems sync company knowledge bases directly into the agent's memory. The maintenance overhead matters as much as the initial setup: when a policy changes, the embeddings have to be rebuilt, or the agent confidently quotes the old policy.

Hybrid speech-to-speech and cascade

Many production deployments in 2026 are hybrid: speech-to-speech for small talk and emotion-rich openings, with cascade fallback (STT, LLM, TTS as separate components) when a tool call or RAG lookup is needed. The reasoning is split:

Speech-to-speech models (single model takes audio in, produces audio out) win on opening latency. They are great for "Hi, how can I help?" and short emotional acknowledgments.
Cascade pipelines give the most flexibility, the cleanest tool use, and the choice of which LLM and TTS to swap in. They are necessary the moment the agent has to call an API or retrieve from a knowledge base.

The right pattern depends on the use case. A booking agent that always needs to call a calendar API stays cascade end-to-end. A friendly receptionist that mostly handles small talk benefits from speech-to-speech for openings and cascade only when intent escalates. NVIDIA's Nemotron, released at CES 2026, and the GPT-4o realtime API are the two canonical speech-to-speech options; both support function calling, with cascade typically still preferred for complex tool orchestration.

When the agent should NOT call a tool

The other side of function calling is restraint. A voice agent that calls a tool for every question is slow. Production prompting includes a clear policy on when to call:

Look up data the agent does not know (account balance, order status, product availability)
Take an action (book, cancel, charge, update)
Verify a fact when uncertain (is this product still available, has the price changed)

But not:

Greet the customer (no tool needed)
Confirm understanding (the LLM can paraphrase without help)
Handle refusals or out-of-scope requests (the agent should follow safety instructions, not call a tool to think about it)

The discipline matters because every unnecessary tool call adds latency and cost.

Where SipPulse AI fits

SipPulse AI supports function calling and RAG natively in the agent prompt. You declare the tools the agent has access to, the runtime handles the LLM round-trip, the streaming TTS speaks filler while tools resolve. RAG can pull from any vector database via standard APIs, and the prompt scaffolding makes it easy to enforce "answer only from retrieved context" rules to reduce hallucination.

NIVA, our block-based builder, lets non-engineers configure which agents have which tools, route the conversation between specialized agents based on intent, and combine voice agent blocks with classical IVR steps. A retention agent with deep account-lookup tools and a billing agent with payment tools can live in the same flow with a clean handoff.

You can talk to a function-calling demo at sippulse.ai/demos and see the per-call latency on the example telemetry viewer. When function calling is in the loop, the per-stage timing in the webhook payload makes it easy to see whether latency is hiding in the API call or in the LLM second pass.

Conclusion

A voice agent that calls APIs and grounds itself in your knowledge base is the difference between a chat toy and a product. Function calling and RAG carry latency and complexity costs, and hybrid speech-to-speech plus cascade is the dominant production pattern in 2026. Try our demo to feel function calling in action, or contact our team to wire SipPulse AI into your tools and data.

#voice agent#RAG voice#function calling#tool use#knowledge base#hybrid architecture

Voice AI Architecture for Telecom: Why Three Planes Matter

Most voice AI demos work. Most voice AI production deployments don't. The gap between a demo that handles a scripted conversation and a system that operates under regulatory scrutiny with real PSTN tr

June 18, 20263 min read

Flavio Goncalves

SipPulse AI telemetry: every parameter explained

SipPulse AI delivers per-call telemetry via signed webhooks. Here is what every event type and metric means, with the open example viewer at /telemetry.

April 15, 20266 min read

SipPulse AI