Most AI agent projects stall. MIT found that 95% of enterprise generative AI pilots deliver little to no measurable business impact. The gap between pilot and production is where most programmes die, and for AI agents in financial services that gap is wider than in any other industry, because a wrong answer can be a compliance breach.
This guide is for ops and CX teams who are ready to change that. These six steps give you a clear path from zero to a running agent, and set you up to grow it well past the point where most deployments stall.
Step 1: Build your knowledge foundation
The agent only knows what you tell it.
Before your agent can handle a single customer query, it needs a knowledge foundation. The challenge is that this knowledge is rarely in one place. It lives in your help centre, in your team's heads, in policy documents that were last updated 18 months ago, and in the informal answers your best agents give without thinking twice.
Getting this right is the most important thing you do before launch. Poor agent responses don't mean the model is broken. They usually mean there's a gap in what the agent was given to work with.
Every AI agent platform structures knowledge differently. In Gradient Labs, we organise it into three layers:
Your knowledge base is the foundation. It syncs from your help centre and covers anything customer-facing. Version-controlled, reviewed, and the most reliable source the agent draws from.
Facts are system-generated insights extracted from past conversations: the informal knowledge that lives in your team's heads but never made it into the help centre. These require curation, so review, edit, and delete outdated entries regularly.
Notes are for time-sensitive context: outages, campaign-specific changes, temporary policy updates. Use them sparingly. Anything permanent belongs in the knowledge base.

The principle applies across platforms: prioritise your public knowledge base. It forces you to formalise the knowledge that benefits both the agent and your team. Even then, a knowledge base on its own only gets you so far: the practised judgement your best agents apply rarely makes it into the help centre, which is what the next step is for.
Step 2: Give your agent the ability to handle complexity
Resolution rate climbs as your agent covers more complex work. That's where the ROI and the better experience for your customers are.
Generic AI agents stall around 60% resolution on a financial services operation, and the reason is structural. They are built for discrete interactions: "where's my refund", "I can't log in", a password reset. A single question gets a single answer and the conversation closes. The work that makes up the rest doesn't fit that shape.
Operations like disputed transactions run as a long-running process that unfold across turns, channels, and days. A dispute can take weeks from intake through investigation, chargeback, and customer follow-up. Resolving these cases is what's expensive for human teams, and it's where the cost savings and the better customer experience actually live. This is the work that needs vertical AI built for financial services: an agent that holds context across the whole case, applies policy at each step, and closes the loop long after the first message.
Breaking through that ceiling requires structured instructions for complex cases. Not just knowledge, and not merely logic, but nuanced reasoning with access to tools and systems. The depth of automation you can reach is directly proportional to how well you've codified these instructions.
In Gradient Labs, we call these procedures. They're natural language instructions that tell the agent exactly what to do, step by step, when a customer reaches out with a particular problem. Think of them as executable versions of your existing SOPs.
When a customer message comes in, the agent identifies the intent, evaluates every procedure linked to that intent, and works through the right one step by step. If a step requires calling a system (freezing a card, updating account status, creating a claim), it executes that action. If a step requires checking customer data, it pulls that information and decides what to do next.
For cases that fan out to multiple root causes, sub-procedures handle the branching. The parent procedure manages diagnosis and routing; sub-procedures handle execution for each path. This keeps the logic clean without sacrificing coverage.
Teams that reach 80 to 90% resolution treat procedures as living documents, refined continuously based on what actually happens in production.
Step 3: Understand your guardrails
In finance, a wrong answer isn't only a bad experience. It can be a compliance breach.
Before launch, understand what your agent is and isn't protected against. Every AI agent platform offers some baseline safety, but the guardrails that matter in financial services go well beyond generic content filtering.
There are two categories to think about:
Customer guardrails detect signals in what the customer is saying. A complaint needs to be logged. A mention of financial difficulty needs to trigger specialist handling under FCA Consumer Duty. A customer who mentions being evicted isn't just asking about a bank balance. These situations need escalation paths that bypass standard procedures entirely.
Agent guardrails inspect what the agent is about to say. Some responses are wrong even if they're technically accurate. Mentioning that an account is under review for suspicious activity could constitute tipping off under the Proceeds of Crime Act. Certain terminology is out of bounds. Giving financial advice, however well-grounded, may be prohibited entirely.
In Gradient Labs, we run 20+ financial-services-specific guardrails out of the box, covering prompt injection, financial advice detection, promises beyond agent capability, vulnerable customer treatment, sensitive information leakage, and more. Each prevents 1 to 2% of potential failures individually. Together, they prevent compounding compliance issues, with global regulatory coverage that runs from UK FCA rules to the EU AI Act. These are purpose-built for finance and not standard across every platform.

Know which guardrails your platform provides, which need configuring, and what happens when they fire.
Step 4: Connect your agent to your systems
Answering questions is useful. Resolving them is what drives ROI.
There is a meaningful difference between an agent that explains how to reset a card and one that actually resets it. That gap is where resolution rates climb. Closing it requires tools: integrations that let the agent take action in your systems, not just reason about them.
Tools typically come in a few forms:
Built-in tools cover common operations out of the box: escalating to a human agent, sending a message, updating a conversation status.
Support platform integrations connect your agent to Intercom, Zendesk, Freshworks, and similar systems. The agent gains access to ticket data and the ability to take actions within your existing support workflow.
Custom API tools are the most critical for financial services. They connect your agent to your internal systems: CRMs, core banking platforms, case management tools, databases. This is what lets the agent check account status, retrieve transaction history, submit a claim, or flag a case for review. Custom tools require an open API endpoint and credentials, but once connected, they unlock a step change in end-to-end resolution.

Start with the integrations that unblock your highest-priority use cases and add tools as you expand the agent's scope.
Step 5: Test before you go live
Don't launch blind.
Before a single real customer sees your agent, run every scenario through your testing environment. There are two modes worth using:
Full knowledge testing: simulate real customer queries to test how the agent reasons across your entire knowledge base. Look for wrong answers and trace them back to the source. Agent thinking and citations show you exactly what it referenced and why.
Procedure-specific testing: test each procedure in isolation. The agent only has access to that procedure, making it easy to validate the logic before it goes anywhere near live traffic.
Most teams start with simulated chat testing to test the basics, then move to a small set of production conversations, and then run batch testing to find the edge cases that only surface at volume.
To get started, try out our voice AI testing guide: it lists 40+ scenarios to feel confident to launch your agent to production, from mumbling and interruptions to vulnerable customers.
Step 6: Treat launch as the beginning
A resolution rate of 60% on day one is a solid start. It's not the destination.
You don't have to switch every customer over at once. Most teams roll out gradually: run the agent in shadow mode alongside the human team first, or route a capped share of live conversations to it, then watch resolution and handoff rates and ramp as the numbers hold. A gradual rollout turns go-live from one risky switch into a controlled ramp you can pause at any point.
Most teams treat launch as the finish line. The organisations that reach 80 to 90% resolution treat it as the starting line. That's where the real cost ROI lives. One large European digital bank runs Gradient Labs across half a million conversations at 98% QA, beating its human team.
The metric to watch post-launch is handoff rate: every time the agent hands off a conversation, it's raising its hand and saying it doesn't know. A rising handoff rate tells you something has gone stale: an outdated KB article, a broken procedure, a policy change that never made it to the agent.
The improvement loop is simple:
Review conversations where the agent handed off or gave a wrong answer
Diagnose the root cause (missing knowledge, conflicting sources, incomplete procedure)
Update the relevant source
Test the fix
Monitor for regression
Growing beyond the ceiling
Fixing what you have is one track. Expanding what you do is the other.
Getting from 60% to 80 to 90% resolution doesn't come from adding more KB articles. It comes from covering more ground on two axes: breadth, the same use cases across more channels, and depth, more procedures and tools for the complex, customer-specific cases. Every new procedure unlocks a class of queries the agent couldn't previously resolve. Every new tool integration lets it take an action it previously had to hand off. These two levers compound. A new procedure paired with a new tool integration can move your resolution rate up significantly.
Tip: Channels are one of the fastest ways to expand coverage without starting from scratch. Once your agent is performing well on chat, launching on email or voice uses the same knowledge and procedures. You're tuning channel nuances, not rebuilding from the ground up.
The number to keep in mind
Only about 5% of enterprise AI pilots reach the impact their sponsors hoped for. MIT put the rest down to a learning gap rather than model quality.
The same discipline scales beyond a single agent. Once one process runs in production, the next one runs on the same data, the same guardrails, and the same audit trail. A neobank that starts with a Disputes Agent adds a Lending Agent when its lending operation needs it, then frontline support on top.
Gradient Labs is the AI-native customer operations platform for financial services: a suite of specialist agents that each take a full lifecycle of manual work and run it end to end, across frontline and back-office, with frontline support on text and voice included. If you'd like to build your agent on our platform, get in touch.
