How Xceptor Moved AI Out of the Pilot Phase and Into Every Stage of Delivery

Most engineering organisations have run an AI pilot by now. Individual developers are faster — they'll tell you that. But zoom out to the team level and the delivery metrics are flat. Stories still queue, test cycles run the same length, and release frequency sits where it was a year ago. The speed gets absorbed by the same bottlenecks that existed before AI showed up, because most organisations went tools-first without changing how work flows between people.

At CTO Craft, we shared how Xceptor — a data automation platform serving financial services — broke that pattern in practice, across a 49-person engineering team, over six months of tracked results.

It started with mapping where things break

Before jumping into full PLCD tool adoption,  we ran a structured discovery with Xceptor's engineering leadership. We walked through how teams actually operate from ticket to deployment, identified stalled handoffs, traced where rework originates, and found the repetitive tasks eating time. We identified baselines — story cycle time, test authoring hours, defect escape rates, resolution timelines — so that every improvement could be proved. Without baselines, that's not possible.

That discovery shaped the entire sequence. It told Xceptor where AI would compound results (process-level automation across phases) versus where it would only add individual speed (IDE-level assist). Operating model first, tooling second.

Three stages, not three months

Xceptor didn't try to do everything at once. The progression was deliberate: start with AI as a tool people use inside their existing workflow, then automate repeatable steps where AI handles execution and humans approve the output, and finally move toward agents that plan and act across the full lifecycle while humans review at every gate.

Stage one — augmentation — meant GitHub Copilot in the IDE, AI-assisted UX prototyping, and research synthesis. Useful and easy to adopt, but limited to individual productivity.

Stage two — automation — targeted process-level work: generating requirements from meeting transcripts, writing test scripts from acceptance criteria, detecting and routing exceptions in production. Results compounded here because the output fed directly into downstream phases instead of staying isolated.

Stage three — agents — is where Xceptor is now. A central orchestrator receives a feature request and coordinates sub-agents for product, development, QA, and operations in parallel. Nothing ships without human approval. The goal: a full connector delivered from requirement to production-ready in under a day.

Where the numbers landed

Five phases, each with measured results against the pre-AI baselines captured during discovery:

  • Design — 83% faster prototyping. That translated to 32 days saved per designer per year. The original target was 50–60% improvement. They overshot it.
  • Requirements — Story creation went from 3 hours to under 1 hour. Rework dropped from 30% to under 10%. The time savings matter, but the rework reduction matters more — less churn downstream.
  • Build — 80% of the engineering team (39 of 49 engineers) adopted Copilot, generating 18,064 AI-assisted events in six months. Adoption was measured by actual usage, not licence installs.
  • Test — 75% reduction in scripting time. One QA engineer now produces what previously took four. Playwright MCP replaced the old Selenium-based approach after Selenium proved too brittle for AI-generated scripts.
  • Production — 170+ SaaS instances monitored by an AI agent that detects exceptions, diagnoses root cause, and routes to the right team. Resolution time dropped from days to hours, with zero missed P1 incidents.

Over half of all AI usage across Xceptor sits in engineering and product. This is production work running inside the core delivery org.

What broke

Three problems worth naming:

Model quality was inconsistent. GPT-4 generated brittle test scripts. Copilot struggled with Xceptor's custom Selenium setup. The fix: switch test generation to Claude Opus and replace Selenium with Playwright MCP, which had better AI-native documentation and more predictable outputs. That got them to the 75% scripting reduction.

The first production agent was hard. A monitoring agent making real decisions across 170+ SaaS instances had to get false-positive rates and routing reliability right before anyone would trust it. Xceptor deployed it in read-only mode first, set confidence thresholds before allowing routing actions, required human sign-off on novel exception types, and rolled out incrementally. The agent now handles exception triage that previously took days, and hasn't missed a P1.

Adoption didn't happen by mandate. Engineers resisted AI tooling when it didn't match how they actually worked. Mandating Copilot without enablement produced install counts, not usage. What worked: we helped Xceptor build a champions programme, role-specific playbooks, and a KPI framework that measured delivery outcomes instead of tool activity. The 80% adoption rate came from pull, not push.

How roles change when this works

The shift is in what people spend their time on.

Product owners move from spending 3+ hours writing stories manually to reviewing and steering agent-generated output. Developers shift from writing code and docs to reviewing PRs and making architecture decisions. For QA, the change is starker — manual test scripting at 4× slower pace gives way to approving AI-generated test suites and catching edge cases. And ops teams that spent their days triaging hundreds of exceptions now review AI-generated insights and approve deployments.

Fewer people covering more ground.

What's next: the 6-week MVP

Xceptor is running a 6-week MVP — with our team embedded alongside theirs — to prove the full agent-driven lifecycle in production. The target: a complete connector delivered from requirement to production-ready in under one day, with artifacts and human reviews at every gate.

Success criteria: all artifacts present (stories, architecture docs, code, tests, release notes), every output reviewed and approved by a human, the agent framework reusable enough for any team to pick up, and training playbooks that don't require hand-holding. They're measuring time per PDLC phase against baseline, defect severity found in review rings, and team satisfaction via developer experience pulse surveys.

The larger direction: a single platform where POs, developers, QA, and ops each work with agents built for their role, all governed by policy guardrails with a full audit trail and human override at every decision point.

The takeaway for engineering leaders

Xceptor's path from AI experiments to an AI-embedded delivery lifecycle took roughly a year. The pattern we built together: understand how work flows before picking tools, baseline everything, and prove each stage before moving to the next.

The uncomfortable part: this required changing tools mid-stream (Selenium to Playwright, GPT-4 to Claude Opus for specific tasks), accepting that adoption is a people problem before it's a tech problem, and measuring production impact instead of pilot activity.

None of this required a massive upfront investment. It required a sequence — and the discipline to stick with it

You may also like

Thinking about your own AI, data, or software strategy?

Let's talk about where you are today and where you want to go - our experts are ready to help you move forward.