Capabilities

How we ship production AI — not how we talk about it.

Six things separate a working AI system from a slide. We build all six ourselves: the agent + MCP layer, the integration surface, the eval and observability harness, the governance layer that answers "what if it goes wrong," the optimization layer that keeps the token bill honest, and the delivery process that ties them together. This is the work, in the order we do it.

Stack Python · Anthropic SDK · MCP

Surface Agents, tools, retrieval, custom servers

Discipline Evals before launch, traces in prod

Governance IADTHAM · kill switch under 60s

Optimization Model routing · caching · usage budgets

Engagement Diagnose · Prototype · Harden · Hand off

Jump to: The stack · Architecture · Integration · Evaluation · Governance · Optimization · Delivery · Boundaries

Where we start

AI is the top of a four-layer stack — not the whole stack.

An agent does exactly what it's built to do. If the data underneath it is inconsistent, siloed, or unmodeled, it will confidently produce garbage from that data — quickly, and at scale. Most AI vendors start at the top of this stack. We start at the bottom.

Every layer depends on the ones beneath it. We assess all four before we scope the AI layer — see the Phase 0 diagnostic under Delivery.

The layer that fails silently is the Data Layer — the one everyone assumes is already in good shape. It's rarely a single clean warehouse. It's a CRM with duplicate accounts, an ERP mid migration, and a "source of truth" spreadsheet nobody admits is load-bearing. An agent built on top of that will answer fast and answer wrong, and it'll look like a model problem when it's a data problem. That's why our Phase 0 diagnostic assesses the Enterprise Systems, Integration, and Data layers before we write a line of agent code — not after the pilot underperforms.

What we build

The six layers we own end-to-end.

01 / Architecture

Agent + MCP server layer

We build the agent and the MCP servers that give it usable hands. Tools, retrieval, memory, and authorization live in a layer we own — not stitched together as one-off function calls. That separation is what makes the system testable, upgradeable, and survivable past the first model version.

Custom MCP servers per integration. Each system of record gets a typed server with versioned tool schemas, structured error handling, and explicit authorization scopes — not raw API wrappers.
Agent loop in Python. Anthropic SDK, deterministic state machine around the model, structured tool use, retry and fallback policies that survive model swaps.
Retrieval as a first-class tool. We treat retrieval as an integration, not a search bar: source-of-truth indexing, freshness policy, and provenance returned with every answer.

A typical Fulcrum deployment. Every arrow is a tool call we own.

02 / Integration

Real systems of record, not scraping

Most AI proofs of concept die at the integration boundary. We don't build against synthetic data, sandbox accounts, or browser automation that breaks at the first DOM change. Our MCP servers connect to authoritative systems through the same auth, identity, and permission boundaries the partner already trusts.

Direct API integration first. CRMs, ERPs, ticketing systems, data warehouses, file stores. Where APIs don't exist, we build adapters — not scrapers.
Identity and permissions inherited. The agent operates inside the partner's existing auth/RBAC model. No service-account shortcuts that turn into compliance issues later.
Event-driven where it matters. For workflows that need to react, not just respond: we stand up event consumers so the agent acts when the partner system changes, not only when a user asks.
Human-in-the-loop by design. Approval gates, escalation paths, and audit trails for anything the agent shouldn't decide on its own.

03 / Evaluation

Eval harness and production observability

An AI system that nobody is measuring isn't a system — it's a liability with good marketing. Every Fulcrum engagement includes an eval harness scoped to the partner's operating metric, and an observability layer that survives the demo and into year two.

Pre-launch: golden sets and regression tests. Scenario coverage tied to the partner's real intents, run on every prompt and model change, gated in CI before merge.
Structured production traces. Every tool call, every model turn, every retrieval hit — stored with prompt, response, and outcome so the team can debug a specific user moment, not vague aggregates.
Drift detection. We watch the things model providers don't: latency distribution shift, refusal rate, tool-call success rate, retrieval freshness. We page on them, not just on uptime.
One operating metric per engagement. Every engagement opens with the number we expect to move and how the eval will tell us when we've moved it.

04 / Governance

Governance: the answer to "what if it goes wrong"

Every AI committee eventually asks two questions that matter more than any benchmark: can you show me exactly what the agent did, and can you stop it before it does something worse. We build governance into the agent layer from day one — it's not a compliance doc bolted on after a committee asks for it. We structure this around seven controls we call IADTHAM.

Identity. Every agent runs under its own identity — a service principal, not a shared API key. You always know which agent, which session, which run.
Authorization. Agents inherit the partner's existing RBAC and permission model. An agent can't read or write anything a human in its role couldn't.
Data control. Read-only by default. Write actions are explicitly scoped per tool, per system — no blanket database access.
Tool control. Every tool an agent can call is enumerated and versioned. Adding a new capability is a reviewed change, not an emergent behavior.
Human in the loop. Approval gates and escalation paths on anything the agent shouldn't decide alone, defined before launch — not discovered after an incident.
Auditability. Structured, timestamped traces of every model turn and tool call. Recreating exactly what an agent did last Thursday is a query, not an investigation.
Monitoring. Drift detection, refusal-rate tracking, and a kill switch that reaches every level — one asset, one channel, or the whole system — activated in under 60 seconds, without vendor intervention.

See how IADTHAM works in practice →

05 / Optimization

Optimization: keeping the token bill honest

The question we hear most from finance and ops leaders isn't "does the model work" — it's "what happens to our bill when usage scales 10x." We build cost and performance discipline into the agent layer itself, not as a dashboard you check after the invoice arrives.

Model routing. Not every call needs the most expensive model. We route easy, high-volume calls to smaller models and reserve frontier models for the calls that actually need the reasoning depth — often the single biggest cost lever in a deployment.
Context and prompt efficiency. Bloated prompts and redundant context are the quiet tax on every call. We trim system prompts, cache stable context, and avoid re-sending what the model already has — before touching the model choice.
Caching and reuse. Repeated or near-duplicate calls get served from cache instead of re-run against the model. For high-volume, template-driven workflows this alone can cut spend substantially.
Usage budgets and alerts. Per-agent, per-tenant usage tracking with budgets and overage alerts, so a runaway loop shows up as a page — not a surprise on next month's invoice.
Latency as a cost lever. Streaming responses, parallel tool calls, and async where the workflow allows it. The fastest system is usually also the cheapest one, since cost scales with tokens-in-flight and retries.
Tuning against the eval harness. Optimization isn't just cost — prompt and retrieval tuning run against the same eval suite from Evaluation, closing the loop between "cheaper" and "still accurate enough to trust."

06 / Delivery

The engagement model

Partners hesitate on AI partnerships for a real reason: most of them are open-ended. Ours isn't. Every Fulcrum build partnership runs through four phases with explicit deliverables, durations, and exit conditions. The partner knows what they're getting and when.

Phase 0 · 1–2 wk

Diagnostic

Assess all four layers of the stack — Enterprise Systems, Integration, Data, and AI — not just the last one. Scope the operating metric, map the integration surface, identify the failure modes. Output: a build plan with a measurable target.

Phase 1 · 3–4 wk

Working prototype

End-to-end thin slice running against real data. One workflow, one integration, one eval. No production exposure yet.

Phase 2 · 4–8 wk

Production hardening

Eval coverage broadened, observability wired in, integrations widened, auth/RBAC inherited. Goes live behind a flag.

Phase 3 · ongoing

Operate or hand off

We either run the system on a retainer or transfer ownership with documentation, runbooks, and an eval suite the team can extend.

Weekly written check-in. What shipped, what's stuck, what the eval shows. Same format every time.
One technical owner. A Fulcrum engineer owns the build end-to-end. Partners don't get rotated through a bench.
No deliverables that can't be run. If we can't show you the system or the eval result, we haven't delivered it.

Boundaries

What we explicitly don't take on.

Foundation model training or fine-tuning as the product. We use the best available model. The defensibility is in the system around it.
Scraper-based or RPA-based integration. They look like progress for six weeks and become an ops tax for six years.
Open-ended retainers without a target. Every engagement has the operating metric we expect to move and how we'll know.
Compliance-heavy verticals without the right partner. We bring counsel and security expertise in before we touch regulated data — we don't fake it.

Building an AI capability inside your product?

The fastest first conversation is the one where you bring the operating problem and we bring the architecture sketch.

Book a working session See representative engagements →