Capabilities

How we ship production AI — not how we talk about it.

Four things separate a working AI system from a slide. We build all four ourselves: the agent + MCP layer, the integration surface, the eval and observability harness, and the delivery process that ties them together. This is the work, in the order we do it.

Stack Python · Anthropic SDK · MCP
Surface Agents, tools, retrieval, custom servers
Discipline Evals before launch, traces in prod
Engagement Diagnose · Prototype · Harden · Hand off
What we build

The four layers we own end-to-end.

01 / Architecture

Agent + MCP server layer

We build the agent and the MCP servers that give it usable hands. Tools, retrieval, memory, and authorization live in a layer we own — not stitched together as one-off function calls. That separation is what makes the system testable, upgradeable, and survivable past the first model version.

  • Custom MCP servers per integration. Each system of record gets a typed server with versioned tool schemas, structured error handling, and explicit authorization scopes — not raw API wrappers.
  • Agent loop in Python. Anthropic SDK, deterministic state machine around the model, structured tool use, retry and fallback policies that survive model swaps.
  • Retrieval as a first-class tool. We treat retrieval as an integration, not a search bar: source-of-truth indexing, freshness policy, and provenance returned with every answer.
Operator user / partner UX Fulcrum Agent state machine · policy structured tool use retries · fallbacks MCP SERVER LAYER CRM server Workflow / ERP server Knowledge / retrieval Custom partner API EVAL + TRACE LAYER golden sets · regression · structured logs · drift

A typical Fulcrum deployment. Every arrow is a tool call we own.

02 / Integration

Real systems of record, not scraping

Most AI proofs of concept die at the integration boundary. We don't build against synthetic data, sandbox accounts, or browser automation that breaks at the first DOM change. Our MCP servers connect to authoritative systems through the same auth, identity, and permission boundaries the partner already trusts.

  • Direct API integration first. CRMs, ERPs, ticketing systems, data warehouses, file stores. Where APIs don't exist, we build adapters — not scrapers.
  • Identity and permissions inherited. The agent operates inside the partner's existing auth/RBAC model. No service-account shortcuts that turn into compliance issues later.
  • Event-driven where it matters. For workflows that need to react, not just respond: we stand up event consumers so the agent acts when the partner system changes, not only when a user asks.
  • Human-in-the-loop by design. Approval gates, escalation paths, and audit trails for anything the agent shouldn't decide on its own.
03 / Evaluation

Eval harness and production observability

An AI system that nobody is measuring isn't a system — it's a liability with good marketing. Every Fulcrum engagement includes an eval harness scoped to the partner's operating metric, and an observability layer that survives the demo and into year two.

  • Pre-launch: golden sets and regression tests. Scenario coverage tied to the partner's real intents, run on every prompt and model change, gated in CI before merge.
  • Structured production traces. Every tool call, every model turn, every retrieval hit — stored with prompt, response, and outcome so the team can debug a specific user moment, not vague aggregates.
  • Drift detection. We watch the things model providers don't: latency distribution shift, refusal rate, tool-call success rate, retrieval freshness. We page on them, not just on uptime.
  • One operating metric per engagement. Every engagement opens with the number we expect to move and how the eval will tell us when we've moved it.
04 / Delivery

The engagement model

Partners hesitate on AI partnerships for a real reason: most of them are open-ended. Ours isn't. Every Fulcrum build partnership runs through four phases with explicit deliverables, durations, and exit conditions. The partner knows what they're getting and when.

Phase 0 · 1–2 wk

Diagnostic

Scope the operating metric, map the integration surface, identify the failure modes. Output: a build plan with a measurable target.

Phase 1 · 3–4 wk

Working prototype

End-to-end thin slice running against real data. One workflow, one integration, one eval. No production exposure yet.

Phase 2 · 4–8 wk

Production hardening

Eval coverage broadened, observability wired in, integrations widened, auth/RBAC inherited. Goes live behind a flag.

Phase 3 · ongoing

Operate or hand off

We either run the system on a retainer or transfer ownership with documentation, runbooks, and an eval suite the team can extend.

  • Weekly written check-in. What shipped, what's stuck, what the eval shows. Same format every time.
  • One technical owner. A Fulcrum engineer owns the build end-to-end. Partners don't get rotated through a bench.
  • No deliverables that can't be run. If we can't show you the system or the eval result, we haven't delivered it.
Boundaries

What we explicitly don't take on.

  • Foundation model training or fine-tuning as the product. We use the best available model. The defensibility is in the system around it.
  • Scraper-based or RPA-based integration. They look like progress for six weeks and become an ops tax for six years.
  • Open-ended retainers without a target. Every engagement has the operating metric we expect to move and how we'll know.
  • Compliance-heavy verticals without the right partner. We bring counsel and security expertise in before we touch regulated data — we don't fake it.

Building an AI capability inside your product?

The fastest first conversation is the one where you bring the operating problem and we bring the architecture sketch.