Cold outbound is an experiment platform

When teams say they 'test outbound,' they usually mean they swap two subject lines and pick the one with more opens. Useful, but not the thing that moves the floor.

· 15 min read

When teams say they “test outbound,” they usually mean they swap two subject lines, send a thousand emails, and pick the one with more opens. That’s useful. It’s not the thing that moves the floor.

Cold outbound is hundreds of decisions stacked together. Which signal counts as buying intent. Which persona this prospect matches. Whether we compress the research digest with one model or another. Which voice file the writer loads. Whether email 1 opens with the public-post anchor or the hiring-history one. Whether the sequence is three emails or four. Whether the rules file has the no-bridge rule turned on. A subject-line A/B test holds 99% of those decisions constant and varies one. The signal is real, but it’s narrow.

Real outbound experimentation tests the pipeline, not the sentence. To do that without burning months of deliverability and reputation, you need platform primitives most outbound stacks don’t have.

Four primitives

After about a year of building these for ourselves, I’d argue you need four:

  1. Workflow-level experiments that vary the architecture of the pipeline (insert a node, swap a model at a specific step, route on a signal), not just the contents of a column.
  2. A dry-run lab paired with a side-by-side compare view, so you can run sample × variant matrices without sending and read every intermediate node’s output across variants in one grid.
  3. A typed context graph as the experiment surface, so experiments are edits to structured entities (personas, proofs, objections, plays) that propagate through generation, not just slot values in a template.
  4. A closed loop that ties cohort assignment, intermediate state, and outcome events together across the seam between generation and sending, so the experiment doesn’t end when the email leaves your system.

Each one earned its place by being missing. The first three buy you a fast inner loop; the fourth is what makes the inner loop matter in production.

Primitive 1: cohort splits in the workflow itself

Our first experiment system lived inside the context-composition node. You’d declare a list of cohorts, weights, and which context files each cohort loaded. The PRNG used a per-contact seed so a lead always ended up in the same cohort on re-runs. It worked for the experiment it was built for: testing whether loading a particular insight file made the brief better.

It broke the moment someone wanted to test something structural. “Does inserting a compress-the-digest node before the writer help?” was an experiment our inline cohort schema couldn’t express. The schema let you swap which files a single node loaded; it couldn’t add or remove a node.

We pulled experiments up to the workflow graph itself, as a new node type. The splitter sits in the graph like any other node, with its outputs as cohort branches. Each branch can do anything the workflow can do: add nodes, swap models, route through different paths, converge back. A merge node at the end re-exposes the winning branch’s outputs so downstream references resolve.

{
  "id": "cohortSplit",
  "type": "cohort:split",
  "parameters": {
    "experimentId": "compress-digest-2026-05",
    "cohorts": [
      { "id": "control", "weight": 0.5 },
      { "id": "compressed", "weight": 0.5 }
    ]
  }
}

This sounds like a small change but it reorganizes how you think about experiments. Once the splitter is a node, any structural change is a candidate variant. Add a critic node on one branch and not the other. Use Claude Sonnet on one branch and DeepSeek on the other. Route based on a signal, so one cohort goes through the discovery sequence while the other gets the default. Cross-product two splitters and you get a 4-cell matrix without writing any new code.

The other thing the graph-level primitive gets you is determinism. The same lead, given the same cohort list and weights, always lands in the same cohort. Re-running the workflow doesn’t shuffle assignments. This sounds basic until you’ve tried to interpret a noisy experiment where the cohort assignments aren’t stable across re-runs.

Primitive 2: a dry-run lab with side-by-side comparison

A dry-run lab paired with a side-by-side compare view is the primitive I’d tell any team to build first if they’re serious about iteration speed. They’re two halves of the same idea: the lab generates variants you can inspect, the compare view is how you actually inspect them, and they only work together.

Most outbound experiments cost you in two currencies: the real emails you send and the real responses (or non-responses) you get back. If you want to test whether a new persona-detection prompt is better, the standard approach is to deploy it to a small slice of traffic, wait two weeks for replies, read tea leaves. Every iteration is a deliverability bet on top of a learning bet.

Previewing a row in a spreadsheet tool gets you halfway: you can see the final email without sending. The half you’re missing is everything that went into producing it. Was the brief actually targeting the right buyer segment? Did the digest surface a fresh signal or a stale repost? Did the persona match land on the right archetype? If your only artifact is the final email, you can compare which variant won but not understand why.

Our lab keeps the whole pipeline visible. You pick a workflow, a set of sample leads, and a set of variants. The lab runs the full workflow graph against every sample × variant cell, persists every node’s input and output as a separate typed row in dedicated lab tables, and exposes all of it side-by-side. The digest is queryable. The brief is queryable. The writer’s output is queryable. Each as its own structured artifact, not a column of text.

3 sample leads × 4 variants = 12 cells
Each cell = a full workflow execution, persisted, no email sent

This is the inner loop. You change a rule in the writer’s rules file, or add an objection to the context graph, or swap a brief preset. You run the lab against five real prospects and read the twelve cells side-by-side. You diagnose what broke and what improved, then go again.

The outer loop is “send and wait two weeks.” The inner loop is “run the lab and read twelve cells.” If your only loop is the outer one, you’re shipping changes too slowly to know which one helped.

Lab and cohort-split integrate in a useful way. When the lab sees a workflow with a cohort:split node, it auto-expands variants as the cross-product of all cohort assignments. Three samples × two 2-way splitters becomes a 12-cell matrix with zero manual configuration. Forced cohort assignments let you pin specific cells to specific branches for end-to-end testing.

Lab mode is also fail-closed against side effects. Every effectful action (sending an email, writing to the CRM, enrolling in SmartLead) checks the execution mode and refuses if the mode is lab. The check isn’t an opt-in flag the executors choose to respect. It’s a runtime guard at the action layer that prevents the lab from accidentally shipping email to a real prospect.

The lab without a compare view is just storage. You can query it, but you can’t learn from it. The compare view we built surfaces the cells in a grid:

  • The same sample lead’s variants stacked horizontally
  • Each variant’s final email body next to its brief, its digest, and the raw LinkedIn profile that fed in
  • The cohort assignment for each variant (which path the lead took through any cohort:split nodes)
  • The token cost, the latency, and the AI-detection score for each cell

A reviewer reads horizontally: same lead, four versions of what we wrote to them, four traces back to the data. The differences jump out. The “this anchor came from the digest but didn’t survive to the brief” gap becomes visible. The “this email used a stale repost from a year ago” tell shows up because the digest pane lists every post that fed in and which ones the brief used.

Side-by-side reading is the only way to evaluate variants honestly. A linear list lets you forget what the control said by the time you’ve read the experimental version. A grid doesn’t.

Primitive 3: a context graph you can experiment on

Most outbound experimentation historically meant editing templates. You wrote one or two template variants, slotted in dynamic variables ({{first_name}}, {{company}}, maybe an AI-generated first-line fragment), and tested template A against template B. The experiment surface was the template itself: which slot value to inject, which template version to use, which dynamic snippet to swap in.

It worked for what it could express. Where it broke down is that the template author had to anticipate every axis you’d ever want to experiment on. If your template had slots for first-name and a proof quote but no slot for “what objection to preempt,” you couldn’t experiment on objection-preemption without rewriting the template. The experiment space was bounded by the slot vocabulary your past self had defined.

LLM generation over a typed context graph removes that boundary. The input layer becomes a graph of typed entities (ICPs, personas, JTBDs, signals, plays, objections, proofs, insights, alternatives) with explicit references between them. The writer composes from that graph at runtime instead of paste-substituting into fixed slots.

What that buys you as an experiment platform:

  • Adding an entity is its own experiment. Author a new objection file. Reference it from the personas it applies to. Run the lab. The objection shows up in briefs for those personas without anyone editing the writer’s prompt, without anyone updating a template. The experiment space grew by one without needing the slot to exist in advance.
  • The substitution is semantic, not string. A template slot can hold a string. A graph entity holds typed structured content the writer weaves into prose: sometimes as a direct quote, sometimes as a reframing, sometimes as a position in the argument. The same input can express differently across compositions, and that latitude is sometimes exactly what you want to vary.
  • One edit propagates non-locally. Edit a persona file once and every campaign whose pipeline traverses that persona picks up the change. Templating could share snippets, but the propagation graph was implicit and easy to lose track of. With typed references the propagation graph is the data.

The email got less deterministic. The experiment process got more deterministic. Both at the same time, for the same reason.

This is the counterintuitive part worth saying directly: LLMs made each individual email less deterministic (same brief, run twice, slightly different output), but they made the experiment process more deterministic. Across enough samples, the population of emails follows the input graph systematically. The question shifts from “did this exact email come out right” (often unanswerable) to “did changing the input layer move the population in the expected direction” (measurable in the lab). You’re not chasing one run; you’re shaping the distribution.

This is the primitive most outbound stacks miss because their generation tool exposes columns, not entities. A spreadsheet column can hold a value, including a value computed by an LLM call. It does not expose a typed graph of references that the writer traverses. Without that substrate, your experiment surface is the columns you’ve remembered to define, and you’re back to templates with extra steps.

Primitive 4: closing the loop

The first three primitives buy you a fast inner loop: you build variants, run them in the lab without sending, and compare them in detail before shipping. That’s how you decide which variant goes out.

The fourth primitive is what tells you whether shipping it actually worked. And it’s the one most teams can’t have, regardless of how good their generation tooling is, because of how the typical outbound stack is wired.

A standard outbound stack splits across at least two systems. An enrichment and copy tool (Clay-style) generates content row by row. A sending tool (Smartlead, Instantly, Apollo’s outreach) handles deliverability and tracks events. Replies land in the sending tool’s webhook. The CRM may or may not get pinged.

That seam between the two tools is where most outbound experiments quietly end. The enrichment tool knew the lead was in cohort B with brief-curator-v3. The sending tool knows the lead replied. Nothing links those two facts together. The cohort assignment didn’t travel through the boundary, so the reply event arrives unattributed, and the experiment loop dead-ends at the seam.

This is also why a real outbound experimentation platform has to wrap generation and enrollment together, not stop at the generation boundary. If your stack ends where Clay’s output begins to flow into Smartlead, the experiment is a development exercise. The platform you need links them.

Why subject-line A/B tests don’t compose

Now back to where this started. Subject-line A/B tests are the form of outbound experimentation most teams reach for first because tools make them easy. They’re not wrong; they’re just narrow.

The narrowness is structural. A subject-line test holds the pipeline constant and varies one decision at the end. If the rest of the pipeline is generating forced bridges, hallucinating product fit, or pulling stale signals, you’ll measure which of two slightly different subject lines does best on top of broken machinery. Picking a winner doesn’t fix the broken machinery; it shifts you from one local optimum to another, both far from the global one.

Workflow-level experiments compose because they let you vary the actual sources of variance. Test whether brief curation with Sonnet beats brief curation with Gemini. Test whether adding a critic-rewriter cycle is worth the cost. Test whether routing tier-3 prospects to a discovery sequence improves response quality. None of these are subject-line tests, and none of them are expressible inside the typical “A/B in Outreach” feature.

This is also why a real outbound experimentation platform has to live deeper in the stack than a sending tool. Outreach, Salesloft, and Smartlead are the last mile. By the time the email arrives at the sending tool, the experiments that mattered already happened upstream. The platform you need is the one that wraps generation, not the one that wraps delivery.

What we got wrong before we built this

A few lessons that didn’t make it into ADRs because they happened before we had ADRs:

  • We ran experiments without a sample × variant matrix. Just deployed the new prompt to all traffic, hoped reply rates didn’t tank, learned in slow motion. The lab cost us about two engineering weeks to build. It paid back inside the first iteration of any prompt change.
  • We compared variants in tabs. Different tabs of the admin UI, one variant per tab, flipping between them. We thought we were reading carefully. We weren’t. A grid view caught more issues in the first week than two months of tab-flipping had.
  • We didn’t pin metric definitions. Different surfaces had different formulas for “reply rate.” Conversations about whether a campaign was working would drift into arguments about which number was right. The denominator-alignment work was tedious but it closed an entire category of arguments.
  • We treated experiments as parameter sweeps. Until we built workflow-level cohort splits, our “experiments” were almost always “try a different value for one config option.” The structural experiments (adding a node, swapping a model, routing on a signal) were ad-hoc workflow forks that didn’t share assignment determinism with each other. The cohort-split node let us treat them as first-class.

The diagnostic

If you want to know whether your outbound has any real experimentation structure under it, four questions:

  1. Can your “split” change the architecture of the pipeline (insert or remove a node, swap a model at a specific step, route on a signal), or only change the contents of a row? Row-level splits give you subject-line-shaped experiments. Architecture-level splits give you the real ones.
  2. For any cell, can you read the digest, the brief, and the final email side-by-side as separate typed artifacts across every variant in a single grid? If you’re flipping between tabs or reading long logs in sequence, you’re not actually comparing; you’re remembering what the control said.
  3. Can you add a new objection, proof, or persona to your system and have it show up in the right emails automatically, without anyone editing a template, a prompt, or a column formula? If new context only takes effect after a template author rewires a slot, your experiment surface is bounded by what they remembered to define.
  4. Can you tie a positive reply event back to the cohort and intermediate state that produced it? If your generation tool and your sending tool don’t share a lead ID and an experiment ID, the loop breaks at the seam. Variants ship; outcomes arrive unattributed; the experiment ends in a development environment.

A team that can answer yes to all four runs outbound differently than a team that can’t. They iterate weekly instead of quarterly. They argue about hypotheses instead of metric definitions. They ship structural improvements instead of cycling subject lines. And for the same reason a typed context graph compounds, their experimental wins compound too, because each one lands as a structural change to the pipeline that future experiments build on.

Subject lines test sentences. Pipelines test programs. If outbound is your channel, the experiment unit has to be the program.

From the library

This is the kind of system we build for clients. The first call is 30 minutes. Bring your current outbound and what you've already tried.

Joe Rhew, Founder