2.8 KB, 13 KB, 17 KB. The snippet size is lightweight. You base your A/B testing tool choice on that because it means the tool won't affect page performance. Right?
In live experiments, the snippet you install is often just a loader. The real payload arrives later. Sometimes running into hundreds of KB after the pages render.
We unpack this in this article. You'll learn how script size is actually measured in production, where common claims fall apart, and why "smallest snippet size" doesn't usually match real-world performance. Then, we'll tell you how we do it differently.
Is your site converting as well as it could be?
Get a CRO Audit - $99How We Investigated A/B Testing Script Size in Real Environments
We set out with one key question:
What is the actual payload required to run a real A/B test in production?
The goal of our investigation was to measure the true execution cost of A/B testing scripts across some leading platforms. Specifically:
- How much code is actually delivered to the browser
- How that code is loaded and executed, and
- How much of it is visible in the vendor-reported snippet size claims
Our test subjects were:
- Convert Experiences
- Mida.so
- VWO
- ABlyft
- Webtrends Optimize
- Fibr.ai
- Visually.io
- Amplitude Experiment (included for contrast as a feature flagging system)
Now, the first step was to define what "script size" should mean in its truest sense. Instead of only isolating the snippet installed on the page, we treated script size as the full execution footprint.
That means every asset required to deliver an experiment to a user was included, namely:
- Initial script payload (inline snippet or SDK)
- Additional scripts loaded at runtime
- Total bytes transferred (gzipped and uncompressed)
- Number of network requests triggered
- Timing of execution relative to page render
- Presence of dynamic loading patterns (e.g., script injection, API fetches)
With that definition in place, we inspected each implementation with a combination of tools: Browser DevTools, direct payload measurement, and code analysis.
Methodology
We maintained the same set of evaluation steps for each tool's snippet.
Step 1: Direct measurement from production environments
We collected the tracking scripts for all tools from live customer sites. Then, they were each measured directly using curl to capture both gzipped transfer size and uncompressed payload to get the exact figures delivered to users.
Step 2: Code-level analysis of how scripts execute
Next, we examined what happens after the initial script loads.
We looked for patterns such as progressive injection, where additional scripts are introduced at runtime, and external API calls fetch experiment configurations or variation logic. This way, we could trace the full execution path of each tool.
Step 3: Measuring runtime dependencies and total payload
Using browser DevTools, we captured the full network waterfall triggered by each testing tool. Talking about secondary scripts, configuration files, and any dynamically injected resources needed to run experiments.
This measurement gave us the total payload required to execute an experiment.
Step 4: Architectural and trade-off analysis
Each platform delivered experiments quite differently. We categorized them based on this. And each architecture had its tradeoffs. More on this in the findings section.
Step 5: Validation against vendor claims and benchmarks
Finally, we lined everything up against what vendors report.
We reviewed official documentation and third-party benchmarks (including the Mida.so benchmark) and compared them against direct measurements from production environments.
Step 6: Individual competitor assessment
As a last step, we assessed each platform independently across the same dimensions:
- Reported script size versus measured payload
- Delivery architecture
- Impact on page load and Core Web Vitals
- Approach to flicker prevention and experiment timing
Findings: What Actually Loads When Experiments Run
1. Reported snippet size rarely reflects the total payload
As you probably guessed, there's a visible gap between what many vendors claim and what actually runs:
Table 1: Advertised vs measured base SDK
| Tool | Advertised Claim | Measured Base SDK |
Key Observation | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Convert | ~93KB baseline | ~93KB gzipped baseline | Full payload delivered upfront. No hidden runtime fetches | ||||||||||||||||||||
| VWO | 2.8KB stub | 14.7KB gzipped minimum (5.2× larger). | Stub excludes dynamically loaded library and campaign code | ||||||||||||||||||||
| ABlyft | 13KB | ~32KB gzipped SDK. ~168.5KB uncompressed. | Claim reflects only initial loader. Full footprint significantly larger | ||||||||||||||||||||
| Mida.so | 17.2KB | ~19.5KB loader. 30-40KB base SDK (1.7-2.3× larger) | Progressive injection model. Runtime configs not included in claim | ||||||||||||||||||||
| Webtrends Optimize | No clear size disclosed | ~170KB uncompressed (third-party benchmark) | Limited transparency on actual payload | ||||||||||||||||||||
| Visually.io | 15.13KB SDK | ~15KB SDK only | Missing experiment configuration footprint | ||||||||||||||||||||
| Fibr.ai | "Zero performance drop" | Not publicly disclosed | No measurable payload data available | ||||||||||||||||||||
| Amplitude Experiment | "<1ms evaluation" | ~63KB uncompressed SDK | Refers to cached evaluation. Not comparable to DOM-based testing |