CRO

The Truth Behind the "Smallest Snippet Size" Claim (And What Convert Does Differently)

2026-04-13 Uwemedimo Usa

Originally published by Convert

2.8 KB, 13 KB, 17 KB. The snippet size is lightweight. You base your A/B testing tool choice on that because it means the tool won't affect page performance. Right?

In live experiments, the snippet you install is often just a loader. The real payload arrives later. Sometimes running into hundreds of KB after the pages render.

We unpack this in this article. You'll learn how script size is actually measured in production, where common claims fall apart, and why "smallest snippet size" doesn't usually match real-world performance. Then, we'll tell you how we do it differently.

Is your site converting as well as it could be?

Get a CRO Audit - $99

How We Investigated A/B Testing Script Size in Real Environments

We set out with one key question:

What is the actual payload required to run a real A/B test in production?

The goal of our investigation was to measure the true execution cost of A/B testing scripts across some leading platforms. Specifically:

How much code is actually delivered to the browser
How that code is loaded and executed, and
How much of it is visible in the vendor-reported snippet size claims

Our test subjects were:

Convert Experiences
Mida.so
VWO
ABlyft
Webtrends Optimize
Fibr.ai
Visually.io
Amplitude Experiment (included for contrast as a feature flagging system)

Now, the first step was to define what "script size" should mean in its truest sense. Instead of only isolating the snippet installed on the page, we treated script size as the full execution footprint.

That means every asset required to deliver an experiment to a user was included, namely:

Initial script payload (inline snippet or SDK)
Additional scripts loaded at runtime
Total bytes transferred (gzipped and uncompressed)
Number of network requests triggered
Timing of execution relative to page render
Presence of dynamic loading patterns (e.g., script injection, API fetches)

With that definition in place, we inspected each implementation with a combination of tools: Browser DevTools, direct payload measurement, and code analysis.

Methodology

We maintained the same set of evaluation steps for each tool's snippet.

Step 1: Direct measurement from production environments

We collected the tracking scripts for all tools from live customer sites. Then, they were each measured directly using curl to capture both gzipped transfer size and uncompressed payload to get the exact figures delivered to users.

Step 2: Code-level analysis of how scripts execute

Next, we examined what happens after the initial script loads.

We looked for patterns such as progressive injection, where additional scripts are introduced at runtime, and external API calls fetch experiment configurations or variation logic. This way, we could trace the full execution path of each tool.

Step 3: Measuring runtime dependencies and total payload

Using browser DevTools, we captured the full network waterfall triggered by each testing tool. Talking about secondary scripts, configuration files, and any dynamically injected resources needed to run experiments.

This measurement gave us the total payload required to execute an experiment.

Step 4: Architectural and trade-off analysis

Each platform delivered experiments quite differently. We categorized them based on this. And each architecture had its tradeoffs. More on this in the findings section.

Step 5: Validation against vendor claims and benchmarks

Finally, we lined everything up against what vendors report.

We reviewed official documentation and third-party benchmarks (including the Mida.so benchmark) and compared them against direct measurements from production environments.

Step 6: Individual competitor assessment

As a last step, we assessed each platform independently across the same dimensions:

Reported script size versus measured payload
Delivery architecture
Impact on page load and Core Web Vitals
Approach to flicker prevention and experiment timing

Findings: What Actually Loads When Experiments Run

1. Reported snippet size rarely reflects the total payload

As you probably guessed, there's a visible gap between what many vendors claim and what actually runs:

Table 1: Advertised vs measured base SDK

Tool	Advertised Claim	Measured Base SDK	Key Observation
Convert	~93KB baseline	~93KB gzipped baseline	Full payload delivered upfront. No hidden runtime fetches
VWO	2.8KB stub	14.7KB gzipped minimum (5.2× larger).	Stub excludes dynamically loaded library and campaign code
ABlyft	13KB	~32KB gzipped SDK. ~168.5KB uncompressed.	Claim reflects only initial loader. Full footprint significantly larger
Mida.so	17.2KB	~19.5KB loader. 30-40KB base SDK (1.7-2.3× larger)	Progressive injection model. Runtime configs not included in claim
Webtrends Optimize	No clear size disclosed	~170KB uncompressed (third-party benchmark)	Limited transparency on actual payload
Visually.io	15.13KB SDK	~15KB SDK only	Missing experiment configuration footprint
Fibr.ai	"Zero performance drop"	Not publicly disclosed	No measurable payload data available
Amplitude Experiment	"<1ms evaluation"	~63KB uncompressed SDK	Refers to cached evaluation. Not comparable to DOM-based testing

Table 2: Total payload required to run experiments

Tool	Total Payload (Observed)	How It's Delivered	Key Implication
Convert	~95-110KB (3-5 experiments)	Single upfront request	Predictable load, no runtime injection
VWO	Up to ~254KB	Distributed across runtime requests	Payload hidden behind stub
ABlyft	Up to ~280KB+	Inline + injected scripts	Large total footprint despite small claim
Mida.so	Not fully measurable (runtime configs)	API-driven config loading	Total cost deferred and opaque
Webtrends Optimize	~170KB (third-party benchmark)	Likely distributed	Limited transparency
Visually.io	Unknown (configs not included)	Partial disclosure	Incomplete measurement
Fibr.ai	Not disclosed	Unknown	Cannot verify total cost
Amplitude Experiment	Varies (flag-based)	Lightweight decisions only	Not comparable to DOM testing

2. Smaller initial scripts often defer the real cost

Smaller script sizes tend to rely on progressive loading. A lightweight script initializes quickly, experiment configurations are fetched from an API, variation code is loaded or injected after initial render, and additional requests appear in the network waterfall.

Payload is spread across multiple requests, so the initial script stays lightweight.

3. Architecture determines when performance impact occurs

Comparing script size is often inconsistent. Since you're comparing systems that have different architectures. See here:

Architecture Type	How It Works	When Payload Arrives	Key Impact
Embedded Bundle (Convert)	All experiments included in one payload	Upfront	Predictable load, no runtime injection
Stub + API Config (VWO, Mida, ABlyft)	Loader fetches configs and variations	Distributed over time	Lower initial size, delayed execution
Feature Flagging (Amplitude)	Returns variant decisions only	Minimal upfront	Not comparable for DOM-based testing

4. The sync vs async trade-off shows up in every tool

The three different models have their trade-offs.

Synchronous loading tends to apply experiments before the page renders, which prevents flickers but impacts Core Web Vitals.

Asynchronous loading improves perceived performance but increases the risk of flicker since variation changes are applied after page render. Anti-flicker tries to fix this by hiding content temporarily, but this also comes at a performance cost.

How A/B Testing Tools Mislead About Snippet Size

When you hear about the snippet size, oftentimes the story is incomplete.

In their marketing, most A/B testing tool vendors refer only to the initial loader, the smallest possible representation of their system. As our investigation revealed, the real execution is handed off to subsequent requests, each with its own weight and impact on performance.

That's why the smaller the advertised snippet, the more likely the payload is deferred.

This changes how you compare tools by their snippet size. The "smallest script size" metric just doesn't work.

Why the "Smallest Script Size" Claim is a Weak Metric

Script size doesn't look past the entry point for tools with the stub + API config architecture described earlier. It only captures a single moment in the lifecycle of an experiment.

You want to look at:

1. Total payload, not initial payload: Because what matters is the total amount of code required to run your tests. If part of the payload is deferred, the system still pays for it. Just later.

2. Timing of execution: If experiment logic is available early, changes happen before the page stabilizes. If it's available later, changes can happen after the user has already seen something else. You'll need powerful flicker prevention to mitigate the risk of skewing your test data. While it removes the visual transition, it delays what users see, making your page feel slower to load and still impacting UX.

Note: Performance metrics like Largest Contentful Paint (LCP) and Cumulative Layout Shift (CLS) in Core Web Vitals are impacted by how much code loads and when the page settles into its final state.

Learn More: How to Improve Core Web Vitals when A/B Testing with Convert

Here's a more useful way to evaluate scripts:

Instead of asking which script is smallest, ask:

What is the total payload required to run the experiment?
When is that payload applied in the render lifecycle?
Does the user ever see an intermediate state?

This is better than comparing kilobytes, and they hold up across architectures rather than favoring one implementation style over another.

How Convert Approaches Script Delivery Differently

Instead of a lightweight loader that fetches the rest later, Convert's script delivers the full experimentation engine and all active experiences in a single bundle.

The script contains experiment logic, targeting, and variation code. No follow-up requests to assemble the experiment. This removes dependency on network timing during execution.

At its baseline, the Convert snippet is 93KB gzipped. With typical usage (3-5 experiments), this goes up to ~95-110KB. This is available upfront, not loaded later, which also means you're not relying too heavily on flicker prevention and can count on a more predictable experiment behavior.

On the flipside, the trade-off is:

Larger upfront payload than loader-based setups
Sync loading can slightly impact Core Web Vitals
Payload grows with the number of active experiments (you'll need good housekeeping to keep things lean)

Conclusion

Smallest script size claims work because they're easy to understand and compare. But it doesn't tell you where execution begins.

When you compare the actual payload and the risk of flicker, you'll assess true performance and data reliability.

← Back to Blog