A/B Testing

Background of A/B Testing and Uncertainty

A/B testing helps determine which version of an element performs better, but interpreting results isn’t always straightforward because randomness is inherently part of the process. Think of flipping a coin: even with two identical coins, outcomes can vary in the short term. If you flip each coin just 10 times, you might get 7 heads and 3 tails with one coin, and a 5-5 split with the other. At this scale, it might seem like the two coins are inherently different, but in reality, this is just random variation.

However, if you flip each coin 50 times, the outcomes will likely stabilize, revealing a distribution closer to a true 50-50 probability. This is due to the law of large numbers, which states that as the sample size grows, the observed frequencies get closer to the expected outcome. A/B testing follows a similar principle: with a limited number of observations (or interactions), results are prone to fluctuation, and this uncertainty can mislead conclusions. Increasing the number of observations helps reduce this uncertainty, making it easier to identify genuine performance differences rather than just random noise. But, as with coins, even extensive testing can’t completely eliminate random variation—it just gives us a stronger foundation to make informed decisions.

Method Selection and the Inherent Uncertainty

Different methods can be used to conduct A/B testing, each with its advantages. However, all methods share a fundamental trait: they require a sufficient number of observations to draw reliable conclusions. Here are a few popular A/B testing methods and their characteristics. The primary difference among these approaches lies in how they handle and optimize around this uncertainty.

Classical (Frequentist) methods use a fixed sample size and pre-defined significance levels, aiming to minimize uncertainty by reaching a clear statistical threshold. This approach waits for a sufficient sample size to reduce the margin of error and provide a reliable result, offering a snapshot of which variant is likely better.
Bayesian methods continually update the probability of one variant being better than another as new data comes in, rather than relying on a fixed sample size. While this approach can adapt more dynamically, it still requires enough data to reliably reduce uncertainty, often taking longer to reach a highly confident decision.
Multi-armed bandit approaches actively allocate more traffic to better-performing variants as the test progresses, attempting to optimize traffic distribution in real-time. This method reduces potential “losses” on underperforming variants but still requires enough interactions to ensure that the early allocation doesn’t prematurely settle on a variant due to random fluctuations.

Regardless of which method you choose, each requires enough “coin tosses” (or interactions) to get trustworthy results. If you stop too early, you risk interpreting random differences as meaningful. Consequently, ensuring enough observations is critical across all methods to achieve a dependable measure of performance.

So, while all methods must contend with the same baseline uncertainty, they differ in their strategies for managing it, each with unique advantages based on goals, timeline, and sample size constraints.

Our choice

At Kilkaya, we chose the classical (Frequentist) method because it’s designed to deliver statistically reliable results quickly, which is crucial for content that needs timely optimization.

Finding the best-performing version as soon as possible has several practical advantages. For example, if a particular title performs well, you may want to share it on social media early to maximize engagement. Additionally, website traffic often decreases over time, so identifying and implementing the optimal version early captures more of the available audience.

While Bayesian and multi-armed bandit methods have other strengths, they tend to require longer testing periods to reach conclusive results. They also involve more complex calculations and probabilities, making them harder to explain and interpret clearly. By using the classical approach, Kilkaya ensures a balance between statistical rigor and speed, providing straightforward, actionable insights that allow clients to make quick, informed decisions to enhance content engagement.

We Measure Only Visible Impressions

In Kilkaya, we measure only the visible impressions of each slot or variant, meaning an impression only counts if the slot is actually visible on the screen. However, even when counting only visible impressions, the click-through rate (CTR) decreases as you move further down the page. This is because the visibility of a slot at the top is much more impactful than visibility further down, where users have already begun scrolling and may be less engaged with content that appears later.

For instance, a slot at the very top of the page might achieve a visible CTR of around 50% because it is the first thing users see. But as users scroll, the CTR drops significantly—sometimes to as low as 2% for slots further down the page. This 25x difference shows that even visible impressions lose effectiveness lower on the page, as users are less likely to interact with content they encounter after scrolling through other items.

This steep decline in visible CTR highlights why we don’t rely on views to gauge an experiment’s duration or a variant’s success.

Views alone can’t account for the engagement disparity across page positions, making them misleading. Instead, Kilkaya uses clicks as the primary metric for determining a variant’s performance.

Clicks provide a direct measure of user engagement, offering a consistent and reliable basis for comparison regardless of the slot’s position on the page.

Testing Different Elements on a Slot

Any element within a slot can be tested. While titles and images are the most commonly tested elements, some publishers experiment with lead text, background colors, section markers, font sizes, text colors, and additional elements. These components can be tested individually or in combination. However, we avoid automatic permutations because elements like images and titles often correlate, and testing them separately may produce results that are misleading.

Number of Clicks Determines Measurement Accuracy

In Kilkaya, the number of clicks per variant is essential to accurately detecting performance differences. After running thousands of “A/A tests”—tests comparing identical versions to gauge natural variation—we found that, with 100 clicks per variant, we can detect a 23% increase in click-through rate (CTR) with 95% confidence. This means that if a new variant performs 23% better than the original, we can confidently say it’s due to the difference in content and not just random chance.

But what if we want to detect smaller or larger differences? Here’s where a statistical rule, called the “inverse-square law,” comes into play. This rule tells us that to detect a smaller difference, we need exponentially more clicks. For example:

50% increase: Only 21 clicks per variant are needed to reliably detect such a large boost.
23% increase: 100 clicks per variant are needed to reliably detect this increase in CTR (our default value).
10% increase: Around 529 clicks per variant are needed to detect this smaller, more subtle improvement.

This rule of thumb helps us scale testing efficiently. For big changes, fewer clicks suffice; for finer differences, more clicks are needed to ensure accuracy. By focusing on clicks (actual user engagement), Kilkaya ensures that each test result is grounded in solid data, giving publishers confidence in decisions to improve content, increase engagement, and boost CTR.

Determining a Winner with a 95% Confidence Threshold
To declare a variant as the winner, it must outperform the others with at least 95% confidence. If a winner is identified, Kilkaya will automatically select it. If no variant reaches this level of confidence, users can either retain the original or opt for the variant with the highest (but still sub-95%) performance.

In Kilkaya, the original variant is always the baseline for comparison. If the original variant performs best, we do not declare it as a “winner” or celebrate it; rather, we simply acknowledge that none of the tested alternatives outperformed it, and thus, we retain the original. This approach avoids rewarding A/B tests that do not lead to improvements, as they typically result in lost traffic due to underperforming variants.

Avoiding Click-Bait

It’s easy to write a headline that gets clicks by using click-bait tactics, but at Kilkaya, we don’t want to reward that approach. Our default setting removes every click where the resulting page view is a quick exit—meaning the reader leaves the page within 5 seconds. Quick exits are a strong indicator that the content didn’t meet user expectations, and tracking these allows us to filter out misleading results that prioritize immediate clicks over meaningful engagement.

Although it’s possible to include all clicks in the analysis, we highly recommend using our default to exclude quick exits. This ensures that performance metrics reflect genuine user interest, supporting headlines and content that truly resonate with audiences and foster long-term loyalty.

Our Recommendation on Titles

When examining A/B tests across multiple publishers, we’ve found that shorter titles tend to perform best—one of the strongest predictors of a winning title is its brevity. However, titles shouldn’t be so short that they verge on click-bait. A good title should give readers a clear sense of the story upfront, not leave them guessing. Don’t be afraid to reveal the essence of the story; transparency builds trust and interest.

Emotions are powerful motivators, so aim to evoke feelings like anger, compassion, admiration, or urgency—emotions that make readers want to dive in, not merely out of curiosity but out of a genuine desire to engage with the full story. When used well, these emotions create a connection with the reader, encouraging them to explore the content with trust and anticipation rather than fleeting intrigue.

Influencing How Long a Test Runs

How long an A/B test needs to run depends on the amount of traffic a site receives. Sites with lots of visitors can gather clicks quickly, allowing them to detect even small improvements between versions. For sites with less traffic, it can take much longer to get enough clicks, so it often helps to use headlines that create a bigger difference in click-through rate (CTR) to reach results faster. No testing method can fully remove uncertainty, so if traffic is low and differences between versions are small, you’ll either need to settle for less certainty or aim for bigger changes to clearly see which version performs best. This way, tests can still finish within a reasonable time frame without sacrificing reliability.

Summary

Kilkaya’s Choice of Classical Testing: Kilkaya uses the classical method for its quick, straightforward results, allowing clients to implement the best-performing version faster—a practical advantage when social media and time-sensitive traffic are in play.
Focusing on Visible Impressions: Kilkaya measures only visible impressions of page elements. CTR drops further down the page, so we rely on clicks over views for consistent, meaningful engagement metrics.