Guide

Flaky Tests: Why They Happen and How to Actually Fix Them

46% of flaky tests fail due to resource issues, not code bugs. Learn the root causes and why common fixes like retries make things worse.

Dhaval Shreyas

Co-founder & CEO at Pie

14 min read

What you’ll learn

The four root causes of test flakiness and why each one matters
Why retries, quarantines, and timeouts make the problem worse over time
How vision-based testing decouples tests from implementation details
A practical framework to measure your team’s flake rate

A test fails. You check the logs. Nothing in the codebase changed. You hit re-run, and it passes.

The World Quality Report found that organizations dedicate 30-50% of their testing resources to maintaining and updating test scripts. For a team of five QA engineers, that’s the equivalent of losing two full-time engineers to test babysitting.

The instinct is to add retries, increase timeouts, or quarantine the worst offenders. None of it works long-term because the problem runs deeper than individual tests. The architecture of selector-based testing is brittle.

This guide breaks down why tests become flaky, why common fixes backfire, and how vision-based testing solves the problem at its source.

What Is a Flaky Test

A flaky test produces inconsistent results without any changes to the code being tested. Run it ten times, it passes eight. The two failures have no explanation that makes sense. This behavior stems from non-determinism in tests, where the same inputs produce different outputs.

Here’s what that looks like in practice:

// This test passes 80% of the time
test('checkout completes successfully', async () => {
  await page.click('#add-to-cart');
  await page.click('#checkout-button');
  // Sometimes the cart hasn't updated yet
  expect(await page.textContent('.total')).toBe('$99.00');
});

The cart update is asynchronous. Sometimes it completes before the assertion, sometimes after. The test is correct in principle but broken in practice.

Four Root Causes of Flakiness

Every flaky test depends on something unstable. Timing, selectors, environment, shared state. These root causes of flaky tests produce different failure patterns but point to the same solution. Remove the dependency entirely.

1. Timing Dependencies

Tests assume operations complete within fixed windows. Network latency spikes. Database queries slow down. CI servers get busy.

A 2024 IEEE study found that 46.5% of flaky tests are “Resource-Affected,” meaning they pass or fail depending on computational resources available during execution. Same code, different machine specs, different results.

2. Selector Fragility

UI tests locate elements using CSS selectors or XPath expressions. When a developer renames a class or restructures a component, selectors break. The application works fine. The test doesn’t.

This is the most common source of flakiness in E2E testing, and it’s built into the architecture of every traditional automation framework.

3. Environment Inconsistency

Tests that pass locally often become flaky in CI pipelines due to resource constraints and timing differences. Browser versions differ. Screen resolutions vary. System resources fluctuate. Failures appear only in specific contexts, making them nearly impossible to reproduce locally.

4. Shared State Pollution

Tests that don’t properly clean up after themselves contaminate subsequent tests. Without proper test isolation, one test’s side effects become another test’s failure. Test order shouldn’t matter, but it does.

Each source creates false failures that erode release velocity.

The Hidden Costs of Ignoring Flaky Tests

Every engineering team knows flaky tests are a problem. Few have quantified what they actually cost. When you break it down, the damage compounds across four dimensions.

1. Compute Waste

If your pipeline takes 20 minutes and engineers re-run twice daily due to flaky failures, that’s 40 minutes of wasted compute per developer per day. For a team of 10, that’s 7 hours of CI time. Every day. Our test maintenance calculator shows what flakiness costs your organization annually in compute, context switches, and engineer hours.

2. Context Switching

Every flaky failure pulls a developer out of their work. They stop coding, investigate the failure, determine it’s noise, re-run the pipeline, and try to get back to where they were. The mental context they’d built up is gone.

3. Trust Erosion

When tests fail randomly, engineers stop trusting the suite. Real bugs slip through because “it’s probably just flaky.” Dismissing failures becomes the default response, and actual regressions get lost in the noise.

4. Morale Drain

Nothing kills energy faster than debugging tests that aren’t actually broken. Engineers want to build features, not babysit infrastructure.

📊 The Hidden Cost

A 2025 empirical study found that 56% of software practitioners encounter flaky tests daily, weekly, or monthly. The same research cites industrial data showing developers spend 1.28% of their time repairing flaky tests, costing roughly $2,250 per developer per month.

Why Common Fixes Backfire

Faced with these costs, teams fight back. They add retries, quarantine bad actors, dedicate sprints to cleanup. Most of these fixes make the problem worse.

1. Quarantine Queues

Remove flaky tests from the critical path. Fix them later.

In practice, “later” never comes. The quarantine grows. Coverage shrinks. Eventually you’re running half your test suite and calling it good enough.

2. Increased Timeouts

Give everything more time. Maybe the flakiness goes away.

The fragility remains. Now your test suite takes 3x longer to run. You’ve traded one problem for another.

3. Retry Logic

Re-run failed tests automatically until they pass.

You’re paying for extra compute to run broken tests until they accidentally succeed. The underlying issues compound. With 1,000 tests at even 0.1% flake rate each, most of your PRs will still hit a flaky failure.

4. Maintenance Sprints

Dedicate a sprint to “test hygiene.” Put engineers on fixing duty.

Tests keep breaking faster than humans can fix them. Engineers rotate onto duty, fix a dozen flaky tests, and watch a dozen more appear next sprint. You’re bailing water from a sinking ship.

Curious how vision-based testing works?

Drop your staging URL. We'll show you tests that don't break on every deploy.

Book a Demo

Why Traditional Frameworks Create Flakiness

Retries, quarantines, and timeouts all treat flakiness as a test-level problem. The real issue runs deeper: how traditional frameworks identify elements on the page.

Selenium, Cypress, and Playwright face similar flakiness patterns. They locate elements using selectors like CSS paths, XPaths, test IDs, and data attributes. These selectors couple your tests directly to implementation details. The test doesn’t ask “is this button visible?” It asks “does div.container > button.primary-cta exist?”

When a developer renames a CSS class, restructures a component, updates a UI library, or refactors page layout, the selector breaks. The application works fine and users see the same button, but the test fails because the underlying HTML changed.

Fast-shipping teams pay the highest price. Every deployment is a chance for UI code to diverge from test selectors. Teams deploying daily accumulate selector drift faster than teams deploying monthly, which means the most productive engineering orgs face the steepest flakiness tax.

How Vision-Based Testing Eliminates Flakiness

If selectors are the problem, the solution is obvious: stop using them. Vision-based testing identifies elements the way humans do, by looking at the screen rather than parsing HTML. Pie works this way.

Instead of searching for button#submit-form.primary-cta, a vision-based system finds “the blue Submit button in the bottom-right corner.” The button can be renamed, restyled, or moved. The test keeps working.

1. No Selector Dependencies

Tests don’t break when developers refactor components, rename classes, or update the DOM structure. The button still looks like a submit button, so the test passes. Your frontend team can ship design system updates, component library migrations, or full framework changes without touching a single test file.

2. Adaptive Waiting

Instead of fixed timeouts that guess how long operations take, vision-based systems wait until elements actually appear on screen. The test watches for the loading spinner to disappear and the content to render, exactly like a human would. No more arbitrary sleep(3000) calls that slow down fast environments and fail in slow ones.

3. Environment Resilience

Tests evaluate what’s rendered, not how it’s implemented. Browser version differences, viewport variations, and backend latency fluctuations matter less when the test asks “can I see the checkout button?” rather than “does this XPath resolve?” The same test runs reliably on a developer laptop, in CI, and across staging environments.

4. Self-Healing by Default

When UI changes, the system adapts automatically. Button moved from the sidebar to the header? The test finds it in the new location. Icon replaced with text? Still recognized as a submit action. Self-healing test automation isn’t a feature bolted on after the fact. It’s the natural consequence of testing what users see instead of testing implementation details.

These four capabilities compound. When you remove selector dependencies, adaptive waiting becomes possible. When tests evaluate rendered output, environment resilience follows naturally. The result is a fundamentally different maintenance profile:

Aspect	Selector-Based	Vision-Based
Element identification	CSS/XPath selectors	Visual recognition
UI refactor impact	Tests break	Tests adapt
Framework migration	Rewrite entire suite	No changes needed
Weekly maintenance	20+ hours	Near-zero

What This Looks Like in Practice

Fi builds AI-powered GPS collars for dogs. For a pet safety company, reliability isn’t optional. When a dog escapes, every second counts. As the product scaled, release validation became a bottleneck.

The pre-Pie reality: 12+ engineers locked in a room for 2-3 days before each release, manually testing core smoke flows. Edge cases got ignored. Bugs slipped through. Release cycles stretched from one day to three.

After switching to vision-based testing with Pie:

Release validation dropped from 2-3 days to a few hours
Manual testing effort fell by 75%
Coverage expanded to edge cases they’d never had bandwidth to test

📊 Customer Result

“Release validation went from two to three days to a few hours. We didn’t have to change how we did things.”

— Philip Hubert, Director of Mobile Engineering, Fi

Read the full case study →

Measure Your Flake Problem

Before fixing flakiness, you need to measure it. Most teams don’t track this systematically, which means they’re flying blind when prioritizing fixes. Here’s a framework to quantify your flake problem.

1. Flake Rate

What percentage of test failures are actual bugs vs. flaky failures? Track failed runs that pass on re-run without code changes.

How to measure: Flag every test failure. If the same test passes on re-run with zero code changes, mark it as flaky. Calculate: (flaky failures / total failures) × 100.

Thresholds: Below 5% is manageable. 5-10% needs dedicated attention. Above 10% is actively blocking your release velocity.

2. Re-run Frequency

How often do engineers retry failed pipelines? This is the clearest signal of how much flakiness disrupts daily work.

How to measure: Pull CI logs for the past 30 days. Count pipeline runs where the same commit was run multiple times. Divide by total commits merged.

Thresholds: One re-run per week is normal. Daily re-runs per engineer means significant time and compute burning on noise.

3. Investigation Time

How many hours weekly do engineers spend on failures that turn out to be flaky? This is often the largest hidden cost because it’s invisible in sprint tracking.

How to measure: Survey your team for one week. Ask them to log every test failure investigation and whether it turned out to be real or flaky. Most teams are shocked by the number.

4. Quarantine Size

How many tests are currently disabled, skipped, or marked as “flaky-allowed”? A growing quarantine signals architectural problems.

How to measure: Grep your test suite for skip annotations, disabled flags, or retry-until-pass wrappers. Track this number monthly.

What to watch: If quarantine grows faster than your test suite, you’re hiding flakiness rather than fixing it. Coverage is shrinking in disguise.

5. Coverage Trend

Is effective coverage growing, stable, or shrinking? If teams delete more tests than they add because maintenance hurts too much, coverage regresses even if test count stays flat.

How to measure: Track tests added vs. tests deleted per sprint. A healthy suite should grow with your codebase. If the ratio flips negative, maintenance burden is winning.

Our autonomous discovery breaks this tradeoff by expanding coverage without adding to maintenance load.

Fix the Architecture, Not the Symptoms

Flaky tests aren’t inevitable. They’re the predictable result of coupling tests to implementation details.

Retries won’t fix it. Timeouts won’t fix it. More engineers on maintenance duty won’t fix it. The only fix is removing the coupling entirely.

Vision-based testing does exactly that. Tests evaluate what users see, not how developers built it. When UI changes, tests adapt instead of breaking.

We built Pie because we got tired of watching talented engineers waste their best hours babysitting test infrastructure. If that sounds familiar, we should talk.

See it in action

Watch AI agents test your app the way users actually use it. No scripts, no selectors.

Book a Demo

Frequently Asked Questions

A flaky test produces different results without code changes. Same test, same code, different outcomes. Root causes include timing dependencies, selector fragility, environment inconsistencies, and shared state pollution.

Track test failures that pass on re-run without code changes. Divide by total test runs. Anything above 5% needs attention. Above 10% is actively hurting your release velocity.

Retries mask the problem without solving it. You’re paying for compute to run broken tests until they accidentally pass. The underlying fragility remains and gets worse as your suite grows.

Selectors couple tests to implementation details. When a developer renames a CSS class or restructures a component, selectors break even though the app works fine. The test is testing the code structure, not the behavior.

Selenium finds elements by code attributes like CSS selectors and XPaths. Vision-based testing finds elements by how they look and behave, the same way a user would. When UI changes, Selenium breaks. Vision-based tests adapt.

Yes. It operates at the rendered UI layer, not the code layer. React, Vue, Angular, Rails, Django, legacy jQuery apps. If users can see it and interact with it, vision-based testing can test it.

No. Pie tests at the UI layer. Your codebase stays on your systems. We only interact with what’s rendered on screen, the same way your users do.

Most teams see 80% coverage within the first hour. AI agents explore your app autonomously and generate tests for the flows they discover. No scripting required.

Dhaval Shreyas

Co-founder & CEO at Pie

13 years building mobile infrastructure at Square, Facebook, and Instacart. Payment systems, video platforms, the works. Now building the QA platform he wished existed the whole time. LinkedIn →