Why Your Tests Fail Randomly: 5 Root Causes of Flaky Tests
Flaky tests aren't random. Most trace back to five root causes. Here's how to detect each one and decide what to fix or delete.
What you’ll learn
- The architectural patterns that make tests unreliable
- Which causes are most common (with research data)
- Detection strategies that surface problems early
- A framework for deciding what to fix vs. delete
Your test passed yesterday. It failed this morning. Nothing changed.
You re-run it. Green. You deploy. Another failure notification. Same test, same code, different outcome.
If this sounds familiar, you’re dealing with flakiness. And it’s quietly destroying your team’s velocity.
Google’s research found that 16% of their tests exhibit flaky behavior, with flakiness causing 84% of their pass-to-fail transitions. The problem isn’t rare. It’s endemic.
Most teams treat symptoms. They add retries, extend timeouts, or selectively ignore failures. These workarounds buy time, but they don’t solve the problem.
If you’ve tried those fixes and your tests still fail randomly, the root cause runs deeper. This guide breaks down the architectural patterns that make tests unreliable in the first place.
What Makes a Test Flaky
A test is flaky when it produces different results (pass or fail) without any change to the code under test.
The defining characteristic is non-determinism in tests. Given the same inputs, the test should produce the same output every time. When it doesn’t, something external to the code is influencing the outcome.
Common misconceptions you might have heard:
- “It’s a bad test”: Sometimes. But more often, the test reveals a legitimate timing or state issue in your application itself.
- “Just add more retries”: This masks the symptom without addressing the cause. You’re spending CI minutes on a lie.
- “It only happens in CI”: Usually indicates environmental differences worth understanding, not dismissing.
The goal isn’t to eliminate all complexity. It’s to ensure that test outcomes reflect code quality, not infrastructure lottery.
Five Root Causes of Flaky Tests
Flakiness is getting worse, not better. The Bitrise 2025 Mobile Insights Report found 26% of teams now experience test flakiness, up from 10% in 2022. That’s a 160% increase in three years. AI-assisted “vibe coding” likely accelerates this as developers ship generated code they don’t fully understand.
The TestDino 2026 Flaky Test Benchmark synthesizes a decade of root cause research. These five causes, ranked by frequency, account for the vast majority of flaky test failures:
| Cause | Frequency | Primary Fix |
|---|---|---|
| Async Wait Issues | Most common | Explicit waits, event-based assertions |
| Concurrency Problems | Very common | Test isolation, proper locking |
| Shared State Pollution | Common | Database/cache isolation |
| Test Order Dependency | Common | Stateless setup/teardown |
| Environmental Drift | Less common | Containerized test environments |
1. Async Wait Issues
Nearly half of all flaky tests fail because of timing assumptions. This is the single most common cause of flakiness, and it’s also one of the easiest to introduce accidentally. Any test that doesn’t explicitly wait for a condition before asserting is a candidate for async-related failures.
The pattern looks like this:
- Test clicks a button
- Test immediately checks for a result
- Result hasn’t loaded yet
- Test fails
- On retry, timing happens to align
- Test passes
Why it happens: Modern applications are asynchronous. API calls, animations, lazy loading, and WebSocket events all happen on unpredictable timelines. Tests written with implicit assumptions about “fast enough” break when those assumptions don’t hold.
The sleep() trap: Teams often add sleep(2000) to “fix” the flakiness. This creates two problems:
- Tests become artificially slow (multiplied across hundreds of tests)
- The arbitrary duration may still be too short under load
The fix: Replace static waits with explicit conditions. Instead of waiting for time, wait for the specific state you need:
// Bad: timing assumption
await page.click('#submit');
await sleep(2000);
expect(await page.textContent('#result')).toBe('Success');
// Good: explicit condition
await page.click('#submit');
await page.waitForSelector('#result:has-text("Success")');
If you’re working with Playwright specifically, we’ve compiled proven strategies for fixing flaky Playwright tests that go beyond basic waits.
Pie eliminates this problem by design. Instead of waiting for DOM elements that may or may not be ready, Pie uses computer vision to interact with applications the way users do. No brittle selectors. No timing assumptions. The test observes what’s actually rendered on screen and responds accordingly.
2. Concurrency Problems
Tests running in parallel can step on each other. This is the second most common cause of flakiness, and it’s particularly frustrating because each test in isolation works perfectly. The problem only emerges when tests run simultaneously, making it difficult to reproduce and debug.
As one QA lead at a fintech company put it: “Tests pass fine individually, but it gets way too flaky when we try to combine them into a single run.”
Common patterns:
- Race conditions in the application: Two tests hit the same endpoint simultaneously. One creates a record, the other expects an empty state. Results become unpredictable.
- Shared resources without isolation: Tests modify the same database records. Test A writes data. Test B reads it mid-transaction. Test B gets an inconsistent view.
- Thread-safety assumptions: A test assumes a singleton is in a known state. Another test modifies it. The assumption breaks.
The fix: True test isolation. Each test should operate as if it’s the only test running:
- Unique identifiers for all test data (timestamps, UUIDs)
- Separate database schemas or transactions per test
- Containerized dependencies with no shared state
3. Test Order Dependency
Tests should be independent. When they’re not, running them in a different order produces different results.
How it happens:
- Test A creates a user, doesn’t clean up
- Test B assumes a clean user table
- Run A then B: passes
- Run B then A: fails
The telltale sign: Tests pass locally but fail in CI. Or they pass in one runner but fail in another. Randomized test order exposes the dependency.
The fix: Proper setup and teardown. Every test is responsible for its own preconditions:
// Each test creates what it needs
beforeEach(async () => {
await db.clean('users');
await db.create('users', testUser);
});
afterEach(async () => {
await db.clean('users');
});
Test frameworks like Jest and Pytest support randomized ordering specifically to catch these issues early. Run your suite with randomization enabled once a week, even if your regular CI runs don’t use it.
4. Shared State Pollution
Beyond test order, shared state creates subtler problems that are harder to diagnose. While test order dependency is usually obvious once you look for it, shared state pollution can hide in layers of infrastructure you rarely examine. Caches, configuration stores, and browser storage all hold state that tests may unknowingly depend on.
Common patterns:
- Caches that persist: A test populates a cache. The next test expects fresh data from the database. It gets stale cache values instead.
- Global configuration changes: A test modifies a feature flag for its scenario. It forgets to reset it. Subsequent tests run under unexpected conditions.
- Browser state leakage: Cookies, local storage, and service workers all persist between tests unless explicitly cleared.
The fix: Reset everything. Not just the database:
- Clear caches between tests
- Reset configuration to defaults
- Fresh browser context per test or per suite
Tired of Debugging Flaky Tests?
See how teams eliminate test maintenance with autonomous QA.
Book a Demo5. Environmental Drift
The test environment differs from where tests were written. This category accounts for the smallest percentage of flakiness, but it’s often the most confusing to debug because the same test produces different results on different machines with no obvious explanation.
Local vs. CI differences:
- Different OS (macOS locally, Linux in CI)
- Different browser versions
- Different timezone settings
- Different resource constraints (memory, CPU)
Infrastructure variability:
- Network latency fluctuations
- Third-party service availability
- DNS resolution timing
Time-sensitive tests:
- Tests that check “created today” fail at midnight
- Tests that verify “expires in 24 hours” fail depending on when they run
The fix: Containerization and mocking. Pin versions, mock external services, control time:
// Mock current time for predictable behavior
jest.useFakeTimers();
jest.setSystemTime(new Date('2026-03-15T10:00:00Z'));
Detection Strategies That Actually Work
Now that you understand the root causes, the next step is finding which tests in your suite are actually flaky. You can’t fix what you don’t measure.
Most teams discover flakiness reactively, when a test blocks a deploy or wastes an engineer’s morning. Proactive detection catches problems before they disrupt workflows. Three approaches work in practice:
1. Repeated Execution
Run each test multiple times (3-10x) before merging. If it fails any run, flag it. Most CI systems support this natively: Jest’s --repeat flag, pytest’s pytest-repeat plugin, or wrapping tests in a loop in your CI config.
- Pros: Simple to implement, catches obvious flakiness
- Cons: Multiplies CI time, may miss rare failures
2. Historical Analysis
Track pass/fail rates over time. A test that fails 5% of runs across commits is flaky. Tools like Datadog CI Visibility, BuildPulse, or custom dashboards aggregating test results over time can surface these patterns automatically.
- Pros: Catches patterns, provides data for prioritization
- Cons: Requires infrastructure investment, delayed feedback
3. Quarantine with Monitoring
Move known-flaky tests to a separate suite. Run them, log results, but don’t block merges. The quarantine suite should still run on every commit so you can track when tests stabilize.
- Pros: Unblocks development, preserves test coverage intent
- Cons: Tests in quarantine often stay there forever
The best approach combines all three: repeated execution for new tests, historical tracking for the full suite, and quarantine as a last resort with weekly review.
When to Fix vs. Quarantine
Once you’ve identified your flaky tests, you need a triage strategy. Not all flaky tests deserve the same investment.
Fix immediately:
- Tests covering critical user flows (checkout, login, payment)
- Tests with high flakiness rate (>10% failure)
- Tests blocking multiple developers daily
Quarantine with timeline:
- Tests covering edge cases with low business impact
- Tests requiring infrastructure changes to fix properly
- Tests with complex root causes requiring investigation
Delete permanently:
- Tests that have been flaky for months without resolution
- Tests covering deprecated features
- Tests where the cost of fixing exceeds the value of the coverage
The trap is infinite quarantine. Set a policy: tests in quarantine for more than 30 days must be fixed or deleted. No exceptions.
A test suite is only valuable if teams trust it. Every flaky test that “probably isn’t a real failure” erodes that trust. Eventually, engineers stop paying attention to failures entirely. Then real bugs ship.
How Vision-Based Testing Addresses Root Causes
The five causes above share a common thread: tests making assumptions about things they can’t directly observe. DOM state, timing, selectors, environment variables.
Vision-based testing takes a different approach. Instead of querying the DOM for element state, platforms using autonomous discovery interact with applications the way users do: by looking at the screen.
Why this matters for each root cause:
- Async waits: Visual verification confirms elements are actually rendered and ready, not just present in the DOM tree.
- Concurrency: Each test runs in visual isolation, unaffected by state changes from parallel runs.
- Test order dependency: Tests identify elements by appearance, not by assumptions about prior state.
- Shared state pollution: No reliance on cached selectors or DOM references that can go stale.
- Environmental drift: Visual testing works the same regardless of browser version or OS differences.
If you’ve fixed the architectural issues above and flakiness persists, the problem may be the testing paradigm itself. Selector-based automation embeds assumptions into every test. Vision-based testing removes them.
Stop Treating Symptoms. Fix the Architecture.
Flaky tests aren’t a testing problem. They’re an architecture problem manifesting in your test suite.
The five causes (async waits, concurrency, test order dependency, shared state, and environmental drift) all share a common thread: tests that assume more than they should about timing, state, or environment.
The fix isn’t more retries or longer sleeps. It’s building tests that make no assumptions beyond the code they verify. Explicit waits. True isolation. Controlled environments.
For teams ready to skip the maintenance treadmill entirely, our autonomous testing platform handles these complexities by design. Computer vision means no selectors to break. Self-healing tests mean no manual fixes when your UI changes. Tests stay green because they see what users see, not what the DOM exposes.
Your test suite should reflect your code quality. Nothing more, nothing less.
Less Maintenance. More Shipping.
See how teams are making the shift to zero-maintenance testing.
Book a DemoSOC 2 Type II certified · No source code access
Frequently Asked Questions
Any test that fails without a code change is too many. Google's research found 16% of their tests exhibit flaky behavior. The goal is zero tolerance with systematic elimination, not an acceptable threshold.
Fix the high-value ones, delete the rest. A test covering critical checkout flow deserves debugging. A test validating a tooltip animation that breaks weekly deserves deletion. Trust your eyes for edge cases.
Yes, if the platform handles waits and retries intelligently. Self-healing tests adapt to UI changes automatically, removing the manual maintenance that causes most flakiness.
Pie uses computer vision instead of DOM selectors, so it sees what users see rather than making assumptions about element states. This eliminates the timing and selector brittleness that cause most flakiness. Tests observe actual rendered content, not hidden DOM properties.
Show them the math. If 10 engineers spend 30 minutes daily investigating flaky failures, that's 100+ hours per month. Multiply by loaded cost. The number usually ends the debate.
CI environments have different resources, timing, and network conditions than your local machine. Common culprits: hardcoded timeouts that work locally but not on slower CI runners, tests that depend on local services or network access, and race conditions that only surface under CI's resource constraints.
Slow tests take too long but produce consistent results. Flaky tests produce inconsistent results regardless of speed. A test can be both, but they require different fixes. Slow tests need optimization. Flaky tests need architectural changes to eliminate non-determinism.
References
- TestDino. “2026 Flaky Test Benchmark” — Synthesis of flaky test root cause research
- Bitrise. “2025 Mobile Insights Report” — 26% of teams experience flakiness, up from 10% in 2022
- Google Testing Blog. “Flaky Tests at Google and How We Mitigate Them” — 16% flaky tests, 84% pass-to-fail transitions
- Luo, Q., Hariri, F., Eloussi, L., & Marinov, D. “An Empirical Analysis of Flaky Tests” (FSE 2014) — Foundational root cause analysis
13 years building mobile infrastructure at Square, Facebook, and Instacart. Payment systems, video platforms, the works. Now building the QA platform he wished existed the whole time. LinkedIn →