How to Reduce Flaky UI Tests Without Rewriting the Whole Framework

Flaky UI tests are usually treated like a framework problem, but in many teams they are really a maintenance and observability problem. The temptation is to assume the whole test stack needs replacing because the same failures keep coming back in CI. In practice, you can often reduce flaky UI tests significantly by tightening a few weak points: locator stability, waits, test isolation, retry policy, and environment issues.

The key is to stop thinking in terms of a full rewrite first. A rewrite is expensive, risky, and slow to prove. Most teams need quick wins that reduce noise, improve trust in the pipeline, and buy time for a more deliberate strategy.

If a test fails sometimes, passes on rerun, and nobody can explain why, the issue is rarely “just instability.” It is usually a combination of timing, selector fragility, state leakage, and environment drift.

What flakiness actually looks like

Flaky UI tests do not all fail for the same reason. Before changing code, classify the failure. That helps you avoid applying the wrong fix.

Common symptoms include:

A test passes locally but fails in CI.
A test fails only on a specific browser or viewport.
A rerun passes without any code changes.
The failure appears at a different step each time.
The failure message is generic, such as “element not found,” “timeout exceeded,” or “detached from DOM.”

These symptoms can come from multiple causes at once. For example, a locator might be fragile, but the underlying issue only becomes visible because the environment is slower in CI. Fixing only the timeout may hide the symptom, not the cause.

A useful mental model is to group failures into four buckets:

Selector problems: the test is pointing at the wrong element or an unstable element.
Timing problems: the UI is not ready when the test interacts with it.
State problems: previous tests, shared data, or background jobs change the test outcome.
Environment problems: network, browser, build, data, or service differences make the same test behave differently.

The rest of this guide follows that structure.

Start with the failure history, not the fix

Before you change tests, collect a short history of the failure. You do not need a full analytics platform to do this well.

Look for:

The exact step that failed most often.
Whether reruns succeed on the same machine.
Whether the failure is tied to one browser or browser version.
Whether the test depends on seed data, accounts, feature flags, or third-party services.
Whether the DOM changed recently on the failing page.

If your CI can expose screenshots, videos, traces, or DOM snapshots, use them. Modern tools such as Playwright and Selenium can capture enough evidence to spot whether the issue is a missing element, a late render, or an unexpected modal.

A simple debugging rule helps:

Do not change the wait strategy until you can answer whether the element was absent, hidden, disabled, detached, or simply not yet ready.

That distinction matters because each failure mode suggests a different repair.

Fix locator stability first

If you want the fastest path to reduce flaky UI tests, start with locators. Fragile selectors are still one of the most common reasons UI tests turn red for reasons unrelated to product behavior.

Prefer stable attributes over structure

Selectors built on DOM position, nested containers, or CSS classes are often brittle. A refactor, styling change, or component library upgrade can break them without affecting the user journey.

Better options include:

data-testid, data-qa, or similar test-only attributes.
Semantic roles and accessible names.
Stable labels, button text, or unique text content.
URLs or route segments for page-level checks.

For example, in Playwright, this is usually more stable than a CSS chain based on layout:

typescript

await page.getByRole('button', { name: 'Save changes' }).click();

And this is often more brittle:

typescript

await page.locator('.settings-panel .actions button:nth-child(2)').click();

The first selector describes how a user finds the element. The second describes how the page happens to be arranged today.

Make locator contracts explicit

Teams often say they use data-testid, but the values are inconsistent, duplicated, or applied only in some components. Treat locator strategy as a contract.

Practical rules:

Choose one or two approved locator conventions.
Document when to use role, text, or test-id selectors.
Reserve structural selectors for cases where no stable semantic alternative exists.
Review locators during code review, not only test failure triage.

Watch out for dynamic IDs and auto-generated class names

UI frameworks often generate IDs, hash class names, or re-render components in ways that make selectors unstable. If a test is bound to a generated ID, the failure may happen after a library update, a build optimization, or even a data change.

If you cannot change the app code immediately, add a thin testability layer where you can. That can be as simple as exposing consistent data-testid hooks on high-value components.

Hard sleeps are one of the quickest ways to create flakiness. They make tests slow when the UI is fast and still unreliable when the UI is slow.

Wait for the right condition

A test should wait for something meaningful, not just for time to pass.

Examples of meaningful conditions:

Element is visible.
Element is enabled.
Network request completes.
Loading indicator disappears.
URL changes to the expected route.
A specific API response is returned.

In Playwright, waiting on a visible button is often enough:

typescript

const save = page.getByRole('button', { name: 'Save changes' });
await save.waitFor({ state: 'visible' });
await save.click();

In Selenium Python, prefer explicit waits over time.sleep():

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 10) button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ‘[data-testid=”save-changes”]’))) button.click()

Do not over-wait

Teams sometimes react to flakes by inflating every timeout. That can make failures rarer, but it also hides regressions and slows the suite. A 30 second blanket timeout on every step is not a fix.

Instead:

Use short default timeouts.
Increase wait only where the app has known asynchronous behavior.
Wait for a specific event or element, not for a whole page to become “settled.”

Align waits with app behavior

If a page renders data after an API call, the test can wait for the response and then assert on the DOM. If a component uses debounced input or animation, wait for the final state, not the first intermediate render.

This is where frontend and QA collaboration matters. The best wait strategy depends on how the app is built, not just on the test tool.

Improve test isolation before chasing retries

A test that depends on hidden state will keep failing until that state is controlled. Retrying the same bad setup is just a way to get a random pass.

Reset data and session state

Check for leakage in:

Browser storage, cookies, and local storage.
Shared test accounts.
Database records created by previous tests.
Cached backend state.
Feature flags that differ by environment.

If you run tests in parallel, isolation becomes even more important. Two tests using the same user, cart, or document can interfere with each other in ways that look like timing failures.

Practical isolation improvements include:

Create unique test data per run.
Log in with per-test or per-worker accounts.
Reset state through APIs before the UI test starts.
Clean up records after each test, or use disposable environments when possible.

Keep end-to-end tests narrow

Not every scenario belongs in the browser. If a setup step can be done through API calls, database fixtures, or a backend seed routine, use that. The more work the UI test does just to set up its world, the more opportunity there is for flakiness.

A good UI test usually validates a user path, not the entire system from scratch.

Treat retry policy as a diagnostic tool, not a cure

Retries can reduce noise, but they are easy to abuse. If a flaky test fails once and then passes on the second try, your pipeline looks healthier while the underlying problem remains.

When retries help

Retries are useful when:

You have intermittent infrastructure issues.
There is known external service instability.
A rare browser glitch is being investigated.
You need to keep the pipeline moving while a root cause is being fixed.

When retries hurt

Retries are a bad idea when:

They cover up a broken locator.
They mask race conditions in the app.
They allow state leakage to go unnoticed.
They make failures harder to reproduce.

A sensible retry policy is limited, visible, and temporary. It should log the first failure, keep the original evidence, and report how often retries are being used.

If a test needs repeated retries to pass, treat that as a signal to investigate, not as proof that the test is healthy.

A common pattern is to allow one retry for selected high-value integration flows while you stabilize the suite. That can be reasonable, but only if the team also tracks the underlying cause and works it down.

Separate environment issues from test issues

A surprising amount of flakiness comes from the environment rather than the test code itself. That is especially true when failures happen more often in CI than locally.

Check the usual suspects

Look at:

Browser version mismatches.
Different viewport sizes or device profiles.
Slower CPU or memory pressure in CI runners.
Network instability or throttling.
Third-party dependencies, such as auth providers, analytics, or payment sandboxes.
Feature flags or configuration differences between environments.

A test can be perfectly written and still fail if the CI environment is starved or inconsistent.

Make environments as boring as possible

The more variable the environment, the more noise you get. Standardize runner images, browser versions, and test data setup. If you can, pin dependencies and make your CI test containers reproducible.

A minimal GitHub Actions job might look like this:

name: ui-tests
on: [push]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npx playwright install --with-deps
      - run: npm run test:e2e

That does not remove flakiness by itself, but it reduces one source of uncertainty, which makes the remaining failures easier to diagnose.

Look for race conditions in the product, not only in the test

Some flaky UI tests expose real application bugs. The test is only the messenger.

Common product-side race conditions include:

A button becomes clickable before the backend is ready.
A spinner disappears before data is fully populated.
A modal opens while another event is still in progress.
Client-side rendering occurs in stages, and the test reads the DOM in the middle.

If your test is failing because the app exposes an unstable intermediate state, fixing the test alone is not enough. The UI should not present an actionable control before the underlying state is ready.

When you investigate, ask whether the app is making an invalid state visible to the test, or whether the test is simply observing state too early.

Reduce maintenance overhead with a thinner test layer

If your test suite has become difficult to edit, flakiness often grows because the cost of change is too high. Teams then avoid touching weak tests until they fail hard in CI.

That is one reason some teams look at tools such as Endtest, which provide a more editable test layer and agentic AI workflows for creating and maintaining tests without having to rewrite everything from scratch. In particular, Endtest can help reduce locator and maintenance work by letting teams import existing suites and edit the resulting steps in the platform rather than hand-translating every selector and assertion.

This is not the only path, and it is not always the right path, but the underlying idea is useful: if maintenance is cheap, you can fix unstable tests faster.

For teams already carrying a lot of legacy Selenium, Playwright, or Cypress coverage, a migration aid like AI Test Import can be a practical way to bring tests into a more editable layer incrementally instead of rewriting everything at once.

A practical triage order that usually works

When several flaky tests are failing at once, do not try to refactor everything. Use a simple order of operations.

Confirm the failure is real
- Re-run once, but keep the original evidence.
- Check whether the same step and same selector fail.
Inspect the locator
- Replace brittle selectors first.
- Prefer roles, labels, or stable test IDs.
Fix the wait condition
- Wait for the state the user actually needs.
- Remove arbitrary sleeps.
Check isolation
- Verify data cleanup, login state, and parallel test interference.
Review the environment
- Compare local and CI browser versions, viewport, data, and services.
Apply a temporary retry only if needed
- Use it as a buffer, not a permanent shield.

This order works because it attacks the most common, cheapest fixes first. In many suites, just improving locators and waits will eliminate a large share of noise.

How to decide whether a rewrite is actually needed

Sometimes a framework rewrite or tool migration is justified, but that should be a decision, not a reflex.

A rewrite becomes more likely when:

The test layer is so tightly coupled to implementation details that every UI change breaks dozens of tests.
No one on the team can safely maintain the suite.
The current framework cannot support the required browser, device, or integration coverage.
The cost of fixing flakiness exceeds the cost of moving to a better-maintained approach.

But if the main issues are unstable selectors, inconsistent waits, and poor test isolation, a rewrite is often the wrong first move. You will carry those same habits into the new framework unless you fix the underlying discipline.

A short checklist for the next failing test

Use this as a lightweight incident checklist the next time CI goes red:

Is the locator stable, or is it based on layout, index, or generated markup?
Is the test waiting for a meaningful condition, or just sleeping?
Does the test share data, session state, or accounts with another test?
Is the failure reproducible locally with the same browser and data?
Are CI runners, browser versions, or feature flags different from local runs?
Is the app exposing an intermediate state that should not be interactable?
Is retry hiding a real issue that should be fixed at the source?

If you answer these questions systematically, you can reduce flaky UI tests without spending months on a framework migration.

The bottom line

Flaky UI tests are frustrating because they undermine trust, and once trust is gone, the suite stops being useful as a release signal. But the solution is usually not to throw away the framework and start again. Start by fixing the parts most likely to break: locator stability, waits, test isolation, retry policy, and environment issues.

That approach gives you faster signal, lower maintenance, and a clearer view of what is actually broken. If the suite still needs a broader platform change later, you will make that decision with better evidence and less urgency.

What flakiness actually looks like

Start with the failure history, not the fix

Fix locator stability first

Prefer stable attributes over structure

Make locator contracts explicit

Watch out for dynamic IDs and auto-generated class names

Replace blind sleeps with condition-based waits

Wait for the right condition

Do not over-wait

Align waits with app behavior

Improve test isolation before chasing retries

Reset data and session state

Keep end-to-end tests narrow

Treat retry policy as a diagnostic tool, not a cure

When retries help

When retries hurt

Separate environment issues from test issues

Check the usual suspects

Make environments as boring as possible

Look for race conditions in the product, not only in the test

Reduce maintenance overhead with a thinner test layer

A practical triage order that usually works

How to decide whether a rewrite is actually needed

A short checklist for the next failing test

The bottom line