kiteto logo

Early Access

E2E Tests with AI: Technical Hurdles and Why It's More Complicated Than You Think

E2E Tests with AI: Technical Hurdles and Why It's More Complicated Than You Think

Georg Dörgeloh May 20, 2025

“Can’t we just use ChatGPT to write our E2E tests?” You probably hear this question frequently as a developer. The answer is more complicated than a simple yes or no. In this article, I’ll show you the technical reality behind AI-powered test automation – and why the problem is more interesting (and difficult) than it initially appears.

The Problem with Simple Approaches

Approach 1: Code Generation with LLMs

The most obvious approach is to have an LLM generate test code:

// AI-generated E2E test
test('User can login and update profile', async ({ page }) => {
  await page.goto('https://example.com');
  await page.fill('input[name="username"]', 'testuser');
  await page.fill('input[name="password"]', 'password123');
  await page.click('button[type="submit"]');
  await page.waitForNavigation();
  // and so on...
});

Technical Problems:

  • Context Window Limitation: Large codebases quickly exceed the LLM’s context window
  • Missing Visual Context: The LLM sees code but not the running application
  • Unreliable Selectors: Generated CSS selectors break with UI changes

Approach 2: Browser Automation with LLMs

The next step: Let the LLM control the browser directly. The LLM receives screenshots and DOM, analyzes the current state, and decides what to do next.

Why this works technically:

  • Modern vision models (GPT-4V, Claude 3.5 Sonnet) can interpret screenshots surprisingly well
  • LLMs understand natural language test descriptions
  • Browser automation APIs (Playwright, Selenium) are well documented

Why it fails in practice:

  • Performance: Each test step requires an API call to the LLM (100-500ms+)
  • Cost: A single E2E test can cost $0.10-1.00+
  • Scaling: CI pipelines become unaffordable with dozens of tests
  • Determinism: Same test, different execution

Hybrid Approach: Record & Replay

The most promising approach combines AI generation with traditional test execution:

  1. Recording Phase: AI executes the test once and records actions
  2. Code Generation: Actions are translated into stable test code
  3. Replay Phase: Tests run without AI involvement
  4. Regeneration: AI can re-record the test when errors occur
class HybridTestRecorder {
  async recordTest(testDescription) {
    const actions = [];

    // AI-controlled execution with recording
    while (!this.isTestComplete()) {
      const screenshot = await this.takeScreenshot();
      const action = await this.aiAgent.nextAction(screenshot, testDescription);

      actions.push({
        type: action.type,
        selector: this.generateStableSelector(action.element),
        data: action.data,
        screenshot: screenshot,
      });

      await this.executeAction(action);
    }

    return this.generatePlaywrightCode(actions);
  }
}

Why Existing Tools Aren’t Sufficient

Problem with Generic AI Assistants

ChatGPT, Claude & Co. aren’t optimized for browser automation:

  • No specialized DOM processing
  • No robust selector strategies
  • No integration with test frameworks
  • No understanding of test maintenance

Problem with Simple Browser Automation Tools

Tools like Claude Desktop with Playwright MCP only solve surface problems:

  • Work for demos, not for production
  • No selector optimization
  • No test management features
  • No CI/CD integration

The Three Core Technical Problems

1. The Context Problem

The DOM of a modern webapp can quickly become 2-10MB in size. LLMs have limited context windows. Even GPT-4 with 128k tokens cannot handle the complete DOM of a modern single-page application.

Solution Approaches:

  • DOM Pruning: Remove irrelevant elements
  • Semantic DOM: Extract only interactive elements

2. The Visual-DOM Mapping Problem

The LLM sees a “Search” button in the screenshot, but the DOM only contains:

<button class="btn-primary">
  <svg viewBox="0 0 24 24">
    <path
      d="M15.5 14h-.79l-.28-.27C15.41 12.59 16 11.11 16 9.5 16 5.91 13.09 3 9.5 3S3 5.91 3 9.5 5.91 16 9.5 16c1.61 0 3.09-.59 4.23-1.57l.27.28v.79l5 4.99L20.49 19l-4.99-5zm-6 0C7.01 14 5 11.99 5 9.5S7.01 5 9.5 5 14 7.01 14 9.5 11.99 14 9.5 14z"
    />
  </svg>
</button>

How should the LLM know that this SVG icon represents the “Search” button, which it is looking for?

Technical Solution Approaches:

  • Computer vision for icon recognition
  • Multi-modal embedding for visual-DOM alignment
  • Accessible name computation according to ARIA standards

3. The Selector Stability Problem

A reliable E2E test needs stable selectors. The LLM must not only find an element but also generate a robust selector for it.

Selector Hierarchy (from stable to fragile):

  1. data-testid attributes (ideal, but rarely available)
  2. Semantic selectors (role, aria-label)
  3. Relative positioning to known elements
  4. CSS classes and IDs
  5. XPath with absolute positions (fragile)
// Robust selector algorithm
function generateSelector(element, dom) {
  if (element.dataset.testid) {
    return `[data-testid="${element.dataset.testid}"]`;
  }

  if (element.getAttribute('aria-label')) {
    return `[aria-label="${element.getAttribute('aria-label')}"]`;
  }

  // Fallback: Relative positioning
  const nearbyLandmark = findNearestLandmark(element, dom);
  if (nearbyLandmark) {
    return `${nearbyLandmark.selector} >> ${getRelativeSelector(element)}`;
  }

  // Last resort: CSS path
  return generateCSSPath(element);
}

The Value of Specialized Solutions

A well-thought-out AI test automation solution should provide:

Technical Excellence:

  • Multi-agent architecture for optimal specialization
  • Intelligent context optimization
  • Robust selector generation
  • Efficient caching and retry mechanisms

Developer Experience:

  • Seamless integration into existing test suites
  • Export to common frameworks (Playwright, Cypress, Selenium)
  • Test versioning and diffing
  • CI/CD pipeline integration

Business Value:

  • Self-service for product owners without developer bottleneck
  • Dramatic reduction in test maintenance time
  • Better test coverage through lower creation costs

Conclusion: Why You Should Suggest a Specialized Solution to Your PO

AI test automation is a fascinating problem that goes far beyond “LLM + Browser.” The technical challenges are real and complex:

  • Context management for large DOMs
  • Visual-DOM mapping for robust element identification
  • Selector strategies for maintainable tests
  • Multi-agent orchestration for optimal performance

As a developer, you understand this complexity. When your product owner asks: “Can’t we just use ChatGPT?”, you can explain why a specialized solution is necessary.

The value for you:

  • Less time spent on flaky test maintenance
  • More test coverage without proportionally more effort
  • Better collaboration with business stakeholders
  • Focus on more interesting development tasks

The future of E2E test automation doesn’t lie in generic AI tools, but in specialized systems that thoughtfully solve these technical challenges.

Welcome to the kiteto Early Access!

Describe test cases in your own words. Let AI handle the automation.

  • Empower your entire Team to create automated tests
  • Stop fixing broken Tests
  • Save valuable developer time
  • Ship with confidence, ship faster
  • Start Early Access now