Achieving Test Reliability for Native E2E Testing: Beyond Fixing Broken Tests

End-to-end (E2E) tests are particularly important for native applications that run on various platforms (Android/iOS), screen sizes, and OS versions. E2E testing picks up differences in behavior across this fragmented ecosystem.

But keeping E2E tests reliable is often more challenging than writing them in the first place. 

The fragmented device ecosystem, gaps in test frameworks, network inconsistencies, unstable test environments, and constantly changing UI all contribute to test flakiness. Teams easily get trapped in a cycle of constantly fixing failing tests due to UI changes or environment instability rather than improving the overall reliability of their test infrastructure. They end up frustrated and hesitant to adopt E2E tests in their workflows.

Having led the native E2E testing infrastructure setup at a mid-sized company, I learned the hard way how critical it is to define and implement strategies for test ownership, observability, and notifications in ensuring long-term test stability. In this piece, I discuss the challenges I’ve seen teams face and share lessons on how to build reliable E2E systems that you actually trust.

Challenges with Reactive Test Maintenance

After setting up periodic E2E runs on the CI, our team initially focused on triaging, investigating, and fixing every failing test to improve test stability. However, even after nearly a year of patching flaky tests, the reliability of our E2E suite didn’t improve, and engineers slowly lost confidence in the usefulness and reliability of the test suite.

I learned that teams that focus primarily on fixing broken tests often end up in a cycle of chasing failures without fixing the root causes of instability. This reactive approach creates several problems:

Test suite fragility: If teams continue patching broken tests without addressing real issues with either the underlying app changes or unstable environments, the test suite becomes increasingly brittle. Over time, tests fail for reasons unrelated to real product defects, making it harder to distinguish genuine regressions from noise.

High maintenance overhead: Debugging and fixing flaky tests often requires a significant amount of developer time and resources. Unlike unit tests, which run quickly and fail in isolation, E2E tests execute against the development, staging, or pre-production environment, making failures harder to reproduce and diagnose. Adjusting E2E tests to work across devices with different screen sizes or OS versions requires additional work, making fixes a non-trivial task.

Reduces trust in the test suites: When failures are common and noisy, teams lose confidence in the E2E suite, and they often start ignoring test failures. This undermines the purpose of having automated tests in the first place. Instead, teams rely on local dev testing or manual QA cycles to validate changes. Over time, the suite becomes more of a liability than a safeguard, slowing down delivery instead of enabling it.

A reactive approach to fixing E2E tests slows down release cycles. Developers must spend significant amounts of time repeatedly fixing and rerunning failing tests, while teams rely on manual QA to catch actual regressions.

Building a Reliable E2E Infrastructure

When our test suite stability didn’t improve after more than a year of chasing failures, we took a step back to analyze historical results and look for patterns. 

We discovered that a significant number of failures could be attributed to an unstable environment or an unexpected state of the test account. For example, spikes in API latencies in the test environment frequently caused false negatives, adding to the noise. Similarly, tests run against existing user accounts could become inconsistent due to a past failure or if multiple tests attempted to use the same account.

I learned that investing in improving your test infrastructure is the only way to get to a stable and reliable native E2E testing workflow. This involves stabilizing the test environment, defining clear test ownership, reducing noisy alerts, and improving observability. Let’s look at each of these in more detail.

Stabilize the Test Environment

Many flaky E2E tests can be traced back to inconsistencies in the underlying environment, such as sporadic device issues, network instability, or API downtime in a staging environment. 

To avoid noisy and unreliable tests, ensure you have a stable and standardized test environment with the following test practices:

Standardize device and environment setup: Device and test environment stability issues heavily impact test stability. To reduce API downtimes, isolate the E2E testing environment from the developer or staging environment to prevent interference from unstable builds and experimental features. Teams could either build a stable pre-prod environment that uses a production-ready artifact or spin up ephemeral environments for each E2E run to ensure consistency. Running tests on standardized device images or containerized emulators with consistent OS versions, configurations, and resources further improves stability. For critical flows, you can schedule periodic runs on physical device farms to validate against real hardware while keeping day-to-day tests stable and cost-effective.

Isolating test data per session: A test that makes modifications to any data should start from a clean slate. For instance, while testing a todo application, every test session should use a new test account to avoid unexpected scenarios because of unpredictable account state. To speed up tests, execute setup scripts in `before` hooks to handle account creation, and seed any required data automatically.

Mocking certain network responses: While an E2E test is meant to test the entire user journey with real data, in some cases, it’s necessary to mock specific API responses to maintain a predictable test environment. For instance, if your application relies on A/B tests or uses feature flags, different sessions might receive different experiences based on the user allocation. This can cause unexpected failures unrelated to actual regressions. Mocking these responses in test builds ensures consistency across sessions, and it avoids building complex test cases that handle different user experiences.

Establish Clear Test Ownership

When a test fails, it’s often unclear who’s responsible for investigating and fixing it. Over time, such an absence of clear test ownership and accountability results in unreliable, unmaintained, and flaky tests. 

Assigning ownership of tests based on the ownership of product features can alleviate this problem to some extent. Ideally, the owning team should be responsible for writing, maintaining, and fixing tests for their critical flows. This ownership model ensures that failures are triaged quickly and that tests are updated as the product evolves instead of becoming stale and unstable. 

Test ownership becomes challenging in codebases where multiple product teams own parts of a single user flow. For example, in a shopping application, different teams might own the login, product catalog, and checkout experiences. If a checkout flow test fails at the login step, it can be confusing which team should triage the issue. Without a clear policy, the failure might be ignored, or multiple teams might end up duplicating the effort. 

To handle these scenarios, set a policy that defines the first point of contact (POC) per test based on the end-user experience. This ensures a single team takes responsibility for triaging the issue, but that fixes can be handed off to upstream dependencies as needed.

Reduce Noise and Improve Alerting

A common challenge with native E2E testing is noisy alerts due to flaky or failing tests. Teams are often flooded with non-actionable alerts when flaky tests fail because of transient network or device issues. Repeated failure notifications about known bugs can also lead to alert fatigue.

The following techniques reduce this noise so that teams are only notified for actionable failures:

Mute flaky tests and known bugs: Instead of reporting and notifying teams about all test failures, allow alerts from tests that are identified as flaky or linked to known issues to be muted without a code change. You can manage muted tests through a remote configuration, environment variables, or a tool like BrowserStack. Flag them for follow-up work, but let alerts only go out for new or unexpected regressions. Muting is particularly important for E2E tests since fixing failing tests often requires significant developer time and resources. Repeated alerts can be especially distracting for developers.

Enrich notifications with failure details: Instead of generic failure messages, include details such as the failing user flow, commit details, the error message, and links to logs or dashboards in your alerts. These details help developers identify and triage issues quicker, resulting in faster fixes and higher confidence in the suite.

Track test metrics and trends: In addition to test suite level reports, track and analyze the historical results of your tests to understand failure rates, flakiness trends, and failure hotspots. For example, if you observe repeated failures in the login flow, it might indicate unstable tests or sporadic bugs in that flow. Tracking these metrics over time provides visibility into whether the E2E suite is improving or degrading, and it helps you prioritize stabilization efforts based on impact.

Hybrid Strategies for Scaling E2E with Dockerized Emulators

Running native E2E tests at scale is challenging due to cost and resource constraints. Device farms that provide access to real cloud-based devices are expensive for running a large suite of tests at high frequency. This becomes a constraint for integrating E2E tests with the CI pipeline that executes with every pull request before the changes are merged. 

As mentioned earlier, a hybrid testing approach that uses Dockerized emulators for PR builds alongside real devices for periodic runs can help you overcome this challenge. When our team moved PR checks to Dockerized emulators, we got faster feedback and significantly reduced cloud device costs.

Containerized device runners can be spun up quickly in CI. For example, the docker-android image lets you run an Android emulator in a containerized Docker environment. It supports multiple device profiles, OS versions, and UI-testing frameworks such as Appium and Espresso. Teams can easily integrate these emulators into CI pipelines to run E2E tests at scale without investing in a huge testing budget

If you are building E2E tests for mobile web, you can also use containerized browser images to run tests consistently across different environments to further reduce cost and setup complexity.

There’s Hope!

If your team has been chasing native E2E test failures like we were, you’re probably also burning engineering time and resources without improving test stability. I hope that this article has encouraged you that there’s a better way: improving your test environment, device setup, alerting, and observability. 

Your best first step is to analyze your historical test failures and categorize them into buckets. Use these insights to define actionable items for reducing flakiness. Use this roadmap to identify test infrastructure investments or process changes that will deliver the most impact. 

After our team invested in test infrastructure improvements, we saw a clear improvement in stability. Developers had a better understanding of real failures, and the number of noisy alerts was reduced. Flakiness didn’t disappear entirely, but the improved reliability of the test suite helped us catch multiple native app regressions before the changes were released to production.

I hope this article will help you achieve similar wins.

Quelle: https://blog.docker.com/feed/

AWS Lambda Managed Instances now supports Rust

AWS Lambda Managed Instances now supports Rust, enabling developers to run high-performance Rust-based functions on Lambda-managed Amazon EC2 instances while maintaining Lambda’s operational simplicity. This combination makes it easier than ever to run performance-critical applications without the complexity of managing servers. Lambda Managed Instances gives Lambda developers access to specialized compute configurations, including the latest-generation processors and high-bandwidth networking. Lambda Managed Instances are fully managed EC2 instances, with built-in routing, load-balancing and auto-scaling, with no operational overhead. They combine Lambda’s serverless experience with EC2 pricing advantages including Compute Savings Plans and Reserved Instances. Rust support for Lambda Managed Instances combines these benefits with the performance and efficiency of Rust, including parallel request processing within each execution environment. Together, using Lambda Managed Instances with Rust maximizes utilization and price-performance. Rust support for Lambda Managed Instances is available today in all AWS Regions where Lambda Managed Instances is available. To get started with Rust on Lambda Managed Instances, see the Lambda documentation. To learn more about more about this release, see the release notes.
Quelle: aws.amazon.com

AWS Network Firewall Launch in the AWS European Sovereign Cloud

Starting today, AWS Network Firewall is available in the AWS European Sovereign Cloud. With this launch, European customers, particularly those in highly regulated industries, government agencies, and organizations with strict data sovereignty requirements, can deploy AWS Network Firewall to protect their most sensitive workloads while maintaining full compliance with European Union (EU) data protection regulations. Through this expansion, customers using the AWS European Sovereign Cloud can leverage the same AWS Network Firewall capabilities available in other AWS Regions, while ensuring that all data and operations remain entirely within EU borders and under EU-based control. AWS Network Firewall is a managed firewall service that provides essential network protections for your Amazon Virtual Private Clouds (VPCs). The service automatically scales with network traffic volume to provide high-availability protections without the need to set up or maintain the underlying infrastructure. To learn more about AWS Network Firewall availability, visit the AWS Region Table. For more information, please see the AWS Network Firewall product page and the service documentation.
Quelle: aws.amazon.com

Amazon MSK announces support for Standard brokers Graviton-3 instance in Africa (Cape Town) region

You can now create provisioned Amazon Managed Streaming for Apache Kafka (Amazon MSK) clusters with Standard brokers running on AWS Graviton3-based M7g instances in Africa (Cape Town) region.
Graviton M7g instances for Standard brokers deliver up to 24% compute cost savings and up to 29% higher write and read throughput over comparable MSK clusters running on M5 instances. To get started, create a new cluster with M7g brokers or upgrade your M5 cluster to M7g through the Amazon MSK console or the Amazon CLI and read our Amazon MSK Developer Guide for more information.
Quelle: aws.amazon.com

Accelerate serverless application development with new SAM Kiro power

AWS announces the AWS Serverless Application Model (SAM) Kiro power, bringing serverless application development expertise to agentic AI development in Kiro. With this power, you can build, deploy, and manage serverless applications with AI agent-assisted development directly in your local environment.
SAM is an open-source framework that simplifies building serverless applications on AWS. SAM Kiro power dynamically loads relevant guidance and development expertise the AI agent needs to build serverless applications. This includes initializing SAM projects, building and deploying applications to AWS, and locally testing Lambda functions. The power supports event-driven patterns with Amazon EventBridge, Amazon Managed Streaming for Apache Kafka (MSK), Amazon Kinesis, Amazon DynamoDB Streams, and Amazon Simple Queue Service (SQS), while covering security best practices for IAM policies. Built-in guidance enforces use of SAM resources and Powertools for AWS Lambda for observability and structured logging by default, ensuring best practices from the start. This guidance accelerates your journey from concept to production, whether building static websites with API backends, event-driven microservices, or full-stack applications.
The SAM Kiro Power is available today with one-click installation from the Kiro IDE and the Kiro Powers page. Explore the power on Github or visit the developer guide to learn more about SAM.
Quelle: aws.amazon.com