Getting to the Root of Quality Issues: Why QA Needs Observability 2.0

Quality assurance (QA) teams are no longer the gatekeeper at the end of the software delivery process. Instead, they are embedded into the software development life cycle (SDLC), acting as the steward for quality, reliability, performance, and user satisfaction. As technology advances and software systems grow in complexity, our approach to quality must evolve. Observability 2.0 offers the tools and insights QA teams need to keep up with this change and to lead it.

Observability, a term once reserved for system operators and site reliability engineers, is now something that every QA professional should understand. Observability 2.0, as defined by Charity Majors, Co-founder and CTO of Honeycomb, takes us beyond traditional monitoring, which I’d call "Observability 1.0."

This advancement offers QA teams the ability to dive deeper, finding the root causes of intermittent or complex issues that traditional monitoring systems struggle to expose. Observability 2.0 is about understanding systems with a depth and breadth that allows teams to ask unanticipated questions of their data and get meaningful answers.

The Difference Between Observability 1.0 and 2.0

Observability 1.0 was based on monitoring tools that measure predefined metrics and generate logs and traces for specific events. While valuable, this approach results in data silos, with logs, traces, and metrics stored separately.
The result is a fragmented view of the system's state that requires a significant amount of manual correlation from engineers and QA professionals.

Observability 1.0’s focus on metrics, logs, and traces means multiple storage instances of similar data, resulting in increased costs and limited insights.

Observability 2.0, however, is built on a unified, cohesive approach to telemetry. Instead of relying on scattered tools and datasets, it gathers all telemetry into a centralized, query-able platform that allows QA professionals to derive metrics, traces, and logs dynamically. This means that rather than building systems to store everything with the hope of sifting through later, QA teams can now interact with telemetry in real time, asking unexpected questions without being confined to predefined fields or data aggregations.

In short, Observability 2.0 turns telemetry into a powerful source of truth, allowing teams to make data-driven decisions that are both proactive and precise.

Observability 2.0 in Quality Assurance: Tackling Intermittent Issues and Root Cause Analysis

One of the most challenging tasks for any QA team is diagnosing intermittent or hard-to-reproduce issues. Traditionally, we’ve relied on a combination of automated testing, manual replication efforts, and aggregated data. This approach fails when it comes to discovering the underlying cause of elusive issues—especially in complex, distributed systems where issues arise unpredictably or only under specific conditions.

Observability 2.0 provides a more sophisticated approach to these challenges.

By leveraging true observability, QA teams can track down the root cause of issues that might not even be flagged in typically aggregated data sets. With full, unsampled telemetry data, QA professionals are no longer limited by predefined metrics or limited storage of historical data.

This helps when investigating system behaviors that appear only under certain conditions or at certain times. In her video interview on "Last Week in AWS," Charity Majors notes that, "Observability 2.0 is based on a single source of truth," which allows teams to slice and dice data without limits, helping them uncover patterns that traditional monitoring can’t reveal.

Consider a scenario where periodic latency issues are detected during performance testing. Traditional monitoring might reveal a general pattern of latency spikes, but without specific data on what’s happening in real-time during these events, it’s nearly impossible to identify the root cause.

Observability 2.0, on the other hand, would capture all relevant telemetry across dimensions, enabling QA to analyze latency as it correlates to specific functions, user behaviors, or device types. Instead of reactive investigation, QA teams can explore this data proactively, asking the system directly, “What was happening during this spike?” and receiving answers that are both timely and specific.

Unanticipated Questions and Dynamic Data Interaction

One of the key advantages of Observability 2.0 is the ability to ask questions that we never anticipated at the time of coding or deployment. In quality assurance, we encounter scenarios daily that demand a flexible approach to data. Aggregated data often lacks the depth required for these types of investigations because it only provides a summarized view of past conditions, not the specifics that make each instance unique.

Including the complete telemetry dataset enables QA teams to interact with data dynamically. Imagine encountering an error that affects only a subset of users on specific device configurations. Without the detailed telemetry Observability 2.0 provides, this issue might never reveal itself, as aggregated data would simply blend it into a larger dataset.

Observability 2.0 offers QA the granularity to identify precisely which requests, environments, or users experienced the issue. As Brian Chang notes in a Honeycomb blog post, "A service-driven perspective allows engineers to understand the entire service’s health and performance."

This empowers QA to assess quality as an integrated part of the service’s performance, rather than an isolated aspect.

The Role of Observability 2.0 in The Future of QA

As QA leaders, we must evolve our practices to meet the demands of complex, service-based architectures. Observability 2.0 offers a way to go beyond the reactive mindset of traditional monitoring by empowering QA to engage deeply with real-time telemetry, discover root causes faster, and ask unanticipated questions of our data.

By embracing this level of observability, QA evolves to become an integrated, proactive element of software delivery.

With Observability we can predict potential issues before they escalate, address root causes with confidence, and deliver software that meets functional requirements and the expectations of our performance-driven world.

Quality assurance is moving from a reactive function to a proactive partner in service delivery and Observability 2.0 is helping lead the way.

«By leveraging true observability, QA teams can track down the root cause of issues that might not even be flagged in typically aggregated data sets.»