Understanding Observability

In today’s digital world, many services we rely on, like online shopping, social media, and streaming, run on complex software systems. When something goes wrong with these systems, it can affect thousands or even millions of users. Observability helps engineers understand how these systems are performing and quickly find and fix issues. This blog post will explain what observability is, why it matters, and how it can be applied using an e-commerce platform as an example. We’ll also explore how visualization, monitoring, and observability differ, and discuss the four key data sources and four golden signals for observability.

What is Observability?

Observability is the ability to understand what’s happening inside a system by looking at the data it produces. Imagine you are driving a car. To know how well your car is performing, you look at the speedometer, fuel gauge, and other dashboard indicators. Observability in software systems is similar. It helps engineers understand how their systems are performing by looking at various indicators, even when something goes wrong. Observability is all about making a system’s internal states (like performance and health) visible and understandable from the outside.

Why Does Observability Matter?

Observability matters because it helps ensure that services run smoothly and reliably. In an e-commerce platform, for example, observability can help engineers quickly identify and fix issues like slow page loads, failed transactions, or server crashes. This leads to a better user experience, higher customer satisfaction, and fewer lost sales. Observability also helps engineers improve the performance and reliability of their systems over time.

Visualization, Monitoring, and Observability: What’s the Difference?

These three terms are related but have different meanings:

  • Visualization: This involves creating visual representations of data to help understand it better. For example, graphs and charts that show how your car’s fuel level changes over time. In an e-commerce platform, visualization might include dashboards that display metrics like the number of active users or the average page load time.
  • Monitoring: This is like having sensors that continuously check specific parts of your system and alert you when something goes wrong. For example, a warning light that comes on when your car’s fuel level is low. In an e-commerce platform, monitoring might involve setting up alerts for high error rates or low inventory levels.
  • Observability: This goes beyond visualization and monitoring. It not only alerts you when something is wrong but also provides detailed information to help you understand why it’s happening and how to fix it. It’s like having a detailed diagnostic tool that shows you everything happening inside your car’s engine. In an e-commerce platform, observability might involve analyzing logs, traces, and metrics to diagnose and resolve complex issues.

Four Key Data Sources for Observability

To make a system observable, engineers rely on four main types of data: Metrics, Events, Logs, and Traces.

1. Metrics

Metrics are numerical data points collected over time that show how well your system is performing. They are usually aggregated and visualized to provide a high-level view of the system’s health. In an e-commerce platform, important metrics might include:

  • CPU Usage: How much processing power is being used.
  • Memory Usage: How much memory is being used.
  • Request Rate: The number of user requests per second.
  • Error Rate: The number of errors per second.

Metrics help engineers quickly identify trends and anomalies in the system’s performance.

2. Events

Events are records of significant actions or changes in the system. They provide context for understanding what happened in the system and when it happened. In an e-commerce platform, important events might include:

  • User Login: A user logs into their account.
  • Order Placed: A user completes a purchase.
  • Item Added to Cart: A user adds an item to their shopping cart.
  • System Updates: Changes to the system, such as software deployments or configuration changes.

Events help engineers understand the sequence of actions that led to an issue.

3. Logs

Logs are detailed records of what happens in the system, like a diary. They capture a wide range of information, including error messages, warnings, and informational messages. In an e-commerce platform, important logs might include:

  • Server Logs: Detailed information about server operations and errors.
  • Application Logs: Detailed information about the application’s behavior and errors.
  • Access Logs: Records of user interactions with the platform, such as page views and API requests.

Logs provide detailed information that helps engineers diagnose and troubleshoot issues.

4. Traces

Traces track the journey of a request as it travels through different parts of the system. They show how a request moves through various services and components, providing a detailed view of its path. In an e-commerce platform, important traces might include:

  • User Request Trace: The path of a user’s request to view a product page, including all the services it interacts with.
  • Order Processing Trace: The path of an order request, from placing the order to payment processing and order confirmation.

Traces help engineers understand the flow of requests and identify bottlenecks or failures in the system.

Four Golden Signals for Observability

Engineers use four key metrics, known as the golden signals (thanks Google!), to monitor the health of their systems:

1. Throughput

Throughput measures how much work a system is doing over time. It is often measured in requests per second. In an e-commerce platform, important throughput metrics might include:

  • Requests per Second (RPS): The number of user requests handled per second.
  • Transactions per Second (TPS): The number of completed transactions per second.

High throughput indicates that the system can handle a large number of requests, while low throughput might indicate performance issues or bottlenecks.

2. Latency

Latency measures how long it takes for a system to respond to a request. It is often measured in milliseconds. In an e-commerce platform, important latency metrics might include:

  • Page Load Time: The time it takes for a webpage to load after a user clicks a link.
  • API Response Time: The time it takes for an API to respond to a request.

Low latency indicates that the system is responsive, while high latency might indicate performance issues that need to be addressed.

3. Errors

Errors track the number of failed requests or errors in the system. It is often measured as an error rate. In an e-commerce platform, important error metrics might include:

  • Error Rate: The percentage of requests that result in errors.
  • HTTP Status Codes: The number of requests that result in different HTTP status codes (e.g., 404 Not Found, 500 Internal Server Error).

A high error rate indicates that there are issues in the system that need to be fixed, while a low error rate indicates that the system is functioning well.

4. Capacity

Capacity measures the system’s ability to handle additional work. It is often measured in terms of resource usage and limits. In an e-commerce platform, important capacity metrics might include:

  • CPU Utilization: The percentage of CPU resources being used.
  • Memory Utilization: The percentage of memory resources being used.
  • Disk Space Utilization: The percentage of disk space being used.

High capacity utilization indicates that the system is close to its limits, while low capacity utilization indicates that there is room for additional work.

In Summary

Observability is essential for maintaining the health and performance of complex software systems, such as an e-commerce platform. By understanding and using the four key data sources—Metrics, Events, Logs, and Traces—and the four golden signals—Throughput, Latency, Errors, and Capacity—engineers can gain deep insights into their systems. By adopting observability practices, we can ensure that the infrastructure and software we rely on every day remains reliable, fast, and effective.