Metric and Log Monitoring: Do You Really Need Both?

Occasionally, I’ll talk to a developer or operations engineer who says they only use APM or logs as a monitoring solution. They’ll say something like, “We don’t look at logs much since we have New Relic.” Even worse, some rely on their customers to know when something is wrong: “We just check Google Analytics to see if popular pages or actions have gone down.” The surprising thing isn’t necessarily that they have a favorite go-to tool, but that they think it’s sufficient for all their needs. Have we finally achieved the holy grail: a single pane of glass that tells us everything we need to know about operations and performance?

While no monitoring solution is omniscient, vendors from different categories have been expanding their solutions. For example, both infrastructure monitoring and APM solutions can give you insight into server performance. Both APM and log management solutions can give you a summary of application errors. Some vendors claim they offer a unified view of both metrics and logs.

Do we really need solutions for both metric and log monitoring? We will explore which is best based on the sources of data, performance, cost efficiency, and key use cases. We’ll also consider hidden trade-offs in so-called “unified solutions.”

Sources of Metrics and Log Data

Monitoring solutions are typically built around either metrics or log-based sources. Metrics are often used to monitor things like system resources and performance because they are easily quantified. On the other hand, logs offer a text record of what happened on a system or application. You might be familiar with several of these sources:

Popular Metrics Sources
 
Popular Log Sources
  • System metrics (CPU, memory, disk)
  • Infrastructure metrics (AWS CloudWatch)
  • Web tracking scripts (Google Analytics)
  • Application agents (APM, error tracking)
 
  • System logs (syslog, journald)
  • Application logs (log4j, log4net)
  • Server logs (Apache, MySQL)
  • Platform logs (AWS CloudTrail)

There are dedicated monitoring solutions for each of these sources. A server or infrastructure monitoring solution typically won’t ingest your application log files. Likewise, a log management solution won’t automatically track application performance without you explicitly coding logs to track it. However, the lines seem to continue to blur as monitoring companies add integrations for more sources, leaving users with tough decisions about which to use. Let’s take a deeper look at the differences and where their strengths lie.

Data Structure Differences

At a fundamental level, what’s the difference between a metric and a log? We often think of a metric as being a measure or number of some quantity. It has a descriptive label and a timestamp at which it was measured. It may also include dimensions that provide extra categorical information. It’s usually stored in a structured data format. Here is an example from Amazon CloudWatch in a JSON format:

     {
          "timestamp": "2016-10-05T18:31:00.000Z",
          "sampleCount": 8,
          "average": 1,
          "sum": 8,
          "minimum": 1,
          "maximum": 1,
          "unit": "Count",
          "metricName": "HealthyHostCount",
          "namespace": "AWS\/ELB",
          "availabilityZone": "us-east-1c"
     }

We often think of logs as text files created while running an application. If you are a developer, you might also think of the text printed out on the console when you test your code. This is often unstructured or semi-structured text, and it optimally includes a header with extra metadata like the timestamp and host. Below is an example of an access log from Amazon Elastic Load Balancing (ELB).

     171.31.0.17/ip-171-31-0-17.ec2.internal" -/172.31.30.234"
     [05/Oct/2016:23:37:22 +0000] "GET /index.html HTTP/1.1" 200 118
     124 + 0 "-" "ELB-HealthChecker/1.0"

Flexibility of Modern Monitoring Solutions

Modern monitoring solutions blur the lines between logs and metrics because they tend to offer functionality that helps translate between them. For example, log management solutions can automatically parse each log field and convert it into metrics. In the example above, “118” is recognized as the size of the response, and “/index.html” is a dimension indicating the request URL. Log management tools are also able to provide summary metrics such as sums and averages, similar to what we see in the CloudWatch example.

Many metrics-based monitoring solutions are also able to track unstructured text logs (e.g., errors) as events. They might store an example of the error as a dimension, along with the number of occurrences over time. This is how APM tools or error trackers work. As a more advanced option, SignalFx even allows you to track custom events with any data that you want.

Additionally, both log-based and metrics-based monitoring tools offer at least basic features on the frontend UI, including some elements of analytics and visualizations. You can plot a time-series graph of average CPU usage over time with almost any modern solution.

Optimized for Efficiency or Level of Detail

At a high level, metrics-based solutions tend to be more efficient when speed and data storage costs are considered, while log-based solutions are better for drilling down into the details after-the-fact. These differences are often driven by data-processing needs on the backend.

 

Metrics-based solutions are typically faster and more efficient because they can summarize data using statistics and sampling. They don’t need to store or process every single data point, only enough to give a sufficient confidence bound on each time period. They can easily generate summary statistics ahead of time, making reports much faster to load. They can also progressively reduce the data resolution as data gets older, so that new data is stored with a one-second frequency while old data is stored with a one-hour frequency or less (for example). Due to the lower data volumes, it’s more economical to use in-memory databases, giving you faster query results. Metrics are great for showing trends over time, and can quickly trigger alerts as new data comes in.

cloud-chart

Log-based solutions tend to be better when you need the flexibility to store raw text or when you need to explore the full record of events for troubleshooting, auditing, or support. This requires higher data storage cost because each event is stored and queried individually. Log-based solutions are often slower than metrics-based solutions because each record is processed when making queries and calculating trends. Log management tools can be built on top of search engines like Elasticsearch so they can offer full text searches. However, they often store data on slower disk-based storage, which is more economical for high data volumes. This offers the advantage of being able to drill down to each individual event, so you can see the exact response to any request at any time.

Some log-based solutions are attempting to unify logs and metrics. If they convert metrics into logs and store them in their search engine, the disadvantage is that they do not get the efficiency benefits of metrics-based solutions. The search results will return slower, and the infrastructure costs will be higher.

When To Use Metrics or Logs

If you’re building a new application and can choose whether to implement metrics or logging, which should you pick? They are best suited for different use cases.

Typically, metrics are best used for monitoring, profiling, and alerting. The efficiency of summarizing data makes them great for monitoring and performance profiling because you can economically store long retention periods of data, giving you dashboards that look back over time. They are also great for alerting because they are fast and can trigger notifications almost instantaneously without the need for expensive queries. Metrics represent roughly 90% of the monitoring workload.

 Logs Cover

 

Logs give you the extra level of detail necessary for troubleshooting, debugging, support, and auditing. As a starting point, you might find out from your metrics dashboard or alerts that a problem exists and when it started. If that’s not enough information to find the root cause, you may need to drill down deeper into the log data to see exactly what happened as the program was executing. Logs can help you debug problems even on your production systems by showing the execution path. Additionally, your support team can search for exactly what happened for each individual customer and transaction because the events are not rolled up into an aggregate. They are also great for auditing because you have a complete record for security and legal compliance. Logs are typically used in 10% of monitoring cases.

 

To run a highly reliable service for your customers, you can’t leave out any of these use cases. If you only have metrics monitoring and alerting, you may not have enough detail to the debug the tougher problems that will inevitably pop up. This can lengthen your downtime and cause additional customer frustration. Likewise, if you only have logging, you’re probably not getting the best performance for monitoring and alerting.

SignalFx for Metrics Monitoring

SignalFx is specifically built to handle all cloud monitoring objectives, inclusive of infrastructure metrics, custom application metrics, and even user and business metrics. It’s designed with all the advantages of metrics in mind. Dashboards update in real time, analytics happen as the data streams into the system, dimensions are virtually unlimited for specificity of visualization, and alerts fire instantly. None of the log-based vendors can match it on performance. The SignalFlow analytics engine powers a broad range of calculations, which allow more intelligent, meaningful alerts and reports.

Integrations with other monitoring tools allow you to create a complete and consistent view of the entire application lifecycle, from development to post-mortem, placing data from your production environment alongside pre-flight and event data from your favorite APM and logging tools like New Relic, AppDynamics, and Splunk, all in SignalFx’s real-time dashboards.

Modern Infrastructure Monitoring + APMs + Logs
Download our Infrastructure Monitoring + APM + Logs ebook » 

 

About the authors

Jason Skowronski

Jason is a technology veteran that has worked at Loggly and Amazon and was co-founder of Keona Health. He started his career as a software developer and now works in product management and content marketing focusing on DevOps solutions.

Enjoyed this blog post? Sign up for our blog updates