What is Kafka?

Apache Kafka is an open-source, distributed publish-subscribe message bus designed to be fast, scalable, and durable. Kafka can support a number of consumers and retain large data with very little overhead. It’s used for log aggregation, message brokerage, activity tracking, operational metrics, and stream processing.

Sending Kafka Metrics

Use collectd and the collectd-kafka plugin to capture Kafka metrics, particularly for brokers and topics. To know what other services are producing or consuming messages, wrap the client in an instrumented layer. SignalFx provides built-in Kafka monitoring dashboards with useful metrics and a template for topic names.

Monitoring Kafka

Kafka Monitoring

There are two primary leading indicators of Kafka issues: log flush latency and under-replicated partitions. In most cases, the changes are at the broker, not cluster, level. Although there’s value to investigating bytes per message and approximate end-to-end latency, alerting on just the two leading indicators will result in meaningful notifications.

Log Flush Latency: The longer it takes to flush log to disk, the more the pipeline backs up, causing low throughput and high latency. For log flush, monitor Kafka changes in the 95th percentile. If this number increases, end-to-end latency will often balloon and impact not just Kafka, but performance of the entire system. Because some topics are more latency-sensitive, set alert conditions on a per-topic basis.

Under-Replicated Partitions: This indicates replication is not going as fast as configured, adding latency as consumers don’t get the data they need until messages are replicated. It also suggests vulnerability to losing data if there is a master failure. Any under-replicated partitions that last more than a minute constitute a bad thing. For alerting, use a simple greater-than-zero threshold against the metric exposed from Kafka and a duration condition.

The SignalFx Difference

Creating Derived Metrics: Individual metrics sometimes don’t provide enough insight on their own to serve as the basis of meaningful alerts. For example, bytes per second and messages per second or bytes in and bytes out alone are often not enough information to identify a problem. SignalFx makes it easy to derive and visualize numbers like bytes per message and read amplification as real-time custom metrics based on other streaming data without writing code or having to wait to perform the calculations after the fact.

Scaling Without Message Loss: Scaling Kafka can get complicated when brokers are heterogeneous. SignalFx helps reduce migration time and risk by monitoring network in and out from collectd on the source and target brokers. Use those metrics to control pace of rebalancing and prevent resource starvation so app-level performance is not impacted by service-level changes.

Curating Metrics and Getting Visibility: There are many metrics specific to Kafka, and knowing where to start and what to monitor can be difficult. SignalFx curates the Kafka metrics that matter right out of the box alongside data from the other applications and cloud services in your infrastructure. SignalFx also provides pre-built dashboards and meaningful alerting that give you a running start on monitoring Kafka in your environment.

Kafka Metrics

Bytes In

Bytes Out

Messages In

Active Controllers

Request Queue

Under Replicated 
Partitions
Log Flushes

Log Flush Time in Ms

Log Flush Time in Ms 
- 95th Percentile 
Produce Total Time

Produce Total Time 
- 99th Percentile
Produce Total Time 
- Median
Fetch Consumer 
Total Time 
Fetch Consumer Total 
Time - 99th Percentile
Fetch Consumer 
Total Time - Median
Fetch Follower 
Total Time 
Fetch Follower Total 
Time  - 99th Percentile
Fetch Follower 
Total Time - Median

Start Your Kafka Monitoring Trial

Try SignalFx for 14 days. No credit card required.