Elasticsearch is an open-source search server based on Apache Lucene. It is highly distributed and designed for easy implementation, fast query against large data volumes, multi-tenant availability, and horizontal scale. Elasticsearch is used primarily as a NoSQL storage, indexing, and search utility for unstructured documents and can also serve as a log analysis tool as part of the Elastic Stack.
Sending Elasticsearch Metrics
Use collectd and the collectd-elasticsearch plugin to capture Elasticsearch metrics and track data by index and cluster. SignalFx provides built-in Elasticsearch monitoring dashboards displaying useful Elasticsearch production metrics at the node, cluster, and cross-cluster levels.
With a large number of nodes, it’s important to know whether problems are cluster-wide or machine-specific. There are basically three sources of issues with which to contend—cluster, shard, and node—and keeping an eye on all three is the best way to manage availability and performance.
Infrastructure (Not Just App): When monitoring Elasticsearch, many performance issues come down to a noisy neighbor, or network I/O bugginess, or other problems with the Amazon machine image or virtual machine underlying a given node on AWS. Although modern architecture relies on modern monitoring approaches, it’s useful to know when less-modern problems arise.
Field Data vs. Doc Values: Large spikes in memory consumption can cause problems during garbage collection. A look at individual caches often reveals a problem with field data cache, which is effective cluster-wide and per-node (on heap). In this case, move to using doc values (on disk).
Spikes in Thread Pool Rejections: At some point, you’ll need to re-shard your index, and it’s not uncommon to run into thread pool rejections on the indexing queue as you batch re-index your documents. To avoid extra load, randomly distribute the query order as you build the batch of documents to index, rather than asking for query results in shard order.
The SignalFx Difference
Monitoring Cluster Status: Noisy alert storms pose a huge problem for monitoring Elasticsearch. On its own, an Elasticsearch cluster experiencing issues would notify on every node, one after the next. SignalFx makes it easy to isolate signal based on cluster status by assigning a score to the host and alerting on max value. If all the nodes in the cluster change to yellow or red, you only get one, meaningful notification.
Duration Conditions: Applying time requirements to alert rules helps determine whether an issue actually requires attention. Elasticsearch can recover a failed machine by restarting replicas on another node. SignalFx helps set duration thresholds so that you aren’t alerted of an issue that Elasticsearch auto-recovery will fix while you’re already troubleshooting.
Scaling and Capacity Planning: Knowing when to scale Elasticsearch requires plenty of runway to manage re-sharding, which can be challenging if you don’t want to lose availability. SignalFx’s pre-built dashboards make it easy to compare document growth to storage growth and absolute storage consumption in real time, essentially modeling remaining capacity so you can plan for the future before you suffer performance problems.
Curating Metrics and Getting Visibility: There are many metrics specific to Elasticsearch, and knowing where to start and what to monitor can be difficult. SignalFx curates the Elasticsearch metrics that matter right out of the box alongside data from the other applications and cloud services in your infrastructure. SignalFx provides dashboards and alerting that give you a fast start on monitoring Elasticsearch in your environment.