Suppose we wish to compare values recorded on Monday at 10 a.m. with the values recorded at 10 a.m. on preceding Mondays. In the alert condition, this corresponds to a value of 1w (one week) for the parameter Cycle length. The weekly cyclicity shown above is so common that one week is the default value for cycle length.
The parameters discussed so far appear immediately below the condition summary, which explains when the alert will trigger, and changes as parameters are changed.
The Details of Parameters
In addition to the cycle length, we need to specify how many previous cycles are used to generate a baseline for comparison. To compare values recorded on Monday at 10 a.m. with the values recorded at 10 a.m. on the preceding four Mondays, for example, we use the value 4 for Number of previous cycles in the alert condition.
To avoid defining a threshold based on just four values, we use windows for both current and historical data. So, for example, to compare the 9:45-10:00 a.m. windows from the preceding four Mondays to the same window on the current Monday, we use the value of 15m (15 minutes) for the parameter Current window.
To produce an alert condition, we will construct ranges of “normal” values and alert when the current signal values are outside that range. The option Mean plus percentage change for the parameter Normal based on is one of the methods of defining a range of normal values for the Historical Anomaly alert condition.
- The first step is to take the mean of the historical windows. In our example, this would give us four historical numbers, each summarizing 15-minute windows spaced one week apart, and one current number, the 15-minute rolling mean.
- The next step is to take either the mean or median of the historical numbers, depending on the value of Ignore historical extremes (Yes uses the median, No uses the mean). We choose to ignore historical extremes and will explain why in the next section.
- The final step is to construct a range of normal values; this is expressed in terms of percentage change (of the median of the four rolling means).
Choosing to Alert when the value is Too low and a value of 25% for Trigger threshold, for example, will alert when the 15-minute rolling mean is at least 25% smaller than the median of the historical means. Choosing Too high alerts when at least 25% larger, and Too high or Too low (our choice) alerts when at least 25% larger or smaller.
While power users will feel comfortable experimenting with all of these parameters, we expect that many high value use cases can be tackled by tuning the percentage change threshold.
Excluding Previous Incidents
We generally recommend ignoring historical extremes (i.e., using the median) since using the mean may render the threshold useless, for example, if there was an incident last week. In our running example, if we use the median, the threshold would only be contaminated if there were two incidents spaced exactly one, two, or three weeks apart.
The benefit of using the median is demonstrated in the following example from one of our customers. The metric being monitored is the total number of messages sent by users of a social networking platform. Drops in this metric mean lower user activity, which may indicate trouble accessing the application.
In this chart, we can see the drop in the metric (the blue plot on the bottom) and the red triangle indicating an alert was triggered. Coincidentally, around the same time two weeks ago, the metric experienced a substantial, but less drastic, drop (the pink plot).
In the alert detail, we can see the range of normal values (defined by upper and lower thresholds) does not react to the pink plot’s sudden descent. Had this prior incident influenced the threshold, the range of normal values would have been dragged downward, and the alert would have been delayed (or failed to trigger altogether). Note the current plot is the rolling mean of the original signal.
One of the more powerful features of SignalFx is the ability to set distinct trigger and clear thresholds for alerts. When a signal hovers around the trigger threshold, other monitoring systems typically produce a sequence of alerts that clear and re-trigger in rapid succession. SignalFx alerts, on the other hand, are not “flappy” due to the ability to clear an alert on an explicitly set condition rather than simply the negation of the trigger condition.
Built-in Alert Conditions exploit this feature. In this example, if we set Clear threshold to 15%, an alert will not clear until the 15-minute rolling mean is within 15% of the historical norm. Using distinct trigger and clear thresholds gives cloud operations greater confidence that when an alert clears, a new alert will not trigger moments later.
Powerful Alerts for Everyone
Flexible analytics must be a component of any cloud monitoring solution: making sense of the stream of metrics produced by modern computing environments is at its heart an analytics challenge. With Built-in Alert Conditions and Alert Preview, users of SignalFx can rapidly experiment with different parameters, learn how the alert would have behaved, and easily customize complex analytical solutions in order to deploy alerts into production with great confidence.
Join us for a webinar on Optimal Alerting in Cloud Environments featuring SignalFx customer Acquia »