v1.32.0, Netdata comes with some ML powered anomaly detection capabilities built into it and available to use out of the box, with zero configuration required (ML was enabled by default in
v1.35.0-29-nightly in this PR, previously it required a one line config change).
This means that in addition to collecting raw value metrics, the Netdata agent will also produce an
anomaly-bit every second which will be
100 when recent raw metric values are considered anomalous by Netdata and
0 when they look normal. Once we aggregate beyond one second intervals this aggregated
anomaly-bit becomes an "anomaly rate".
To be as concrete as possible, the below api call shows how to access the raw anomaly bit of the
system.cpu chart from the london.my-netdata.io Netdata demo server. Passing
options=anomaly-bit returns the anomay bit instead of the raw metric value.
If we aggregate the above to just 1 point by adding
points=1 we get an "Anomaly Rate":
The fundamentals of Netdata's anomaly detection approach and implmentation are covered in lots more detail in the agent ML documentation.
This guide will explain how to get started using these ML based anomaly detection capabilities within Netdata.
The Anomaly Advisor is the flagship anomaly detection feature within Netdata. In the "Anomalies" tab of Netdata you will see an overall "Anomaly Rate" chart that aggregates node level anomaly rate for all nodes in a space. The aim of this chart is to make it easy to quickly spot periods of time where the overall "node anomaly rate" is evelated in some unusual way and for what node or nodes this relates to.
Once an area on the Anomaly Rate chart is highlighted netdata will append a "heatmap" to the bottom of the screen that shows which metrics were more anomalous in the highlighted timeframe. Each row in the heatmap consists of an anomaly rate sparkline graph that can be expanded to reveal the raw underlying metric chart for that dimension.
Embedded Anomaly Rate Charts
Charts in both the Overview and single node dashboard tabs also expose the underlying anomaly rates for each dimension so users can easily see if the raw metrics are considered anomalous or not by Netdata.
Pressing the anomalies icon (next to the information icon in the chart header) will expand the anomaly rate chart to make it easy to see how the anomaly rate for any individual dimension corresponds to the raw underlying data. In the example below we can see that the spike in
system.pgpgio|in corresponded in the anomaly rate for that dimension jumping to 100% for a small period of time until the spike passed.
Anomaly Rate Based Alerts
It is possible to use the
anomaly-bit when defining traditional Alerts within netdata. The
anomaly-bit is just another
options parameter that can be passed as part of an alarm line lookup.
You can see some example ML based alert configurations below:
- Anomaly rate based CPU dimensions alarm
- Anomaly rate based CPU chart alarm
- Anomaly rate based node level alarm
- More examples in the
/health/health.d/ml.conffile that ships with the agent.
Check out the resources below to learn more about how Netdata is approaching ML: