Skip to main content

Machine Learning Models and Anomaly Detection in Netdata

Overview

Machine learning helps detect patterns and anomalies in large datasets, enabling early issue identification before they escalate.

At Netdata, we developed Anomaly Advisor, a tool designed to improve troubleshooting, reduce mean time to resolution, and prevent issues from escalating. You can access it through the Netdata dashboard.

Note

To configure ML on your nodes, check the ML configuration documentation.


Design Principles

Netdata’s machine learning models follow these key principles:

PrincipleDescription
Unsupervised LearningModels operate independently without requiring user input.
Real-time PerformanceWhile ML impacts CPU usage, it won't compromise Netdata's high-fidelity, real-time monitoring.
Seamless IntegrationML-based insights are fully embedded into Netdata's existing infrastructure monitoring and troubleshooting.
Assistance Over AlertsML helps users investigate potential issues rather than triggering unnecessary alerts. It won't wake you up at 3 AM for minor anomalies.

Types of Anomalies Detected

Netdata identifies several anomaly types:

  • Point Anomalies: Unusually high or low values compared to historical data.
  • Contextual Anomalies: Sequences of values that deviate from expected patterns.
  • Collective Anomalies: Multivariate anomalies where a combination of metrics appears off.
  • Concept Drifts: Gradual shifts leading to a new baseline.
  • Change Points: Sudden shifts resulting in a new normal state.

How Netdata’s ML Models Work

Training & Detection

Once ML is enabled, Netdata trains an unsupervised model for each metric. By default, this model is a k-means clustering algorithm trained on the last 4 hours of data. Instead of just analyzing raw values, the model works with preprocessed feature vectors to improve detection accuracy.

To reduce false positives, Netdata trains multiple models per time-series, covering over two days of data. An anomaly is flagged only if all models agree on it, eliminating 99% of false positives.

Anomaly Bit

Each trained model assigns an anomaly score at every time step based on how far the data deviates from learned clusters. If the score exceeds the 99th percentile of training data, the anomaly bit is set to true (100); otherwise, it remains false (0).

Key benefits:

  • No additional storage overhead since the anomaly bit is embedded in Netdata’s floating point number format.
  • The query engine automatically computes anomaly rates without requiring extra queries.

Anomaly Rate

Netdata calculates Node Anomaly Rate (NAR) and Dimension Anomaly Rate (DAR) based on anomaly bits. Here’s an example matrix:

Timed1d2d3d4d5NAR
t1000000%
t2000010020%
t3000000%
t4010000020%
t5100000020%
t60100100010060%
t701000100040%
t8000010020%
t900100100040%
t10000000%
DAR10%30%20%20%30%NAR_t1-10 = 22%
  • DAR (Dimension Anomaly Rate): Average anomalies for a specific metric over time.
  • NAR (Node Anomaly Rate): Average anomalies across all metrics at a given time.
  • Overall anomaly rate: Computed across the entire dataset for deeper insights.

Node-Level Anomaly Detection

Netdata tracks the percentage of anomaly bits over time. When the Node Anomaly Rate (NAR) exceeds a set threshold and remains high for a period, a node anomaly event is triggered. These events are recorded in the new_anomaly_event dimension on the anomaly_detection.anomaly_detection chart.


Viewing Anomaly Data in Netdata

Once ML is enabled, Netdata provides an Anomaly Detection menu with key charts:

  • anomaly_detection.dimensions: Number of dimensions flagged as anomalous.
  • anomaly_detection.anomaly_rate: Percentage of anomalous dimensions.
  • anomaly_detection.anomaly_detection: Flags (0 or 1) indicating when an anomaly event occurs.

These insights help you quickly assess potential issues and take action before they escalate.


Summary

Netdata’s machine learning models provide reliable, real-time anomaly detection with minimal false positives. By embedding ML within existing observability workflows, Netdata enhances troubleshooting and ensures proactive monitoring without unnecessary alerts.

For more details, check out:


Do you have any feedback for this page? If so, you can open a new issue on our netdata/learn repository.