Skip to main content

Machine Learning and Anomaly Detection

Overview

You can leverage machine learning to detect patterns and anomalies in your large datasets, enabling you to identify issues early before they escalate.

Netdata offers Anomaly Advisor, a tool designed to improve your troubleshooting experience, reduce mean time to resolution, and prevent issues from escalating. You can access it through the Netdata dashboard.

tip

To configure ML on your nodes, check the ML configuration documentation.

Design Principles

When you use Netdata's machine learning models, you benefit from these key principles:

PrincipleDescription
Unsupervised LearningModels operate independently without requiring your input
Real-time PerformanceWhile ML impacts CPU usage, you won't experience any compromise to Netdata's high-fidelity, real-time monitoring
Seamless IntegrationML-based insights are fully embedded into your existing Netdata infrastructure monitoring and troubleshooting workflow
Assistance Over AlertsML helps you investigate potential issues rather than triggering unnecessary alerts - no 3 AM wake-ups for minor anomalies
Many Light ModelsNetdata uses many lightweight models instead of a few heavy ones, optimizing for resource usage while maintaining accuracy
Scalable ArchitectureThe system is designed to handle thousands of metrics simultaneously, scoring each one every second with minimal latency
note

Netdata deliberately avoids using deep learning models, as they would introduce heavy dependencies and resource requirements that wouldn't align with Netdata's goal of running efficiently on any Linux system. Instead, the implementation uses the lightweight dlib library and spreads training costs over a wide window to minimize performance impact.

Types of Anomalies You Can Detect

Anomaly TypeDescriptionBusiness Impact
Point AnomaliesUnusually high or low values compared to historical dataEarly warning of service degradation
Contextual AnomaliesSequences of values that deviate from expected patternsIdentification of unusual usage patterns
Collective AnomaliesMultivariate anomalies where a combination of metrics appears offDetection of complex system issues
Concept DriftsGradual shifts leading to a new baselineRecognition of evolving system behavior
Change PointsSudden shifts resulting in a new normal stateIdentification of system changes

How Netdata ML Works

Training & Detection

When you enable ML, Netdata trains an unsupervised model for each of your metrics. By default, this model is a k-means clustering algorithm (with k=2) trained on the last 4 hours of your data. Instead of just analyzing raw values, the model works with preprocessed feature vectors to improve your detection accuracy.

important

To reduce false positives in your environment, Netdata trains multiple models per time-series, covering over two days of data. An anomaly is flagged only if all models agree on it, eliminating 99% of false positives. This approach of requiring consensus across models trained on different time scales makes the system highly resistant to spurious anomalies while still being sensitive to real issues.

The anomaly detection algorithm uses the Euclidean distance between recent metric patterns and the learned cluster centers. If this distance exceeds a threshold based on the 99th percentile of training data, that model considers the metric anomalous.

Anomaly Bit

Each trained model assigns an anomaly score at every time step based on how far your data deviates from learned clusters. If the score exceeds the 99th percentile of training data, the anomaly bit is set to true (100); otherwise, it remains false (0).

Key benefits you'll experience:

  • No additional storage overhead since the anomaly bit is embedded in Netdata's floating point number format
  • The query engine automatically computes anomaly rates without requiring extra queries
note

The anomaly bit is quite literally a bit in Netdata's internal storage representation. This ingenious design means that for every metric collected, Netdata can also track whether it's anomalous without increasing storage requirements.

You can access the anomaly bits through Netdata's API by adding the options=anomaly-bit parameter to your query. For example:

https://your-node/api/v1/data?chart=system.cpu&dimensions=user&after=-10&options=anomaly-bit

This would return anomaly bits for the last 10 seconds of CPU user data, with values of either 0 (normal) or 100 (anomalous).

Anomaly Rate

You can see Node Anomaly Rate (NAR) and Dimension Anomaly Rate (DAR) calculated based on anomaly bits. Here's an example matrix:

Timed1d2d3d4d5NAR
t1000000%
t2000010020%
t3000000%
t4010000020%
t5100000020%
t60100100010060%
t701000100040%
t8000010020%
t900100100040%
t10000000%
DAR10%30%20%20%30%NAR_t1-10 = 22%
  • DAR (Dimension Anomaly Rate): Average anomalies for a specific metric over time
  • NAR (Node Anomaly Rate): Average anomalies across all metrics at a given time
  • Overall anomaly rate: Computed across your entire dataset for deeper insights

Node-Level Anomaly Detection

Netdata tracks the percentage of anomaly bits over time for you. When the Node Anomaly Rate (NAR) exceeds a set threshold and remains high for a period, a node anomaly event is triggered. These events are recorded in the new_anomaly_event dimension on the anomaly_detection.anomaly_detection chart.

Viewing Anomaly Data in Your Netdata Dashboard

Once you enable ML, you'll have access to an Anomaly Detection menu with key charts:

  • anomaly_detection.dimensions: Number of dimensions flagged as anomalous
  • anomaly_detection.anomaly_rate: Percentage of anomalous dimensions
  • anomaly_detection.anomaly_detection: Flags (0 or 1) indicating when an anomaly event occurs

These insights help you quickly assess potential issues and take action before they escalate.

Summary

With Netdata ML, you get reliable, real-time anomaly detection with minimal false positives. By incorporating ML within your existing observability workflows, you can enhance troubleshooting and ensure proactive monitoring without unnecessary alerts.

For more information:


Do you have any feedback for this page? If so, you can open a new issue on our netdata/learn repository.