Machine Learning Anomaly Detection

Netdata uses k-means clustering to detect anomalies for each collected metric automatically.

The system maintains 18 models per metric, each trained on 6-hour windows at 3-hour intervals, providing approximately 54 hours of rolling behavioral patterns. Anomaly detection occurs in real-time during data collection - a data point is flagged as anomalous only when all 18 models reach consensus, effectively eliminating noise while maintaining sensitivity to genuine issues.

Anomaly bits are stored alongside metric data in the time-series database, with the same retention period. The query engine calculates anomaly rates dynamically during data aggregation, exposing anomaly information on every chart without additional overhead.

A dedicated process correlates anomalies across all metrics within each node, generating real-time node-level anomaly charts. This correlation data feeds into Netdata's scoring engine - a specialized query system that can evaluate thousands of metrics simultaneously and return an ordered list ranked by anomaly severity, powering the Anomaly Advisor for rapid root cause analysis.

System Characteristics

Aspect	Implementation	Benefit
Algorithm	Unsupervised k-means clustering (k=2) via dlib	No manual training or labeled data required
Model Architecture	Rolling 18 models per metric, 3-hour staggered training	Eliminates 99% of false positives through consensus
Processing Location	Edge computation on each Netdata Agent	No cloud dependency, no data egress
Resource Usage	~18KB RAM per metric, 2-4% of a single CPU for 10k metrics	Predictable linear scaling
Configuration	Zero-configuration with automatic adaptation	Works instantly on any metric type
Detection Latency	Real-time during data collection	Anomalies flagged within 1 second
Historical Storage	Anomaly bit embedded in metric storage	No additional storage overhead
Query Performance	On-the-fly anomaly rate calculation	No pre-aggregation needed
Time-series Integrity	Immutable anomaly history	No hindsight bias — shows what was detectable then
Coverage	Every metric, every dimension	No sampling, no blind spots
Correlation Engine	Real-time anomaly correlation across metrics	Powers Anomaly Advisor for root cause analysis
Alert Philosophy	Primarily an investigation aid; anomaly bits and anomaly rate can also drive health alerts	Reduces alert fatigue while enabling anomaly-based alerting

note

Netdata avoids deep learning models to maintain lightweight operation on any Linux system. The entire ML system is designed to run efficiently without specialized hardware or dependencies.

Types of Anomalies Detected

Anomaly Type	Description	Business Impact
Point Anomalies	Unusually high or low values compared to historical data	Early warning of service degradation
Contextual Anomalies	Sequences of values that deviate from expected patterns	Identification of unusual usage patterns
Collective Anomalies	Multivariate anomalies where a combination of metrics appears off	Detection of complex system issues
Concept Drifts	Gradual shifts leading to a new baseline	Recognition of evolving system behavior
Change Points	Sudden shifts resulting in a new normal state	Identification of system changes

Technical Deep Dive: How Netdata ML Works

Training & Detection Process

When you enable ML, Netdata trains an unsupervised model for each of your metrics. By default, this model is a k-means clustering algorithm (with k=2) trained on the last 6 hours of your data. Instead of just analyzing raw values, the model works with preprocessed feature vectors to improve your detection accuracy.

important

To reduce false positives in your environment, Netdata trains multiple models per time-series, covering over two days of data. An anomaly is flagged only if all models agree on it, eliminating 99% of false positives. This approach of requiring consensus across models trained on different time scales makes the system highly resistant to spurious anomalies while still being sensitive to real issues.

The anomaly detection algorithm uses the Euclidean distance between recent metric patterns and the learned cluster centers. If this distance exceeds a threshold based on the 99th percentile of training data, that model considers the metric anomalous.

The Anomaly Bit

Each trained model assigns an anomaly score at every time step based on how far your data deviates from learned clusters. If the score exceeds the 99th percentile of training data, the anomaly bit is set to true (100); otherwise, it remains false (0).

Key benefits you'll experience:

No additional storage overhead since the anomaly bit is embedded in Netdata's floating point number format
The query engine automatically computes anomaly rates without requiring extra queries

note

The anomaly bit is quite literally a bit in Netdata's internal storage representation. This ingenious design means that for every metric collected, Netdata can also track whether it's anomalous without increasing storage requirements.

You can access the anomaly bits through Netdata's API by adding the options=anomaly-bit parameter to your query. For example:

https://your-node/api/v3/data?chart=system.cpu&dimensions=user&after=-10&options=anomaly-bit

This would return anomaly bits for the last 10 seconds of CPU user data, with values of either 0 (normal) or 100 (anomalous).

Anomaly Rate Calculations

You can see Node Anomaly Rate (NAR) and Dimension Anomaly Rate (DAR) calculated based on anomaly bits. Here's an example matrix:

Time	d1	d2	d3	d4	d5	NAR
t1	0	0	0	0	0	0%
t2	0	0	0	0	100	20%
t3	0	0	0	0	0	0%
t4	0	100	0	0	0	20%
t5	100	0	0	0	0	20%
t6	0	100	100	0	100	60%
t7	0	100	0	100	0	40%
t8	0	0	0	0	100	20%
t9	0	0	100	100	0	40%
t10	0	0	0	0	0	0%
DAR	10%	30%	20%	20%	30%	*NAR_t1-10 = 22%*

DAR (Dimension Anomaly Rate): Average anomalies for a specific metric over time
NAR (Node Anomaly Rate): Average anomalies across all metrics at a given time
Overall anomaly rate: Computed across your entire dataset for deeper insights

Node-Level Anomaly Detection

Netdata tracks the percentage of anomaly bits over time for you. When the Node Anomaly Rate (NAR) exceeds a set threshold and remains high for a period, a node anomaly event is triggered. These events are recorded in the new_anomaly_event dimension on the anomaly_detection.anomaly_detection chart.

Available Documentation

ML Configuration - Configuration and tuning guide
Metric Correlations - Finding related metrics during incidents

Viewing Anomaly Data in Your Netdata Dashboard

Once you enable ML, you'll have access to an Anomaly Detection menu with key charts:

anomaly_detection.dimensions: Number of dimensions flagged as anomalous
anomaly_detection.anomaly_rate: Percentage of anomalous dimensions
anomaly_detection.anomaly_detection: Flags (0 or 1) indicating when an anomaly event occurs

These insights help you quickly assess potential issues and take action before they escalate.

Operational Details

Why 18 Models?

The number 18 balances three competing requirements:

Incremental learning efficiency - Training 48 hours of data every 3 hours would waste computational resources. Instead, each model trains on just 6 hours of data, with only one new model created every 3 hours.
Adaptive memory duration - When an anomaly occurs, the newest model will learn it as "normal" within 3 hours. The system gradually "forgets" this pattern as older models are replaced. With 18 models at 3-hour intervals, complete forgetting takes 54 hours (2.25 days).
Consensus noise reduction - Multiple models voting together eliminate random fluctuations. 18 models provide strong consensus without excessive memory use.

This creates a sliding window memory: recent anomalies become "normal" quickly (within 3 hours for the newest model), while the full consensus takes 54 hours to completely forget an anomalous pattern. This balance prevents both alert fatigue from repeated anomalies and blindness to recurring issues.

How Netdata Minimizes Training CPU Impact

ML typically doubles the agent's CPU usage - from ~2% to ~4% of a single core. This efficiency comes from several optimizations:

Smart metric filtering - Metrics with constant or fixed values are automatically excluded from training, eliminating wasted computation on unchanging data.
Incremental training windows - Each model trains on only 6 hours of data instead of the full 54-hour history, reducing computational requirements by ~90%.
Even training distribution - The agent dynamically throttles model training to spread the work evenly across each 3-hour window, preventing CPU spikes. With 10,000 metrics, this means training ~1 model per second instead of training 10,000 models in a burst.
Distributed intelligence - Child agents stream both trained models and anomaly bits to parent agents along with metric data. Parents receive pre-computed ML results, requiring zero additional ML computation for aggregated views.

This design ensures ML remains lightweight enough to run on production systems without impacting primary workloads.

Dynamic prioritization: ML automatically throttles or even pauses training during:

Heavy query load - ensuring dashboards remain responsive
Parent-child reconnections - prioritizing metric replication
Any resource contention - backing off to protect core monitoring

Under these conditions, ML will completely stop training new models to ensure:

User queries remain fast and responsive
Metric streaming completes quickly after network interruptions
Overall CPU and I/O consumption stays within bounds

This means ML is truly a background process - it uses spare cycles but immediately yields resources when needed for operational tasks.

Storage Impact

ML has zero storage overhead in the time-series database. The anomaly bit uses a previously unused bit in the existing sample storage format - no schema changes or storage expansion required.

The only storage impact comes from persisting trained models to disk for survival across restarts:

Model files are small compared to the time-series data
Negligible impact on overall storage requirements
Models are retained only for active metrics

This means you can enable ML without provisioning additional storage capacity. Anomaly history is retained for the same period as your metrics, with no extra space required.

Query performance impact: None. The anomaly bit is loaded together with metric data in a single disk read - no additional I/O operations required. Querying metrics with anomaly data has the same disk I/O pattern as querying metrics without ML.

Cold Start Behavior

On a freshly installed agent, ML begins detecting anomalies within 10 minutes. However, early detection has important characteristics:

Timeline:

0-10 minutes: Collecting initial data, no anomaly detection
10+ minutes: First models trained, anomaly detection begins with high sensitivity
3 hours: First model rotation, improved accuracy
54 hours: Full model set established, optimal detection accuracy

What to expect:

Initial hours show more anomalies due to limited training data
False positive rate decreases as models accumulate more behavioral patterns
Each 3-hour cycle improves detection quality
After 2-3 days, the system reaches steady-state accuracy

Operational tip: During the first 48 hours after deployment, expect elevated anomaly rates. This is normal as the system learns your infrastructure's patterns. Use this period to observe ML behavior but avoid making critical decisions based solely on early anomaly detection.

Creating Anomaly-Based Health Alerts

You can create health alerts that trigger based on anomaly rates instead of raw metric values by using the anomaly-bit option in your alert's lookup line. Internally, anomaly bits mark samples as anomalous or normal, and the query/health pipeline exposes this as an anomaly-rate percentage in the 0-100 range. For aggregated or tiered data, returned values can be intermediate percentages rather than only 0 or 100, so averaging over a time window gives you the anomaly rate as a percentage.

Anomaly-rate alert

The following template triggers when the anomaly rate on system.cpu exceeds the defined thresholds:

 template: ml_5min_cpu_chart
       on: system.cpu
   lookup: average -5m anomaly-bit of *
     calc: $this
    units: %
    every: 30s
     warn: $this > (($status >= $WARNING)  ? (5) : (20))
     crit: $this >= (($status == $CRITICAL) ? (20) : (100))
     info: rolling 5min anomaly rate for system.cpu chart

Pairing with actual values

An anomaly-rate alert tells you something is unusual, but not what the actual numbers are. To get the real values alongside the anomaly alert, create a companion alert on the raw metric:

 alarm: cpu_usage_5min
    on: system.cpu
lookup: average -5m of user,system
 units: %
 every: 30s
  warn: $this > 80
  crit: $this > 95
  info: average user+system CPU utilization over the last 5 minutes

When the anomaly alert fires, the companion alert can provide the concrete values in a separate alert — for example, "CPU anomaly rate 35%" alongside "CPU utilization 92%".

tip

Use foreach in a template to generate one alert instance per dimension (e.g., one per CPU state). Note that foreach and of serve different purposes: of selects which dimensions the lookup aggregates, while foreach creates separate alert instances for each matching dimension in a template. For the full alert syntax, see the health configuration reference.

Adding context to alert notifications

The info and summary fields support template variables that add contextual detail to notifications:

Variable	Replaced With
`${family}`	Family instance (for example, `eth0`)
`${label:LABEL_NAME}`	Chart label value

important

$this is available only in calc, warn, and crit expressions — not in the info or summary fields.

Investigating anomaly alerts

When an anomaly alert fires, use Netdata's built-in tools to investigate the root cause:

Alert Troubleshooting — generate a one-click report from any fired alert, assessing its validity, uncovering correlated signals, and proposing a root-cause hypothesis
Investigations — ask open-ended questions about your infrastructure for deeper analysis beyond a single alert

Do you have any feedback for this page? If so, you can open a new issue on our netdata/learn repository.

System Characteristics​

Types of Anomalies Detected​

Technical Deep Dive: How Netdata ML Works​

Training & Detection Process​

The Anomaly Bit​

Anomaly Rate Calculations​

Node-Level Anomaly Detection​

Available Documentation​

Viewing Anomaly Data in Your Netdata Dashboard​

Operational Details​

Why 18 Models?​

How Netdata Minimizes Training CPU Impact​

Storage Impact​

Cold Start Behavior​

Creating Anomaly-Based Health Alerts​

Anomaly-rate alert​

Pairing with actual values​

Adding context to alert notifications​

Investigating anomaly alerts​