Anomaly Advisor

The Anomaly Advisor (the "Anomalies" tab on Netdata dashboards) is a troubleshooting assistant that correlates anomalies across your entire infrastructure and presents them as a ranked list of metrics, sorted by anomaly severity.

Built on three components: per-metric anomaly detection using k-means clustering (18 models per metric), pre-computed Node Anomaly Rate (NAR) correlation charts, and a specialized scoring engine that evaluates thousands of metrics simultaneously. When you highlight an incident timeframe, the scoring engine analyzes all metrics and returns an ordered list - typically placing root causes within the top 30-50 results.

The system works by detecting anomaly clusters. A single event (like an SSH login, a backup job, or a service restart) typically triggers anomalies across dozens of related metrics - CPU, memory, network, disk I/O, and application-specific counters. The Anomaly Advisor captures these correlations and ranks them by severity, effectively showing you what changed and how those changes cascaded through your infrastructure.

This approach inverts traditional troubleshooting. Instead of forming hypotheses and validating them one by one, you start with data - a ranked list of what actually deviated from normal patterns. The tool works without requiring system-specific knowledge, though interpreting results still requires engineering expertise.

Limitations: Works best for sudden changes and patterns not seen in the last 54 hours. Cannot detect anomalies in stopped services (no data = no anomalies). May miss gradually evolving issues where each increment appears normal.

System Characteristics

Aspect	Implementation	Operational Benefit
Data Source	Per-metric anomaly detection using k-means (k=2) with 18-model consensus	Comprehensive coverage, no blind spots
Correlation Engine	Pre-computed Node Anomaly Rate (NAR) charts updated in real-time	Instant blast radius visualization
Query Engine	Specialized scoring engine evaluating thousands of metrics simultaneously	Returns ranked list, not time-series data
Ranking Algorithm	Anomaly severity scoring across selected time window	Root cause typically in top 30-50 results
Infrastructure View	Dual charts: % anomalous and absolute count per node	Distinguishes small node spikes from large node issues
Time to Insight	Highlight timeframe → ranked results in seconds	Minutes to root cause vs hours of hypothesis testing
Expertise Required	No system-specific knowledge needed to identify anomalies	Minimal expertise to interpret results
Dependency Discovery	Correlated anomalies reveal component relationships	Exposes hidden infrastructure dependencies
Best Use Cases	Sudden changes, cascading failures, multi-node incidents	Excellent for "what just happened?" scenarios
Limitations	Cannot detect stopped services, gradual degradation	Not a replacement for all monitoring

Limitations

Stopped services: No data = no anomalies detected
Gradual degradation: Changes within 54-hour training window may appear normal
Pattern fragments: If anomaly patterns existed separately in training data, consensus may not trigger

Visualizing Cascading Infrastructure Level Effects

The Anomaly Advisor provides two views of node-level anomalies, revealing how issues propagate across infrastructure. Each node-level chart aggregates the underlying anomaly bits calculated per metric (as described in Machine Learning Anomaly Detection):

Anomaly cascading effects visualization

This visualization shows two distinct anomaly clusters:

First cluster (left):

Shows clear propagation: one node spikes first, followed by three more nodes in sequence
Each subsequent node shows anomalies shortly after the previous one
Classic cascading pattern where an issue on one node impacts dependent nodes

Second cluster (right):

Multiple nodes become anomalous simultaneously
The final node shows the largest spike (200+ anomalous metrics in absolute count)
Indicates either a shared resource issue or the final node being the aggregation point

The two charts provide different perspectives:

Top chart - Percentage of anomalous metrics per node (spikes up to 10%)
Bottom chart - Absolute count of anomalous metrics per node (spikes up to 200+)

This dual view helps distinguish between:

Small nodes with high anomaly rates (high percentage, low count)
Large nodes with many anomalies (lower percentage, high count)

This visualization provides the infrastructure-level blast radius of any incident. At a glance, you can see:

Which nodes were affected
When each node was impacted
The severity of impact on each node
Whether the issue propagated sequentially or hit multiple nodes simultaneously

By highlighting any spike or cluster (click and drag on the timeline), the scoring engine analyzes all metrics from all affected nodes during that period. The engine ranks metrics by their anomaly severity within the selected timeframe, returning an ordered list that typically reveals the root cause within the top 30-50 results.

How to Use It

Click the Anomalies tab in any dashboard
Highlight the incident time window (click and drag on any chart)
Review the ranked list of anomalous metrics
Root cause usually surfaces in top 30-50 metrics

Learn More

For detailed information about using the Anomalies tab, see:

Anomalies Tab Documentation
Machine Learning Anomaly Detection - The foundation powering the Anomaly Advisor

Do you have any feedback for this page? If so, you can open a new issue on our netdata/learn repository.

System Characteristics​

Visualizing Cascading Infrastructure Level Effects​

How to Use It​

Learn More​

System Characteristics

Visualizing Cascading Infrastructure Level Effects

How to Use It

Learn More