Alerts and Notifications in Netdata
Introduction to Alerting
Netdata provides a distributed, real-time health monitoring framework that evaluates conditions against your metrics and executes actions on state transitions. You can configure notifications as one of these actions.
Unlike traditional monitoring systems, Netdata evaluates alerts simultaneously at multiple levels - on the edge (Agents), at aggregation points (Parents), and deduplicates them in Netdata Cloud. This allows your teams to implement different alerting strategies at different infrastructure levels.
Understanding Alerts in Netdata
Netdata alerts function as component-level watchdogs. You attach them to specific components/instances (network interfaces, database instances, web servers, containers, processes) where they evaluate metrics at configurable intervals.
To simplify your configuration, you can define alert templates once and apply them to all matching components. The system matches instances by host labels, instance labels, and names, allowing you to define the same alert multiple times with different matching criteria.
Each alert provides a name, value, unit, and status - making them easy to display in dashboards and send as meaningful notifications regardless of your infrastructure's complexity.
Where Your Alerts Run
Your alerts evaluate at the edge. Every Netdata Agent and Parent runs alerts on the metrics it processes and stores (enabled by default, but you can disable alerting at any level). When you stream metrics to a Parent, the Parent evaluates its own alerts on those metrics independently of the child's alerts. Each Agent maintains its own alert configuration and evaluates alerts autonomously. Metric streaming doesn't propagate alert configurations or transitions to Parents.
┌─────────┐ Metrics ┌──────────┐ Metrics ┌──────────┐
│ Child │ ───────────────> │ Parent 1 │ ───────────────> │ Parent 2 │
│ Agent │ of child │ Agent │ of child + │ Agent │
└────┬────┘ └────┬─────┘ Parent 1 └────┬─────┘
│ │ │
│ Evaluates alerts on │ Evaluates alerts on │ Evaluates alerts on
│ local metrics │ child + local metrics │ all streamed + local
│ │ │
▼ ▼ ▼
Alerts Alerts Alerts
Alert Actions and Notifications
Your Netdata Agents treat notifications as actions triggered by alert status transitions. Agents can dispatch notifications or perform automation tasks like scaling services, restarting processes, or rotating logs. Actions are shell scripts or executable programs that receive all alert transition metadata from Netdata.
When you claim Agents to Netdata Cloud, they send their alert configurations and transitions to Cloud, which deduplicates them (merging multiple transitions from different Agents for the same host). Netdata Cloud triggers notifications centrally through its integrations (Slack, Microsoft Teams, Amazon SNS, PagerDuty, OpsGenie).
Netdata Cloud's intelligent deduplication works by:
- Consolidating multiple Agents reporting the same alert
- Prioritizing highest severity: CRITICAL > WARNING > CLEAR
- Creating unique keys: Alert name + Instance + Node
Your Agents and Netdata Cloud trigger actions independently using their own configurations and integrations.
This design enables you to:
- Maintain team independence: Different teams run their own Parents with custom alerts
- Implement edge intelligence: Critical alerts trigger automations directly on nodes
- Scale naturally: Alert evaluation distributes with your infrastructure
- Mix strategies: Combine edge, regional, and central alerting
Quick Example
Web Server (Child):
- Alert: system CPU > 80% triggers scale out
- Alert: process X memory > 90% restarts process X
DevOps Parent:
- Alert: Response time > 500ms across all web servers
- Alert: Error rate > 1% for any service
SRE Parent:
- Alert: Anomaly detection on traffic patterns
- Alert: Capacity planning thresholds
Netdata Cloud:
- Receives all alert transitions
- Deduplicates overlapping alerts
- Shows CRITICAL if any instance reports CRITICAL
- Provides unified view for incident response
Each level operates independently while Netdata Cloud provides a coherent, deduplicated view of your entire infrastructure's health (when all agents connect directly to Cloud).
Managing Alert Configuration
You configure Netdata alerts in 3 layers:
- Stock Alerts: Netdata provides hundreds of alert definitions in
/usr/lib/netdata/conf.d/health.d
to detect common issues. Don't edit these directly - updates will overwrite your changes. - Your Custom Alerts: Create your own definitions in
/etc/netdata/health.d
. - Dynamic UI Configuration: Use Netdata dashboards to edit, add, enable, or disable alerts on any node through the streaming transport.
Managing Notification Configuration
You can configure notifications for any infrastructure node at 3 levels:
Level | What It Evaluates | Where Notifications Come From | Use Case | Documentation |
---|---|---|---|---|
Netdata Agent | Local Metrics | Netdata Agent | Edge automation | Agent integrations |
Netdata Parent | Local and Children Metrics | Netdata Parent | Edge automation | Agent integrations |
Netdata Cloud | Receives Transitions | Netdata Cloud | Web-hooks, role/room based | Cloud integrations |
When using Parents and Cloud with default settings, you may receive duplicate email notifications. Agents send emails by default when an MTA exists on their systems. Disable email notifications on Agents and Parents when using Cloud by setting SEND_EMAIL="NO"
in /etc/netdata/health_alarm_notify.conf
using edit-config
.
Best Practices for Large Deployments
Central Alerting Strategy
When you:
- Don't need edge automation (no scripts reacting to alerts)
- Use highly available Parents for all nodes
- Use Netdata Cloud (at least for Parents)
Follow these steps:
- Disable health monitoring on child nodes
- Share the same alert configuration across Parents (use git repo or CI/CD)
- Disable Parent notifications (
SEND_EMAIL="NO"
in/etc/netdata/health_alarm_notify.conf
) - Keep only Cloud notifications
This emulates traditional monitoring tools where you configure alerts centrally and dispatch notifications centrally.
Edge Flexible Alerting Strategy
When you:
- Need edge automation (scale out, restart processes)
- Use Parents
- Use Cloud for all nodes
Follow these steps:
- Disable stock alerts on children (
enable stock health configuration
tono
in/etc/netdata/netdata.conf
[health]
section) - Configure only automation-required alerts on children
- Keep stock alerts on Parents but disable notifications (
SEND_EMAIL="NO"
) - Keep only Cloud notifications
This enables edge automation on children while maintaining central alerting control and deduplicated Cloud notifications.
Set Up Alerts via Netdata Cloud
- Connect your nodes to Netdata Cloud
- Navigate to:
Space → Notifications
- Choose your integration (Slack, Amazon SNS, Splunk)
- Configure alert severity filters
Set Up Alerts via Netdata Agent
-
Open notification config:
sudo ./edit-config health_alarm_notify.conf
-
Enable your method (example: email):
SEND_EMAIL="YES"
DEFAULT_RECIPIENT_EMAIL="[email protected]" -
Verify your system can send mail (sendmail, SMTP relay)
-
Restart the agent:
sudo systemctl restart netdata
Core Alerting Concepts
Netdata supports two alert types:
- Alarms: Attach to specific instances (specific network interface, database instance)
- Templates: Apply to all matching instances (all network interfaces, all databases)
Alert Lifecycle and States
Your alerts produce more than threshold checks. Each generates:
- A value: Combines metrics or other alerts using time-series lookups and expressions
- A unit: Makes alerts meaningful ("seconds", "%", "requests/s")
- A name: Identifies the alert
This enables sophisticated alerts like:
out of disk space time: 450 seconds
- Predicts when disk fills based on current rate3xx redirects: 12.5 percent
- Calculates redirects as percentage of totalresponse time vs yesterday: 150%
- Compares current to historical baseline
Alert States
Your alerts exist in one of these states:
State | Description | Trigger |
---|---|---|
CLEAR | Normal - conditions exist but not triggered | Warning and critical conditions evaluate to zero |
WARNING | Warning threshold exceeded | Warning condition evaluates to non-zero |
CRITICAL | Critical threshold exceeded | Critical condition evaluates to non-zero |
UNDEFINED | Cannot evaluate | No conditions defined, or value is NaN/Inf |
UNINITIALIZED | Never evaluated | Alert just created |
REMOVED | Alert deleted | Child disconnected, agent exit, or health reload |
Alerts transition freely between states based on:
- Calculated value (including NaN, Inf, or valid numbers)
- Warning/critical conditions (evaluation results)
- External events (disconnections, reloads, exits)
Key behaviors:
- Alerts jump directly from CLEAR to CRITICAL (no WARNING required)
- WARNING and CRITICAL evaluate independently
- Alerts return to appropriate state when data becomes available
- CRITICAL takes precedence when both conditions are true
Alert Evaluation Process
1. Calculate Value
Your alerts perform complex calculations:
lookup calc warn,crit status
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────────┐
│ Database │ │Expression│ │ Warning │ │ Execute │
│ Query │──────>│Processor │──────>│ Critical │ ───> │ Action on │
│(optional)│ $this │(optional)│ $this │ Checks │ │Transition │
└──────────┘ └──────────┘ └──────────┘ └───────────┘
Examples:
# Simple threshold
calc: $used
# Result: $this = latest value of dimension 'used'
# Time-series lookup
lookup: average -1h of used
# Result: $this = average of 'used' over last hour
# Combined calculation
lookup: average -1h of used
calc: $this * 100 / $total
# Result: $this = percentage of hourly average vs total
# Baseline comparison
lookup: average -1h of used
calc: $this * 100 / $average_yesterday
# Result: $this = percentage vs yesterday's average
2. Evaluate Conditions
After calculating, check conditions:
# Simple conditions
warn: $this > 80
crit: $this > 90
# Flapping prevention
warn: ($status >= $WARNING) ? ($this > 50) : ($this > 80)
crit: ($status == $CRITICAL) ? ($this > 70) : ($this > 90)
# Complex conditions
warn: $this > 80 AND $rate > 10
crit: $this > 90 OR $failures > 5
3. Determine State
Each condition evaluates to:
- NaN or Inf → UNDEFINED
- Non-zero → RAISED
- Zero → CLEAR
Final status:
- Critical RAISED → CRITICAL (priority)
- Warning RAISED → WARNING
- Either CLEAR → CLEAR
- Both missing/UNDEFINED → UNDEFINED
Evaluation Timing
Alert evaluation runs independently from data collection:
Data Collection Alert Evaluation
│ │
▼ every 1s ▼ configurable interval
[Metrics] ──────────> [Alert Engine]
│
▼
Query metrics,
Calculate values,
Check conditions
- Default interval: Query window duration (with lookup) or manual setting required
- Configurable: Use
every
for custom intervals - Constrained: Cannot evaluate faster than data collection frequency
Anti-Flapping Mechanisms
Netdata prevents alert flapping through:
1. Hysteresis
warn: ($status < $WARNING) ? ($this > 80) : ($this > 50)
Triggers at 80, clears at 50, preventing flapping between 50-80.
2. Dynamic Delays
Alerts transition immediately in dashboards but notifications use exponential backoff.
3. Duration Requirements
lookup: average -10m of used
warn: $this > 80
Requires 10 minutes of data before triggering.
Multi-Stage Alerts
Create dependent alerts:
# Stage 1: Baseline
template: requests_average_yesterday
on: web_log.requests
lookup: average -1h at -1d
every: 10s
# Stage 2: Current
template: requests_average_now
on: web_log.requests
lookup: average -1h
every: 10s
# Stage 3: Compare
template: web_requests_vs_yesterday
on: web_log.requests
calc: $requests_average_now * 100 / $requests_average_yesterday
units: %
warn: $this > 150 || $this < 75
crit: $this > 200 || $this < 50
Available Variables
Variables resolve in order (first match wins):
1. Built-in Variables
Variable | Description | Value |
---|---|---|
$this | Current calculated value | Result from lookup/calc |
$after | Query start timestamp | Unix timestamp |
$before | Query end timestamp | Unix timestamp |
$now | Current time | Unix timestamp |
$last_collected_t | Last collection time | Unix timestamp |
$update_every | Collection frequency | Seconds |
$status | Current status code | -2 to 3 |
$REMOVED | Status constant | -2 |
$UNINITIALIZED | Status constant | -1 |
$UNDEFINED | Status constant | 0 |
$CLEAR | Status constant | 1 |
$WARNING | Status constant | 2 |
$CRITICAL | Status constant | 3 |
2. Dimension Values
Syntax | Description | Example |
---|---|---|
$dimension_name | Last normalized value | $used |
$dimension_name_raw | Last raw collected value | $used_raw |
$dimension_name_last_collected_t | Collection timestamp | $used_last_collected_t |
template: disk_usage_percent
on: disk.space
calc: $used * 100 / ($used + $available)
units: %
3. Chart Variables
calc: $used > $threshold # If chart defines 'threshold'
4. Host Variables
warn: $connections > $max_connections * 0.8 # If host defines 'max_connections'
5. Other Alerts
# Alert 1
template: cpu_baseline
calc: $system + $user
# Alert 2
template: cpu_check
calc: $system
warn: $this > $cpu_baseline * 1.5
6. Cross-Context References
template: disk_io_vs_iops
on: disk.io
calc: $reads / $\{disk.iops.reads}
units: bytes per operation
Variable Resolution and Label Scoring
When alerts reference variables matching multiple instances, Netdata uses label similarity scoring:
- Collect candidates with matching names
- Score by labels - count common labels
- Select best match - highest label overlap
Example: Alert on disk.io
(labels: device=sda
, mount=/data
) references $\{disk.iops.reads}
:
disk.iops
for sda (labels match) → Score: 2disk.iops
for sdb (no match) → Score: 0 Result: Uses sda's value
Missing Data Handling
During lookups with missing data:
- All values NULL:
$this
becomesNaN
- Some values exist: Ignores NULL, continues calculation
- Dimension doesn't exist:
$this
becomesNaN
This handles intermittent collection, dynamic dimensions, and partial outages.
Evaluation Frequency
Determine frequency by:
-
With lookup: Defaults to window duration
lookup: average -5m # Evaluates every 5 minutes
-
Without lookup: Set explicitly
every: 10s
calc: $system + $user -
Custom interval: Override default
lookup: average -1m
every: 10s # Check every 10s despite 1m window
Constraints:
- Cannot exceed data collection frequency
- High frequency impacts performance
- Use larger intervals with
unaligned
for efficiency
Troubleshooting Your Alerts
Netdata Assistant
The Netdata Assistant provides AI-powered troubleshooting when alerts trigger:
- Click the alert in your dashboard
- Press the Assistant button
- Receive customized troubleshooting tips
The Assistant window follows you through dashboards for easy reference while investigating.
Community Resources
Visit our Alerts Troubleshooting space for complex issues. Get help through GitHub or Discord. Share your solutions to help others.
Customizing Alerts
Tune alerts for your environment by adjusting thresholds, writing custom conditions, silencing alerts, and using statistical functions.
Related Documentation
Do you have any feedback for this page? If so, you can open a new issue on our netdata/learn repository.