Netdata's health watchdog is highly configurable, with support for dynamic thresholds, hysteresis, alarm templates, and more. You can tweak any of the existing alarms based on your infrastructure's topology or specific monitoring needs, or create new entities.
While you can see active alarms both on the local dashboard and Netdata Cloud, all health alarms are configured per node via individual Netdata Agents. If you want to deploy a new alarm across your infrastructure, you must configure each node with the same health configuration files.
All of Netdata's health configuration files are in Netdata's config
directory, inside the
health.d/ directory. Navigate to your Netdata config directory and
edit-config to make changes to any of these files.
For example, to edit the
cpu.conf health configuration file, run:
Each health configuration file contains one or more health entities, which always begin with
For example, here is the first health entity in
To tune this alarm to trigger warning and critical alarms at a lower CPU utilization, change the
to the values of your choosing. For example:
Save the file and reload Netdata's health configuration to make your changes live.
Instead of disabling an alarm altogether, or even disabling all alarms, you can silence individual alarms by changing
one line in a given health entity. To silence any single alarm, change the
to: line in its entity to
While tuning existing alarms may work in some cases, you may need to write entirely new health entities based on how your systems, containers, and applications work.
Read Netdata's health reference for a full listing of the format, syntax, and functionality of health entities.
To write a new health entity into a new file, navigate to your Netdata config directory,
touch to create a new file in the
health.d/ directory. Use
edit-config to start editing the file.
As an example, let's create a
For example, here is a health entity that triggers a warning alarm when a node's RAM usage rises above 80%, and a critical alarm above 90%:
Let's look into each of the lines to see how they create a working health entity.
alarm: The name for your new entity. The name needs to follow these requirements:
- Any alphabet letter or number.
- The symbols
- Cannot be
family name, or
chart variable names.
on: Which chart the entity listens to.
lookup: Which metrics the alarm monitors, the duration of time to monitor, and how to process the metrics into a usable format.
average: Calculate the average of all the metrics collected.
-1m: Use metrics from 1 minute ago until now to calculate that average.
percentage: Clarify that we're calculating a percentage of RAM usage.
of used: Specify which dimension (
used) on the
system.ramchart you want to monitor with this entity.
units: Use percentages rather than absolute units.
every: How often to perform the
lookupcalculation to decide whether or not to trigger this alarm.
crit: The value at which Netdata should trigger a warning or critical alarm. This example uses simple syntax, but most pre-configured health entities use hysteresis to avoid superfluous notifications.
info: A description of the alarm, which will appear in the dashboard and notifications.
In human-readable format:
This health entity, named ram_usage, watches the system.ram chart. It looks up the last 1 minute of metrics from the used dimension and calculates the average of all those metrics in a percentage format, using a % unit. The entity performs this lookup every minute.
If the average RAM usage percentage over the last 1 minute is more than 80%, the entity triggers a warning alarm. If the usage is more than 90%, the entity triggers a critical alarm.
When you finish writing this new health entity, reload Netdata's health configuration to see it live on the local dashboard or Netdata Cloud.
To make any changes to your health configuration live, you must reload Netdata's health monitoring system. To do that
without restarting all of Netdata, run
netdatacli reload-health or
killall -USR2 netdata.
With your health entities configured properly, it's time to enable notifications to get notified whenever a node reaches a warning or critical state.
To build complex, dynamic alarms, read our guide on dimension templates.