Skip to main content

NVMe devices monitoring with Netdata

Monitors health metrics (estimated endurance, space capacity, critical warnings, temperature, etc.) using the nvme CLI tool.

The module uses nvme, which can only be executed by root. It uses sudo and assumes that it is configured such that the netdata user can execute nvme as root without a password.

Requirements

  • Install nvme-cli.

  • Add the netdata user to the /etc/sudoers file (use which nvme to find the full path to the binary):

    netdata ALL=(root) NOPASSWD: /usr/sbin/nvme

Additionally, you may need to adjust Netdata's system unit on Linux distributions using systemd.

Note: This is an optional step. Only do this if adding netdata to /etc/sudoers didn't help.

The default CapabilityBoundingSet doesn't allow using sudo, and is quite strict in general. Resetting is not optimal, but a next-best solution given the inability to execute nvme using sudo.

As the root user, do the following:

mkdir /etc/systemd/system/netdata.service.d
echo -e '[Service]\nCapabilityBoundingSet=~' | tee /etc/systemd/system/netdata.service.d/unset-capability-bounding-set.conf
systemctl daemon-reload
systemctl restart netdata.service

Metrics

All metrics have "nvme." prefix.

Labels per scope:

  • device: device.
MetricScopeDimensionsUnits
device_estimated_endurance_percdeviceused%
device_available_spare_percdevicespare%
device_composite_temperaturedevicetemperaturecelsius
device_io_transferred_countdeviceread, writtenbytes
device_power_cycles_countdevicepowercycles
device_power_on_timedevicepower-onseconds
device_critical_warnings_statedeviceavailable_spare, temp_threshold, nvm_subsystem_reliability, read_only, volatile_mem_backup_failed, persistent_memory_read_onlystate
device_unsafe_shutdowns_countdeviceunsafeshutdowns
device_media_errors_ratedevicemediaerrors/s
device_error_log_entries_ratedeviceerror_logentries/s
device_warning_composite_temperature_timedevicewctempseconds
device_critical_composite_temperature_timedevicecctempseconds
device_thermal_mgmt_temp1_transitions_ratedevicetemp1transitions/s
device_thermal_mgmt_temp2_transitions_ratedevicetemp2transitions/s
device_thermal_mgmt_temp1_timedevicetemp1seconds
device_thermal_mgmt_temp2_timedevicetemp2seconds

Configuration

No configuration required.

Troubleshooting

To troubleshoot issues with the nvme collector, run the go.d.plugin with the debug option enabled. The output should give you clues as to why the collector isn't working.

  • Navigate to the plugins.d directory, usually at /usr/libexec/netdata/plugins.d/. If that's not the case on your system, open netdata.conf and look for the plugins setting under [directories].

    cd /usr/libexec/netdata/plugins.d/
  • Switch to the netdata user.

    sudo -u netdata -s
  • Run the go.d.plugin to debug the collector:

    ./go.d.plugin -d -m nvme

Was this page helpful?

Contribute