Skip to main content

Nvidia GPU monitoring with Netdata

Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using the nvidia-smi CLI tool.

Warning: under development, collects fewer metrics then python version.

Metrics

All metrics have "nvidia_smi." prefix.

Labels per scope:

  • gpu: product_name, product_brand.
MetricScopeDimensionsUnits
gpu_pcie_bandwidth_usagegpurx, txB/s
gpu_fan_speed_percgpufan_speed%
gpu_utilizationgpugpu%
gpu_memory_utilizationgpumemory%
gpu_decoder_utilizationgpudecoder%
gpu_encoder_utilizationgpuencoder%
gpu_frame_buffer_memory_usagegpufree, used, reservedB
gpu_bar1_memory_usagegpufree, usedB
gpu_temperaturegputemperatureCelsius
gpu_clock_freqgpugraphics, video, sm, memMHz
gpu_power_drawgpupower_drawWatts
gpu_performance_stategpuP0-P15state

Configuration

No configuration required.

Troubleshooting

To troubleshoot issues with the nvidia_smi collector, run the go.d.plugin with the debug option enabled. The output should give you clues as to why the collector isn't working.

  • Navigate to the plugins.d directory, usually at /usr/libexec/netdata/plugins.d/. If that's not the case on your system, open netdata.conf and look for the plugins setting under [directories].

    cd /usr/libexec/netdata/plugins.d/
  • Switch to the netdata user.

    sudo -u netdata -s
  • Run the go.d.plugin to debug the collector:

    ./go.d.plugin -d -m nvidia_smi

Was this page helpful?

Contribute