Skip to main content

Validation and Data Quality

The plugin handles per-flow sampling-rate multiplication, template persistence, and database refresh internally — these are not concerns you have to monitor. What's left to validate is whether the data you see corresponds to the traffic that actually crossed your network. This page is the routine to confirm that.

The goal: distinguish "the data is correct" from "the data looks plausible but isn't".

What you actually need to watch

A small number of failure modes need active monitoring. Most are signalled by Netdata's existing alerts; the rest you check periodically.

Failure modeDetectionNotes
Kernel-level UDP receive-buffer dropsThe system alert 1m_ipv4_udp_receive_buffer_errors fires when the RcvbufErrors rate averages more than 10 errors/second over a 1-minute window. The dimension is incremental in Netdata, so the numeric threshold is per-second, not per-minute.OS-wide signal. Ships as to: silent by default -- change the to: field in your alert config to receive notifications. Tune net.core.rmem_max (see Troubleshooting).
An exporter stopped sendingFilter the dashboard to that exporter; rate dropped to zero. Per-exporter ingest counters are not published.Manual periodic spot-check or external monitoring.
Wrong set of interfaces being exportedCross-check show flow exporter (or vendor equivalent) on each router against the interfaces you intended.Configuration drift over time; audit quarterly.
An exporter is sampling but not communicating the rateThe plugin treats those records as unsampled, so volumes are undercounted by the sampling factor. Cross-check against SNMP.NetFlow v7 has no rate field; v5 sometimes sends rate=0; v9 / IPFIX without a Sampling Options Template lose the per-flow rate. See "Sampling rate verification" below.
Stale GeoIP / ASN databaseNo in-process signal. Check file mtimes; refresh on the schedule the provider recommends.DB-IP and GeoLite2 ship monthly.

The plugin handles sampling-rate changes per flow record, and it persists decoder templates under journal_dir/decoder-state.d/ so they reload automatically across restarts. You still need external validation for packet drops, exporter drift, stale enrichment data, and exporters that do not report their sampling rate.

The minimum viable validation routine

Run this once after deployment, then quarterly, plus whenever something looks off.

1. SNMP cross-check

Compare flow-derived bandwidth on a specific interface to the SNMP ifInOctets / ifOutOctets counter for that same interface. They should be close.

The flow-derived bandwidth: filter the dashboard to one exporter and one interface (Ingress Interface Name OR Egress Interface Name — pick one), and read the bytes/s rate.

The SNMP-derived bandwidth: from your SNMP monitoring (Netdata's snmp.d, your separate SNMP system, or your network team).

Acceptable difference: roughly 5-15%. SNMP includes layer-2 traffic (ARP, STP, LLDP, routing protocols, interface-level multicast) that flow data filters out. Expect SNMP slightly higher.

Not acceptable: more than 30% gap. That indicates one of:

  • UDP drops at the kernel. Check the alert mentioned above, or run sudo ss -uamn sport = :2055 and inspect the d<N> value inside the skmem:(...) line (the sock_drop counter increments on every dropped datagram). The -n flag keeps the port numeric in the output.
  • Sampling rate not honoured (the exporter is sampling but not communicating the rate). See "Sampling rate verification" below.
  • Wrong interfaces being exported.

Plugin reporting wildly more than SNMP indicates the doubling effect — see below.

2. Doubling sanity check

If your dashboard's total bandwidth exceeds the physical link capacity, you're double-counting. When both ingress and egress flow exporters are enabled on the same router — a common but not universal configuration; vendor best practice is ingress-only — each packet is recorded twice. With multiple monitored routers on the same path, even more.

Verify by: filter to one exporter and one interface (Ingress Interface Name OR Egress Interface Name, pick one). Each packet then appears in exactly one record on that interface. Compare to SNMP for that same interface; they should agree within 5-15%.

3. Sampling rate verification

The plugin multiplies bytes and packets by the sampling rate each flow carries — per flow, at ingestion. You don't have to keep rates uniform across exporters; mixed rates are scaled correctly.

What you DO need to verify, once per exporter, is that the plugin is actually seeing the rate. The dashboard does not surface the per-flow SAMPLING_RATE field as a filter, group-by, or facet, so the verification is by magnitude, not by reading the rate directly:

  • After the plugin has data, compare the dashboard's bytes/s on a known interface to that interface's SNMP ifInOctets / ifOutOctets rate. If the dashboard reading is roughly the SNMP figure divided by the configured sampling rate (e.g. ~1/1000 of SNMP at 1-in-1000), the plugin saw the rate as 1 and is not multiplying. If the dashboard agrees with SNMP within 5-15%, the plugin is honouring the rate.
  • The "sampling but not telling" case happens with NetFlow v7 (no rate field), v5 with rate=0, and v9 / IPFIX exporters that don't send a Sampling Options Template. Fix on the exporter side, or apply enrichment.override_sampling_rate per source prefix to substitute a known rate (see Configuration → enrichment and the Static Metadata integration card).

4. Per-exporter health check

Per-exporter ingest counters are not published. To verify each exporter is sending:

  • Filter the dashboard to one exporter at a time. A healthy edge router during business hours should show non-zero traffic.
  • An exporter that abruptly drops to zero is offline. The plugin won't tell you — your monitoring practice has to.

5. GeoIP / ASN database freshness

The plugin doesn't publish a "MMDB last loaded" signal. To verify your databases aren't stale:

ls -la /var/cache/netdata/topology-ip-intel/ /usr/share/netdata/topology-ip-intel/

Files older than ~60 days are likely stale. Refresh:

sudo /usr/sbin/topology-ip-intel-downloader

Packaged 32-bit installs do not include topology-ip-intel-downloader; use the packaged stock MMDB payload there, or refresh the cache from a system that includes the downloader.

The plugin polls the files every 30 seconds — a successful refresh picks up automatically without restart.

To cross-check the on-disk size of each tier:

sudo du -sh /var/cache/netdata/flows/{raw,1m,5m,1h}

The raw tier dominates. If raw-tier disk usage is approaching the size_of_journal_files you set, plan retention vs. capacity (see Sizing and Capacity Planning).

Plugin-side signals worth alerting on

These are charts the plugin already exposes. Tune the alert thresholds to your environment.

SignalWhereSuggested alert
udp_received rate droppednetflow.input_packetsSustained 0 during business hours indicates the listener is up but no exporter is sending.
parse_errors risingnetflow.input_packetsSustained > 5% of udp_received indicates the wire is producing malformed datagrams.
template_errors risingnetflow.input_packetsSustained > 1% of udp_received after the warm-up period indicates an exporter sends data records before templates.
Memory growing (unaccounted)netflow.memory_accounted_bytesRSS climbs linearly without ingest growth -- possible leak.
decoder_scopes unbounded growthnetflow.decoder_scopesAn exporter is rotating template IDs; investigate per-router behaviour.
Disk write errorsnetflow.raw_journal_ops write_errorsAny non-zero indicates filesystem trouble.
SNMP-flow gapexternalMore than 30% on a steady-state link triggers the validation routine above.

When to file a "data is wrong" investigation

Start an investigation when two independent signals disagree:

  • SNMP says 500 Mbps; flow data says 50 Mbps. Investigate sampling, kernel drops, exporter coverage.
  • Flow data shows a destination ASN you don't expect; threat-intelligence or DNS resolution disagrees. Investigate ASN MMDB staleness or anycast.
  • Last week's top talker disappeared this week. Investigate exporter health, routing changes, business-side changes.

For each, read the Anti-patterns page first — most "data is wrong" reports are expected behaviour misread.

What's next


Do you have any feedback for this page? If so, you can open a new issue on our netdata/learn repository.