Skip to main content

Scalability: Monitoring at Any Scale

TL;DR

Netdata scales from a single node to 100,000+ nodes without architectural changes, maintaining 1-second granularity and sub-2-second latency at any scale. The distributed edge-native architecture ensures that adding more nodes doesn't degrade performance - each node operates independently while collaborating seamlessly.

The Problem with Centralization

Traditional observability assumes one thing: push all data to a central database, then query it for dashboards and alerts.

This works. Until it doesn't.

When scale breaks the model, teams face two options - and both are wrong:

Option 1: Reduce the Workload

Lower granularity. Drop cardinality. Filter. Sample.

This is a trap and a paradox. If you knew which data you'd need during a crisis, you could predict the crisis and prevent it. By definition, an unpredictable event is the one that will be invisible in your downsampled dataset. You're betting your incident response on being able to predict the unpredictable.

Option 2: Scale the Database

Build giant, expensive clusters. Add more cores. More RAM. More everything.

This is a money pit. In many organizations today, observability costs more than the services being monitored. We routinely encounter companies where 40-50% of infrastructure budget goes to monitoring pipelines, plus teams of specialists to keep them alive.

The Netdata Way: Process and Store at the Edge

Instead of centralizing data, distribute the code. This is the heart of Netdata's philosophy and design.

Every Netdata Agent is a full observability engine:

  • Collects metrics at the edge
  • Stores data locally in multi-tier storage
  • Runs ML-based anomaly detection in real-time
  • Runs health checks and triggers alerts
  • Serves dashboards and APIs independently

When you need high availability, persistent storage for ephemeral nodes, reduced load on production systems, or on-premises dashboards, Parents aggregate streams without becoming bottlenecks - because the heavy lifting already happened at the edge.

This distributed architecture delivers results that speak for themselves:

  • No loss of fidelity - every metric, every second, always visible
  • No blind spots - no sampling, no cherry-picking
  • No scaling tax - adding nodes adds observability, not exponential cost curves

Once you pass ~500 nodes, you’re naturally in the multi-million metrics/s range. What looks “heroic” elsewhere is simply normal operating conditions with Netdata.

Proof: The Numbers Don't Lie

Independent Validation: University of Amsterdam Study (2023)

Study: "An Empirical Evaluation of the Energy and Performance Overhead of Monitoring Tools on Docker-Based Systems"
Conference: ICSOC 2023 (International Conference on Service-Oriented Computing)
DOI: 10.1007/978-3-031-48421-6_13

Finding: Netdata is the most energy-efficient monitoring solution, with the lowest CPU and memory overhead - even while collecting data every second and running anomaly detection at the edge.

Head-to-Head: Netdata vs Prometheus (2025)

We tested a single installation Netdata Parent and Prometheus, at 4.6 million metrics per second - the scale you hit with just 1,000 nodes. This is how the systems compare for ingestion:

MetricNetdataPrometheusImpact
CPU Usage~9.4 cores~14.8 cores36% less CPU
Memory Usage~47 GiB~383 GiB88% less RAM
Disk I/O~4.7 MiB/s writes~147 MiB/s total97% less I/O
Per-second retention (1TiB)~1.25 days~2 hours15x longer - 40x retention in lower tiers
Sample completeness~100%~93.7%Zero data loss
Query latency (2hr window)~0.11s~1.8s16x faster - 22x faster in long term queries

Critical insight: This isn't exotic scale. Every Netdata deployment with >500 nodes runs at millions of metrics per second. Our users don't even notice - because the architecture absorbs it.

Architecture: Built for Planet Scale

Core Components

ComponentRoleResources (Standalone)Resources (Offloaded)Scale Factor
AgentEdge collector<5% CPU, <200 MiB RAM, disk I/O<2% CPU, <150 MiB RAM, zero disk3,000-20,000 metrics/sec
ParentWorkload distributor & aggregator10 cores, 40 GiB RAM per 1M metrics/secSame + ML training if enabledLinear scaling
CloudControl plane & federationMinimalMinimalUnlimited Parents

Offloading: Agents can offload ML, alerting, dashboards, and retention to Parents - typically cutting agent CPU ≈50%, RAM ≈25%, and eliminating disk I/O entirely.

The Edge Advantage

Each Agent is autonomous:

  • Collects 3,000-20,000 metrics per second per node
  • Stores data in tiered storage (raw + aggregated)
  • Detects anomalies using unsupervised ML
  • Triggers alerts in real-time
  • Serves local dashboards and APIs
  • Streams to Parents for aggregation

This means:

  • No data loss if Parent is down (Agents buffer locally)
  • No performance degradation as you scale (work stays distributed)
  • No architectural changes from 1 to 100,000 nodes

The Parent Advantage: Intelligent Workload Distribution

Why Parents Should Be Your Default

Parents aren't just centralization points - they're intelligent workload distributors that can reduce the resource footprint on production systems.

With Parents, Agents can offload:

  • ML Training - Parents train models, Agents just collect (50% CPU reduction)
  • Health Checks - Parents run all alerts, Agents focus on collection
  • Persistent Storage - Agents run in RAM-only mode with zero disk I/O
  • Dashboard Serving - Parents handle all queries and visualizations

A fully offloaded Agent uses <2% CPU, <150 MiB RAM, and zero disk I/O - a fraction of a standalone Agent.

ML Intelligence: Train Where It Makes Sense

Netdata's ML models flow with the metrics stream, giving you complete flexibility:

Option 1: ML at the Edge (default)

  • Agents train their own models locally
  • Models stream to Parents along with metrics
  • Parents receive pre-computed ML results
  • Best for: Systems with available CPU, need for immediate local anomaly detection

Option 2: ML at Parents

  • Agents disable ML training (50% CPU savings)
  • First Parent trains models for all Agents
  • Models shared with other Parents in cluster
  • Best for: Resource-constrained production systems, centralized ML management

The architecture adapts to your needs - train where you have resources, use everywhere.

When You Need Parents

We recommend Parents by default:

  • Future-proof your architecture (same setup works at 10 or 100,000 nodes)
  • Reduce production system load even at small scale
  • Provide unified dashboards and centralized alerting
  • Enable high availability and disaster recovery
  • Cost less than the resources they save on production systems

Parents are essential when you have:

  • Ephemeral systems - Kubernetes pods, auto-scaling VMs that disappear
  • Resource constraints - Systems where every CPU cycle matters
  • On-premises requirements - Multi-node view without Cloud connectivity
  • Network restrictions - Agents can't reach Cloud due to firewalls/policies

Parent Sizing Guidelines

Nodes per ParentMetrics/secResources
~100 nodes~0.5M/sec5 cores, 20 GiB RAM
~250 nodes~1M/sec10 cores, 40 GiB RAM
~500 nodes~2M/sec20 cores, 80 GiB RAM

Key principle: Scale horizontally with more Parents, not vertically with bigger Parents. Beyond 500 nodes per Parent, resource usage grows non-linearly.

Parent Placement Strategy

  • Keep Parents close to their Agents (same datacenter, region, or cloud zone)
  • Minimize network hops to reduce latency and bandwidth costs
  • Deploy per region in multi-region architectures
  • Use multiple Parents rather than one giant Parent

High Availability & Intelligent Clustering

Parents work together intelligently to eliminate duplicate work:

  • Active-active Parents with automatic work distribution
  • ML model sharing - First Parent trains, others receive models
  • Automatic failover - Agents reconnect to available Parents
  • Local buffering - Agents retain 1+ hour of data during Parent downtime
  • Streaming replication between Parents for complete redundancy
  • Federated queries across all Parents via Netdata Cloud

Key insight: Clustering without double-spend: In an active-active cluster, the first Parent that sees a child trains the model; peers reuse it. You get HA without multiplying heavy work.

Alerts: Automation vs Monitoring

Netdata separates automation from monitoring, letting you optimize both:

Agents: Local Automation

  • Keep only alerts that trigger local scripts
  • Example: "If CPU > 90%, scale this service"
  • Immediate response, no network dependency
  • Minimal overhead when selective

Parents: Human Monitoring

  • Run comprehensive health checks for all Agents
  • Send notifications to teams via Cloud or integrations
  • Correlate issues across multiple systems
  • Rich context for troubleshooting

This dual approach means production systems only run automation-critical alerts while Parents handle the hundreds of monitoring alerts that humans need to see.

Storage: Efficient Multi-Tier Architecture

Three-Tier Storage System

TierResolutionCompressionRetentionUse Case
Tier 0Per-secondMinimal (0.6 bytes/sample)Days to WeeksTroubleshooting
Tier 1Per-minuteHighWeeks to MonthsTrending
Tier 2Per-hourMaximumMonths to YearsCapacity planning

All tiers update in parallel - no post-processing or compaction jobs needed.

Storage Efficiency

  • 0.6 bytes per sample - industry's most efficient
  • Gorilla + ZSTD compression for optimal size/speed
  • WORM design - append-only, no expensive compaction

Why This Architecture Wins

For Operations Teams

  • No blind spots during incidents - all data available
  • No architectural rewrites as you scale
  • No sampling lottery - the metric you need is always there
  • No specialized skills required - it just works

For Finance

  • Predictable costs - linear scaling, no surprises
  • Lower TCO - fewer resources for same visibility
  • Energy efficient - independently validated lowest overhead
  • Reduced team size - less complexity to manage

For Developers

  • Per-second granularity - see what actually happened
  • Real-time anomaly detection - catch issues immediately
  • Local dashboards - debug without central bottlenecks
  • Full cardinality - every dimension tracked

The Bottom Line

Through intelligent workload distribution between Parents and Agents:

  • ML trains where you have resources (edge or Parents, your choice)
  • Alerts run where they matter (automation locally, monitoring centrally)
  • Storage happens where it's cheap (Parents, not production)
  • Millions of metrics per second is normal (not heroic)
  • HA doesn't multiply overhead (intelligent clustering)

This isn't just optimization. It's a fundamentally different architecture that recognizes observability shouldn't compete with your applications for resources.

Welcome to observability that makes your infrastructure better, not heavier.

FAQ

Q: How many nodes can a single Netdata Parent handle?
A: We recommend running Parents with up to 500 Agents (1.5M metrics/s). We have customers running larger Parents, but resources increase and performance decreases non-linearly.

Q: What happens if a Parent goes down?
A: If the Parent was clustered, agents will connect to the other Parent and replicate to it any metrics collected during the transition. If there is no other Parent to connect to, Agents keep collecting and storing data locally, which will be replicated to the Parent when it becomes available. Note that the replication of past metrics uses only tier-0 (high-res data), so Agents must have enough retention in tier-0 to avoid gaps in the charts.

Q: Do I always need Parents?
A: No. Agents alone may be enough. Parents are usually required when you have ephemeral nodes.

Q: How much overhead does Netdata introduce on my systems?
A: Less than 5% CPU and ~200 MiB RAM per agent in standalone mode. Offloaded agents (streaming to a Parent) drop to <2% CPU and ~150 MiB RAM with zero disk I/O. Netdata is designed to be "polite citizen" to production workloads, so it spreads its workload across time and avoids all kinds of sudden and intense spikes.

Q: How efficient is Netdata’s storage?
A: Tier 0 (per-second) is ~0.6 bytes/sample - the industry’s most efficient. Tiers 1 & 2 keep per-minute and per-hour aggregates, letting you retain months or years of history cheaply.

Q: How do I deploy Parents in multi-region or multi-cloud setups?
A: Place Parents close to the agents they serve (same DC/region/AZ). Deploy multiple Parents per region for HA. Use Netdata Cloud to unify dashboards and queries across Parents.

Q: What’s the difference between monitoring and automation alerts?
A: Since Netdata evaluates alerts at the edge, it allows you to specify scripts to be executed when an alert triggers. This enables automation, e.g. "restart service if API endpoint is not responding".

Q: Is Netdata really energy-efficient?
A: Yes. A peer-reviewed 2023 study (ICSOC, University of Amsterdam) found Netdata to be the most energy-efficient tool among the ones tested, with the lowest CPU and RAM overhead even at 1-second collection.

Q: Is 100,000+ nodes single installation real?
A: Yes. Even Netdata Cloud SaaS itself (our commercial service) is such a single installation that serves way more than 100k reachable nodes.

Q: Do you promote per-second collection and unlimited metrics because your revenue depends on volume?
A: No. Our commercial offerings are priced per node, with volume discounts (smaller price as the number of nodes increases). Our revenue is not related to the number of metrics or the volume of observability data collected or viewed. We designed Netdata for maximum performance at scale and volume for your benefit. Not ours.

Q: If I have multiple Parents, how Netdata Cloud provides unified dashboards?
A: Think of Netdata Cloud as the headend of a distributed database. Each Netdata Parent and Agent dynamically becomes part of that database. So, Netdata Cloud queries them all in parallel, to provide the unified view required.

Q: Is querying 100 remote systems in parallel slower than querying a bigger one locally?
A: There is some extra Network latency involved, but this is usually small (a few ms), because the data transferred are tiny (your web browser will receive 500-1000 points max, even if the query is 10 days of per-second data). However, the aggregate horse power and parallelism of 100 totally independent systems is orders of magnitude more, compared to any single local system. The queries are actually quite faster.

Next Steps


Based on real production deployments, independent research (University of Amsterdam, ICSOC 2023), and comparative testing (2025). All metrics and resource usage figures represent typical production scenarios.


Do you have any feedback for this page? If so, you can open a new issue on our netdata/learn repository.