Skip to main content

Netdata Cloud On-Prem Troubleshooting

Netdata Cloud On-Prem is an enterprise-grade monitoring solution that relies on several infrastructure components:

  • Databases: PostgreSQL, Redis, Elasticsearch
  • Message Brokers: Pulsar, EMQX
  • Traffic Controllers: Ingress, Traefik
  • Kubernetes Cluster

These components should be monitored and managed according to your organization's established practices and requirements.

Common Issues

Slow Chart Loading or Chart Errors

When charts take a long time to load or fail with errors, the issue typically stems from data collection challenges. The charts service must gather data from multiple Agents within a Room, requiring successful responses from all queried Agents.

IssueSymptomsCauseSolution
Agent Connectivity- Queries stall or timeout
- Inconsistent chart loading
Slow Agents or unreliable network connections prevent timely data collectionDeploy additional Parent nodes to provide reliable backends. The system will automatically prefer these for queries when available
Kubernetes Resources- Service throttling
- Slow data processing
- Delayed dashboard updates
Resource saturation at the node level or restrictive container limitsReview and adjust container resource limits and node capacity as needed
Database Performance- Slow query responses
- Increased latency across services
PostgreSQL performance bottlenecksMonitor and optimize database resource utilization:
- CPU usage
- Memory allocation
- Disk I/O performance
Message Broker- Delayed node status updates (online/offline/stale)
- Slow alert transitions
- Dashboard update delays
Message accumulation in Pulsar due to processing bottlenecks- Review Pulsar configuration
- Adjust microservice resource allocation
- Monitor message processing rates

Do you have any feedback for this page? If so, you can open a new issue on our netdata/learn repository.