Netdata Cloud On-Prem Troubleshooting
Netdata Cloud On-Prem is an enterprise-grade monitoring solution that relies on several infrastructure components:
- Databases: PostgreSQL, Redis, Elasticsearch
- Message Brokers: Pulsar, EMQX
- Traffic Controllers: Ingress, Traefik
- Kubernetes Cluster
These components should be monitored and managed according to your organization's established practices and requirements.
Common Issues
Timeout During Installation
If your installation fails with this error:
Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart...
[...]
Error: client rate limiter Wait returned an error: Context deadline exceeded.
This error typically indicates insufficient cluster resources. Here's how to diagnose and resolve the issue.
Diagnosis Steps
Important
- For full installation: Ensure you're in the correct cluster context.
- For Light PoC: SSH into the Ubuntu VM with
kubectl
pre-configured.- For Light PoC, always perform a complete uninstallation before attempting a new installation.
Check for pods stuck in Pending state:
kubectl get pods -n netdata-cloud | grep -v Running
If you find Pending pods, examine the resource constraints:
kubectl describe pod <POD_NAME> -n netdata-cloud
Review the Events section at the bottom of the output. Look for messages about:
- Insufficient CPU
- Insufficient Memory
- Node capacity issues
View overall cluster resources:
# Check resource allocation across nodes
kubectl top nodes
# View detailed node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"
Solution
- Compare your available resources against the minimum requirements.
- Take one of these actions:
- Add more resources to your cluster.
- Free up existing resources.
Login Issues After Installation
Installation may complete successfully, but login issues can occur due to configuration mismatches. This table provides a quick reference for troubleshooting common login issues after installation.
Issue | Symptoms | Cause | Solution |
---|---|---|---|
SSO Login Failure | Unable to authenticate via SSO providers | - Invalid callback URLs - Expired/invalid SSO tokens - Untrusted certificates - Incorrect FQDN in global.public | - Update SSO configuration in values.yaml - Verify certificates are valid and trusted - Ensure FQDN matches certificate |
MailCatcher Login (Light PoC) | - Magic links not arriving - "Invalid token" errors | - Incorrect hostname during installation - Modified default MailCatcher values | - Reinstall with correct FQDN - Restore default MailCatcher settings - Ensure hostname matches certificate |
Custom Mail Server Login | Magic links not arriving | - Incorrect SMTP configuration - Network connectivity issues | - Update SMTP settings in values.yaml - Verify network allows SMTP traffic - Check mail server logs |
Invalid Token Error | "Something went wrong - invalid token" message | - Mismatched netdata-cloud-common secret- Database hash mismatch - Namespace change without secret migration | - Migrate secret before namespace change - Perform fresh installation - Contact support for data recovery |
Warning
If you're modifying the installation namespace, the
netdata-cloud-common
secret will be recreated.Before proceeding: Back up the existing
netdata-cloud-common
secret. Alternatively, wipe the PostgreSQL database to prevent data conflicts.
Slow Chart Loading or Chart Errors
When charts take a long time to load or fail with errors, the issue typically stems from data collection challenges. The charts
service must gather data from multiple Agents within a Room, requiring successful responses from all queried Agents.
Issue | Symptoms | Cause | Solution |
---|---|---|---|
Agent Connectivity | - Queries stall or timeout - Inconsistent chart loading | Slow Agents or unreliable network connections prevent timely data collection | Deploy additional Parent nodes to provide reliable backends. The system will automatically prefer these for queries when available |
Kubernetes Resources | - Service throttling - Slow data processing - Delayed dashboard updates | Resource saturation at the node level or restrictive container limits | Review and adjust container resource limits and node capacity as needed |
Database Performance | - Slow query responses - Increased latency across services | PostgreSQL performance bottlenecks | Monitor and optimize database resource utilization: - CPU usage - Memory allocation - Disk I/O performance |
Message Broker | - Delayed node status updates (online/offline/stale) - Slow alert transitions - Dashboard update delays | Message accumulation in Pulsar due to processing bottlenecks | - Review Pulsar configuration - Adjust microservice resource allocation - Monitor message processing rates |
Need Help?
If issues persist:
Gather the following information:
- Installation logs
- Your cluster specifications
Contact support at
[email protected]
.
Do you have any feedback for this page? If so, you can open a new issue on our netdata/learn repository.