Skip to main content

Netdata Cloud On-Prem Troubleshooting

Netdata Cloud On-Prem is an enterprise-grade monitoring solution that relies on several infrastructure components:

  • Databases: PostgreSQL, Redis, Elasticsearch
  • Message Brokers: Pulsar, EMQX
  • Traffic Controllers: Ingress, Traefik
  • Kubernetes Cluster

These components should be monitored and managed according to your organization's established practices and requirements.

Common Issues

Timeout During Installation

If your installation fails with this error:

Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart...
[...]
Error: client rate limiter Wait returned an error: Context deadline exceeded.

This error typically indicates insufficient cluster resources. Here's how to diagnose and resolve the issue.

Diagnosis Steps

Important

  • For full installation: Ensure you're in the correct cluster context.
  • For Light PoC: SSH into the Ubuntu VM with kubectl pre-configured.
  • For Light PoC, always perform a complete uninstallation before attempting a new installation.
  1. Check for pods stuck in Pending state:

    kubectl get pods -n netdata-cloud | grep -v Running
  2. If you find Pending pods, examine the resource constraints:

    kubectl describe pod <POD_NAME> -n netdata-cloud

    Review the Events section at the bottom of the output. Look for messages about:

    • Insufficient CPU
    • Insufficient Memory
    • Node capacity issues
  3. View overall cluster resources:

    # Check resource allocation across nodes
    kubectl top nodes

    # View detailed node capacity
    kubectl describe nodes | grep -A 5 "Allocated resources"

Solution

  1. Compare your available resources against the minimum requirements.
  2. Take one of these actions:
    • Add more resources to your cluster.
    • Free up existing resources.

Login Issues After Installation

Installation may complete successfully, but login issues can occur due to configuration mismatches. This table provides a quick reference for troubleshooting common login issues after installation.

IssueSymptomsCauseSolution
SSO Login FailureUnable to authenticate via SSO providers- Invalid callback URLs
- Expired/invalid SSO tokens
- Untrusted certificates
- Incorrect FQDN in global.public
- Update SSO configuration in values.yaml
- Verify certificates are valid and trusted
- Ensure FQDN matches certificate
MailCatcher Login (Light PoC)- Magic links not arriving
- "Invalid token" errors
- Incorrect hostname during installation
- Modified default MailCatcher values
- Reinstall with correct FQDN
- Restore default MailCatcher settings
- Ensure hostname matches certificate
Custom Mail Server LoginMagic links not arriving- Incorrect SMTP configuration
- Network connectivity issues
- Update SMTP settings in values.yaml
- Verify network allows SMTP traffic
- Check mail server logs
Invalid Token Error"Something went wrong - invalid token" message- Mismatched netdata-cloud-common secret
- Database hash mismatch
- Namespace change without secret migration
- Migrate secret before namespace change
- Perform fresh installation
- Contact support for data recovery

Warning

If you're modifying the installation namespace, the netdata-cloud-common secret will be recreated.

Before proceeding: Back up the existing netdata-cloud-common secret. Alternatively, wipe the PostgreSQL database to prevent data conflicts.

Slow Chart Loading or Chart Errors

When charts take a long time to load or fail with errors, the issue typically stems from data collection challenges. The charts service must gather data from multiple Agents within a Room, requiring successful responses from all queried Agents.

IssueSymptomsCauseSolution
Agent Connectivity- Queries stall or timeout
- Inconsistent chart loading
Slow Agents or unreliable network connections prevent timely data collectionDeploy additional Parent nodes to provide reliable backends. The system will automatically prefer these for queries when available
Kubernetes Resources- Service throttling
- Slow data processing
- Delayed dashboard updates
Resource saturation at the node level or restrictive container limitsReview and adjust container resource limits and node capacity as needed
Database Performance- Slow query responses
- Increased latency across services
PostgreSQL performance bottlenecksMonitor and optimize database resource utilization:
- CPU usage
- Memory allocation
- Disk I/O performance
Message Broker- Delayed node status updates (online/offline/stale)
- Slow alert transitions
- Dashboard update delays
Message accumulation in Pulsar due to processing bottlenecks- Review Pulsar configuration
- Adjust microservice resource allocation
- Monitor message processing rates

Need Help?

If issues persist:

  1. Gather the following information:

    • Installation logs
    • Your cluster specifications
  2. Contact support at [email protected].


Do you have any feedback for this page? If so, you can open a new issue on our netdata/learn repository.