Troubleshooting Guide

This guide helps you diagnose and resolve common issues with the PipeOps Kubernetes Agent.

Quick Diagnostic Steps

When experiencing issues, follow these steps first:

# 1. Check agent status
kubectl get pods -n pipeops-system

# 2. View recent logs
kubectl logs deployment/pipeops-agent -n pipeops-system --tail=50

# 3. Check events
kubectl get events -n pipeops-system --sort-by='.lastTimestamp'

# 4. Describe the pod
kubectl describe pod -n pipeops-system -l app=pipeops-agent

# 5. Check resource usage
kubectl top pod -n pipeops-system

Installation Issues

Issue: Agent Pod Not Starting

Symptoms:

Pod stuck in Pending, CrashLoopBackOff, or ImagePullBackOff
Installation script fails

Diagnosis:

# Check pod status
kubectl get pods -n pipeops-system

# View detailed pod information
kubectl describe pod -n pipeops-system -l app=pipeops-agent

# Check logs if pod started
kubectl logs -n pipeops-system -l app=pipeops-agent

Solutions:

For ImagePullBackOff:

# Check if image exists and is accessible
docker pull ghcr.io/pipeopshq/pipeops-k8-agent:latest

# Verify image pull secrets if using private registry
kubectl get secrets -n pipeops-system

# Check for image pull errors
kubectl describe pod -n pipeops-system -l app=pipeops-agent | grep -A 5 "Events"

For CrashLoopBackOff:

# View crash logs
kubectl logs -n pipeops-system -l app=pipeops-agent --previous

# Common causes:
# 1. Invalid configuration
# 2. Missing required environment variables
# 3. Insufficient permissions

# Check configuration
kubectl get configmap pipeops-agent-config -n pipeops-system -o yaml
kubectl get secret pipeops-agent-config -n pipeops-system -o yaml

For Pending:

# Check if there are sufficient node resources
kubectl describe pod -n pipeops-system -l app=pipeops-agent | grep -A 10 "Events"

# Common causes:
# 1. No nodes available
# 2. Insufficient CPU/memory
# 3. Node selector mismatch
# 4. Taints/tolerations

# Check node resources
kubectl top nodes
kubectl describe nodes

Issue: Invalid or Missing API Token

Symptoms:

Agent logs show authentication errors
"Unauthorized" or "Forbidden" errors

Diagnosis:

# Check if secret exists
kubectl get secret pipeops-agent-config -n pipeops-system

# View secret (base64 encoded)
kubectl get secret pipeops-agent-config -n pipeops-system -o jsonpath='{.data.token}' | base64 -d

Solutions:

# Create or update secret with correct token
kubectl create secret generic pipeops-agent-config \
  --from-literal=token=your-correct-api-token \
  --namespace pipeops-system \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart agent to apply changes
kubectl rollout restart deployment/pipeops-agent -n pipeops-system

# Verify connection
kubectl logs -n pipeops-system deployment/pipeops-agent | grep -i "connected\|authenticated"

Issue: Kubernetes Cluster Not Created

Symptoms:

Installer fails to create cluster
k3s, minikube, or other distribution installation fails

Diagnosis:

# Check installer logs
sudo cat /var/log/pipeops/install.log

# Check if Kubernetes is running
kubectl get nodes
systemctl status k3s  # or minikube status

Solutions:

For k3s:

# Check k3s logs
sudo journalctl -u k3s -n 100

# Common issues:
# - Port 6443 already in use
# - Insufficient resources
# - SELinux/AppArmor conflicts

# Manual k3s installation
curl -sfL https://get.k3s.io | sh -

# Verify installation
sudo k3s kubectl get nodes

For minikube:

# Check minikube status
minikube status

# View logs
minikube logs

# Delete and recreate
minikube delete
minikube start --cpus 2 --memory 4096

Connection Issues

Issue: Cannot Connect to PipeOps API

Symptoms:

Logs show "connection refused" or "timeout" errors
Agent status shows disconnected

Diagnosis:

# Test connectivity from pod
kubectl exec -n pipeops-system deployment/pipeops-agent -- \
  curl -v https://api.pipeops.sh/health

# Check DNS resolution
kubectl exec -n pipeops-system deployment/pipeops-agent -- \
  nslookup api.pipeops.sh

# View agent logs
kubectl logs -n pipeops-system deployment/pipeops-agent | grep -i "error\|fail\|connection"

Solutions:

Check Network Policies:

# List network policies
kubectl get networkpolicies -n pipeops-system

# If blocking outbound, create egress rule
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-pipeops-api
  namespace: pipeops-system
spec:
  podSelector:
    matchLabels:
      app: pipeops-agent
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 443
    - to:
        - namespaceSelector: {}
      ports:
        - protocol: TCP
          port: 53
        - protocol: UDP
          port: 53
EOF

Check Proxy Configuration:

# If behind corporate proxy, configure proxy
helm upgrade pipeops-agent pipeops/pipeops-agent \
  --set agent.proxy.http="http://proxy.company.com:8080" \
  --set agent.proxy.https="http://proxy.company.com:8080" \
  --set agent.proxy.no_proxy="localhost,127.0.0.1,.cluster.local" \
  --namespace pipeops-system \
  --reuse-values

Firewall Rules:

# Ensure outbound HTTPS (443) is allowed
# Check with cloud provider firewall/security groups

# Test from node
curl -v https://api.pipeops.sh/health

Issue: Tunnel Connection Failures

Symptoms:

Cannot access cluster through PipeOps dashboard
Tunnel status shows "disconnected"

Diagnosis:

# Check tunnel logs
kubectl logs -n pipeops-system deployment/pipeops-agent | grep -i tunnel

# Check tunnel configuration
kubectl get configmap pipeops-agent-config -n pipeops-system -o yaml | grep -A 10 tunnel

Solutions:

# Verify tunnel is enabled
helm upgrade pipeops-agent pipeops/pipeops-agent \
  --set tunnel.enabled=true \
  --namespace pipeops-system \
  --reuse-values

# Check for port conflicts
netstat -tlnp | grep -E '6443|10250|8080'

# Restart agent
kubectl rollout restart deployment/pipeops-agent -n pipeops-system

Resource Issues

Issue: High Memory Usage

Symptoms:

Pod evicted due to memory
OOMKilled status
Slow performance

Diagnosis:

# Check current memory usage
kubectl top pod -n pipeops-system

# View memory limits
kubectl get pod -n pipeops-system -l app=pipeops-agent -o jsonpath='{.items[0].spec.containers[0].resources}'

# Check for memory leaks
kubectl logs -n pipeops-system deployment/pipeops-agent | grep -i "memory\|oom"

Solutions:

Increase Memory Limits:

helm upgrade pipeops-agent pipeops/pipeops-agent \
  --set agent.resources.limits.memory="1Gi" \
  --set agent.resources.requests.memory="512Mi" \
  --namespace pipeops-system \
  --reuse-values

Reduce Monitoring Overhead:

# Disable monitoring if not needed
helm upgrade pipeops-agent pipeops/pipeops-agent \
  --set monitoring.enabled=false \
  --namespace pipeops-system \
  --reuse-values

# Or adjust scrape intervals
helm upgrade pipeops-agent pipeops/pipeops-agent \
  --set monitoring.prometheus.scrape_interval="60s" \
  --namespace pipeops-system \
  --reuse-values

Issue: High CPU Usage

Symptoms:

CPU throttling
Slow API responses
Pod stuck in throttling state

Diagnosis:

# Check CPU usage
kubectl top pod -n pipeops-system

# View CPU limits
kubectl describe pod -n pipeops-system -l app=pipeops-agent | grep -A 5 "Limits\|Requests"

# Check for CPU-intensive operations
kubectl logs -n pipeops-system deployment/pipeops-agent | tail -100

Solutions:

# Increase CPU limits
helm upgrade pipeops-agent pipeops/pipeops-agent \
  --set agent.resources.limits.cpu="1000m" \
  --set agent.resources.requests.cpu="500m" \
  --namespace pipeops-system \
  --reuse-values

Issue: Disk Space Full

Symptoms:

Pod evicted
Cannot write logs
Monitoring data not persisted

Diagnosis:

# Check node disk usage
kubectl get nodes -o custom-columns=NAME:.metadata.name,DISK:.status.allocatable.ephemeral-storage

# Check PV usage
kubectl get pv
kubectl describe pvc -n pipeops-monitoring

# Check pod disk usage
kubectl exec -n pipeops-system deployment/pipeops-agent -- df -h

Solutions:

Increase PV Size:

# For monitoring PVCs
helm upgrade pipeops-agent pipeops/pipeops-agent \
  --set monitoring.prometheus.persistence.size="50Gi" \
  --set monitoring.grafana.persistence.size="10Gi" \
  --set monitoring.loki.persistence.size="50Gi" \
  --namespace pipeops-system \
  --reuse-values

Reduce Retention:

helm upgrade pipeops-agent pipeops/pipeops-agent \
  --set monitoring.prometheus.retention="7d" \
  --set monitoring.loki.retention="168h" \
  --namespace pipeops-system \
  --reuse-values

Clean Up Old Data:

# Delete old logs
kubectl exec -n pipeops-monitoring deployment/loki -- rm -rf /data/loki/chunks/*

# Restart to rebuild index
kubectl rollout restart deployment/loki -n pipeops-monitoring

Monitoring Issues

Issue: Prometheus Not Scraping Metrics

Symptoms:

No data in Grafana dashboards
Missing metrics in Prometheus

Diagnosis:

# Check Prometheus targets
kubectl port-forward -n pipeops-monitoring svc/prometheus-server 9090:9090

# Open http://localhost:9090/targets in browser
# Look for failed targets

# Check service monitors
kubectl get servicemonitor -n pipeops-monitoring

Solutions:

Verify Service Monitor:

kubectl get servicemonitor -n pipeops-monitoring -o yaml

# Ensure labels match Prometheus selector
kubectl get prometheus -n pipeops-monitoring -o yaml | grep serviceMonitorSelector -A 5

Check Network Policies:

# Ensure Prometheus can reach targets
kubectl describe networkpolicy -n pipeops-monitoring

Restart Prometheus:

kubectl rollout restart deployment/prometheus-server -n pipeops-monitoring

Issue: Grafana Dashboards Not Loading

Symptoms:

Dashboards show "No data"
Data source connection failed

Diagnosis:

# Check Grafana logs
kubectl logs -n pipeops-monitoring deployment/grafana

# Test Prometheus data source
kubectl port-forward -n pipeops-monitoring svc/grafana 3000:3000

# Login and check Configuration > Data Sources

Solutions:

Verify Data Source Configuration:

# Check Prometheus URL in Grafana
# Should be: http://prometheus-server.pipeops-monitoring.svc.cluster.local

# Update if needed through Grafana UI or ConfigMap
kubectl edit configmap grafana-datasources -n pipeops-monitoring

Restart Grafana:

kubectl rollout restart deployment/grafana -n pipeops-monitoring

Issue: Cannot Access Grafana Dashboard

Symptoms:

Cannot access Grafana UI
Login fails
Forgot admin password

Diagnosis:

# Check Grafana pod status
kubectl get pods -n pipeops-monitoring -l app=grafana

# Check Grafana logs
kubectl logs -n pipeops-monitoring deployment/grafana

Solutions:

Reset Admin Password:

kubectl exec -it -n pipeops-monitoring deployment/grafana -- \
  grafana-cli admin reset-admin-password newpassword

Port Forward to Access:

kubectl port-forward -n pipeops-monitoring svc/grafana 3000:3000
# Open http://localhost:3000

Check Ingress (if configured):

kubectl get ingress -n pipeops-monitoring
kubectl describe ingress grafana -n pipeops-monitoring

Configuration Issues

Issue: Configuration Changes Not Applied

Symptoms:

Changes to ConfigMap don't take effect
Updated Helm values not applied

Diagnosis:

# Check current configuration
kubectl get configmap pipeops-agent-config -n pipeops-system -o yaml

# Check Helm values
helm get values pipeops-agent -n pipeops-system

Solutions:

Restart Agent:

# ConfigMaps are not auto-reloaded
kubectl rollout restart deployment/pipeops-agent -n pipeops-system

Verify Helm Upgrade:

# Use --reuse-values to keep existing values
helm upgrade pipeops-agent pipeops/pipeops-agent \
  --set agent.new.setting="value" \
  --namespace pipeops-system \
  --reuse-values

# Check what changed
helm diff upgrade pipeops-agent pipeops/pipeops-agent \
  -f values.yaml \
  --namespace pipeops-system

Issue: Invalid YAML Configuration

Symptoms:

Parse errors in logs
Agent fails to start
ConfigMap apply fails

Diagnosis:

# Validate YAML syntax
kubectl create configmap test --from-file=config.yaml --dry-run=client -o yaml

# Check for common issues:
# - Incorrect indentation
# - Missing quotes
# - Invalid characters

Solutions:

Use YAML Validator:

# Install yamllint
pip install yamllint

# Validate file
yamllint config.yaml

Common Fixes:

# Incorrect (bad indentation)
agent:
cluster_name: "test"

# Correct
agent:
  cluster_name: "test"

# Incorrect (missing quotes)
cluster_name: my-cluster-name

# Correct
cluster_name: "my-cluster-name"

Performance Issues

Issue: Slow API Responses

Symptoms:

PipeOps dashboard slow
Deployment delays
Timeouts

Diagnosis:

# Check agent logs for slow requests
kubectl logs -n pipeops-system deployment/pipeops-agent | grep -i "slow\|timeout"

# Check resource usage
kubectl top pod -n pipeops-system

# Check API latency metrics (if exposed)
kubectl exec -n pipeops-system deployment/pipeops-agent -- \
  curl http://localhost:9091/metrics | grep api_request_duration

Solutions:

Increase Timeouts:

helm upgrade pipeops-agent pipeops/pipeops-agent \
  --set agent.pipeops.timeout="60s" \
  --namespace pipeops-system \
  --reuse-values

Scale Resources:

helm upgrade pipeops-agent pipeops/pipeops-agent \
  --set agent.resources.limits.cpu="1000m" \
  --set agent.resources.limits.memory="1Gi" \
  --namespace pipeops-system \
  --reuse-values

Check Network Latency:

# Test latency to PipeOps API
kubectl exec -n pipeops-system deployment/pipeops-agent -- \
  time curl -s https://api.pipeops.sh/health

Security Issues

Issue: RBAC Permission Denied

Symptoms:

"Forbidden" errors in logs
Cannot list/create Kubernetes resources

Diagnosis:

# Check service account
kubectl get sa pipeops-agent -n pipeops-system

# Check role bindings
kubectl get rolebinding,clusterrolebinding -n pipeops-system | grep pipeops

# Test specific permission
kubectl auth can-i list pods --as=system:serviceaccount:pipeops-system:pipeops-agent

Solutions:

Verify RBAC Creation:

# Ensure RBAC is enabled
helm upgrade pipeops-agent pipeops/pipeops-agent \
  --set rbac.create=true \
  --set serviceAccount.create=true \
  --namespace pipeops-system \
  --reuse-values

Grant Additional Permissions (if needed):

# custom-rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: pipeops-agent-custom
rules:
  - apiGroups: [""]
    resources: ["pods", "services"]
    verbs: ["get", "list", "watch", "create", "update", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: pipeops-agent-custom
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: pipeops-agent-custom
subjects:
  - kind: ServiceAccount
    name: pipeops-agent
    namespace: pipeops-system

Apply:

kubectl apply -f custom-rbac.yaml

Issue: TLS Certificate Errors

Symptoms:

"x509: certificate" errors
TLS handshake failures

Diagnosis:

# Check TLS configuration
kubectl logs -n pipeops-system deployment/pipeops-agent | grep -i "tls\|certificate"

# Test TLS connection
kubectl exec -n pipeops-system deployment/pipeops-agent -- \
  openssl s_client -connect api.pipeops.sh:443

Solutions:

Update CA Certificates:

# In pod
kubectl exec -n pipeops-system deployment/pipeops-agent -- \
  update-ca-certificates

Disable TLS Verification (NOT for production):

helm upgrade pipeops-agent pipeops/pipeops-agent \
  --set agent.pipeops.tls.insecure_skip_verify=true \
  --namespace pipeops-system \
  --reuse-values

Debugging Tools

Enable Debug Logging

# Helm
helm upgrade pipeops-agent pipeops/pipeops-agent \
  --set logging.level="debug" \
  --namespace pipeops-system \
  --reuse-values

# ConfigMap
kubectl patch configmap pipeops-agent-config -n pipeops-system \
  --type merge \
  -p '{"data":{"log_level":"debug"}}'

# Restart agent
kubectl rollout restart deployment/pipeops-agent -n pipeops-system

Interactive Debugging

# Shell into agent pod
kubectl exec -it -n pipeops-system deployment/pipeops-agent -- /bin/sh

# Common debugging commands:
# - ps aux (check processes)
# - netstat -tlnp (check listening ports)
# - curl localhost:8081/healthz (health check)
# - env (check environment variables)

Collect Diagnostic Information

#!/bin/bash
# collect-diagnostics.sh

mkdir -p diagnostics

# Agent info
kubectl get all -n pipeops-system -o yaml > diagnostics/agent-resources.yaml
kubectl logs deployment/pipeops-agent -n pipeops-system > diagnostics/agent-logs.txt
kubectl describe pod -n pipeops-system -l app=pipeops-agent > diagnostics/agent-pod-describe.txt

# Monitoring info
kubectl get all -n pipeops-monitoring -o yaml > diagnostics/monitoring-resources.yaml
kubectl logs deployment/prometheus-server -n pipeops-monitoring > diagnostics/prometheus-logs.txt

# Cluster info
kubectl get nodes -o wide > diagnostics/nodes.txt
kubectl top nodes > diagnostics/node-resources.txt
kubectl get events --all-namespaces --sort-by='.lastTimestamp' > diagnostics/events.txt

# Create archive
tar czf diagnostics-$(date +%Y%m%d-%H%M%S).tar.gz diagnostics/

echo "Diagnostics collected in diagnostics-*.tar.gz"

Getting Help

If you cannot resolve the issue using this guide:

Check GitHub Issues: https://github.com/PipeOpsHQ/pipeops-k8-agent/issues
Community Forum: https://community.pipeops.io
Email Support: support@pipeops.io
- Include agent version
- Attach diagnostic logs
- Describe steps to reproduce

Create Support Ticket: Include:

# Agent version
kubectl get deployment pipeops-agent -n pipeops-system -o jsonpath='{.spec.template.spec.containers[0].image}'

# Cluster info
kubectl version
kubectl get nodes

# Recent logs
kubectl logs deployment/pipeops-agent -n pipeops-system --tail=100

Next Steps

Management & Operations — Agent lifecycle management
API Reference — API documentation
Configuration Reference — Complete configuration options

Troubleshooting Guide

Quick Diagnostic Steps​

Installation Issues​

Issue: Agent Pod Not Starting​

Issue: Invalid or Missing API Token​

Issue: Kubernetes Cluster Not Created​

Connection Issues​

Issue: Cannot Connect to PipeOps API​

Issue: Tunnel Connection Failures​

Resource Issues​

Issue: High Memory Usage​

Issue: High CPU Usage​

Issue: Disk Space Full​

Monitoring Issues​

Issue: Prometheus Not Scraping Metrics​

Issue: Grafana Dashboards Not Loading​

Issue: Cannot Access Grafana Dashboard​

Configuration Issues​

Issue: Configuration Changes Not Applied​

Issue: Invalid YAML Configuration​

Performance Issues​

Issue: Slow API Responses​

Security Issues​

Issue: RBAC Permission Denied​

Issue: TLS Certificate Errors​

Debugging Tools​

Enable Debug Logging​

Interactive Debugging​

Collect Diagnostic Information​

Getting Help​

Next Steps​

Quick Diagnostic Steps

Installation Issues

Issue: Agent Pod Not Starting

Issue: Invalid or Missing API Token

Issue: Kubernetes Cluster Not Created

Connection Issues

Issue: Cannot Connect to PipeOps API

Issue: Tunnel Connection Failures

Resource Issues

Issue: High Memory Usage

Issue: High CPU Usage

Issue: Disk Space Full

Monitoring Issues

Issue: Prometheus Not Scraping Metrics

Issue: Grafana Dashboards Not Loading

Issue: Cannot Access Grafana Dashboard

Configuration Issues

Issue: Configuration Changes Not Applied

Issue: Invalid YAML Configuration

Performance Issues

Issue: Slow API Responses

Security Issues

Issue: RBAC Permission Denied

Issue: TLS Certificate Errors

Debugging Tools

Enable Debug Logging

Interactive Debugging

Collect Diagnostic Information

Getting Help

Next Steps