Monitoring & Observability

The PipeOps Kubernetes Agent includes a comprehensive monitoring and observability stack built on industry-standard open-source tools. This guide covers setup, configuration, and usage of the monitoring components.

Overview

The agent's monitoring stack provides complete visibility into your Kubernetes cluster and workloads through:

Prometheus — Metrics collection, storage, and alerting
Grafana — Visualization, dashboards, and analysis
Loki — Log aggregation and querying
OpenCost — Kubernetes cost monitoring and optimization
Node Exporter — Node-level system metrics
kube-state-metrics — Kubernetes object metrics

Architecture

┌─────────────────────────────────────────────────────┐
│                  Grafana Dashboard                   │
│         (Visualization & Analysis)                   │
└─────────────────────────────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          │               │               │
┌─────────▼────────┐ ┌───▼────────┐ ┌────▼──────┐
│   Prometheus     │ │    Loki    │ │ OpenCost  │
│  (Metrics DB)    │ │ (Logs DB)  │ │  (Costs)  │
└─────────┬────────┘ └───┬────────┘ └────┬──────┘
          │              │               │
    ┌─────┴──────┬───────┴──────┬────────┴─────┐
    │            │              │              │
┌───▼──────┐ ┌──▼────────┐ ┌───▼────────┐ ┌───▼─────┐
│ K8s API  │ │   Nodes   │ │   Pods     │ │ Kubelet │
└──────────┘ └───────────┘ └────────────┘ └─────────┘

Installation

Enable Monitoring During Initial Setup

Intelligent Installer:

export PIPEOPS_TOKEN="your-api-token"

curl -fsSL https://get.pipeops.dev/k8-install.sh | bash

The monitoring stack (Prometheus, Grafana, Loki, OpenCost) is installed by default.

Helm Installation:

helm install pipeops-agent oci://ghcr.io/pipeopshq/pipeops-agent \
  --set agent.pipeops.token="your-api-token" \
  --set monitoring.enabled=true \
  --set monitoring.prometheus.enabled=true \
  --set monitoring.grafana.enabled=true \
  --set monitoring.loki.enabled=true \
  --namespace pipeops-system \
  --create-namespace

Add Monitoring to Existing Installation

If you initially installed without monitoring:

helm upgrade pipeops-agent oci://ghcr.io/pipeopshq/pipeops-agent \
  --set monitoring.enabled=true \
  --namespace pipeops-system \
  --reuse-values

Prometheus Configuration

Prometheus collects and stores time-series metrics from your cluster.

Basic Setup

monitoring:
  prometheus:
    enabled: true
    port: 9090
    
    # Metric retention period
    retention: "15d"
    
    # Storage configuration
    persistence:
      enabled: true
      storageClass: "standard"
      size: "10Gi"
    
    # Scrape interval
    scrape_interval: "30s"
    
    # Resource limits
    resources:
      requests:
        cpu: "250m"
        memory: "512Mi"
      limits:
        cpu: "500m"
        memory: "1Gi"

Custom Scrape Configurations

Add custom scrape targets:

monitoring:
  prometheus:
    additionalScrapeConfigs:
      - job_name: 'custom-app'
        kubernetes_sd_configs:
          - role: pod
            namespaces:
              names:
                - my-app-namespace
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__

Accessing Prometheus

Port Forward (Local Access):

kubectl port-forward svc/prometheus-server 9090:9090 -n pipeops-monitoring

Then open http://localhost:9090

Ingress (Production):

monitoring:
  prometheus:
    ingress:
      enabled: true
      className: "nginx"
      annotations:
        cert-manager.io/cluster-issuer: "letsencrypt-prod"
      hosts:
        - host: prometheus.example.com
          paths:
            - path: /
              pathType: Prefix
      tls:
        - secretName: prometheus-tls
          hosts:
            - prometheus.example.com

Common Prometheus Queries

CPU Usage by Pod:

sum(rate(container_cpu_usage_seconds_total{namespace="default"}[5m])) by (pod)

Memory Usage by Namespace:

sum(container_memory_working_set_bytes{namespace!=""}) by (namespace)

Pod Restart Count:

kube_pod_container_status_restarts_total

Node CPU Usage:

100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Grafana Configuration

Grafana provides powerful visualization and dashboarding capabilities.

Basic Setup

monitoring:
  grafana:
    enabled: true
    port: 3000
    
    # Admin credentials
    adminUser: "admin"
    adminPassword: "changeme"  # Change in production!
    
    # Persistence for dashboards
    persistence:
      enabled: true
      storageClass: "standard"
      size: "5Gi"
    
    # Resource limits
    resources:
      requests:
        cpu: "100m"
        memory: "128Mi"
      limits:
        cpu: "200m"
        memory: "256Mi"

Accessing Grafana

Port Forward (Local Access):

kubectl port-forward svc/grafana 3000:3000 -n pipeops-monitoring

Then open http://localhost:3000

Default credentials:

Username: admin
Password: pipeops (or as configured)

Ingress (Production):

monitoring:
  grafana:
    ingress:
      enabled: true
      className: "nginx"
      annotations:
        cert-manager.io/cluster-issuer: "letsencrypt-prod"
        nginx.ingress.kubernetes.io/rewrite-target: /
      hosts:
        - host: grafana.example.com
          paths:
            - path: /
              pathType: Prefix
      tls:
        - secretName: grafana-tls
          hosts:
            - grafana.example.com

Pre-configured Dashboards

The agent includes several pre-configured Grafana dashboards:

Kubernetes Cluster Overview

ID: kubernetes-cluster-overview
Metrics: Node status, pod counts, resource usage
Use Case: High-level cluster health monitoring

Node Metrics

ID: node-exporter-full
Metrics: CPU, memory, disk, network per node
Use Case: Node-level performance analysis

Pod Resources

ID: kubernetes-pod-resources
Metrics: CPU, memory, network per pod
Use Case: Application resource monitoring

Persistent Volumes

ID: kubernetes-persistent-volumes
Metrics: Volume usage, capacity, status
Use Case: Storage monitoring

Cost Analysis (OpenCost)

ID: opencost-overview
Metrics: Cost per namespace, pod, label
Use Case: Cost optimization and budgeting

Importing Custom Dashboards

Via UI:

Navigate to Dashboards → Import
Enter dashboard ID from Grafana.com
Select Prometheus data source
Click Import

Via Configuration:

monitoring:
  grafana:
    dashboardProviders:
      dashboardproviders.yaml:
        apiVersion: 1
        providers:
          - name: 'default'
            orgId: 1
            folder: ''
            type: file
            disableDeletion: false
            editable: true
            options:
              path: /var/lib/grafana/dashboards/default
    
    dashboards:
      default:
        kubernetes-cluster:
          gnetId: 7249
          revision: 1
          datasource: Prometheus
        node-exporter:
          gnetId: 1860
          revision: 27
          datasource: Prometheus

Creating Alerts in Grafana

Navigate to Alerting → Alert rules → New alert rule:

Example: High CPU Alert

Alert name: High CPU Usage
Query: avg by (namespace) (rate(container_cpu_usage_seconds_total[5m])) > 0.8
Condition: WHEN avg() OF query(A, 5m, now) IS ABOVE 0.8
For: 5m
Annotations:
  summary: High CPU usage detected in {{ $labels.namespace }}

Loki Configuration

Loki provides log aggregation and querying capabilities.

Basic Setup

monitoring:
  loki:
    enabled: true
    port: 3100
    
    # Log retention
    retention: "168h"  # 7 days
    
    # Storage
    persistence:
      enabled: true
      storageClass: "standard"
      size: "10Gi"
    
    # Resource limits
    resources:
      requests:
        cpu: "100m"
        memory: "256Mi"
      limits:
        cpu: "200m"
        memory: "512Mi"

Accessing Loki

Loki is primarily accessed through Grafana:

In Grafana, go to Configuration → Data Sources
Loki should be pre-configured as a data source
Use the Explore view to query logs

LogQL Query Examples

All logs from a namespace:

{namespace="default"}

Logs from a specific pod:

{namespace="default", pod="my-app-xyz"}

Error logs:

{namespace="default"} |= "error" | json

Rate of errors:

rate({namespace="default"} |= "error" [5m])

Logs filtered by label:

{app="nginx"} | json | line_format "{{.message}}"

Log Shipping to Loki

The agent automatically configures Promtail to ship logs to Loki. For custom applications, add the Loki label:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
  labels:
    app: my-app
    loki: "true"  # Enable log collection
spec:
  containers:
    - name: app
      image: my-app:latest

OpenCost Configuration

OpenCost provides Kubernetes cost monitoring and optimization insights.

Basic Setup

monitoring:
  opencost:
    enabled: true
    port: 9003
    
    # Cloud provider pricing (optional)
    cloudProvider: "aws"  # aws, gcp, azure
    
    # Custom pricing (optional)
    customPricing:
      cpu: "0.031611"  # USD per CPU-hour
      memory: "0.004237"  # USD per GB-hour
      storage: "0.00005"  # USD per GB-hour

Accessing OpenCost

Port Forward:

kubectl port-forward svc/opencost 9003:9003 -n pipeops-monitoring

Then open http://localhost:9003

Via Grafana Dashboard: The OpenCost dashboard is pre-configured in Grafana showing:

Cost by namespace
Cost by deployment
Cost by label
Cost trends over time

Cost Optimization Tips

Identify expensive namespaces:

kubectl top nodes
kubectl top pods --all-namespaces

Right-size resources:
- Review actual CPU/memory usage vs. requests/limits
- Adjust resource specifications based on actual usage

Use cost labels:

apiVersion: v1
kind: Pod
metadata:
  labels:
    team: "engineering"
    cost-center: "product"
    environment: "production"

Alerting

Prometheus Alerting Rules

Configure alerts in Prometheus:

monitoring:
  prometheus:
    alerting:
      enabled: true
    
    serverFiles:
      alerting_rules.yml:
        groups:
          - name: kubernetes-alerts
            interval: 30s
            rules:
              - alert: PodCrashLooping
                expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
                for: 5m
                labels:
                  severity: warning
                annotations:
                  summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
              
              - alert: HighCPUUsage
                expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace) > 0.8
                for: 10m
                labels:
                  severity: warning
                annotations:
                  summary: "High CPU usage in namespace {{ $labels.namespace }}"
              
              - alert: HighMemoryUsage
                expr: sum(container_memory_working_set_bytes) by (namespace) / sum(kube_node_status_allocatable{resource="memory"}) > 0.8
                for: 10m
                labels:
                  severity: warning
                annotations:
                  summary: "High memory usage in namespace {{ $labels.namespace }}"
              
              - alert: NodeDiskPressure
                expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
                for: 5m
                labels:
                  severity: critical
                annotations:
                  summary: "Node {{ $labels.node }} has disk pressure"

Alertmanager Configuration

Configure alert routing and notifications:

monitoring:
  prometheus:
    alertmanagerFiles:
      alertmanager.yml:
        global:
          resolve_timeout: 5m
          slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'
        
        route:
          group_by: ['alertname', 'namespace']
          group_wait: 10s
          group_interval: 10s
          repeat_interval: 12h
          receiver: 'slack-notifications'
          
          routes:
            - match:
                severity: critical
              receiver: 'pagerduty'
              continue: true
            
            - match:
                severity: warning
              receiver: 'slack-notifications'
        
        receivers:
          - name: 'slack-notifications'
            slack_configs:
              - channel: '#kubernetes-alerts'
                title: 'Kubernetes Alert'
                text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
          
          - name: 'pagerduty'
            pagerduty_configs:
              - service_key: 'YOUR_PAGERDUTY_KEY'

Best Practices

Resource Management

Set appropriate retention periods:
- Development: 7 days
- Staging: 15 days
- Production: 30+ days
Configure storage:
- Use persistent volumes for production
- Set appropriate storage class and size
- Monitor disk usage regularly
Resource limits:
- Set requests and limits for all components
- Monitor actual usage and adjust accordingly

Security

Change default passwords:

monitoring:
  grafana:
    adminPassword: "strong-password-here"

Use HTTPS/TLS:
- Configure ingress with TLS certificates
- Use cert-manager for automated certificate management
Implement RBAC:
- Restrict access to monitoring namespace
- Use Grafana organizations and teams

Performance

Optimize scrape intervals:
- Balance between data granularity and storage
- Use longer intervals for less critical metrics

Use recording rules:

groups:
  - name: aggregation_rules
    interval: 30s
    rules:
      - record: namespace:container_cpu_usage:sum
        expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)

Limit cardinality:
- Avoid high-cardinality labels
- Use label dropping/keeping judiciously

Troubleshooting

Prometheus Issues

Prometheus not scraping targets:

# Check Prometheus targets
kubectl port-forward svc/prometheus-server 9090:9090 -n pipeops-monitoring
# Open http://localhost:9090/targets

# Check service monitor
kubectl get servicemonitor -n pipeops-monitoring

High memory usage:

Reduce retention period
Increase memory limits
Optimize queries and recording rules

Grafana Issues

Dashboards not loading:

Verify Prometheus data source connection
Check Grafana logs: kubectl logs deployment/grafana -n pipeops-monitoring

Login issues:

Reset admin password:

kubectl exec -it deployment/grafana -n pipeops-monitoring -- grafana-cli admin reset-admin-password newpassword

Loki Issues

Logs not appearing:

Verify Promtail is running:

kubectl get pods -n pipeops-monitoring -l app=promtail

Check Promtail logs:

kubectl logs -n pipeops-monitoring -l app=promtail

Next Steps

Management & Operations — Manage agent lifecycle
Troubleshooting — Common issues and solutions
API Reference — Agent API documentation

Monitoring & Observability

Overview​

Architecture​

Installation​

Enable Monitoring During Initial Setup​

Add Monitoring to Existing Installation​

Prometheus Configuration​

Basic Setup​

Custom Scrape Configurations​

Accessing Prometheus​

Common Prometheus Queries​

Grafana Configuration​

Basic Setup​

Accessing Grafana​

Pre-configured Dashboards​

Kubernetes Cluster Overview​

Node Metrics​

Pod Resources​

Persistent Volumes​

Cost Analysis (OpenCost)​

Importing Custom Dashboards​

Creating Alerts in Grafana​

Loki Configuration​

Basic Setup​

Accessing Loki​

LogQL Query Examples​

Log Shipping to Loki​

OpenCost Configuration​

Basic Setup​

Accessing OpenCost​

Cost Optimization Tips​

Alerting​

Prometheus Alerting Rules​

Alertmanager Configuration​

Best Practices​

Resource Management​

Security​

Performance​

Troubleshooting​

Prometheus Issues​

Grafana Issues​

Loki Issues​

Next Steps​

Overview

Architecture

Installation

Enable Monitoring During Initial Setup

Add Monitoring to Existing Installation

Prometheus Configuration

Basic Setup

Custom Scrape Configurations

Accessing Prometheus

Common Prometheus Queries

Grafana Configuration

Basic Setup

Accessing Grafana

Pre-configured Dashboards

Kubernetes Cluster Overview

Node Metrics

Pod Resources

Persistent Volumes

Cost Analysis (OpenCost)

Importing Custom Dashboards

Creating Alerts in Grafana

Loki Configuration

Basic Setup

Accessing Loki

LogQL Query Examples

Log Shipping to Loki

OpenCost Configuration

Basic Setup

Accessing OpenCost

Cost Optimization Tips

Alerting

Prometheus Alerting Rules

Alertmanager Configuration

Best Practices

Resource Management

Security

Performance

Troubleshooting

Prometheus Issues

Grafana Issues

Loki Issues

Next Steps