diff --git a/apps/monitoring/prometheus-grafana.yaml b/apps/monitoring/prometheus-grafana.yaml new file mode 100644 index 0000000..f28ef73 --- /dev/null +++ b/apps/monitoring/prometheus-grafana.yaml @@ -0,0 +1,29 @@ +apiVersion: argoproj.io/v1alpha1 +kind: Application +metadata: + name: prometheus-grafana + namespace: argocd + annotations: + argocd.argoproj.io/sync-wave: "10" +spec: + project: workshop + sources: + - repoURL: https://prometheus-community.github.io/helm-charts + chart: kube-prometheus-stack + targetRevision: "68.4.4" + helm: + valueFiles: + - $values/manifests/monitoring/values.yaml + - repoURL: https://github.com/innspire/ops-demo.git + targetRevision: HEAD + ref: values + destination: + server: https://kubernetes.default.svc + namespace: monitoring + syncPolicy: + automated: + prune: true + selfHeal: true + syncOptions: + - CreateNamespace=true + - ServerSideApply=true diff --git a/docs/06-monitoring.md b/docs/06-monitoring.md new file mode 100644 index 0000000..82a18f5 --- /dev/null +++ b/docs/06-monitoring.md @@ -0,0 +1,138 @@ +# Exercise 06 (Bonus) — Monitoring: Prometheus + Grafana + +**Time**: ~60 min +**Goal**: Deploy a full observability stack via ArgoCD and explore cluster + application metrics in Grafana. + +--- + +## What you'll learn +- How to deploy a complex multi-component stack (kube-prometheus-stack) purely via GitOps +- How Prometheus scrapes metrics from Kubernetes and applications +- How to navigate Grafana dashboards for cluster and pod-level metrics + +--- + +## Prerequisites + +Exercises 01–03 complete. Ingress-Nginx is running and nip.io URLs are reachable from your laptop. + +**Note**: This exercise adds ~700 MB of additional memory usage. It works on an 8 GB VM but may be slow. If the VM feels sluggish, reduce `replicas` or skip Prometheus `storageSpec`. + +--- + +## Steps + +### 1. Enable the monitoring Application + +The ArgoCD Application manifest for the monitoring stack is already in `apps/monitoring/`. +The root App-of-Apps watches this directory, so the application should already appear +in ArgoCD as **prometheus-grafana**. + +Check its sync status: + +```bash +kubectl get application prometheus-grafana -n argocd +``` + +The initial sync takes 5–8 minutes — the kube-prometheus-stack chart is large and +installs many CRDs. + +--- + +### 2. Watch the stack come up + +```bash +kubectl get pods -n monitoring -w +# You'll see prometheus, grafana, kube-state-metrics, node-exporter pods appear +``` + +Once all pods are Running: + +```bash +kubectl get ingress -n monitoring +# NAME CLASS HOSTS ADDRESS +# grafana nginx grafana.192.168.56.200.nip.io 192.168.56.200 +``` + +--- + +### 3. Open Grafana + +From your laptop: **http://grafana.192.168.56.200.nip.io** + +Login: `admin` / `workshop123` + +--- + +### 4. Explore dashboards + +kube-prometheus-stack ships with pre-built dashboards. In the Grafana sidebar: +**Dashboards → Browse** + +Useful dashboards for this workshop: + +| Dashboard | What to look at | +|-----------|----------------| +| **Kubernetes / Compute Resources / Namespace (Pods)** | CPU + memory per pod in `podinfo` namespace | +| **Kubernetes / Compute Resources / Node (Pods)** | Node-level resource view | +| **Node Exporter / Full** | VM-level CPU, memory, disk, network | + +--- + +### 5. Generate some load on podinfo + +In a new terminal, run a simple load loop: + +```bash +# Inside the VM +while true; do curl -s http://podinfo.192.168.56.200.nip.io > /dev/null; sleep 0.2; done +``` + +Switch back to Grafana → **Kubernetes / Compute Resources / Namespace (Pods)** → +set namespace to `podinfo`. You should see CPU usage climb for the podinfo pod. + +--- + +### 6. Explore the GitOps aspect + +Every configuration change to the monitoring stack goes through Git. + +Try changing the Grafana admin password: + +```bash +vim manifests/monitoring/values.yaml +# Change: adminPassword: workshop123 +# To: adminPassword: supersecret +git add manifests/monitoring/values.yaml +git commit -m "chore(monitoring): update grafana admin password" +git push +``` + +Watch ArgoCD sync the Helm release, then try logging into Grafana with the new password. + +--- + +## Expected outcome + +- Grafana accessible at **http://grafana.192.168.56.200.nip.io** +- Prometheus scraping cluster metrics +- Pre-built Kubernetes dashboards visible and populated + +--- + +## Troubleshooting + +| Symptom | Fix | +|---------|-----| +| Pods in Pending state | VM may be low on memory; `kubectl describe pod` to confirm | +| Grafana 502 from Nginx | Grafana pod not ready yet; wait and retry | +| No data in dashboards | Prometheus needs ~2 min to scrape first metrics; wait and refresh | +| CRD conflict on sync | First sync installs CRDs; second sync applies resources — retry | + +--- + +## Going further (at home) + +- Add a podinfo `ServiceMonitor` so Prometheus scrapes podinfo's `/metrics` endpoint +- Create a custom Grafana dashboard for podinfo request rate and error rate +- Alert on high memory usage with Alertmanager (enable it in `values.yaml`) diff --git a/manifests/monitoring/values.yaml b/manifests/monitoring/values.yaml new file mode 100644 index 0000000..42e363c --- /dev/null +++ b/manifests/monitoring/values.yaml @@ -0,0 +1,56 @@ +# kube-prometheus-stack Helm values (workshop — lightweight config) +# Chart: prometheus-community/kube-prometheus-stack 68.x + +grafana: + adminPassword: workshop123 + + ingress: + enabled: true + ingressClassName: nginx + hosts: + - grafana.192.168.56.200.nip.io + + # Lightweight for a workshop VM + resources: + requests: + cpu: 100m + memory: 256Mi + +prometheus: + prometheusSpec: + resources: + requests: + cpu: 200m + memory: 512Mi + + # Scrape everything in the cluster + podMonitorSelectorNilUsesHelmValues: false + serviceMonitorSelectorNilUsesHelmValues: false + + # Short retention for a workshop + retention: 6h + retentionSize: "1GB" + + storageSpec: + volumeClaimTemplate: + spec: + accessModes: [ReadWriteOnce] + resources: + requests: + storage: 2Gi + +alertmanager: + enabled: false # not needed for the workshop + +# Reduce resource footprint +kubeStateMetrics: + resources: + requests: + cpu: 50m + memory: 64Mi + +nodeExporter: + resources: + requests: + cpu: 50m + memory: 64Mi