feat(ex06): bonus monitoring — Prometheus + Grafana via kube-prometheus-stack

- apps/monitoring/prometheus-grafana.yaml: ArgoCD Application (chart 68.4.4) - manifests/monitoring/values.yaml: lightweight values, Grafana ingress, 6h retention - docs/06-monitoring.md: Exercise 06 bonus participant guide
2026-02-28 15:34:47 +01:00 · 2026-02-28 15:34:47 +01:00 · ed5d39efa2
commit ed5d39efa2
parent dce81a4993
3 changed files with 223 additions and 0 deletions
--- a/apps/monitoring/prometheus-grafana.yaml
+++ b/apps/monitoring/prometheus-grafana.yaml
@ -0,0 +1,29 @@
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+  name: prometheus-grafana
+  namespace: argocd
+  annotations:
+    argocd.argoproj.io/sync-wave: "10"
+spec:
+  project: workshop
+  sources:
+    - repoURL: https://prometheus-community.github.io/helm-charts
+      chart: kube-prometheus-stack
+      targetRevision: "68.4.4"
+      helm:
+        valueFiles:
+          - $values/manifests/monitoring/values.yaml
+    - repoURL: https://github.com/innspire/ops-demo.git
+      targetRevision: HEAD
+      ref: values
+  destination:
+    server: https://kubernetes.default.svc
+    namespace: monitoring
+  syncPolicy:
+    automated:
+      prune: true
+      selfHeal: true
+    syncOptions:
+      - CreateNamespace=true
+      - ServerSideApply=true
--- a/docs/06-monitoring.md
+++ b/docs/06-monitoring.md
@ -0,0 +1,138 @@
+# Exercise 06 (Bonus) — Monitoring: Prometheus + Grafana
+
+**Time**: ~60 min
+**Goal**: Deploy a full observability stack via ArgoCD and explore cluster + application metrics in Grafana.
+
+---
+
+## What you'll learn
+- How to deploy a complex multi-component stack (kube-prometheus-stack) purely via GitOps
+- How Prometheus scrapes metrics from Kubernetes and applications
+- How to navigate Grafana dashboards for cluster and pod-level metrics
+
+---
+
+## Prerequisites
+
+Exercises 01–03 complete. Ingress-Nginx is running and nip.io URLs are reachable from your laptop.
+
+**Note**: This exercise adds ~700 MB of additional memory usage. It works on an 8 GB VM but may be slow. If the VM feels sluggish, reduce `replicas` or skip Prometheus `storageSpec`.
+
+---
+
+## Steps
+
+### 1. Enable the monitoring Application
+
+The ArgoCD Application manifest for the monitoring stack is already in `apps/monitoring/`.
+The root App-of-Apps watches this directory, so the application should already appear
+in ArgoCD as **prometheus-grafana**.
+
+Check its sync status:
+
+```bash
+kubectl get application prometheus-grafana -n argocd
+```
+
+The initial sync takes 5–8 minutes — the kube-prometheus-stack chart is large and
+installs many CRDs.
+
+---
+
+### 2. Watch the stack come up
+
+```bash
+kubectl get pods -n monitoring -w
+# You'll see prometheus, grafana, kube-state-metrics, node-exporter pods appear
+```
+
+Once all pods are Running:
+
+```bash
+kubectl get ingress -n monitoring
+# NAME      CLASS   HOSTS                               ADDRESS
+# grafana   nginx   grafana.192.168.56.200.nip.io       192.168.56.200
+```
+
+---
+
+### 3. Open Grafana
+
+From your laptop: **http://grafana.192.168.56.200.nip.io**
+
+Login: `admin` / `workshop123`
+
+---
+
+### 4. Explore dashboards
+
+kube-prometheus-stack ships with pre-built dashboards. In the Grafana sidebar:
+**Dashboards → Browse**
+
+Useful dashboards for this workshop:
+
+| Dashboard | What to look at |
+|-----------|----------------|
+| **Kubernetes / Compute Resources / Namespace (Pods)** | CPU + memory per pod in `podinfo` namespace |
+| **Kubernetes / Compute Resources / Node (Pods)** | Node-level resource view |
+| **Node Exporter / Full** | VM-level CPU, memory, disk, network |
+
+---
+
+### 5. Generate some load on podinfo
+
+In a new terminal, run a simple load loop:
+
+```bash
+# Inside the VM
+while true; do curl -s http://podinfo.192.168.56.200.nip.io > /dev/null; sleep 0.2; done
+```
+
+Switch back to Grafana → **Kubernetes / Compute Resources / Namespace (Pods)** →
+set namespace to `podinfo`. You should see CPU usage climb for the podinfo pod.
+
+---
+
+### 6. Explore the GitOps aspect
+
+Every configuration change to the monitoring stack goes through Git.
+
+Try changing the Grafana admin password:
+
+```bash
+vim manifests/monitoring/values.yaml
+# Change: adminPassword: workshop123
+# To:     adminPassword: supersecret
+git add manifests/monitoring/values.yaml
+git commit -m "chore(monitoring): update grafana admin password"
+git push
+```
+
+Watch ArgoCD sync the Helm release, then try logging into Grafana with the new password.
+
+---
+
+## Expected outcome
+
+- Grafana accessible at **http://grafana.192.168.56.200.nip.io**
+- Prometheus scraping cluster metrics
+- Pre-built Kubernetes dashboards visible and populated
+
+---
+
+## Troubleshooting
+
+| Symptom | Fix |
+|---------|-----|
+| Pods in Pending state | VM may be low on memory; `kubectl describe pod` to confirm |
+| Grafana 502 from Nginx | Grafana pod not ready yet; wait and retry |
+| No data in dashboards | Prometheus needs ~2 min to scrape first metrics; wait and refresh |
+| CRD conflict on sync | First sync installs CRDs; second sync applies resources — retry |
+
+---
+
+## Going further (at home)
+
+- Add a podinfo `ServiceMonitor` so Prometheus scrapes podinfo's `/metrics` endpoint
+- Create a custom Grafana dashboard for podinfo request rate and error rate
+- Alert on high memory usage with Alertmanager (enable it in `values.yaml`)
--- a/manifests/monitoring/values.yaml
+++ b/manifests/monitoring/values.yaml
@ -0,0 +1,56 @@
+# kube-prometheus-stack Helm values (workshop — lightweight config)
+# Chart: prometheus-community/kube-prometheus-stack 68.x
+
+grafana:
+  adminPassword: workshop123
+
+  ingress:
+    enabled: true
+    ingressClassName: nginx
+    hosts:
+      - grafana.192.168.56.200.nip.io
+
+  # Lightweight for a workshop VM
+  resources:
+    requests:
+      cpu: 100m
+      memory: 256Mi
+
+prometheus:
+  prometheusSpec:
+    resources:
+      requests:
+        cpu: 200m
+        memory: 512Mi
+
+    # Scrape everything in the cluster
+    podMonitorSelectorNilUsesHelmValues: false
+    serviceMonitorSelectorNilUsesHelmValues: false
+
+    # Short retention for a workshop
+    retention: 6h
+    retentionSize: "1GB"
+
+    storageSpec:
+      volumeClaimTemplate:
+        spec:
+          accessModes: [ReadWriteOnce]
+          resources:
+            requests:
+              storage: 2Gi
+
+alertmanager:
+  enabled: false   # not needed for the workshop
+
+# Reduce resource footprint
+kubeStateMetrics:
+  resources:
+    requests:
+      cpu: 50m
+      memory: 64Mi
+
+nodeExporter:
+  resources:
+    requests:
+      cpu: 50m
+      memory: 64Mi