Files
Donavan Fritz 01e4b58c91
Build dns-webhook Image / build (push) Has been cancelled
Initial commit: dns-webhook MutatingAdmissionWebhook
Rewrites dnsPolicy+dnsConfig on ClusterFirst pods to distribute
queries across 3 randomly-selected auth-dns nameservers with
edns0/rotate/ndots:5. Includes Gitea CI workflow and README.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 17:14:56 -05:00

94 lines
4.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# dns-webhook
A Kubernetes [MutatingAdmissionWebhook](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#mutatingadmissionwebhook) that rewrites DNS configuration on every new pod before it starts.
## What it does
When a pod is created with the default `dnsPolicy: ClusterFirst`, this webhook intercepts the request and:
1. **Picks 3 random nameservers** from the pool of 4 production auth-dns pods (ns1ns4). This distributes DNS query load instead of every pod always hitting the same two servers.
2. **Sets search domains** appropriate for the pod's namespace so short service names resolve correctly.
3. **Enables `edns0`** — allows DNS responses larger than 512 bytes (needed for DNSSEC and large TXT records).
4. **Enables `rotate`** — cycles through nameservers on each query for even load distribution.
Pods that opt out (`dnsPolicy: None`, `Default`, or `ClusterFirstWithHostNet`) are passed through unchanged.
## Architecture
```
kubelet / kubectl apply
Kubernetes API server
│ (Pod CREATE request)
MutatingAdmissionWebhook ──► dns-webhook pod (this service)
│ │
│ ◄── JSON Patch ─────────┘
│ replace dnsPolicy → None
│ add dnsConfig { nameservers, searches, options }
Pod stored with rewritten DNS config
```
The webhook runs as a Deployment in `kube-system` and is registered via a `MutatingWebhookConfiguration`. cert-manager issues the TLS certificate; its cainjector populates the `caBundle` field automatically.
## Logs
Key log lines to watch for during debugging:
| Prefix | Meaning |
|--------|---------|
| `dns-webhook starting: cert=... key=...` | Server startup — confirms TLS paths |
| `MUTATE pod=<ns>/<name> uid=... nameservers=[...] op=add\|replace` | Pod was mutated — shows which nameservers were assigned |
| `SKIP pod=<ns>/<name> uid=... policy=<policy>` | Pod was not mutated — shows why (non-ClusterFirst policy) |
| `ERROR ...` | Decode/encode failures — should never appear in normal operation |
```bash
# Stream logs from all webhook replicas
kubectl --context sjc001 logs -n kube-system -l app=dns-webhook -f
# Verify a running pod received the correct DNS config
kubectl --context sjc001 exec -n <namespace> <pod> -- cat /etc/resolv.conf
```
## Deployment
Managed by ArgoCD. Manifests live in the `fritzlab/apps` repo under
`sjc001/kube-system/dns-webhook/manifests/`.
```
apps/sjc001/kube-system/dns-webhook/
├── app.yaml # ArgoCD Application
└── manifests/
├── deployment.yaml # Webhook pods (2 replicas, dnsPolicy: Default)
├── issuer.yaml # cert-manager: selfSigned → CA → leaf cert
├── service.yaml # ClusterIP Service on :443 → pod :8443
├── serviceaccount.yaml
└── webhook.yaml # MutatingWebhookConfiguration
```
The `deployment.yaml` image tag (`code.fritzlab.net/fritzlab/dns-webhook:<run_number>`) must be updated whenever a new image is built. CI in this repo produces the image; update the tag in `apps` to deploy.
## Development
```bash
# Build locally
go build ./...
# Run tests (none yet — the mutation logic is straightforward enough that
# end-to-end verification via a test pod is more useful)
go test ./...
# Build container image
docker build -t dns-webhook:local .
```
## Design notes
- **`dnsPolicy: Default` on the webhook pods themselves**: avoids a circular dependency — if cluster DNS is disrupted, the webhook pods can still start because they use the node's `/etc/resolv.conf` directly.
- **`failurePolicy: Ignore`**: if the webhook is unavailable, pods are admitted without mutation rather than being blocked. Availability of workloads takes priority over DNS load balancing.
- **`imagePullPolicy: IfNotPresent`**: if cluster DNS is down at pod start time, the image pull (which needs DNS to reach the registry) would fail. This policy uses the locally cached image instead.
- **ClusterIP service (not headless)**: webhook calls are short-lived HTTP requests — the keepalive starvation problem that affects long-lived connections doesn't apply here. A stable VIP is the conventional pattern for webhook services.
- **Static nameserver IPs**: the auth-dns pods use `cni.projectcalico.org/ipAddrs` to pin their Calico-allocated IPv6 addresses across restarts, making them safe to hardcode here.