Rewrites dnsPolicy+dnsConfig on ClusterFirst pods to distribute queries across 3 randomly-selected auth-dns nameservers with edns0/rotate/ndots:5. Includes Gitea CI workflow and README. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dns-webhook
A Kubernetes MutatingAdmissionWebhook that rewrites DNS configuration on every new pod before it starts.
What it does
When a pod is created with the default dnsPolicy: ClusterFirst, this webhook intercepts the request and:
- Picks 3 random nameservers from the pool of 4 production auth-dns pods (ns1–ns4). This distributes DNS query load instead of every pod always hitting the same two servers.
- Sets search domains appropriate for the pod's namespace so short service names resolve correctly.
- Enables
edns0— allows DNS responses larger than 512 bytes (needed for DNSSEC and large TXT records). - Enables
rotate— cycles through nameservers on each query for even load distribution.
Pods that opt out (dnsPolicy: None, Default, or ClusterFirstWithHostNet) are passed through unchanged.
Architecture
kubelet / kubectl apply
│
▼
Kubernetes API server
│ (Pod CREATE request)
▼
MutatingAdmissionWebhook ──► dns-webhook pod (this service)
│ │
│ ◄── JSON Patch ─────────┘
│ replace dnsPolicy → None
│ add dnsConfig { nameservers, searches, options }
▼
Pod stored with rewritten DNS config
The webhook runs as a Deployment in kube-system and is registered via a MutatingWebhookConfiguration. cert-manager issues the TLS certificate; its cainjector populates the caBundle field automatically.
Logs
Key log lines to watch for during debugging:
| Prefix | Meaning |
|---|---|
dns-webhook starting: cert=... key=... |
Server startup — confirms TLS paths |
MUTATE pod=<ns>/<name> uid=... nameservers=[...] op=add|replace |
Pod was mutated — shows which nameservers were assigned |
SKIP pod=<ns>/<name> uid=... policy=<policy> |
Pod was not mutated — shows why (non-ClusterFirst policy) |
ERROR ... |
Decode/encode failures — should never appear in normal operation |
# Stream logs from all webhook replicas
kubectl --context sjc001 logs -n kube-system -l app=dns-webhook -f
# Verify a running pod received the correct DNS config
kubectl --context sjc001 exec -n <namespace> <pod> -- cat /etc/resolv.conf
Deployment
Managed by ArgoCD. Manifests live in the fritzlab/apps repo under
sjc001/kube-system/dns-webhook/manifests/.
apps/sjc001/kube-system/dns-webhook/
├── app.yaml # ArgoCD Application
└── manifests/
├── deployment.yaml # Webhook pods (2 replicas, dnsPolicy: Default)
├── issuer.yaml # cert-manager: selfSigned → CA → leaf cert
├── service.yaml # ClusterIP Service on :443 → pod :8443
├── serviceaccount.yaml
└── webhook.yaml # MutatingWebhookConfiguration
The deployment.yaml image tag (code.fritzlab.net/fritzlab/dns-webhook:<run_number>) must be updated whenever a new image is built. CI in this repo produces the image; update the tag in apps to deploy.
Development
# Build locally
go build ./...
# Run tests (none yet — the mutation logic is straightforward enough that
# end-to-end verification via a test pod is more useful)
go test ./...
# Build container image
docker build -t dns-webhook:local .
Design notes
dnsPolicy: Defaulton the webhook pods themselves: avoids a circular dependency — if cluster DNS is disrupted, the webhook pods can still start because they use the node's/etc/resolv.confdirectly.failurePolicy: Ignore: if the webhook is unavailable, pods are admitted without mutation rather than being blocked. Availability of workloads takes priority over DNS load balancing.imagePullPolicy: IfNotPresent: if cluster DNS is down at pod start time, the image pull (which needs DNS to reach the registry) would fail. This policy uses the locally cached image instead.- ClusterIP service (not headless): webhook calls are short-lived HTTP requests — the keepalive starvation problem that affects long-lived connections doesn't apply here. A stable VIP is the conventional pattern for webhook services.
- Static nameserver IPs: the auth-dns pods use
cni.projectcalico.org/ipAddrsto pin their Calico-allocated IPv6 addresses across restarts, making them safe to hardcode here.