Files
webhook/README.md
T
Donavan Fritz 01e4b58c91
Build dns-webhook Image / build (push) Has been cancelled
Initial commit: dns-webhook MutatingAdmissionWebhook
Rewrites dnsPolicy+dnsConfig on ClusterFirst pods to distribute
queries across 3 randomly-selected auth-dns nameservers with
edns0/rotate/ndots:5. Includes Gitea CI workflow and README.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 17:14:56 -05:00

4.4 KiB
Raw Blame History

dns-webhook

A Kubernetes MutatingAdmissionWebhook that rewrites DNS configuration on every new pod before it starts.

What it does

When a pod is created with the default dnsPolicy: ClusterFirst, this webhook intercepts the request and:

  1. Picks 3 random nameservers from the pool of 4 production auth-dns pods (ns1ns4). This distributes DNS query load instead of every pod always hitting the same two servers.
  2. Sets search domains appropriate for the pod's namespace so short service names resolve correctly.
  3. Enables edns0 — allows DNS responses larger than 512 bytes (needed for DNSSEC and large TXT records).
  4. Enables rotate — cycles through nameservers on each query for even load distribution.

Pods that opt out (dnsPolicy: None, Default, or ClusterFirstWithHostNet) are passed through unchanged.

Architecture

kubelet / kubectl apply
       │
       ▼
Kubernetes API server
       │  (Pod CREATE request)
       ▼
MutatingAdmissionWebhook ──► dns-webhook pod (this service)
       │                          │
       │  ◄── JSON Patch ─────────┘
       │       replace dnsPolicy → None
       │       add dnsConfig { nameservers, searches, options }
       ▼
Pod stored with rewritten DNS config

The webhook runs as a Deployment in kube-system and is registered via a MutatingWebhookConfiguration. cert-manager issues the TLS certificate; its cainjector populates the caBundle field automatically.

Logs

Key log lines to watch for during debugging:

Prefix Meaning
dns-webhook starting: cert=... key=... Server startup — confirms TLS paths
MUTATE pod=<ns>/<name> uid=... nameservers=[...] op=add|replace Pod was mutated — shows which nameservers were assigned
SKIP pod=<ns>/<name> uid=... policy=<policy> Pod was not mutated — shows why (non-ClusterFirst policy)
ERROR ... Decode/encode failures — should never appear in normal operation
# Stream logs from all webhook replicas
kubectl --context sjc001 logs -n kube-system -l app=dns-webhook -f

# Verify a running pod received the correct DNS config
kubectl --context sjc001 exec -n <namespace> <pod> -- cat /etc/resolv.conf

Deployment

Managed by ArgoCD. Manifests live in the fritzlab/apps repo under sjc001/kube-system/dns-webhook/manifests/.

apps/sjc001/kube-system/dns-webhook/
├── app.yaml                  # ArgoCD Application
└── manifests/
    ├── deployment.yaml       # Webhook pods (2 replicas, dnsPolicy: Default)
    ├── issuer.yaml           # cert-manager: selfSigned → CA → leaf cert
    ├── service.yaml          # ClusterIP Service on :443 → pod :8443
    ├── serviceaccount.yaml
    └── webhook.yaml          # MutatingWebhookConfiguration

The deployment.yaml image tag (code.fritzlab.net/fritzlab/dns-webhook:<run_number>) must be updated whenever a new image is built. CI in this repo produces the image; update the tag in apps to deploy.

Development

# Build locally
go build ./...

# Run tests (none yet — the mutation logic is straightforward enough that
# end-to-end verification via a test pod is more useful)
go test ./...

# Build container image
docker build -t dns-webhook:local .

Design notes

  • dnsPolicy: Default on the webhook pods themselves: avoids a circular dependency — if cluster DNS is disrupted, the webhook pods can still start because they use the node's /etc/resolv.conf directly.
  • failurePolicy: Ignore: if the webhook is unavailable, pods are admitted without mutation rather than being blocked. Availability of workloads takes priority over DNS load balancing.
  • imagePullPolicy: IfNotPresent: if cluster DNS is down at pod start time, the image pull (which needs DNS to reach the registry) would fail. This policy uses the locally cached image instead.
  • ClusterIP service (not headless): webhook calls are short-lived HTTP requests — the keepalive starvation problem that affects long-lived connections doesn't apply here. A stable VIP is the conventional pattern for webhook services.
  • Static nameserver IPs: the auth-dns pods use cni.projectcalico.org/ipAddrs to pin their Calico-allocated IPv6 addresses across restarts, making them safe to hardcode here.