Compare commits

..

23 Commits

Author SHA1 Message Date
Donavan Fritz 580b9afa33 ci: push image to fritzlab-public org
flock / release (push) Successful in 47m37s
This repo was transferred from fritzlab to fritzlab-public so the container
package's anonymous-pull access (governed by org visibility in Gitea 1.26.1)
remains open after the rest of fritzlab/* flips to limited.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 13:58:56 -05:00
Donavan Fritz 8d6e50c980 deploy: catch-all toleration so DS schedules on not-ready nodes
flock / release (push) Successful in 45m40s
Replaces the explicit toleration list with `operator: Exists`. The previous
list lacked node.kubernetes.io/not-ready:NoSchedule, so during a fresh
control-plane join the CNI agent couldn't schedule until the node became
Ready — but the node can't become Ready without the CNI. Surfaced during
host001/host002 PERC migration rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 09:35:27 -05:00
Donavan Fritz 3d0081780c ci: migrate to action/ org composite actions
flock / release (push) Successful in 3m4s
2026-05-06 08:14:35 -05:00
Donavan Fritz 9b777ca7d1 bird: per-peer import filter rejects connected subnet
Build flock Image / build (push) Successful in 2m17s
Without a filter, crt001's `network 2602:817:3000:A25::/64` gets
re-advertised to every peer on that subnet. bird installs the BGP /64
with metric 32, beating the kernel-connected route at 256, and all
inter-host VLAN-25 traffic hairpins through the gateway — losing PMTU
9000 and ~30x throughput. Broke Plex 2026-05-04: NFS to nas002 capped
at 7 MB/s, jumbo blackholed.

Add LocalSubnetV6/V4 (CIDR) to NodeBGP. Agent populates by masking the
peer's address to /64 (v6) or /24 (v4) — same fritzlab convention
already in localAddrSameSubnet. Render emits `import where net !=
<subnet>;` per BGP channel when set, falls back to `import all;`
otherwise so existing tests stay green.

Defence in depth: with the matching outbound route-map on crt001
(ROUTE_MAP_CLUSTER_OUT_V{4,6}) the agent now refuses the leak on its
own if the router filter ever drifts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:03:59 -05:00
Donavan Fritz a17d33e182 agent: addresses annotation replaces IPAM allocation
Build flock Image / build (push) Successful in 5m27s
When flock.fritzlab.net/addresses provides a v6 or v4, the IP becomes
the pod's primary IP for that family — bound to eth0, default route off
it, on-link host route via setHostRoute, and a per-pod /128 or /32 in
BGP. IPAM no longer allocates a private IP alongside it. The pod ends up
with exactly the operator-supplied addresses on eth0 (plus any extras
beyond the first-of-family, which keep the pre-existing layered
behavior).

This is the fix the original addresses-annotation work missed: bug #1
allocated a private IP next to the public one (so VPN-routed clients
could land on the private path on Plex). Promoting addresses-supplied
IPs into the IPAM-style routing slot keeps the public IP as the only
primary IP visible from outside.

Three pieces:
- annotations.go: reject pods whose addresses/anycast IP family is
  disabled (ipv6/ipv4 annotation or NodeConfig default). Both annotation
  types rely on the family being enabled for return-path routing.
- handlers.go: peel first v6 + first v4 from Addresses into res.IP6/IP4;
  suppress IPAM for those families; skip IPAM call entirely if both
  families are addresses-supplied.
- anycast_linux.go: extend renderBird to advertise any IPAM IP that's
  outside the node's BGP aggregate as a per-pod /32 or /128. This is
  what makes 142.202.202.166 reachable when host004's pod CIDR is
  172.25.214.0/24 — the addresses-promoted IP isn't covered by the
  aggregate.

Tests: 7 new annotation tests covering the conflict cases (ipv4=false +
addresses-v4, NodeConfig default + addresses-v4, etc.) plus 5 unit tests
for the splitAddressesPrimary helper.

README updated with the addresses-replaces-IPAM behavior, the
addresses-vs-anycast comparison, the conflict rule, and a Plex-style
example.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 09:46:48 -05:00
Donavan Fritz 40e13037b5 agent: revert CNI result addresses inclusion; document k8s limit
Build flock Image / build (push) Successful in 1m36s
Kubernetes limits pod.status.podIPs to one IPv4 + one IPv6 per pod.
Additional IPs in the CNI result are silently dropped by kubelet, making
the resultFromAllocation change in 4a60c00 a no-op. Revert it and add
a comment documenting the constraint so the intent is clear.

Addresses IPs remain fully functional: bound to eth0, advertised via
BGP, visible inside the pod — just not reflected in pod status.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 18:37:05 -05:00
Donavan Fritz 4a60c004c3 agent: include addresses IPs in CNI result
Build flock Image / build (push) Successful in 1m37s
resultFromAllocation now appends Addresses entries to the CNI result so
they appear in pod.status.podIPs. Kubernetes and workloads that inspect
pod metadata (e.g. Plex remote-access detection) see the public IPs
alongside the IPAM-allocated ones.

Anycast IPs are intentionally excluded — they're shared across replicas
and must not appear as per-pod IPs in Kubernetes.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 18:11:17 -05:00
Donavan Fritz 2daa2a21f3 agent: add flock.fritzlab.net/addresses annotation (eth0 static IPs)
Build flock Image / build (push) Successful in 3m23s
Like anycast, addresses IPs are advertised via BGP (/128+/32) and get
host routes via the AnycastReconciler. The sole difference: they are
assigned to pod eth0 instead of lo, so workloads that inspect their
primary interface (e.g. Plex remote-access detection) see the public IP
directly.

- annotations.go: annAddresses const, Addresses []net.IP in ParsedAnnotations
- state.go: Addresses []string persisted in allocations.json
- anycast.go: resolveAnycastTargets processes Anycast+Addresses together
- netns_linux.go: configurePodSide assigns Addresses to eth0
- netns_stub.go: mirror Addresses field for non-Linux builds
- handlers.go: thread Addresses through ADD path

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 17:50:49 -05:00
Donavan Fritz 362a1e01ce ci: trigger dispatch after scheduler reset
Build flock Image / build (push) Successful in 1m56s
2026-04-26 17:53:55 -05:00
Donavan Fritz 222006240c ci: use fritzlab/build-image@v1
Build flock Image / build (push) Has been cancelled
Replaces inline docker login + metadata + build-push + tag-cleanup
with the shared build-image composite action. Standardizes on
CI_BOT_TOKEN (drops REGISTRY_PASSWORD).
2026-04-26 09:32:46 -05:00
Donavan Fritz e00579f7ca nodecondition: SSA the NetworkUnavailable condition (don't merge-patch)
Build flock Image / build (push) Has been cancelled
The previous implementation used JSON merge-patch (types.MergePatchType)
with a one-element conditions array. JSON merge-patch on arrays is
whole-array replacement, so every 60s flock-agent stomped over the
kubelet-managed conditions (Ready, MemoryPressure, DiskPressure,
PIDPressure), leaving only NetworkUnavailable on the node — until
kubelet's next status post (~5s later) re-set them.

Symptom: `kubectl get nodes` flickered, with one node briefly showing
Unknown each polling tick. k9s lit up red on rotating nodes. (kube-
controller-manager is also a write contender and was correctly noted
in the field-managers list.)

Switch to Server-Side Apply against the status subresource with
fieldManager=flock-agent and Force=true. NodeStatus.Conditions is a
listType=map keyed by `type`, so SSA merges by type — we declare
ownership of only the NetworkUnavailable entry and leave kubelet's
entries untouched. Force lets us reclaim the condition if a previous
CNI manager (e.g. calico-node finalizer leftovers) still owns it.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-04-26 08:55:03 -05:00
Donavan Fritz a6a50fd73f ci: retrigger build (run #685 + #686 hit transient github.com timeout / cancellation)
Build flock Image / build (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-04-25 22:56:33 -05:00
Donavan Fritz c61b12204c anycast: drop pods from nexthop set on DeletionTimestamp
Build flock Image / build (push) Has been cancelled
Previously the AnycastReconciler kept a pod in the nexthop set as long as
its PodReady condition was True. During a rolling restart that produces a
window after kubelet has accepted SIGTERM (DeletionTimestamp set, pod
still Ready until probes observe shutdown) where BGP still advertises a
path through the dying pod's veth — in-flight requests get RST'd when
the container actually exits.

Fix: introduce podAnycastEligible(pod) = !DeletionTimestamp && Ready,
swap it in at the AnycastReconciler's isReady callback, and fire the
ready-change callback when DeletionTimestamp transitions (the informer
UpdateFunc previously only fired on Ready transitions).

Result: as soon as the apiserver marks a pod for deletion, the
reconciler withdraws the local nexthop and BIRD reannounces the route
without it. Sibling replicas absorb traffic before the pod's
terminationGracePeriod elapses.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-04-25 22:24:50 -05:00
Donavan Fritz e9d3eef2cc netpol: accept established+related at top of every pod chain
Build flock Image / build (push) Has been cancelled
K8s NetworkPolicy applies to the start of new connections; reply
packets for established flows (and ICMP related) must not be matched
against the explicit allow set. The pod ingress chain previously had
only explicit dport allows + a final drop, so any reply to a
pod-initiated outbound where the reply's dport (the ephemeral source
port) wasn't in the allow set got dropped.

Hit in production 2026-04-26: garage's `garage-admin-restrict` NP
allowed dports 3900/80/3901/3903 only. Garage uses kubernetes_discovery
to find peers — outbound to kube-apiserver succeeded, replies returned
to ephemeral source ports, dropped → "Layout not ready" cluster-wide.

Fix: emit `ct state established,related accept` as the first rule in
every pod_<hash>_(ingress|egress) chain. Regression test added.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-04-25 22:22:39 -05:00
Donavan Fritz 8dd109866e ci: re-trigger build (runs #682-#683 failed transient github.com timeout)
Build flock Image / build (push) Has been cancelled
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 11:53:35 -05:00
Donavan Fritz d5161e09d3 deploy: drop fritzlab.net/cni-test toleration
Build flock Image / build (push) Has been cancelled
Migration off Calico is complete; host001/host004 no longer carry
the cni-test taint. The toleration is now dead config.
2026-04-25 11:42:48 -05:00
Donavan Fritz 65b2fb5b17 ip-algo: rename pod field to app; image from pod spec
Build flock Image / build (push) Has been cancelled
The `pod` field hashed pod.Name, which differs per replica because of
the ReplicaSet pod-template-hash + 5-char random suffix. With
namespace,pod,image, all replicas of the same Deployment got distinct
hextets even though they were the same workload.

Replace `pod` with `app` — a stable workload identifier derived from
the controller chain:

  - Deployment → ReplicaSet → Pod: strip the pod-template-hash suffix
    from the RS name (`traefik-789df685f` → `traefik`).
  - StatefulSet/DaemonSet/Job → Pod: use controller name as-is.
  - Bare pod: pod name.

Image now comes from pod.Spec.Containers[0].Image (the spec'd
reference). 64-hex-char values are treated as sha256 digests and
parsed as before; everything else (image:tag, short SHA) is FNV-1a-64'd
as a string. This makes `traefik:v3.5` deterministic across replicas
without needing the runtime-resolved digest.

Net effect: namespace,app,image yields identical hextets across all
replicas of the same Deployment except the trailing random N nibble.

embed.Values.Pod → App; AllocRequest.Pod kept for log context only,
new App and Image fields drive the embed call. handlers.go computes
both via deriveAppName + podImageRef helpers.

Tests: 7 new TestDeriveAppName_* cases (Deploy/STS/DS/bare/RS-without-
hash/non-controller-owner) + TestPodImageRef. Existing fuzz seeds
updated for the new keyword.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 11:42:06 -05:00
Donavan Fritz c860e9351b ip-algo: pod annotation > NodeConfig annotation > random
Build flock Image / build (push) Has been cancelled
Add flock.fritzlab.net/ip-algo as a node-wide default via NodeConfig
metadata.annotations. Pod-level annotation still wins. Empty, missing,
or invalid input at either level falls through to the next; invalid
values warn-log via the agent's slog. Both unset → fully random IID
(unchanged baseline).

ParseAnnotations no longer touches ip-algo; ResolveIPAlgo handles the
full precedence chain, called from PodHandler.Add with the cached
NodeConfig's annotations and the agent logger.

Tests: 9 new TestResolveIPAlgo_* cases covering pod-wins, all
fall-through paths, both-absent, nil node map, whitespace, and
duplicate-as-invalid. Fuzz target rebuilt without ip-algo input space
(now exercised by ResolveIPAlgo unit tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 11:09:09 -05:00
Donavan Fritz a6202a36bd defaults: built-in baseline is dual-stack (IPv6 + IPv4), not IPv6-only
Build flock Image / build (push) Has been cancelled
BuiltinFamilyDefaults() now returns {WantV6: true, WantV4: true}. Pods
that want a single family explicitly opt out via the
flock.fritzlab.net/ipv4 (or ipv6) annotation, or the operator narrows
the default at the node level via NodeConfig.Spec.Defaults.

Annotation precedence is unchanged: pod annotation > NodeConfig defaults
> built-in baseline. Tests updated to reflect the new baseline; the
"opt out of v4" path now has explicit coverage.

Docs updated:
  - NodeConfig.Spec.Defaults Go doc + CRD descriptions reflect the new
    baseline and its overrides
  - README opening framing softened from "IPv6-first" to "dual-stack,
    IPv6-friendly"; example pods + spec.defaults table flipped to
    treat dual-stack as the default and v6/v4-only as overrides
  - README NetworkPolicy line in the comparison table flipped to
    "yes (nftables)" since v1 enforcement shipped
  - Limitations note about IPv4-only destinations rewritten — every
    pod has v4 by default now, so the question is whether your IPv4
    pool is routable beyond your network

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-04-25 10:07:48 -05:00
Donavan Fritz a7dc7bf1f4 anycast: kernel multipath route + L4 hash for multi-pod-per-node
Build flock Image / build (push) Has been cancelled
Move pure resolver logic out of anycast_linux.go into anycast.go so it's
unit-testable on any host. Reshape anycastTarget from a single
{hostIface, via} into a sorted list of nexthops; multiple Ready pods on
the same node binding the same anycast IP now contribute one nexthop
each.

installAnycastRoute uses RTA_MULTIPATH (via netlink.Route.MultiPath)
when the target has more than one nexthop. Single-nexthop targets keep
the simple via-route shape so 1-pod-per-node keeps rendering identically
to today's production form in `ip route show`.

flock-agent writes net.ipv{4,6}.fib_multipath_hash_policy = 1 at
startup so the kernel hashes flows on (saddr, daddr, sport, dport, proto)
rather than just IPs. Best-effort — runs privileged in production, so
it works; falls back to L3 hash on environments where the write fails
(only matters for the multi-pod-per-node case anyway).

resolveAnycastTargets sorts nexthops by canonical(via) for stable
comparison so a quiet reconcile pass doesn't churn the kernel route.

8 new unit tests cover: 1-pod, 2-pods-same-anycast (multi-nexthop),
NotReady drop, no-Ready omits the IP, pending skipped, mixed v6+v4,
family mismatch warns, determinism.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-04-25 09:57:32 -05:00
Donavan Fritz 5d9b6bfeec netpol: anchor base-chain jump on veth only, not pod IP
Build flock Image / build (push) Has been cancelled
The previous base-chain jump matched iifname/oifname AND saddr/daddr ==
pod eth0 IP. Anycast traffic has the anycast IP as daddr, not the pod's
eth0 unicast — so anycast packets skipped the policy chain entirely and
fell through to the forward chain's policy=accept.

The veth uniquely belongs to one pod. Anything traversing it is to or
from that pod by definition (anycast, unicast, future overlay routes).
Match on iifname/oifname alone; let the pod-side chain's accept lines +
trailing drop be the policy.

Validated end-to-end on host001: anycast nginx pod with default-deny
ingress NetPol now correctly drops traffic from any peer; adding an
allow-from-podSelector rule unblocks only the matched peer.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-04-25 09:32:08 -05:00
Donavan Fritz 39ede9130b netpol: NetworkPolicy v1 enforcement via nftables
Build flock Image / build (push) Has been cancelled
New pkg/agent/netpol implementing standard networking.k8s.io/v1
NetworkPolicy. Pipeline:

  pods + policies + namespaces  →  Translate  →  Render  →  Apply

Supports ingress + egress, all three peer types (podSelector,
namespaceSelector, ipBlock with except), numeric ports + port ranges,
default-deny semantics derived from PolicyTypes (or inferred from
non-empty Spec.Egress when unset).

Apply path is `nft -f -` shell-out — single transaction, atomic, kernel
guarantees partial-failure rollback. Idempotent dedup via last-applied
script. Reconcile triggers: informer events, 30s self-heal tick, every
CNI ADD/DEL.

Verified against the three live cluster NetPols (calico-apiserver,
remote-proxies/lodge-home-assistant, storage/garage-admin-restrict).
Fuzz target stitches Translate + Render with random selector and peer
inputs; 21 unit tests cover the policy semantics.

Named ports skip with a warn — deferred until kubelet exposes them in a
form that doesn't require shadowing pod state.

Dockerfile: + nftables.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-04-25 09:25:58 -05:00
Donavan Fritz 71e584cf96 NodeConfig defaults + code-quality pass + fuzz tests + README
NodeConfig.Spec.Defaults adds per-node IPv6/IPv4 family defaults that pod
annotations can override; built-in baseline (v6=true, v4=false) still
applies when the field is omitted.

bird.Render now validates every operator-supplied value (peer addresses,
CIDRs, anycast IPs, source addresses) before templating — fuzz found a
peer address containing `}` produced unbalanced braces in bird.conf.
Failing input preserved as a regression seed.

Fuzz targets added for ParseAnnotations, ParseCNIArgs, HostIfaceName,
canonical, IPAM allocate sequences, embed.Embed, and bird.Render.
Hardened canonical/ipToU32 against nil and non-IPv4 inputs.

README rewritten for outside readers — quickstart, NodeConfig + annotation
reference with worked examples, anycast use cases, comparison vs Calico
and Cilium, requirements, limitations.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-04-25 09:25:45 -05:00
51 changed files with 5915 additions and 348 deletions
+14 -45
View File
@@ -1,55 +1,24 @@
name: Build flock Image name: flock
on: on:
push: push:
branches: [main] branches: [main]
jobs: jobs:
build: release:
runs-on: fritzlab runs-on: fritzlab
steps: steps:
- name: Check out repo - uses: actions/checkout@v4
uses: actions/checkout@v4
- name: Log in to Gitea registry - uses: https://code.fritzlab.net/action/image-build@v1
uses: docker/login-action@v3
with: with:
registry: code.fritzlab.net image: code.fritzlab.net/fritzlab-public/flock
username: ci-bot build-args: GIT_SHA=${{ github.sha }}
password: ${{ secrets.REGISTRY_PASSWORD }} smoke-test: |
docker run --rm $IMAGE --help || true
docker run --rm --entrypoint /usr/local/bin/flock $IMAGE || true
- name: Extract Docker metadata - uses: https://code.fritzlab.net/action/image-push@v1
id: meta
uses: docker/metadata-action@v5
with: with:
images: code.fritzlab.net/fritzlab/flock image: code.fritzlab.net/fritzlab-public/flock
tags: | token: ${{ secrets.CI_BOT_TOKEN }}
type=raw,value=latest org: fritzlab-public
type=raw,value=${{ github.run_number }} name: flock
- name: Build and push
uses: docker/build-push-action@v6
with:
context: .
push: true
provenance: false
build-args: |
GIT_SHA=${{ github.sha }}
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
network: host
- name: Smoke-test image
run: |
docker run --rm code.fritzlab.net/fritzlab/flock:${{ github.run_number }} --help || true
docker run --rm --entrypoint /usr/local/bin/flock \
code.fritzlab.net/fritzlab/flock:${{ github.run_number }} || true
- name: Clean up old image tags
run: |
tea login add --name ci --url https://code.fritzlab.net --token '${{ secrets.CI_BOT_TOKEN }}' --no-version-check
tea api '/packages/fritzlab?type=container' \
| jq -r '.[] | select(.name=="flock") | select(.version | test("^[0-9]+$")) | .version' \
| sort -n | head -n -3 \
| while read tag; do
echo "deleting flock:$tag"
tea api -X DELETE "/packages/fritzlab/container/flock/$tag"
done
+16
View File
@@ -0,0 +1,16 @@
name: flock PR validation
on:
pull_request:
branches: [main]
jobs:
validate:
runs-on: fritzlab
steps:
- uses: actions/checkout@v4
- uses: https://code.fritzlab.net/action/image-build@v1
with:
image: code.fritzlab.net/fritzlab/flock
build-args: GIT_SHA=${{ github.sha }}
smoke-test: |
docker run --rm $IMAGE --help || true
docker run --rm --entrypoint /usr/local/bin/flock $IMAGE || true
+1 -1
View File
@@ -21,7 +21,7 @@ RUN CGO_ENABLED=0 go build -trimpath \
-o /out/flock-installer ./cmd/flock-installer -o /out/flock-installer ./cmd/flock-installer
FROM alpine:3.21 FROM alpine:3.21
RUN apk add --no-cache iproute2 bird ca-certificates RUN apk add --no-cache iproute2 bird nftables ca-certificates
COPY --from=build /out/flock /usr/local/bin/flock COPY --from=build /out/flock /usr/local/bin/flock
COPY --from=build /out/flock-agent /usr/local/bin/flock-agent COPY --from=build /out/flock-agent /usr/local/bin/flock-agent
COPY --from=build /out/flock-installer /usr/local/bin/flock-installer COPY --from=build /out/flock-installer /usr/local/bin/flock-installer
+389 -13
View File
@@ -1,22 +1,398 @@
# flock # flock
Kubernetes CNI for sjc001. Per-pod IPv4 opt-in, IID embedding, Ready-gated anycast via BGP. A small, opinionated Kubernetes CNI built around three ideas:
Design doc: `k8s-manager/dfritz-cni.md` (in the operator's k8s-manager repo). 1. **Dual-stack, IPv6-friendly.** Every pod gets a globally routable IPv6
address by default. IPv4 is also enabled by default; either family can
be turned off per-node or per-pod when you really mean to.
2. **No tunnels, no NAT.** Pod addresses are the real packets on the wire.
Each node speaks BGP to its upstream router and advertises its own
per-node prefix. The pod network is just the LAN, plus host routes.
3. **Anycast as a primitive.** A pod can request an anycast address via
an annotation; flock binds it on the pod's loopback and advertises a
`/128` (or `/32`) over BGP, but only while the pod is `Ready`. Multiple
replicas advertise the same address from different nodes for ECMP load
balancing without a separate Service or external LB.
Status: M1 scaffold. Not functional. See milestones table in the design doc. flock is built for clusters where every node already speaks BGP to one
or more upstream routers. It deliberately leaves out features you'd
expect from a general-purpose CNI — overlays, IPsec/Wireguard, IPAM
coordination across nodes, kube-proxy integration — so the moving parts
that remain are easy to reason about.
## Layout > **Status:** alpha. CRD shape and annotation keys may still change.
- `cmd/flock` — CNI plugin binary (kubelet-invoked) ## Table of contents
- `cmd/flock-agent` — DaemonSet binary
- `pkg/api/v1alpha1``NodeConfig` CRD types - [How it works](#how-it-works)
- `pkg/cni` — CNI plugin internals + RPC client - [Requirements](#requirements)
- `pkg/agent` — agent server, IPAM, state file, anycast, NetworkPolicy - [Quickstart](#quickstart)
- `pkg/embed``ip-algo` IID embedding (pure) - [NodeConfig CRD](#nodeconfig-crd)
- `pkg/routing/{bird,ospf}` — routing backends - [Pod annotations](#pod-annotations)
- `deploy/` — CRDs, RBAC, DaemonSet manifests - [Use cases](#use-cases)
- [Comparison vs Calico / Cilium](#comparison-vs-calico--cilium)
- [Limitations and non-goals](#limitations-and-non-goals)
- [Building and testing](#building-and-testing)
- [License](#license)
## How it works
Each node runs a single `flock-agent` DaemonSet pod with three containers:
- a privileged init container (`flock-installer`) that drops the CNI
plugin binary into `/opt/cni/bin/flock` and writes
`/etc/cni/net.d/01-flock.conflist`,
- the agent itself, which owns IPAM, programs veth pairs, and tracks
pod readiness, and
- a [BIRD2](https://bird.network.cz/) sidecar that the agent re-renders
and reloads when the per-node config or the active anycast set changes.
Each node has a `NodeConfig` CR (cluster-scoped, name = node name) that
declares its IPv6 and IPv4 prefixes, its local BGP ASN, and its upstream
peers. The agent reads the CR via a dynamic informer.
When kubelet runs the CNI plugin on `ADD`, the plugin opens a unix-socket
RPC to the agent. The agent allocates an address from the per-node
CIDRs, creates a veth pair, configures the pod side, persists the
allocation to `/var/lib/flock/allocations.json`, and returns the result.
There is no controller loop and no IPAM coordination across nodes — each
node owns a non-overlapping CIDR and allocates locally.
For anycast, the agent installs `<anycast-ip> via <pod-eth0-ip> dev <veth>`
host routes on the node and adds the anycast IP to BIRD's BGP export
filter. When a pod loses readiness, the agent withdraws the route from
both the kernel and BGP within one reconcile cycle (sub-second).
### Packet path
`pod.eth0` (a veth) ↔ host-side veth (with `addrgenmode none`,
`fe80::1/64`, proxy-ARP for the v4 default-via) ↔ host kernel ↔ uplink
NIC ↔ upstream router. No conntrack, no SNAT, no encapsulation.
For IPv6 the host side of every veth carries the deterministic link-local
gateway `fe80::1`, so every pod can use a fixed default route. For IPv4
the host side answers ARP for `169.254.1.1`, providing the same fixed
default route in v4.
## Requirements
- Linux nodes. flock has not been tested on, and does not target,
Windows nodes.
- Kubernetes ≥ 1.27.
- An upstream router (or pair) that accepts a BGP session from each
node. flock has been tested with Cisco IOS-XE, Arista EOS, and FRR
acting as the upstream; anything that speaks standard eBGP should work.
- Globally routable (or at least datacentre-routable) IPv6 prefix
delegated to the cluster, sliced into a per-node /64. IPv4 is
optional but supported.
- Each node must have a unique local ASN. Private ASNs (`6451265534`,
`42000000004294967294`) are typical.
## Quickstart
```sh
# 1. Install CRD + RBAC + DaemonSet (single bundled manifest):
kubectl apply -f deploy/install.yaml
# 2. Label the node(s) you want flock to manage:
kubectl label node <node-name> flock.fritzlab.net/agent=
# 3. Apply a NodeConfig CR for that node (see "NodeConfig CRD" below):
kubectl apply -f my-nodeconfig.yaml
# 4. Verify the agent is up:
kubectl -n kube-system get pod -l app=flock-agent -o wide
kubectl -n kube-system exec -it ds/flock-agent -c bird -- \
birdc -s /run/flock/bird.ctl show protocols
```
The DaemonSet is gated by the `flock.fritzlab.net/agent` node label, so
unlabelled nodes continue to use whatever CNI was installed before. This
lets you migrate node-by-node — start with one node, prove it works, then
proceed.
## NodeConfig CRD
A `NodeConfig` is the only operator-supplied input. One per node, name
matches the node name. Example:
```yaml
apiVersion: flock.fritzlab.net/v1alpha1
kind: NodeConfig
metadata:
name: node-a
spec:
cidr6:
- 2001:db8:f001::/64 # Pods on this node get addresses from here.
cidr4:
- 192.0.2.0/24 # IPv4 pool, used only when a pod opts in.
defaults:
ipv6: true # Optional. Built-in baseline if omitted.
ipv4: true # Optional. Built-in baseline if omitted.
bgp:
asn: 65101 # This node's local ASN.
peers:
- address: 2001:db8::1 # Upstream router (IPv6 session).
asn: 65000
- address: 192.0.2.1 # Same router, IPv4 session.
asn: 65000
```
### `spec.defaults`
`spec.defaults` controls which address families a pod *gets by default*
on this node — i.e. when the pod has no explicit `flock.fritzlab.net/ipv6`
or `flock.fritzlab.net/ipv4` annotation. Pod annotations always override.
If you omit `spec.defaults` (or any individual field inside it) flock
falls back to its built-in baseline of **dual-stack (IPv6 on, IPv4 on)**.
| Goal | `spec.defaults` |
|-----------------------------------|----------------------------------------|
| Dual-stack (the default) | omit, or `{ ipv6: true, ipv4: true }` |
| IPv6-only node | `{ ipv6: true, ipv4: false }` |
| IPv4-only (legacy node) | `{ ipv6: false, ipv4: true }` |
A NodeConfig that resolves to "neither family" is rejected at allocation
time, so misconfiguring both to false will surface as an error on the
first `CNI ADD`.
### `spec.bgp`
Each `peer` becomes one BGP session. The agent picks a node-local source
address on the same subnet as the peer; if there isn't one, BIRD uses
its default. Multi-homing (multiple peers per family — or per upstream
router pair) is allowed.
## Pod annotations
All annotations live under `flock.fritzlab.net/`. Every annotation is
optional; leave them off to inherit the per-node defaults.
| Annotation | Type | Purpose |
|-------------------------------------|--------|-----------------------------------------------------------------------------------------------|
| `flock.fritzlab.net/ipv6` | bool | Override `spec.defaults.ipv6` for this pod (`true`/`false`). |
| `flock.fritzlab.net/ipv4` | bool | Override `spec.defaults.ipv4` for this pod (`true`/`false`). |
| `flock.fritzlab.net/cidr6` | CIDRs | Restrict IPv6 allocation to a sub-range of the node's `cidr6`. Comma-separated. |
| `flock.fritzlab.net/cidr4` | CIDRs | Restrict IPv4 allocation to a sub-range of the node's `cidr4`. Comma-separated. |
| `flock.fritzlab.net/ip-algo` | list | Embed identity into the IPv6 IID. Subset of `namespace,pod,image`, in order, comma-separated. |
| `flock.fritzlab.net/anycast` | IPs | Bind these IPs on the pod's `lo`; advertise via BGP while pod is `Ready`. Mixed v6+v4 ok. |
| `flock.fritzlab.net/addresses` | IPs | Bind these IPs on the pod's `eth0`. The first v6 and first v4 **replace** IPAM allocation for that family — the addresses IP becomes the pod's primary IP. Mixed v6+v4 ok. Single-replica only in practice. |
Bool values must be the literal strings `"true"` or `"false"`
(case-insensitive, surrounding whitespace tolerated). Other values —
`1`, `0`, `yes`, `no` — are rejected so a typo can't silently flip
behaviour.
### `addresses` vs `anycast`
Both annotations bind operator-supplied IPs onto a pod and have flock
advertise `/128` (or `/32`) per-pod over BGP. The differences are
where the IP lands and what it's for:
| | `anycast` | `addresses` |
|----------------------------|----------------------------------------------------|-------------------------------------------------------------------|
| Bound on | pod `lo` | pod `eth0` |
| Multi-replica? | yes — every Ready replica advertises the same IP and the upstream router ECMPs across them | no — the same IP on multiple replicas is operator error |
| Replaces IPAM? | no — pod still has an IPAM-allocated unicast IP | **yes** — the first v6 + first v4 in the list become the pod's primary IPs in place of an IPAM allocation |
| Workload visibility | only the IPAM IP is on the primary interface | the public IP is `eth0`'s primary address — workloads that read their own NIC see it (e.g. Plex's remote-access detection) |
Use `anycast` for shared services with many replicas (DNS, ingress).
Use `addresses` when one specific pod needs a known public IP that the
workload itself must see on its primary interface.
### Conflict detection
`addresses` and `anycast` reject pods that supply an IP whose family is
disabled. If the resolved `WantV4` is false (via the pod's `ipv4`
annotation or the NodeConfig default) and any addresses- or
anycast-supplied IP is IPv4, the CNI ADD fails with an explicit error.
Same for v6. Both annotation types put IPs on a pod interface and rely
on the family being enabled for return-path routing — silently accepting
the IP would leave a non-functional pod.
### Outside-aggregate advertisement
When an `addresses` IP replaces IPAM (becomes the pod's primary IP) the
IP is typically **outside** the node's BGP aggregate (e.g. a public
`/32` on a node whose pod CIDR is private). flock notices this during
BGP rendering and advertises the IP individually as a per-pod `/32` or
`/128` so the upstream router has a route to it.
### Example pods
Default dual-stack — no annotations needed:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: minimal
```
IPv6 only — opt out of the default v4 allocation:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: v6-only
annotations:
flock.fritzlab.net/ipv4: "false"
```
Operator-friendly addressing — `fnv(namespace) | fnv(pod) | random`
packed into the host bits, so a pod's identity is recognisable from
its IP in `kubectl get pods -o wide`:
```yaml
metadata:
annotations:
flock.fritzlab.net/ip-algo: "namespace,pod"
```
Anycast service — three replicas, each advertising the same v6+v4
anycast pair from the node it lands on. The upstream router does ECMP
across the active set:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: dns
spec:
replicas: 3
template:
metadata:
annotations:
flock.fritzlab.net/anycast: "2001:db8:a::53, 192.0.2.53"
spec:
containers:
- name: coredns
image: coredns/coredns
readinessProbe:
httpGet: { path: /ready, port: 8181 }
periodSeconds: 1
failureThreshold: 1
```
Workload with a known public IP — single-replica pod whose application
inspects its own primary interface (Plex's remote-access flow). The
addresses become the pod's primary IPs in place of any IPAM allocation;
the pod's `eth0` ends up with exactly the supplied addresses, and BGP
advertises them as a `/128` and `/32`:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: plex
spec:
replicas: 1
template:
metadata:
annotations:
flock.fritzlab.net/addresses: "2001:db8:c606::166, 192.0.2.166"
spec:
containers:
- name: plex
image: plexinc/pms-docker
```
## Use cases
**Highly-available DNS.** Run N CoreDNS replicas, each annotated with
the same `anycast` IP. Point client `/etc/resolv.conf` at the anycast
address. Each replica advertises a `/128` from its own node; the
upstream router does ECMP. Lose a pod, traffic fails over within a
probe cycle.
**Replacing a kube-proxy `ClusterIP`.** Headless Service plus an anycast
IP gives you a single stable address with load-balancing across pods,
without the DNAT-pinning that makes long-lived TCP keepalive connections
stick to one backend forever. ECMP routes each new flow independently.
**Per-pod public IPv6.** Because every pod has a globally routable IPv6
address and the cluster does no NAT, a pod's `eth0` IP is reachable from
the rest of the internet (subject to your firewall). Useful for things
like outgoing SMTP, where you want a stable from-address per pod, or for
peer-to-peer protocols that don't tolerate NAT.
**Fast pod identification in `kubectl`.** With
`flock.fritzlab.net/ip-algo: namespace,pod` the IPv6 host bits encode
the pod's namespace+name, so you can recognise a pod from its IP without
a lookup. Reverse-DNS via a wildcard zone makes those IPs human-readable
too.
**Static-IP migration.** Annotation-driven address allocation means you
can ask for a specific sub-CIDR (`cidr6: 2001:db8:f001::ab00/120`) for
services that previously needed pinned IPs (mail server, ingress
controller). When the static-IP requirement goes away, drop the
annotation and the pod gets a normal allocation.
## Comparison vs Calico / Cilium
| | flock | Calico | Cilium |
|--------------------------|-----------------------------|------------------------------|------------------------------|
| Default address family | dual (IPv6+IPv4) | IPv4 | dual |
| BGP | yes (BIRD) | yes | optional |
| Overlay (VXLAN/IPIP) | never | optional | yes (geneve) or native |
| NAT in datapath | never | masquerade by default | masquerade by default |
| Anycast pod addressing | first-class | manual | optional, via service mesh |
| eBPF datapath | no | optional | yes |
| NetworkPolicy | yes (nftables) | yes (Felix) | yes (eBPF) |
| Cluster size target | small (< 100 nodes) | thousands | thousands |
| Operational surface area | low (1 DaemonSet, 1 CRD) | medium | high |
| Production-ready | alpha | yes | yes |
flock is not trying to compete with Calico or Cilium. The right answer
for most clusters is one of those two — flock exists for clusters where
every node already speaks BGP, the operator wants real (no NAT) IPv6
addressing on every pod, and per-pod anycast is something they actually
want to use rather than work around.
## Limitations and non-goals
- NetworkPolicy supports `networking.k8s.io/v1` (ingress + egress, all
three peer types, numeric ports + port ranges). Named ports and
AdminNetworkPolicy are not yet implemented.
- No NAT, no masquerade, no SNAT-egress. Pods reach the wider internet
using their real cluster-routable addresses; if your IPv4 pool isn't
routable beyond your network, those pods can't reach v4-only hosts on
the public internet without help from your border router.
- No multi-cluster, no peering across clusters.
- Linux-only datapath.
- IPAM is per-node — there's no global allocator and no IP mobility.
When a pod moves to a different node it gets a new address.
- The agent is privileged. It mounts `/var/run/netns`, configures veth
pairs, manages kernel routes, and holds `CAP_NET_ADMIN`. This is
inherent to being a CNI; reducing privilege further is not a goal.
- If BIRD dies but the agent stays up, pods on that node stop being
reachable from off-node. The DaemonSet liveness probes catch this.
## Building and testing
```sh
# Unit tests + fuzz seed corpora (fast, ~1s):
go test ./...
# Targeted fuzz pass:
go test -run NEVERMATCH -fuzz=FuzzParseAnnotations -fuzztime=30s ./pkg/agent
go test -run NEVERMATCH -fuzz=FuzzRender -fuzztime=30s ./pkg/routing/bird
go test -run NEVERMATCH -fuzz=FuzzEmbed -fuzztime=30s ./pkg/embed
go test -run NEVERMATCH -fuzz=FuzzIPAM_Allocate -fuzztime=30s ./pkg/agent
# Build the container image (used by the DaemonSet):
docker build -t flock:dev .
```
The fuzz tests are also run as plain unit tests via their seed corpora,
so every `go test ./...` exercises the discovered edge cases as
regressions.
`pkg/agent` has Linux-only files (`*_linux.go`) for netlink and netns
work; the macOS/Windows build pulls in stubs from `*_stub.go` so tests
run cleanly on developer laptops.
## License ## License
Apache 2.0. Apache 2.0 — see [LICENSE](LICENSE).
@@ -20,6 +20,9 @@ spec:
openAPIV3Schema: openAPIV3Schema:
type: object type: object
required: [spec] required: [spec]
description: |
NodeConfig is the per-node operator-supplied configuration for the
flock CNI agent. Its name MUST equal the Kubernetes node name.
properties: properties:
spec: spec:
type: object type: object
@@ -35,6 +38,25 @@ spec:
items: items:
type: string type: string
description: IPv4 CIDR owned and aggregate-advertised by this node. description: IPv4 CIDR owned and aggregate-advertised by this node.
defaults:
type: object
description: |
Per-node baseline for which address families a pod receives
when its own annotations don't specify. Pod annotations
flock.fritzlab.net/ipv6 and flock.fritzlab.net/ipv4 always
override these defaults. Built-in fallback (when this block
or any field is omitted) is IPv6=true, IPv4=true (dual-stack).
properties:
ipv6:
type: boolean
description: |
Default IPv6 inclusion for pods on this node. Omit to
inherit the built-in baseline (true).
ipv4:
type: boolean
description: |
Default IPv4 inclusion for pods on this node. Omit to
inherit the built-in baseline (true).
bgp: bgp:
type: object type: object
required: [asn, peers] required: [asn, peers]
@@ -70,3 +92,9 @@ spec:
- name: CIDR4 - name: CIDR4
type: string type: string
jsonPath: .spec.cidr4 jsonPath: .spec.cidr4
- name: DefV6
type: boolean
jsonPath: .spec.defaults.ipv6
- name: DefV4
type: boolean
jsonPath: .spec.defaults.ipv4
+4 -13
View File
@@ -41,19 +41,10 @@ spec:
nodeSelector: nodeSelector:
flock.fritzlab.net/agent: "" flock.fritzlab.net/agent: ""
tolerations: tolerations:
- key: fritzlab.net/cni-test # CNI must schedule on a fresh node before it becomes Ready —
operator: Equal # the node has not-ready:NoSchedule until flock installs the CNI conflist.
value: "true" # Catch-all tolerates all taints so the agent always runs.
effect: NoSchedule - operator: Exists
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
initContainers: initContainers:
- name: install-cni - name: install-cni
image: code.fritzlab.net/fritzlab/flock:latest image: code.fritzlab.net/fritzlab/flock:latest
+35 -13
View File
@@ -20,6 +20,9 @@ spec:
openAPIV3Schema: openAPIV3Schema:
type: object type: object
required: [spec] required: [spec]
description: |
NodeConfig is the per-node operator-supplied configuration for the
flock CNI agent. Its name MUST equal the Kubernetes node name.
properties: properties:
spec: spec:
type: object type: object
@@ -35,6 +38,25 @@ spec:
items: items:
type: string type: string
description: IPv4 CIDR owned and aggregate-advertised by this node. description: IPv4 CIDR owned and aggregate-advertised by this node.
defaults:
type: object
description: |
Per-node baseline for which address families a pod receives
when its own annotations don't specify. Pod annotations
flock.fritzlab.net/ipv6 and flock.fritzlab.net/ipv4 always
override these defaults. Built-in fallback (when this block
or any field is omitted) is IPv6=true, IPv4=true (dual-stack).
properties:
ipv6:
type: boolean
description: |
Default IPv6 inclusion for pods on this node. Omit to
inherit the built-in baseline (true).
ipv4:
type: boolean
description: |
Default IPv4 inclusion for pods on this node. Omit to
inherit the built-in baseline (true).
bgp: bgp:
type: object type: object
required: [asn, peers] required: [asn, peers]
@@ -70,6 +92,12 @@ spec:
- name: CIDR4 - name: CIDR4
type: string type: string
jsonPath: .spec.cidr4 jsonPath: .spec.cidr4
- name: DefV6
type: boolean
jsonPath: .spec.defaults.ipv6
- name: DefV4
type: boolean
jsonPath: .spec.defaults.ipv4
--- ---
apiVersion: v1 apiVersion: v1
kind: ServiceAccount kind: ServiceAccount
@@ -91,6 +119,9 @@ rules:
- apiGroups: ["networking.k8s.io"] - apiGroups: ["networking.k8s.io"]
resources: ["networkpolicies"] resources: ["networkpolicies"]
verbs: ["get", "list", "watch"] verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["namespaces"]
verbs: ["get", "list", "watch"]
- apiGroups: [""] - apiGroups: [""]
resources: ["nodes/status"] resources: ["nodes/status"]
verbs: ["patch"] verbs: ["patch"]
@@ -151,19 +182,10 @@ spec:
nodeSelector: nodeSelector:
flock.fritzlab.net/agent: "" flock.fritzlab.net/agent: ""
tolerations: tolerations:
- key: fritzlab.net/cni-test # CNI must schedule on a fresh node before it becomes Ready —
operator: Equal # the node has not-ready:NoSchedule until flock installs the CNI conflist.
value: "true" # Catch-all tolerates all taints so the agent always runs.
effect: NoSchedule - operator: Exists
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
initContainers: initContainers:
- name: install-cni - name: install-cni
image: code.fritzlab.net/fritzlab/flock:latest image: code.fritzlab.net/fritzlab/flock:latest
+290 -61
View File
@@ -2,88 +2,241 @@ package agent
import ( import (
"fmt" "fmt"
"log/slog"
"net" "net"
"strings" "strings"
flockv1alpha1 "code.fritzlab.net/fritzlab/flock/pkg/api/v1alpha1"
"code.fritzlab.net/fritzlab/flock/pkg/embed" "code.fritzlab.net/fritzlab/flock/pkg/embed"
) )
// annotationPrefix is the namespace under which all flock pod annotations
// live. Anything not starting with this prefix is ignored by the parser.
const annotationPrefix = "flock.fritzlab.net/" const annotationPrefix = "flock.fritzlab.net/"
// ParsedAnnotations is the typed view of a Pod's flock annotations. // Recognised annotation keys (without the prefix).
type ParsedAnnotations struct { const (
annIPv6 = "ipv6"
annIPv4 = "ipv4"
annCIDR6 = "cidr6"
annCIDR4 = "cidr4"
annIPAlgo = "ip-algo"
annAnycast = "anycast"
annAddresses = "addresses"
)
// FamilyDefaults is the per-call baseline for whether a pod receives an IPv6
// and/or IPv4 address. It is the merge of:
//
// 1. flock's built-in baseline (IPv6=true, IPv4=true — dual-stack), then
// 2. any NodeConfig.Spec.Defaults override the operator has applied to
// the local node.
//
// Pod-level `flock.fritzlab.net/ipv{6,4}` annotations override this baseline.
//
// Use FamilyDefaultsFromNodeConfig to compute a value from a NodeConfig,
// or BuiltinFamilyDefaults() if no NodeConfig is in scope.
type FamilyDefaults struct {
// WantV6 is the default-on value for IPv6 inclusion when the pod has no
// explicit ipv6 annotation.
WantV6 bool WantV6 bool
// WantV4 is the default-on value for IPv4 inclusion when the pod has no
// explicit ipv4 annotation.
WantV4 bool WantV4 bool
}
// BuiltinFamilyDefaults returns flock's hard-coded fallback: dual-stack
// (IPv6 + IPv4). This is the policy applied when no NodeConfig override is
// in effect. Pods that want a single family explicitly opt out via the
// `flock.fritzlab.net/ipv6` or `flock.fritzlab.net/ipv4` annotation, or
// the operator narrows the fallback at the node level via
// NodeConfig.Spec.Defaults.
//
// We define it as a function rather than a var so callers can't mutate the
// shared baseline at runtime.
func BuiltinFamilyDefaults() FamilyDefaults {
return FamilyDefaults{WantV6: true, WantV4: true}
}
// FamilyDefaultsFromNodeConfig resolves the effective per-node defaults,
// falling back to BuiltinFamilyDefaults for any field the NodeConfig leaves
// unset. A nil NodeConfig (or nil Spec.Defaults) returns the built-in
// baseline unchanged.
func FamilyDefaultsFromNodeConfig(nc *flockv1alpha1.NodeConfig) FamilyDefaults {
out := BuiltinFamilyDefaults()
if nc == nil || nc.Spec.Defaults == nil {
return out
}
if nc.Spec.Defaults.IPv6 != nil {
out.WantV6 = *nc.Spec.Defaults.IPv6
}
if nc.Spec.Defaults.IPv4 != nil {
out.WantV4 = *nc.Spec.Defaults.IPv4
}
return out
}
// ParsedAnnotations is the typed view of a pod's flock annotations after the
// node-level defaults have been merged in. All slices are non-nil only when
// the corresponding annotation was present and parsed cleanly.
type ParsedAnnotations struct {
// WantV6 is true when the pod should receive an IPv6 address.
WantV6 bool
// WantV4 is true when the pod should receive an IPv4 address.
WantV4 bool
// CIDR6 narrows IPv6 allocation to specific operator-approved sub-ranges
// of the node's CIDR6 set. nil/empty means "use any node CIDR6".
CIDR6 []*net.IPNet CIDR6 []*net.IPNet
// CIDR4 narrows IPv4 allocation. nil/empty means "use any node CIDR4".
CIDR4 []*net.IPNet CIDR4 []*net.IPNet
IPAlgo []embed.Field // Anycast is the set of anycast IPs to bind on the pod's loopback.
// nil/empty means "no anycast".
Anycast []net.IP Anycast []net.IP
// Addresses is the set of additional IPs to bind directly on the pod's
// eth0. BGP advertisement (/128+/32) is identical to Anycast; the only
// difference is that these IPs land on the primary interface instead of
// lo. Use this when the workload needs the IP directly visible on eth0
// (e.g. Plex, which inspects its own interfaces for remote-access setup).
// nil/empty means "no extra addresses".
Addresses []net.IP
} }
// ParseAnnotations applies the design-doc defaults (ipv6=true, ipv4=false) // ParseAnnotations applies the supplied per-node defaults and validates the
// and validates the post-merge combination. // post-merge combination. It is pure — it does not consult NodeConfig or any
func ParseAnnotations(in map[string]string) (*ParsedAnnotations, error) { // global state — so it is safe to call from tests and fuzz targets.
out := &ParsedAnnotations{WantV6: true, WantV4: false} //
// Annotation precedence: pod annotation > FamilyDefaults > built-in baseline.
// Callers compute FamilyDefaults via FamilyDefaultsFromNodeConfig and pass it
// in.
//
// Errors:
// - any unknown ipv6/ipv4 value (must be "true" or "false", case-insensitive)
// - any malformed cidr6/cidr4/anycast/ip-algo value
// - the post-merge combination resolves to neither IPv6 nor IPv4 (a pod
// must have at least one address)
func ParseAnnotations(in map[string]string, defaults FamilyDefaults) (*ParsedAnnotations, error) {
out := &ParsedAnnotations{WantV6: defaults.WantV6, WantV4: defaults.WantV4}
if v, ok := in[annotationPrefix+"ipv6"]; ok { if v, ok := in[annotationPrefix+annIPv6]; ok {
switch strings.ToLower(strings.TrimSpace(v)) { b, err := parseBoolAnnotation(annIPv6, v)
case "true": if err != nil {
out.WantV6 = true return nil, err
case "false":
out.WantV6 = false
default:
return nil, fmt.Errorf("annotation ipv6=%q: must be true or false", v)
} }
out.WantV6 = b
} }
if v, ok := in[annotationPrefix+"ipv4"]; ok { if v, ok := in[annotationPrefix+annIPv4]; ok {
switch strings.ToLower(strings.TrimSpace(v)) { b, err := parseBoolAnnotation(annIPv4, v)
case "true": if err != nil {
out.WantV4 = true return nil, err
case "false":
out.WantV4 = false
default:
return nil, fmt.Errorf("annotation ipv4=%q: must be true or false", v)
} }
out.WantV4 = b
} }
if !out.WantV6 && !out.WantV4 { if !out.WantV6 && !out.WantV4 {
return nil, fmt.Errorf("ipv6=false requires ipv4=true (pod must have at least one address)") return nil, fmt.Errorf("annotations + defaults resolve to no address family (need at least one of ipv6/ipv4)")
} }
if v, ok := in[annotationPrefix+"cidr6"]; ok { if v, ok := in[annotationPrefix+annCIDR6]; ok {
nets, err := parseCIDRList(v) nets, err := parseCIDRList(v, familyV6)
if err != nil { if err != nil {
return nil, fmt.Errorf("annotation cidr6: %w", err) return nil, fmt.Errorf("annotation %s: %w", annCIDR6, err)
} }
out.CIDR6 = nets out.CIDR6 = nets
} }
if v, ok := in[annotationPrefix+"cidr4"]; ok { if v, ok := in[annotationPrefix+annCIDR4]; ok {
nets, err := parseCIDRList(v) nets, err := parseCIDRList(v, familyV4)
if err != nil { if err != nil {
return nil, fmt.Errorf("annotation cidr4: %w", err) return nil, fmt.Errorf("annotation %s: %w", annCIDR4, err)
} }
out.CIDR4 = nets out.CIDR4 = nets
} }
if v, ok := in[annotationPrefix+"ip-algo"]; ok { if v, ok := in[annotationPrefix+annAnycast]; ok {
fields, err := parseIPAlgo(v)
if err != nil {
return nil, fmt.Errorf("annotation ip-algo: %w", err)
}
out.IPAlgo = fields
}
if v, ok := in[annotationPrefix+"anycast"]; ok {
ips, err := parseIPList(v) ips, err := parseIPList(v)
if err != nil { if err != nil {
return nil, fmt.Errorf("annotation anycast: %w", err) return nil, fmt.Errorf("annotation %s: %w", annAnycast, err)
} }
out.Anycast = ips out.Anycast = ips
} }
if v, ok := in[annotationPrefix+annAddresses]; ok {
ips, err := parseIPList(v)
if err != nil {
return nil, fmt.Errorf("annotation %s: %w", annAddresses, err)
}
out.Addresses = ips
}
// Reject pods that ask for an addresses- or anycast-supplied IP whose
// family was disabled (via the pod's ipv6/ipv4 annotation or NodeConfig
// default). Both annotation types put the IP on a pod interface and rely
// on the family being enabled for return-path routing — addresses needs
// the in-pod default v6/v4 route to send replies; anycast on lo needs
// the same default route on eth0 for the same reason. Silently accepting
// the IP would leave a non-functional pod, so we fail closed at ADD.
for _, ip := range out.Addresses {
if err := requireFamilyEnabled(ip, out.WantV6, out.WantV4, annAddresses); err != nil {
return nil, err
}
}
for _, ip := range out.Anycast {
if err := requireFamilyEnabled(ip, out.WantV6, out.WantV4, annAnycast); err != nil {
return nil, err
}
}
return out, nil return out, nil
} }
func parseCIDRList(s string) ([]*net.IPNet, error) { // requireFamilyEnabled returns an error when ip's family was opted out via
// the resolved WantV6/WantV4 booleans (pod annotation > NodeConfig default >
// built-in dual-stack). The source string identifies which annotation
// supplied the conflicting IP so the operator's error message is specific.
func requireFamilyEnabled(ip net.IP, wantV6, wantV4 bool, source string) error {
if ip.To4() != nil {
if !wantV4 {
return fmt.Errorf("annotation %s: contains IPv4 %s but ipv4 is disabled (annotation or NodeConfig default)", source, ip)
}
return nil
}
if !wantV6 {
return fmt.Errorf("annotation %s: contains IPv6 %s but ipv6 is disabled (annotation or NodeConfig default)", source, ip)
}
return nil
}
// parseBoolAnnotation accepts only "true" or "false" (case-insensitive,
// surrounding whitespace tolerated). All other values — including "1", "0",
// "yes", "no" — are rejected so operator typos are caught loudly rather
// than silently producing the "false" default.
func parseBoolAnnotation(key, v string) (bool, error) {
switch strings.ToLower(strings.TrimSpace(v)) {
case "true":
return true, nil
case "false":
return false, nil
default:
return false, fmt.Errorf("annotation %s=%q: must be \"true\" or \"false\"", key, v)
}
}
// addressFamily distinguishes IPv6 vs IPv4 in places where the parser must
// validate the family of supplied CIDRs.
type addressFamily int
const (
familyAny addressFamily = iota
familyV6
familyV4
)
// parseCIDRList parses a comma-separated CIDR list. Whitespace around items
// is trimmed; empty items are silently dropped. The list must contain at
// least one entry post-trim.
//
// If `want` is familyV6 or familyV4 each entry's family is checked and a
// mismatch is reported, so an `flock.fritzlab.net/cidr6` annotation cannot
// silently slip a v4 prefix into the v6 allocator.
func parseCIDRList(s string, want addressFamily) ([]*net.IPNet, error) {
var out []*net.IPNet var out []*net.IPNet
for _, part := range strings.Split(s, ",") { for _, part := range strings.Split(s, ",") {
part = strings.TrimSpace(part) part = strings.TrimSpace(part)
@@ -94,6 +247,17 @@ func parseCIDRList(s string) ([]*net.IPNet, error) {
if err != nil { if err != nil {
return nil, fmt.Errorf("invalid CIDR %q: %w", part, err) return nil, fmt.Errorf("invalid CIDR %q: %w", part, err)
} }
isV4 := n.IP.To4() != nil
switch want {
case familyV6:
if isV4 {
return nil, fmt.Errorf("CIDR %q is IPv4, expected IPv6", part)
}
case familyV4:
if !isV4 {
return nil, fmt.Errorf("CIDR %q is IPv6, expected IPv4", part)
}
}
out = append(out, n) out = append(out, n)
} }
if len(out) == 0 { if len(out) == 0 {
@@ -102,6 +266,9 @@ func parseCIDRList(s string) ([]*net.IPNet, error) {
return out, nil return out, nil
} }
// parseIPList parses a comma-separated literal-IP list. Same trim/empty
// semantics as parseCIDRList. Mixed v4 and v6 entries are allowed (anycast
// pods can advertise both families together).
func parseIPList(s string) ([]net.IP, error) { func parseIPList(s string) ([]net.IP, error) {
var out []net.IP var out []net.IP
for _, part := range strings.Split(s, ",") { for _, part := range strings.Split(s, ",") {
@@ -121,31 +288,89 @@ func parseIPList(s string) ([]net.IP, error) {
return out, nil return out, nil
} }
func parseIPAlgo(s string) ([]embed.Field, error) { // ResolveIPAlgo resolves the effective ip-algo for a pod. Precedence:
var out []embed.Field //
for _, part := range strings.Split(s, ",") { // pod annotation → NodeConfig annotation → nil (random IID).
part = strings.TrimSpace(part) //
switch part { // Empty, missing, or invalid annotations at any level fall through to the
case "": // next. Invalid input emits a warning via log; a nil log is silent. A nil
continue // return value means "no algo, generate a fully random IID".
case "namespace": //
out = append(out, embed.FieldNamespace) // "Invalid" is everything tryParseIPAlgo cannot turn into a non-empty,
case "pod": // duplicate-free subset of {namespace, pod, image} — unrecognised tokens,
out = append(out, embed.FieldPod) // duplicates, lists that resolve to zero fields after trimming.
case "image": func ResolveIPAlgo(podAnn, nodeAnn map[string]string, log *slog.Logger) []embed.Field {
out = append(out, embed.FieldImage) if v, ok := podAnn[annotationPrefix+annIPAlgo]; ok {
default: if fields := tryParseIPAlgo(v); fields != nil {
return nil, fmt.Errorf("unknown ip-algo field %q (allowed: namespace, pod, image)", part) return fields
} }
warnIPAlgo(log, "pod", v)
} }
if len(out) == 0 { if v, ok := nodeAnn[annotationPrefix+annIPAlgo]; ok {
return nil, fmt.Errorf("empty ip-algo") if fields := tryParseIPAlgo(v); fields != nil {
return fields
} }
return out, nil warnIPAlgo(log, "NodeConfig", v)
}
return nil
} }
// CNIArgs parses the K=V;K=V CNI_ARGS string for the kubelet keys we care // warnIPAlgo logs a single warning when an ip-algo annotation is present
// about. Other keys are ignored. // but cannot be parsed. Empty values are not worth a warn — they are
// indistinguishable from "key absent" by the user's design rule, so we
// only warn when a non-empty value failed parsing.
func warnIPAlgo(log *slog.Logger, source, value string) {
if log == nil {
return
}
if strings.TrimSpace(value) == "" {
return
}
log.Warn("ignoring invalid ip-algo annotation; falling through",
"source", source, "value", value)
}
// tryParseIPAlgo parses an ip-algo annotation value under the relaxed
// "invalid → unset" rules. Returns nil for: empty input, unrecognised
// tokens, duplicate fields, or anything that resolves to zero fields after
// trimming. Returns the ordered field list otherwise.
//
// Duplicates collapse to nil rather than dedup-and-keep so the operator
// notices their malformed annotation via the warn log instead of silently
// losing a field they thought they had specified.
func tryParseIPAlgo(s string) []embed.Field {
var out []embed.Field
seen := map[embed.Field]struct{}{}
for _, part := range strings.Split(s, ",") {
part = strings.TrimSpace(part)
if part == "" {
continue
}
var f embed.Field
switch part {
case string(embed.FieldNamespace):
f = embed.FieldNamespace
case string(embed.FieldApp):
f = embed.FieldApp
case string(embed.FieldImage):
f = embed.FieldImage
default:
return nil
}
if _, dup := seen[f]; dup {
return nil
}
seen[f] = struct{}{}
out = append(out, f)
}
if len(out) == 0 {
return nil
}
return out
}
// CNIArgs is the typed view of the K=V;K=V CNI_ARGS string passed by kubelet.
// We only keep the fields the agent uses; unknown keys are ignored.
type CNIArgs struct { type CNIArgs struct {
PodNamespace string PodNamespace string
PodName string PodName string
@@ -153,6 +378,10 @@ type CNIArgs struct {
InfraID string InfraID string
} }
// ParseCNIArgs is permissive by design — kubelet versions and runtime
// shims pass varying sets of keys. Malformed entries are skipped silently
// rather than failing the whole ADD; required-key validation is the
// caller's responsibility.
func ParseCNIArgs(s string) CNIArgs { func ParseCNIArgs(s string) CNIArgs {
var a CNIArgs var a CNIArgs
for _, kv := range strings.Split(s, ";") { for _, kv := range strings.Split(s, ";") {
+145
View File
@@ -0,0 +1,145 @@
package agent
import (
"testing"
)
// FuzzParseAnnotations explores the joint space of {ipv6, ipv4, cidr6, cidr4,
// anycast} annotations with random byte strings. ip-algo is handled by
// ResolveIPAlgo (separate fuzz target below) and is no longer touched by
// ParseAnnotations. Every recognised key is exercised by deriving a
// deterministic input map from the fuzzed bytes.
//
// Properties checked:
//
// 1. The parser never panics on any input.
// 2. On nil-error return, the result satisfies the design-doc invariant
// that at least one of WantV6 / WantV4 is true (a pod always has at
// least one address).
// 3. Anycast IPs and CIDR slices are non-nil/empty only when the
// annotation was supplied; never spontaneously populated.
//
// Seed corpus covers known edge cases the spec must handle.
func FuzzParseAnnotations(f *testing.F) {
// Seeds: each entry is five strings — the literal raw values for the
// five parsed keys. Empty string for "key absent".
type seed struct {
ipv6, ipv4, cidr6, cidr4, anycast string
}
seeds := []seed{
{},
{ipv4: "true"},
{ipv6: "false", ipv4: "true"},
{ipv6: "TRUE"},
{ipv6: " true "},
{ipv6: "yes"}, // invalid → expect error
{ipv4: "1"}, // invalid
{cidr6: ""}, // invalid (empty after split)
{cidr6: ","}, // invalid (empty after trim)
{cidr6: "2602:817:3000:f001::/64"}, // valid single
{cidr6: "2602:817:3000:f001::/64,"}, // trailing comma
{cidr6: " 2602:817:3000:f001::/64 "}, // surrounding whitespace
{cidr6: "2602:817:3000:f001::/64, 2602:817:3000:f002::/64"},
{cidr6: "10.0.0.0/8"}, // family mismatch
{cidr4: "172.25.210.0/24"}, // valid
{cidr4: "172.25.210.0/24,172.25.211.0/24"}, // multiple
{cidr4: "2602:817::/32"}, // family mismatch
{anycast: "2602:817:3000:ac::1"},
{anycast: "2602:817:3000:ac::1, 172.25.255.1"},
{anycast: "::1"}, // loopback (allowed at parse time)
{anycast: "fe80::1"}, // link-local (allowed at parse time)
{anycast: "::ffff:10.0.0.1"}, // v4-mapped v6
{anycast: "0.0.0.0"}, // unspecified
{anycast: "definitely-not-an-ip"}, // invalid
{anycast: ""}, // invalid
// Embedded NUL bytes
{ipv4: "true\x00"},
{cidr6: "2602:817:3000:f001::/64\x00"},
{anycast: "\x00\x00"},
// Unicode
{ipv4: "trüe"},
// Very long
{cidr6: longString("2602:817:3000:f001::/64,", 4096)},
}
for _, s := range seeds {
f.Add(s.ipv6, s.ipv4, s.cidr6, s.cidr4, s.anycast)
}
f.Fuzz(func(t *testing.T, ipv6, ipv4, cidr6, cidr4, anycast string) {
in := map[string]string{}
// Treat empty as "key absent" so the seed table matches the run-time
// shape; Kubernetes annotations cannot have a nil value but they CAN
// be missing entirely. Empty-string-with-key is also a real case
// (operator typo); add a separate seed below to cover it.
if ipv6 != "" {
in[annotationPrefix+annIPv6] = ipv6
}
if ipv4 != "" {
in[annotationPrefix+annIPv4] = ipv4
}
if cidr6 != "" {
in[annotationPrefix+annCIDR6] = cidr6
}
if cidr4 != "" {
in[annotationPrefix+annCIDR4] = cidr4
}
if anycast != "" {
in[annotationPrefix+annAnycast] = anycast
}
got, err := ParseAnnotations(in, BuiltinFamilyDefaults())
if err != nil {
return // any error is acceptable; we only require no panic
}
// Property: at least one family must be selected.
if !got.WantV6 && !got.WantV4 {
t.Fatalf("parser accepted but produced no family: in=%#v", in)
}
// Property: optional fields populated only when their key was set.
if _, hasAny := in[annotationPrefix+annAnycast]; !hasAny && len(got.Anycast) != 0 {
t.Fatalf("Anycast populated without annotation")
}
if _, hasC6 := in[annotationPrefix+annCIDR6]; !hasC6 && len(got.CIDR6) != 0 {
t.Fatalf("CIDR6 populated without annotation")
}
if _, hasC4 := in[annotationPrefix+annCIDR4]; !hasC4 && len(got.CIDR4) != 0 {
t.Fatalf("CIDR4 populated without annotation")
}
})
}
// FuzzParseCNIArgs requires the parser to never panic on adversarial inputs.
// The parser is permissive by spec — it returns a CNIArgs with whatever it
// could extract — so the only invariant is "doesn't crash".
func FuzzParseCNIArgs(f *testing.F) {
f.Add("")
f.Add("=")
f.Add(";")
f.Add(";=;=;")
f.Add("K8S_POD_NAMESPACE=ns;K8S_POD_NAME=p")
f.Add("K8S_POD_NAMESPACE=ns;K8S_POD_NAME=p;K8S_POD_UID=abc;K8S_POD_INFRA_CONTAINER_ID=def")
f.Add("=value-only")
f.Add("key-only=")
f.Add("\x00\x00\x00")
f.Add("K8S_POD_NAMESPACE=\xff\xfe\xfd")
f.Add("K8S_POD_NAME=value;K8S_POD_NAME=other") // duplicate keys: last wins
// Long input
f.Add(longString("K8S_POD_NAME=x;", 4096))
f.Fuzz(func(t *testing.T, in string) {
_ = ParseCNIArgs(in)
})
}
// longString returns s repeated to total >= n bytes, useful for piling up
// realistic-looking but oversized inputs.
func longString(s string, n int) string {
if len(s) == 0 {
return ""
}
var b []byte
for len(b) < n {
b = append(b, s...)
}
return string(b)
}
+371 -21
View File
@@ -3,23 +3,125 @@ package agent
import ( import (
"testing" "testing"
flockv1alpha1 "code.fritzlab.net/fritzlab/flock/pkg/api/v1alpha1"
"code.fritzlab.net/fritzlab/flock/pkg/embed" "code.fritzlab.net/fritzlab/flock/pkg/embed"
) )
func TestParseAnnotations_Defaults(t *testing.T) { // boolPtr returns a pointer to b — convenient for the *bool pointer fields
a, err := ParseAnnotations(nil) // in FamilyDefaults where nil means "unset".
func boolPtr(b bool) *bool { return &b }
func TestBuiltinFamilyDefaults(t *testing.T) {
d := BuiltinFamilyDefaults()
if !d.WantV6 || !d.WantV4 {
t.Fatalf("built-in defaults wrong: v6=%v v4=%v (want dual-stack true/true)", d.WantV6, d.WantV4)
}
}
func TestFamilyDefaultsFromNodeConfig_NilNodeConfig(t *testing.T) {
d := FamilyDefaultsFromNodeConfig(nil)
if d != BuiltinFamilyDefaults() {
t.Fatalf("nil NodeConfig should yield built-in defaults; got %+v", d)
}
}
func TestFamilyDefaultsFromNodeConfig_NilDefaults(t *testing.T) {
nc := &flockv1alpha1.NodeConfig{}
d := FamilyDefaultsFromNodeConfig(nc)
if d != BuiltinFamilyDefaults() {
t.Fatalf("missing Defaults should yield built-in; got %+v", d)
}
}
func TestFamilyDefaultsFromNodeConfig_PartialOverride(t *testing.T) {
nc := &flockv1alpha1.NodeConfig{
Spec: flockv1alpha1.NodeConfigSpec{
Defaults: &flockv1alpha1.FamilyDefaults{
IPv4: boolPtr(false),
},
},
}
d := FamilyDefaultsFromNodeConfig(nc)
// IPv6 unset → keeps built-in true; IPv4 explicitly set to false →
// node opts the family off. Validates that an explicit false beats
// the dual-stack baseline rather than being silently overridden.
if !d.WantV6 || d.WantV4 {
t.Fatalf("partial override wrong: %+v (want v6=true, v4=false)", d)
}
}
func TestFamilyDefaultsFromNodeConfig_FullOverride(t *testing.T) {
nc := &flockv1alpha1.NodeConfig{
Spec: flockv1alpha1.NodeConfigSpec{
Defaults: &flockv1alpha1.FamilyDefaults{
IPv6: boolPtr(false),
IPv4: boolPtr(true),
},
},
}
d := FamilyDefaultsFromNodeConfig(nc)
if d.WantV6 || !d.WantV4 {
t.Fatalf("full override wrong: %+v (want v6=false, v4=true)", d)
}
}
func TestParseAnnotations_BuiltinDefaults(t *testing.T) {
// Built-in baseline is dual-stack — no annotation needed.
a, err := ParseAnnotations(nil, BuiltinFamilyDefaults())
if err != nil {
t.Fatal(err)
}
if !a.WantV6 || !a.WantV4 {
t.Fatalf("expected dual-stack default, got v6=%v v4=%v", a.WantV6, a.WantV4)
}
}
// TestParseAnnotations_OptOutV4 — pods that want IPv6 only must opt out
// explicitly via the ipv4 annotation now that the built-in is dual-stack.
func TestParseAnnotations_OptOutV4(t *testing.T) {
a, err := ParseAnnotations(map[string]string{
annotationPrefix + "ipv4": "false",
}, BuiltinFamilyDefaults())
if err != nil { if err != nil {
t.Fatal(err) t.Fatal(err)
} }
if !a.WantV6 || a.WantV4 { if !a.WantV6 || a.WantV4 {
t.Fatalf("defaults wrong: v6=%v v4=%v", a.WantV6, a.WantV4) t.Fatalf("ipv4=false override failed: v6=%v v4=%v", a.WantV6, a.WantV4)
} }
} }
func TestParseAnnotations_DualStack(t *testing.T) { func TestParseAnnotations_NodeDefaultsApplied(t *testing.T) {
// Node config says "IPv4 is on by default for this node".
d := FamilyDefaults{WantV6: true, WantV4: true}
a, err := ParseAnnotations(nil, d)
if err != nil {
t.Fatal(err)
}
if !a.WantV6 || !a.WantV4 {
t.Fatalf("node defaults not applied: %+v", a)
}
}
func TestParseAnnotations_AnnotationOverridesNodeDefault(t *testing.T) {
// Node says dual-stack by default; pod opts out of v4 explicitly.
d := FamilyDefaults{WantV6: true, WantV4: true}
a, err := ParseAnnotations(map[string]string{
annotationPrefix + "ipv4": "false",
}, d)
if err != nil {
t.Fatal(err)
}
if !a.WantV6 || a.WantV4 {
t.Fatalf("annotation override failed: %+v", a)
}
}
func TestParseAnnotations_DualStackViaAnnotation(t *testing.T) {
// Same as built-in default; explicit ipv4=true is a no-op now but must
// still parse cleanly.
a, err := ParseAnnotations(map[string]string{ a, err := ParseAnnotations(map[string]string{
annotationPrefix + "ipv4": "true", annotationPrefix + "ipv4": "true",
}) }, BuiltinFamilyDefaults())
if err != nil { if err != nil {
t.Fatal(err) t.Fatal(err)
} }
@@ -29,35 +131,152 @@ func TestParseAnnotations_DualStack(t *testing.T) {
} }
func TestParseAnnotations_NoFamily(t *testing.T) { func TestParseAnnotations_NoFamily(t *testing.T) {
// Pod opts out of both families → must be rejected.
if _, err := ParseAnnotations(map[string]string{ if _, err := ParseAnnotations(map[string]string{
annotationPrefix + "ipv6": "false", annotationPrefix + "ipv6": "false",
}); err == nil { annotationPrefix + "ipv4": "false",
t.Fatalf("expected error: ipv6=false ipv4=false") }, BuiltinFamilyDefaults()); err == nil {
t.Fatalf("expected error when pod opts out of both families")
} }
} }
func TestParseAnnotations_IPAlgo(t *testing.T) { func TestParseAnnotations_NoFamily_NodeDefaultsAlsoOff(t *testing.T) {
a, err := ParseAnnotations(map[string]string{ // Pathological NodeConfig that disables both families. Even with no pod
annotationPrefix + "ip-algo": "namespace,pod,image", // annotation we must reject — otherwise a pod gets an empty allocation.
}) d := FamilyDefaults{WantV6: false, WantV4: false}
if _, err := ParseAnnotations(nil, d); err == nil {
t.Fatalf("expected error when both defaults are false")
}
}
func TestParseAnnotations_BoolStrictness(t *testing.T) {
// Common misuses that should be rejected so typos don't silently flip
// behaviour to the implicit-false default.
bad := []string{"1", "0", "yes", "no", "TrueFalse", " "}
for _, v := range bad {
_, err := ParseAnnotations(map[string]string{
annotationPrefix + "ipv4": v,
}, BuiltinFamilyDefaults())
if err == nil {
t.Errorf("expected error for ipv4=%q", v)
}
}
}
func TestParseAnnotations_BoolCaseInsensitive(t *testing.T) {
for _, v := range []string{"TRUE", "True", " true ", "FALSE", "False"} {
_, err := ParseAnnotations(map[string]string{
annotationPrefix + "ipv4": v,
}, BuiltinFamilyDefaults())
if err != nil { if err != nil {
t.Fatal(err) t.Errorf("expected ipv4=%q to parse cleanly: %v", v, err)
}
want := []embed.Field{embed.FieldNamespace, embed.FieldPod, embed.FieldImage}
if len(a.IPAlgo) != len(want) {
t.Fatalf("ip-algo len=%d, want %d", len(a.IPAlgo), len(want))
}
for i := range want {
if a.IPAlgo[i] != want[i] {
t.Fatalf("ip-algo[%d]=%s, want %s", i, a.IPAlgo[i], want[i])
} }
} }
} }
// ResolveIPAlgo: precedence is pod → node → nil. Empty / missing / invalid
// at any level falls through to the next under the relaxed user-defined rule
// "all three mean unset".
func TestResolveIPAlgo_PodWins(t *testing.T) {
pod := map[string]string{annotationPrefix + annIPAlgo: "namespace,app"}
node := map[string]string{annotationPrefix + annIPAlgo: "image"}
got := ResolveIPAlgo(pod, node, nil)
want := []embed.Field{embed.FieldNamespace, embed.FieldApp}
if !equalFields(got, want) {
t.Fatalf("got %v, want %v", got, want)
}
}
func TestResolveIPAlgo_PodAbsentFallsToNode(t *testing.T) {
node := map[string]string{annotationPrefix + annIPAlgo: "image"}
got := ResolveIPAlgo(nil, node, nil)
want := []embed.Field{embed.FieldImage}
if !equalFields(got, want) {
t.Fatalf("got %v, want %v", got, want)
}
}
func TestResolveIPAlgo_PodEmptyFallsToNode(t *testing.T) {
pod := map[string]string{annotationPrefix + annIPAlgo: ""}
node := map[string]string{annotationPrefix + annIPAlgo: "image"}
got := ResolveIPAlgo(pod, node, nil)
want := []embed.Field{embed.FieldImage}
if !equalFields(got, want) {
t.Fatalf("got %v, want %v", got, want)
}
}
func TestResolveIPAlgo_PodInvalidFallsToNode(t *testing.T) {
for _, podVal := range []string{"namespace,bogus", "ns", ",", "namespace,namespace"} {
pod := map[string]string{annotationPrefix + annIPAlgo: podVal}
node := map[string]string{annotationPrefix + annIPAlgo: "app"}
got := ResolveIPAlgo(pod, node, nil)
want := []embed.Field{embed.FieldApp}
if !equalFields(got, want) {
t.Fatalf("podVal=%q: got %v, want %v", podVal, got, want)
}
}
}
func TestResolveIPAlgo_BothInvalidReturnsNil(t *testing.T) {
pod := map[string]string{annotationPrefix + annIPAlgo: "bogus"}
node := map[string]string{annotationPrefix + annIPAlgo: "also-bogus"}
if got := ResolveIPAlgo(pod, node, nil); got != nil {
t.Fatalf("got %v, want nil", got)
}
}
func TestResolveIPAlgo_BothAbsentReturnsNil(t *testing.T) {
if got := ResolveIPAlgo(nil, nil, nil); got != nil {
t.Fatalf("got %v, want nil", got)
}
}
func TestResolveIPAlgo_NilNodeMap(t *testing.T) {
pod := map[string]string{annotationPrefix + annIPAlgo: "image"}
got := ResolveIPAlgo(pod, nil, nil)
want := []embed.Field{embed.FieldImage}
if !equalFields(got, want) {
t.Fatalf("got %v, want %v", got, want)
}
}
func TestResolveIPAlgo_Whitespace(t *testing.T) {
pod := map[string]string{annotationPrefix + annIPAlgo: " namespace , app "}
got := ResolveIPAlgo(pod, nil, nil)
want := []embed.Field{embed.FieldNamespace, embed.FieldApp}
if !equalFields(got, want) {
t.Fatalf("got %v, want %v", got, want)
}
}
func TestResolveIPAlgo_DuplicateInvalidates(t *testing.T) {
pod := map[string]string{annotationPrefix + annIPAlgo: "app,app"}
node := map[string]string{annotationPrefix + annIPAlgo: "namespace"}
got := ResolveIPAlgo(pod, node, nil)
want := []embed.Field{embed.FieldNamespace}
if !equalFields(got, want) {
t.Fatalf("got %v, want %v (duplicate must collapse to invalid)", got, want)
}
}
func equalFields(a, b []embed.Field) bool {
if len(a) != len(b) {
return false
}
for i := range a {
if a[i] != b[i] {
return false
}
}
return true
}
func TestParseAnnotations_CIDR(t *testing.T) { func TestParseAnnotations_CIDR(t *testing.T) {
a, err := ParseAnnotations(map[string]string{ a, err := ParseAnnotations(map[string]string{
annotationPrefix + "cidr6": "2602:817:3000:f001::/64, 2602:817:3000:f002::/64", annotationPrefix + "cidr6": "2602:817:3000:f001::/64, 2602:817:3000:f002::/64",
}) }, BuiltinFamilyDefaults())
if err != nil { if err != nil {
t.Fatal(err) t.Fatal(err)
} }
@@ -66,9 +285,140 @@ func TestParseAnnotations_CIDR(t *testing.T) {
} }
} }
func TestParseAnnotations_CIDR_FamilyMismatch(t *testing.T) {
// v4 prefix in a cidr6 annotation must not silently slip through.
if _, err := ParseAnnotations(map[string]string{
annotationPrefix + "cidr6": "10.0.0.0/8",
}, BuiltinFamilyDefaults()); err == nil {
t.Fatalf("expected family mismatch error")
}
if _, err := ParseAnnotations(map[string]string{
annotationPrefix + "cidr4": "2602:817::/32",
}, BuiltinFamilyDefaults()); err == nil {
t.Fatalf("expected family mismatch error")
}
}
func TestParseAnnotations_Anycast_Mixed(t *testing.T) {
// Anycast accepts both families together — typical for a service that
// advertises one v6 and one v4 anycast IP.
a, err := ParseAnnotations(map[string]string{
annotationPrefix + "anycast": "2602:817:3000:ac::1, 172.25.255.1",
}, BuiltinFamilyDefaults())
if err != nil {
t.Fatal(err)
}
if len(a.Anycast) != 2 {
t.Fatalf("anycast len=%d", len(a.Anycast))
}
}
func TestParseAnnotations_Addresses_Mixed(t *testing.T) {
// Plex's case: one v6 and one v4 supplied via addresses, both families
// enabled (built-in defaults). Both IPs are recorded; conflict check
// passes; later in handlers.Add they get peeled into primary slots.
a, err := ParseAnnotations(map[string]string{
annotationPrefix + "addresses": "2602:817:3000:c606::166, 142.202.202.166",
}, BuiltinFamilyDefaults())
if err != nil {
t.Fatal(err)
}
if len(a.Addresses) != 2 {
t.Fatalf("addresses len=%d", len(a.Addresses))
}
}
func TestParseAnnotations_Addresses_ConflictV4Disabled(t *testing.T) {
// addresses contains a v4 but the pod has explicitly opted out of v4.
// The IP would land on eth0 with no default v4 route, so reject at ADD.
_, err := ParseAnnotations(map[string]string{
annotationPrefix + "ipv4": "false",
annotationPrefix + "addresses": "142.202.202.166",
}, BuiltinFamilyDefaults())
if err == nil {
t.Fatal("want error for ipv4=false + addresses v4, got nil")
}
}
func TestParseAnnotations_Addresses_ConflictV6Disabled(t *testing.T) {
_, err := ParseAnnotations(map[string]string{
annotationPrefix + "ipv6": "false",
annotationPrefix + "ipv4": "true",
annotationPrefix + "addresses": "2602:817:3000:c606::166",
}, BuiltinFamilyDefaults())
if err == nil {
t.Fatal("want error for ipv6=false + addresses v6, got nil")
}
}
func TestParseAnnotations_Anycast_ConflictV4Disabled(t *testing.T) {
// Anycast on lo also requires the family enabled — replies need the
// in-pod default v4 route off eth0, which only exists when v4 is on.
_, err := ParseAnnotations(map[string]string{
annotationPrefix + "ipv4": "false",
annotationPrefix + "anycast": "172.25.255.1",
}, BuiltinFamilyDefaults())
if err == nil {
t.Fatal("want error for ipv4=false + anycast v4, got nil")
}
}
func TestParseAnnotations_Anycast_ConflictV6Disabled(t *testing.T) {
_, err := ParseAnnotations(map[string]string{
annotationPrefix + "ipv6": "false",
annotationPrefix + "ipv4": "true",
annotationPrefix + "anycast": "2602:817:3000:ac::1",
}, BuiltinFamilyDefaults())
if err == nil {
t.Fatal("want error for ipv6=false + anycast v6, got nil")
}
}
func TestParseAnnotations_Addresses_NodeDefaultV4Off(t *testing.T) {
// NodeConfig default opts v4 off for the node, and the pod has no
// explicit ipv4 annotation. addresses-v4 still conflicts because the
// resolved WantV4 is false. Operator must add `ipv4: "true"` on the
// pod to override the node default.
defaults := FamilyDefaults{WantV6: true, WantV4: false}
_, err := ParseAnnotations(map[string]string{
annotationPrefix + "addresses": "142.202.202.166",
}, defaults)
if err == nil {
t.Fatal("want error for NodeConfig v4=false + addresses v4, got nil")
}
}
func TestParseAnnotations_Addresses_NodeDefaultV4Off_PodOptsBackIn(t *testing.T) {
// Same as above but pod explicitly sets ipv4=true to override the node
// default. Conflict resolved; parse succeeds.
defaults := FamilyDefaults{WantV6: true, WantV4: false}
a, err := ParseAnnotations(map[string]string{
annotationPrefix + "ipv4": "true",
annotationPrefix + "addresses": "142.202.202.166",
}, defaults)
if err != nil {
t.Fatalf("expected ok, got %v", err)
}
if !a.WantV4 || len(a.Addresses) != 1 {
t.Fatalf("unexpected: %+v", a)
}
}
func TestParseCNIArgs(t *testing.T) { func TestParseCNIArgs(t *testing.T) {
args := ParseCNIArgs("IgnoreUnknown=1;K8S_POD_NAMESPACE=mail;K8S_POD_NAME=stalwart-0;K8S_POD_INFRA_CONTAINER_ID=abc123") args := ParseCNIArgs("IgnoreUnknown=1;K8S_POD_NAMESPACE=mail;K8S_POD_NAME=stalwart-0;K8S_POD_INFRA_CONTAINER_ID=abc123")
if args.PodNamespace != "mail" || args.PodName != "stalwart-0" || args.InfraID != "abc123" { if args.PodNamespace != "mail" || args.PodName != "stalwart-0" || args.InfraID != "abc123" {
t.Fatalf("ParseCNIArgs got %+v", args) t.Fatalf("ParseCNIArgs got %+v", args)
} }
} }
func TestParseCNIArgs_EmptyAndMalformed(t *testing.T) {
// Permissive: malformed entries are skipped, never crash.
a := ParseCNIArgs("")
if a.PodName != "" {
t.Fatalf("empty input should yield empty CNIArgs, got %+v", a)
}
a = ParseCNIArgs(";;K8S_POD_NAMESPACE=ns;noequalshere;=novalue;K8S_POD_NAME=p")
if a.PodNamespace != "ns" || a.PodName != "p" {
t.Fatalf("permissive parse failed: %+v", a)
}
}
+112
View File
@@ -0,0 +1,112 @@
package agent
import (
"net"
"sort"
)
// anycastNexthop is one (host-side veth, pod-eth0-IP) pair the kernel route
// can use as a multipath nexthop.
type anycastNexthop struct {
hostIface string
via net.IP
}
// anycastTarget describes the kernel route shape for one advertised anycast
// IP. When more than one Ready pod on this node binds the same anycast IP,
// every Ready pod contributes a nexthop and the kernel does per-flow ECMP
// across them.
//
// nexthops is sorted by canonical(via) for deterministic comparison and
// stable kernel-route ordering across reconcile passes — the
// AnycastReconciler skips kernel writes when the new and old targets are
// equal, which only works if the slice order is stable.
type anycastTarget struct {
nexthops []anycastNexthop
}
// equal reports whether two targets describe the same kernel route.
// Both sides are expected to be sorted (the canonical constructor sorts).
func (t anycastTarget) equal(o anycastTarget) bool {
if len(t.nexthops) != len(o.nexthops) {
return false
}
for i := range t.nexthops {
if t.nexthops[i].hostIface != o.nexthops[i].hostIface {
return false
}
if !t.nexthops[i].via.Equal(o.nexthops[i].via) {
return false
}
}
return true
}
// resolveAnycastTargets walks the committed allocation set and returns the
// desired kernel-route shape for every anycast IP that has at least one
// Ready local pod binding it. Multiple Ready pods sharing the same anycast
// IP collapse into a single multi-nexthop target so the kernel can
// per-flow ECMP across them.
//
// Pure: no kernel calls, no informer access. Pods are surfaced via the
// isReady callback so the reconciler can plug in its informer; tests can
// pass any function that satisfies the signature.
//
// warn is invoked for human-facing skip reasons (e.g. anycast with no
// unicast of same family). nil-safe — pass nil to silently drop.
func resolveAnycastTargets(
allocations []Allocation,
isReady func(namespace, name string) bool,
warn func(string),
) map[string]anycastTarget {
if warn == nil {
warn = func(string) {}
}
out := map[string]anycastTarget{}
for _, a := range allocations {
if a.State != StateCommitted || (len(a.Anycast) == 0 && len(a.Addresses) == 0) {
continue
}
if !isReady(a.Namespace, a.PodName) {
continue
}
host := HostIfaceName(a.ContainerID)
via6 := net.ParseIP(a.IP6)
via4 := net.ParseIP(a.IP4)
// Anycast (lo-bound) and Addresses (eth0-bound) are advertised
// identically: /128 or /32 host route on the host, BGP via BIRD.
for _, ipStr := range append(a.Anycast, a.Addresses...) {
ip := net.ParseIP(ipStr)
if ip == nil {
continue
}
var via net.IP
if ip.To4() != nil {
via = via4
} else {
via = via6
}
if via == nil {
warn("anycast " + ipStr + " skipped: pod " +
a.Namespace + "/" + a.PodName +
" has no unicast of same family")
continue
}
key := canonical(ip)
t := out[key]
t.nexthops = append(t.nexthops, anycastNexthop{hostIface: host, via: via})
out[key] = t
}
}
// Sort each target's nexthops for stable comparison + stable kernel
// ordering. Sort key is canonical(via) — sufficient for stability
// because (host, via) pairs are 1:1 (one veth per pod, one v6+v4 per
// pod, so via uniquely identifies the nexthop).
for k, t := range out {
sort.Slice(t.nexthops, func(i, j int) bool {
return canonical(t.nexthops[i].via) < canonical(t.nexthops[j].via)
})
out[k] = t
}
return out
}
+127 -76
View File
@@ -26,6 +26,11 @@ import (
// - Pod transitions to Ready=False or DELETE → remove kernel route, remove // - Pod transitions to Ready=False or DELETE → remove kernel route, remove
// from BIRD export. // from BIRD export.
// //
// When more than one Ready pod on this node binds the same anycast IP, the
// kernel route uses RTA_MULTIPATH so the kernel does per-flow ECMP across
// the contributing pods. This is the within-node companion to BGP-level
// ECMP across nodes.
//
// Reconcile is idempotent. Triggers: AfterCommit hook, Pod informer // Reconcile is idempotent. Triggers: AfterCommit hook, Pod informer
// UpdateFunc on Ready transitions, periodic 2s tick. // UpdateFunc on Ready transitions, periodic 2s tick.
type AnycastReconciler struct { type AnycastReconciler struct {
@@ -42,13 +47,6 @@ type AnycastReconciler struct {
trigger chan struct{} trigger chan struct{}
} }
// anycastTarget describes the kernel route shape for one advertised
// anycast IP: which veth, and which pod eth0 IP to use as next-hop.
type anycastTarget struct {
hostIface string
via net.IP
}
// NewAnycastReconciler returns a Reconciler ready to Run. // NewAnycastReconciler returns a Reconciler ready to Run.
func NewAnycastReconciler(node string, store *Store, pods *PodCache, nc *NodeConfigCache, bird *BirdManager, routerID string, logger *slog.Logger) *AnycastReconciler { func NewAnycastReconciler(node string, store *Store, pods *PodCache, nc *NodeConfigCache, bird *BirdManager, routerID string, logger *slog.Logger) *AnycastReconciler {
return &AnycastReconciler{ return &AnycastReconciler{
@@ -96,25 +94,26 @@ func (r *AnycastReconciler) reconcile() {
desired := r.computeDesired() desired := r.computeDesired()
// Install routes that should exist but don't (or whose target changed). // Install routes that should exist but don't, or whose nexthop set
// changed.
for ip, t := range desired { for ip, t := range desired {
if cur, ok := r.advertised[ip]; ok && cur.hostIface == t.hostIface && cur.via.Equal(t.via) { if cur, ok := r.advertised[ip]; ok && cur.equal(t) {
continue continue
} }
if err := installAnycastRoute(ip, t); err != nil { if err := installAnycastRoute(ip, t); err != nil {
r.Logger.Warn("anycast install", "ip", ip, "host", t.hostIface, "via", t.via, "err", err) r.Logger.Warn("anycast install", "ip", ip, "nexthops", len(t.nexthops), "err", err)
continue continue
} }
r.Logger.Info("anycast advertise", "ip", ip, "host", t.hostIface, "via", t.via) r.Logger.Info("anycast advertise", "ip", ip, "nexthops", describeNexthops(t))
r.advertised[ip] = t r.advertised[ip] = t
} }
// Remove routes that exist but shouldn't. // Remove routes that exist but shouldn't.
for ip, t := range r.advertised { for ip, t := range r.advertised {
if _, want := desired[ip]; !want { if _, want := desired[ip]; !want {
if err := removeAnycastRoute(ip, t); err != nil { if err := removeAnycastRoute(ip, t); err != nil {
r.Logger.Warn("anycast remove", "ip", ip, "host", t.hostIface, "err", err) r.Logger.Warn("anycast remove", "ip", ip, "err", err)
} else { } else {
r.Logger.Info("anycast withdraw", "ip", ip, "host", t.hostIface) r.Logger.Info("anycast withdraw", "ip", ip)
} }
delete(r.advertised, ip) delete(r.advertised, ip)
} }
@@ -124,44 +123,17 @@ func (r *AnycastReconciler) reconcile() {
r.renderBird(desired) r.renderBird(desired)
} }
// computeDesired walks the Store and returns the per-ip anycastTarget for // computeDesired delegates to the pure resolveAnycastTargets and plugs in
// every anycast advertisement that should be active right now. Each target // the live informer-based isReady callback.
// uses the pod's own eth0 IP (same family) as the route's `via` next-hop —
// that way kernel NDP/ARP resolves the eth0 address, which IS configured
// on the pod's eth0, so the pod responds normally without proxy_ndp.
func (r *AnycastReconciler) computeDesired() map[string]anycastTarget { func (r *AnycastReconciler) computeDesired() map[string]anycastTarget {
out := map[string]anycastTarget{} return resolveAnycastTargets(
for _, a := range r.Store.Snapshot() { r.Store.Snapshot(),
if a.State != StateCommitted || len(a.Anycast) == 0 { func(ns, name string) bool {
continue pod, ok := r.Pods.Get(ns, name)
} return ok && podAnycastEligible(pod)
pod, ok := r.Pods.Get(a.Namespace, a.PodName) },
if !ok || !podReady(pod) { func(s string) { r.Logger.Warn(s) },
continue )
}
host := HostIfaceName(a.ContainerID)
via6 := net.ParseIP(a.IP6)
via4 := net.ParseIP(a.IP4)
for _, ipStr := range a.Anycast {
ip := net.ParseIP(ipStr)
if ip == nil {
continue
}
var via net.IP
if ip.To4() != nil {
via = via4
} else {
via = via6
}
if via == nil {
r.Logger.Warn("anycast skipped: pod has no unicast IP of same family",
"pod", a.Namespace+"/"+a.PodName, "anycast", ipStr)
continue
}
out[canonical(ip)] = anycastTarget{hostIface: host, via: via}
}
}
return out
} }
func (r *AnycastReconciler) renderBird(desired map[string]anycastTarget) { func (r *AnycastReconciler) renderBird(desired map[string]anycastTarget) {
@@ -170,72 +142,139 @@ func (r *AnycastReconciler) renderBird(desired map[string]anycastTarget) {
return return
} }
var v6, v4 []string var v6, v4 []string
for ipStr := range desired { seen := map[string]struct{}{}
ip := net.ParseIP(ipStr) add := func(ip net.IP) {
if ip == nil { key := canonical(ip)
continue if _, dup := seen[key]; dup {
return
} }
seen[key] = struct{}{}
if ip.To4() != nil { if ip.To4() != nil {
v4 = append(v4, ip.To4().String()) v4 = append(v4, ip.To4().String())
} else { } else {
v6 = append(v6, ip.To16().String()) v6 = append(v6, ip.To16().String())
} }
} }
for ipStr := range desired {
if ip := net.ParseIP(ipStr); ip != nil {
add(ip)
}
}
// A pod IP that lives outside the node's BGP aggregate (e.g. an
// addresses-annotation IP promoted to be the pod's primary v4 — Plex's
// 142.202.202.166 against host004's 172.25.214.0/24) is not naturally
// covered by the aggregate, so it must be advertised individually as a
// /32 or /128. Anycast and addresses extras are already covered by the
// `desired` loop above; this sweep is for promoted-primary IPs which do
// not flow through the AnycastReconciler.
nodeV6, nodeV4 := parseNodeCIDRs(nc)
for _, a := range r.Store.Snapshot() {
if a.State != StateCommitted {
continue
}
if ip := net.ParseIP(a.IP6); ip != nil && !ipInAny(ip, nodeV6) {
add(ip)
}
if ip := net.ParseIP(a.IP4); ip != nil && !ipInAny(ip, nodeV4) {
add(ip)
}
}
if err := r.Bird.Render(nc, v6, v4, r.RouterID); err != nil { if err := r.Bird.Render(nc, v6, v4, r.RouterID); err != nil {
r.Logger.Warn("anycast bird render", "err", err) r.Logger.Warn("anycast bird render", "err", err)
} }
} }
// installAnycastRoute installs `<ipStr>/<128|32> via t.via dev t.hostIface`. // parseNodeCIDRs parses NodeConfig.Spec.CIDR6/4 strings into IPNets,
// silently dropping malformed entries (admission-time validation should
// have rejected them long before this point).
func parseNodeCIDRs(nc *flockv1alpha1.NodeConfig) (v6, v4 []*net.IPNet) {
for _, s := range nc.Spec.CIDR6 {
if _, n, err := net.ParseCIDR(s); err == nil {
v6 = append(v6, n)
}
}
for _, s := range nc.Spec.CIDR4 {
if _, n, err := net.ParseCIDR(s); err == nil {
v4 = append(v4, n)
}
}
return
}
func ipInAny(ip net.IP, nets []*net.IPNet) bool {
for _, n := range nets {
if n.Contains(ip) {
return true
}
}
return false
}
// installAnycastRoute installs `<ipStr>/<128|32>` pointing at the
// nexthop set in t. With one nexthop the route is a plain via-route;
// with multiple, it's a multipath route using RTA_MULTIPATH so the
// kernel hashes flows across the constituent pods.
//
// Idempotent — RouteReplace overwrites a stale entry. // Idempotent — RouteReplace overwrites a stale entry.
func installAnycastRoute(ipStr string, t anycastTarget) error { func installAnycastRoute(ipStr string, t anycastTarget) error {
ip := net.ParseIP(ipStr) ip := net.ParseIP(ipStr)
if ip == nil { if ip == nil {
return fmt.Errorf("bad ip %q", ipStr) return fmt.Errorf("bad ip %q", ipStr)
} }
link, err := netlink.LinkByName(t.hostIface) if len(t.nexthops) == 0 {
if err != nil { return fmt.Errorf("anycast %s: no nexthops", ipStr)
return fmt.Errorf("lookup %s: %w", t.hostIface, err)
} }
prefix := 128 prefix := 128
if ip.To4() != nil { if ip.To4() != nil {
prefix = 32 prefix = 32
ip = ip.To4() ip = ip.To4()
} }
r := &netlink.Route{ r := &netlink.Route{Dst: cidrFor(ip, prefix)}
if len(t.nexthops) == 1 {
// Single nexthop — keep the route shape identical to today's
// production form. Functionally equivalent to a 1-element
// MultiPath but `ip route show` renders nicer for operators.
nh := t.nexthops[0]
link, err := netlink.LinkByName(nh.hostIface)
if err != nil {
return fmt.Errorf("lookup %s: %w", nh.hostIface, err)
}
r.LinkIndex = link.Attrs().Index
r.Gw = nh.via
} else {
hops := make([]*netlink.NexthopInfo, 0, len(t.nexthops))
for _, nh := range t.nexthops {
link, err := netlink.LinkByName(nh.hostIface)
if err != nil {
return fmt.Errorf("lookup %s: %w", nh.hostIface, err)
}
hops = append(hops, &netlink.NexthopInfo{
LinkIndex: link.Attrs().Index, LinkIndex: link.Attrs().Index,
Dst: cidrFor(ip, prefix), Gw: nh.via,
Gw: t.via, Hops: 0,
// SCOPE_UNIVERSE — the gateway is on a different "logical" subnet })
// than the local /128 route, but reachable on this veth. Linux is }
// happy as long as the veth has IPv6 forwarding on (it does — set r.MultiPath = hops
// in configureHostSide) and the pod's eth0 has the via address
// (also true — that's the pod's IP6/IP4 we allocated).
} }
return netlink.RouteReplace(r) return netlink.RouteReplace(r)
} }
// removeAnycastRoute deletes the host route. Missing routes / interfaces // removeAnycastRoute deletes the host route. Missing routes / interfaces
// are treated as success — DEL paths can race with veth teardown. // are treated as success — DEL paths can race with veth teardown.
func removeAnycastRoute(ipStr string, t anycastTarget) error { //
// Kernel route deletion matches by destination prefix; we don't need to
// re-specify the nexthop set.
func removeAnycastRoute(ipStr string, _ anycastTarget) error {
ip := net.ParseIP(ipStr) ip := net.ParseIP(ipStr)
if ip == nil { if ip == nil {
return nil return nil
} }
link, err := netlink.LinkByName(t.hostIface)
if err != nil {
return nil
}
prefix := 128 prefix := 128
if ip.To4() != nil { if ip.To4() != nil {
prefix = 32 prefix = 32
ip = ip.To4() ip = ip.To4()
} }
r := &netlink.Route{ r := &netlink.Route{Dst: cidrFor(ip, prefix)}
LinkIndex: link.Attrs().Index,
Dst: cidrFor(ip, prefix),
Gw: t.via,
}
if err := netlink.RouteDel(r); err != nil { if err := netlink.RouteDel(r); err != nil {
// ESRCH ("no such process") is netlink-speak for "no such route"; // ESRCH ("no such process") is netlink-speak for "no such route";
// treat as success. // treat as success.
@@ -247,5 +286,17 @@ func removeAnycastRoute(ipStr string, t anycastTarget) error {
return nil return nil
} }
// describeNexthops returns a compact string for log messages.
func describeNexthops(t anycastTarget) string {
var s string
for i, nh := range t.nexthops {
if i > 0 {
s += ","
}
s += nh.hostIface + "→" + nh.via.String()
}
return s
}
// _ = flockv1alpha1 to silence unused import warnings on minimal builds. // _ = flockv1alpha1 to silence unused import warnings on minimal builds.
var _ = flockv1alpha1.GroupName var _ = flockv1alpha1.GroupName
+227
View File
@@ -0,0 +1,227 @@
package agent
import (
"net"
"strings"
"testing"
)
// allReady is a convenience isReady that says yes to every pod.
func allReady(_, _ string) bool { return true }
// readyOnly returns an isReady that only says yes to the named pods.
func readyOnly(want ...string) func(string, string) bool {
set := map[string]struct{}{}
for _, n := range want {
set[n] = struct{}{}
}
return func(_, name string) bool {
_, ok := set[name]
return ok
}
}
func TestResolveAnycastTargets_OnePodOneAnycast(t *testing.T) {
allocs := []Allocation{{
ContainerID: "c1", Namespace: "ns", PodName: "pod-a",
State: StateCommitted,
IP6: "2001:db8::1",
Anycast: []string{"2001:db8:a::1"},
}}
out := resolveAnycastTargets(allocs, allReady, nil)
if len(out) != 1 {
t.Fatalf("expected 1 anycast IP, got %d", len(out))
}
tgt, ok := out["2001:db8:a::1"]
if !ok {
t.Fatalf("missing target")
}
if len(tgt.nexthops) != 1 {
t.Fatalf("expected 1 nexthop, got %d", len(tgt.nexthops))
}
if !tgt.nexthops[0].via.Equal(net.ParseIP("2001:db8::1")) {
t.Fatalf("nexthop via wrong: %v", tgt.nexthops[0].via)
}
}
// Two pods on the same node binding the same anycast IP must produce a
// SINGLE target with TWO nexthops. The previous behaviour (overwriting)
// was the bug this whole change exists to fix.
func TestResolveAnycastTargets_TwoPodsSameAnycast_MultiNexthop(t *testing.T) {
allocs := []Allocation{
{ContainerID: "c1", Namespace: "ns", PodName: "pod-a",
State: StateCommitted, IP6: "2001:db8::2",
Anycast: []string{"2001:db8:a::1"}},
{ContainerID: "c2", Namespace: "ns", PodName: "pod-b",
State: StateCommitted, IP6: "2001:db8::1",
Anycast: []string{"2001:db8:a::1"}},
}
out := resolveAnycastTargets(allocs, allReady, nil)
tgt := out["2001:db8:a::1"]
if len(tgt.nexthops) != 2 {
t.Fatalf("expected 2 nexthops, got %d", len(tgt.nexthops))
}
// Order should be sorted by canonical(via) — ::1 before ::2.
if !tgt.nexthops[0].via.Equal(net.ParseIP("2001:db8::1")) {
t.Fatalf("nexthops not sorted by via; got %v first", tgt.nexthops[0].via)
}
if !tgt.nexthops[1].via.Equal(net.ParseIP("2001:db8::2")) {
t.Fatalf("nexthops not sorted by via; got %v second", tgt.nexthops[1].via)
}
// HostIface differs per pod (different containerID → different FNV).
if tgt.nexthops[0].hostIface == tgt.nexthops[1].hostIface {
t.Fatalf("expected distinct hostIfaces, both %q", tgt.nexthops[0].hostIface)
}
}
// When one of the contributing pods goes NotReady, only the remaining
// Ready pod should appear in the target's nexthop set.
func TestResolveAnycastTargets_NotReadyDropped(t *testing.T) {
allocs := []Allocation{
{ContainerID: "c1", Namespace: "ns", PodName: "pod-a",
State: StateCommitted, IP6: "2001:db8::1",
Anycast: []string{"2001:db8:a::1"}},
{ContainerID: "c2", Namespace: "ns", PodName: "pod-b",
State: StateCommitted, IP6: "2001:db8::2",
Anycast: []string{"2001:db8:a::1"}},
}
out := resolveAnycastTargets(allocs, readyOnly("pod-a"), nil)
tgt := out["2001:db8:a::1"]
if len(tgt.nexthops) != 1 {
t.Fatalf("expected 1 nexthop after NotReady drop, got %d", len(tgt.nexthops))
}
if !tgt.nexthops[0].via.Equal(net.ParseIP("2001:db8::1")) {
t.Fatalf("wrong surviving nexthop: %v", tgt.nexthops[0].via)
}
}
// Pods that haven't reached Ready are excluded entirely from the target
// set. If no pod is Ready for an anycast IP, that IP is absent from the
// output (BIRD will withdraw from BGP, kernel route will be removed).
func TestResolveAnycastTargets_NoReadyPodsOmitsIP(t *testing.T) {
allocs := []Allocation{
{ContainerID: "c1", Namespace: "ns", PodName: "pod-a",
State: StateCommitted, IP6: "2001:db8::1",
Anycast: []string{"2001:db8:a::1"}},
}
out := resolveAnycastTargets(allocs, readyOnly( /* none */ ), nil)
if _, ok := out["2001:db8:a::1"]; ok {
t.Fatalf("anycast should be absent when no pod ready")
}
}
// Pending allocations (CNI ADD partway through) are skipped even if the
// pod is Ready — we don't program kernel routes for partial setups.
func TestResolveAnycastTargets_PendingSkipped(t *testing.T) {
allocs := []Allocation{
{ContainerID: "c1", Namespace: "ns", PodName: "pod-a",
State: StatePending, IP6: "2001:db8::1",
Anycast: []string{"2001:db8:a::1"}},
}
out := resolveAnycastTargets(allocs, allReady, nil)
if len(out) != 0 {
t.Fatalf("pending allocations must be skipped")
}
}
// Mixed v6+v4 anycast on the same pod produces two separate target
// entries, one per family, each anchored on the matching unicast IP.
func TestResolveAnycastTargets_MixedFamilies(t *testing.T) {
allocs := []Allocation{{
ContainerID: "c1", Namespace: "ns", PodName: "pod-a",
State: StateCommitted,
IP6: "2001:db8::1",
IP4: "10.0.0.1",
Anycast: []string{"2001:db8:a::1", "10.255.0.1"},
}}
out := resolveAnycastTargets(allocs, allReady, nil)
if !out["2001:db8:a::1"].nexthops[0].via.Equal(net.ParseIP("2001:db8::1")) {
t.Fatalf("v6 anycast should resolve via v6 unicast")
}
if !out["10.255.0.1"].nexthops[0].via.Equal(net.ParseIP("10.0.0.1").To4()) {
t.Fatalf("v4 anycast should resolve via v4 unicast")
}
}
// An anycast whose family has no matching unicast on the pod is skipped
// with a warning. Other anycast IPs on the same pod are unaffected.
func TestResolveAnycastTargets_FamilyMismatchWarns(t *testing.T) {
allocs := []Allocation{{
ContainerID: "c1", Namespace: "ns", PodName: "pod-a",
State: StateCommitted,
IP6: "2001:db8::1", // v6 only
Anycast: []string{"2001:db8:a::1", "10.255.0.1"},
}}
var warns []string
out := resolveAnycastTargets(allocs, allReady, func(s string) { warns = append(warns, s) })
if _, has := out["2001:db8:a::1"]; !has {
t.Fatalf("v6 anycast should have been programmed")
}
if _, has := out["10.255.0.1"]; has {
t.Fatalf("v4 anycast should have been skipped")
}
if len(warns) != 1 {
t.Fatalf("expected 1 warning, got %d: %v", len(warns), warns)
}
if !strings.Contains(warns[0], "10.255.0.1") {
t.Fatalf("warning should mention skipped IP: %q", warns[0])
}
}
// Determinism: the same input must produce nexthops in the same order.
func TestResolveAnycastTargets_Determinism(t *testing.T) {
allocs := []Allocation{
{ContainerID: "z-late", Namespace: "ns", PodName: "z",
State: StateCommitted, IP6: "2001:db8::5",
Anycast: []string{"2001:db8:a::1"}},
{ContainerID: "a-early", Namespace: "ns", PodName: "a",
State: StateCommitted, IP6: "2001:db8::3",
Anycast: []string{"2001:db8:a::1"}},
{ContainerID: "m-mid", Namespace: "ns", PodName: "m",
State: StateCommitted, IP6: "2001:db8::4",
Anycast: []string{"2001:db8:a::1"}},
}
a := resolveAnycastTargets(allocs, allReady, nil)
b := resolveAnycastTargets(allocs, allReady, nil)
if !a["2001:db8:a::1"].equal(b["2001:db8:a::1"]) {
t.Fatalf("same input produced unequal targets")
}
// Sorted by canonical(via): ::3, ::4, ::5
via := a["2001:db8:a::1"].nexthops
if !via[0].via.Equal(net.ParseIP("2001:db8::3")) ||
!via[1].via.Equal(net.ParseIP("2001:db8::4")) ||
!via[2].via.Equal(net.ParseIP("2001:db8::5")) {
t.Fatalf("nexthops not stably sorted: %v %v %v", via[0].via, via[1].via, via[2].via)
}
}
// equal()'s contract — different orderings are still considered equal
// AS LONG AS both sides have been canonicalised by resolveAnycastTargets.
// Across-call comparisons of resolver outputs must always match for the
// same logical input.
func TestAnycastTarget_Equal(t *testing.T) {
a := anycastTarget{nexthops: []anycastNexthop{
{hostIface: "f1", via: net.ParseIP("2001:db8::1")},
{hostIface: "f2", via: net.ParseIP("2001:db8::2")},
}}
b := anycastTarget{nexthops: []anycastNexthop{
{hostIface: "f1", via: net.ParseIP("2001:db8::1")},
{hostIface: "f2", via: net.ParseIP("2001:db8::2")},
}}
if !a.equal(b) {
t.Fatalf("equal targets reported unequal")
}
c := anycastTarget{nexthops: []anycastNexthop{
{hostIface: "f1", via: net.ParseIP("2001:db8::1")},
}}
if a.equal(c) {
t.Fatalf("targets with different lengths reported equal")
}
d := anycastTarget{nexthops: []anycastNexthop{
{hostIface: "f1", via: net.ParseIP("2001:db8::1")},
{hostIface: "f2", via: net.ParseIP("2001:db8::3")}, // diff IP
}}
if a.equal(d) {
t.Fatalf("targets with different vias reported equal")
}
}
+33
View File
@@ -55,6 +55,12 @@ func (b *BirdManager) Render(nc *flockv1alpha1.NodeConfig, anycast6, anycast4 []
// the BGP peer. crt001 rejects IPv6 advertisements whose next-hop is // the BGP peer. crt001 rejects IPv6 advertisements whose next-hop is
// link-local-only; an explicit `source address` makes BIRD use a // link-local-only; an explicit `source address` makes BIRD use a
// global next-hop self, which Cisco accepts. // global next-hop self, which Cisco accepts.
//
// Also derive the connected subnet (peer IP masked to /64 v6 / /24 v4)
// per family. Render uses it to install `import where net != <subnet>`
// on the BGP channel so the gateway can't readvertise our own connected
// /64 back to us — accepting it would override the kernel route and
// hairpin all inter-host traffic via the gateway.
for _, p := range nc.Spec.BGP.Peers { for _, p := range nc.Spec.BGP.Peers {
fam := bird.FamilyOf(p.Address) fam := bird.FamilyOf(p.Address)
if fam == "" { if fam == "" {
@@ -69,6 +75,14 @@ func (b *BirdManager) Render(nc *flockv1alpha1.NodeConfig, anycast6, anycast4 []
in.LocalV4 = local in.LocalV4 = local
} }
} }
if subnet := peerSubnet(p.Address); subnet != "" {
if fam == "v6" && in.LocalSubnetV6 == "" {
in.LocalSubnetV6 = subnet
}
if fam == "v4" && in.LocalSubnetV4 == "" {
in.LocalSubnetV4 = subnet
}
}
} }
cfg, err := bird.Render(in) cfg, err := bird.Render(in)
@@ -165,6 +179,25 @@ func (b *BirdManager) SummaryRoutes(nc *flockv1alpha1.NodeConfig) error {
return nil return nil
} }
// peerSubnet returns the canonical CIDR of the assumed connected subnet
// containing `peer` — /64 for IPv6, /24 for IPv4. Returns "" if peer
// doesn't parse. Matches the assumption already baked into
// localAddrSameSubnet: fritzlab convention is /64 v6 and /24 v4.
func peerSubnet(peer string) string {
pip := net.ParseIP(peer)
if pip == nil {
return ""
}
var mask net.IPMask
if pip.To4() != nil {
mask = net.CIDRMask(24, 32)
} else {
mask = net.CIDRMask(64, 128)
}
n := &net.IPNet{IP: pip.Mask(mask), Mask: mask}
return n.String()
}
// localAddrSameSubnet finds an IP on a local interface that's in the same // localAddrSameSubnet finds an IP on a local interface that's in the same
// /64 (v6) or /24 (v4) as `peer`. Returns "" if none. Used to derive the // /64 (v6) or /24 (v4) as `peer`. Returns "" if none. Used to derive the
// `source address` for a BGP session. // `source address` for a BGP session.
+25
View File
@@ -0,0 +1,25 @@
package agent
import "testing"
func TestPeerSubnet(t *testing.T) {
cases := []struct {
peer string
want string
}{
{"2602:817:3000:a25::1", "2602:817:3000:a25::/64"},
{"2602:817:3000:a25::104", "2602:817:3000:a25::/64"},
{"172.25.25.1", "172.25.25.0/24"},
{"172.25.25.104", "172.25.25.0/24"},
{"", ""},
{"not-an-ip", ""},
}
for _, tc := range cases {
t.Run(tc.peer, func(t *testing.T) {
got := peerSubnet(tc.peer)
if got != tc.want {
t.Fatalf("peerSubnet(%q) = %q, want %q", tc.peer, got, tc.want)
}
})
}
}
+22
View File
@@ -0,0 +1,22 @@
// Package agent owns the in-process flock-agent runtime. The agent is a
// single Linux DaemonSet pod per node and holds:
//
// - the durable per-node allocation file at /var/lib/flock/allocations.json
// (see Store in state.go),
// - an in-memory IPAM seeded from NodeConfig CIDRs and reconciled against
// the allocation file at startup (see ipam.go),
// - dynamic informers watching the per-node NodeConfig CR (nodeconfig.go)
// and the local-node Pod set (podinfo.go),
// - an RPC server speaking to the lightweight CNI plugin binary
// (cmd/flock and pkg/cni), so kubelet's CNI invocations are answered by
// a long-lived process rather than spinning up a fresh binary per ADD,
// - the BirdManager that renders bird.conf and triggers `birdc reload`
// on changes (bird.go), and
// - the AnycastReconciler that programs per-pod /128 and /32 host routes
// gated on Pod readiness (anycast_linux.go).
//
// The package is split between platform-specific files (anycast_linux.go,
// netns_linux.go, runtime_linux.go) and stub files used on non-Linux build
// hosts so the rest of the package — IPAM, parsing, store, RPC plumbing —
// stays unit-testable on macOS and Windows CI.
package agent
+150 -5
View File
@@ -3,14 +3,91 @@ package agent
import ( import (
"context" "context"
"fmt" "fmt"
"log/slog"
"net" "net"
"strings"
"time" "time"
flockcni "code.fritzlab.net/fritzlab/flock/pkg/cni" flockcni "code.fritzlab.net/fritzlab/flock/pkg/cni"
cnitypes "github.com/containernetworking/cni/pkg/types" cnitypes "github.com/containernetworking/cni/pkg/types"
current "github.com/containernetworking/cni/pkg/types/100" current "github.com/containernetworking/cni/pkg/types/100"
corev1 "k8s.io/api/core/v1"
) )
// podTemplateHashLabel is the well-known label Kubernetes attaches to
// every Pod owned by a ReplicaSet so the ReplicaSet name can be
// reconstructed as "<deploy>-<hash>". We use it to peel the hash back off
// in deriveAppName.
const podTemplateHashLabel = "pod-template-hash"
// deriveAppName returns the stable workload identifier for a Pod — the
// name of the topmost stable controller, with the pod-template-hash
// stripped for ReplicaSet-owned pods.
//
// The rule maps to Kubernetes pod-name generation:
//
// Deployment → ReplicaSet → Pod pod owner is RS named "<deploy>-<hash>";
// strip the trailing "-<hash>" to recover
// the Deployment name.
// StatefulSet → Pod pod owner is the STS itself; use as-is.
// DaemonSet → Pod pod owner is the DS itself; use as-is.
// Job → Pod pod owner is the Job itself; use as-is.
// (bare pod) → Pod no controller owner; fall back to pod name.
//
// All replicas of the same workload converge on the same return value,
// which is the property the ip-algo `app` field needs.
func deriveAppName(pod *corev1.Pod) string {
owner := controllerOwner(pod)
if owner == nil {
return pod.Name
}
if owner.Kind == "ReplicaSet" {
if hash, ok := pod.Labels[podTemplateHashLabel]; ok && hash != "" {
suffix := "-" + hash
if strings.HasSuffix(owner.Name, suffix) {
return strings.TrimSuffix(owner.Name, suffix)
}
}
// Custom controller named the RS something that doesn't match
// the pod-template-hash convention. Falling back to the RS name
// keeps replicas of the same RS aligned, which is the second-
// best correctness we can offer.
return owner.Name
}
return owner.Name
}
// controllerOwner returns the OwnerReference flagged with Controller=true,
// or nil if none. Kubernetes guarantees at most one controller per object.
func controllerOwner(pod *corev1.Pod) *metav1OwnerLite {
for i := range pod.OwnerReferences {
o := &pod.OwnerReferences[i]
if o.Controller != nil && *o.Controller {
return &metav1OwnerLite{Kind: o.Kind, Name: o.Name}
}
}
return nil
}
// metav1OwnerLite is the slice of OwnerReference we actually consult,
// kept tiny so it can be returned by value-pointer cheaply.
type metav1OwnerLite struct {
Kind string
Name string
}
// podImageRef returns a deterministic image reference for the embed
// `image` field. We use the first container's spec'd image — this is
// stable across replicas of the same Deployment without requiring the
// runtime-resolved digest. Empty string if the pod has no containers,
// in which case the embed package falls back to FNV(containerID).
func podImageRef(pod *corev1.Pod) string {
if len(pod.Spec.Containers) == 0 {
return ""
}
return pod.Spec.Containers[0].Image
}
// PodHandler is the platform-agnostic ADD/DEL/CHECK implementation. It // PodHandler is the platform-agnostic ADD/DEL/CHECK implementation. It
// resolves the Pod from the informer cache, parses annotations, allocates // resolves the Pod from the informer cache, parses annotations, allocates
// from IPAM, programs netns (or skips on non-Linux build), and persists // from IPAM, programs netns (or skips on non-Linux build), and persists
@@ -22,6 +99,7 @@ type PodHandler struct {
IPAM *IPAM IPAM *IPAM
Pods *PodCache Pods *PodCache
NodeConfig *NodeConfigCache NodeConfig *NodeConfigCache
Logger *slog.Logger
// SetupFunc and TeardownFunc are injected at startup; in production // SetupFunc and TeardownFunc are injected at startup; in production
// they point at the Linux netlink ops, in tests they're fakes. // they point at the Linux netlink ops, in tests they're fakes.
SetupFunc func(SetupRequest) error SetupFunc func(SetupRequest) error
@@ -49,25 +127,58 @@ func (h *PodHandler) Add(ctx context.Context, req flockcni.Request) (*current.Re
return nil, fmt.Errorf("lookup pod: %w", err) return nil, fmt.Errorf("lookup pod: %w", err)
} }
parsed, err := ParseAnnotations(pod.Annotations) nc := h.NodeConfig.Load()
defaults := FamilyDefaultsFromNodeConfig(nc)
parsed, err := ParseAnnotations(pod.Annotations, defaults)
if err != nil { if err != nil {
return nil, fmt.Errorf("parse annotations: %w", err) return nil, fmt.Errorf("parse annotations: %w", err)
} }
var nodeAnn map[string]string
if nc != nil {
nodeAnn = nc.GetAnnotations()
}
ipAlgo := ResolveIPAlgo(pod.Annotations, nodeAnn, h.Logger)
// addresses-annotation IPs replace IPAM allocation for any family they
// cover. Plex needs its public IPv4 to be the pod's primary v4 (default
// route source, on-link host route, /32 in BGP) — not just an extra IP
// layered on top of a private IPAM allocation. Peel one v6 + one v4 out
// of Addresses to use as the pod's primary IPs; anything beyond that
// stays in addrExtras and gets the existing layered behavior.
addrV6, addrV4, addrExtras := splitAddressesPrimary(parsed.Addresses)
allocReq := AllocRequest{ allocReq := AllocRequest{
ContainerID: req.ContainerID, ContainerID: req.ContainerID,
Namespace: args.PodNamespace, Namespace: args.PodNamespace,
Pod: args.PodName, Pod: args.PodName,
WantV6: parsed.WantV6, App: deriveAppName(pod),
WantV4: parsed.WantV4, WantV6: parsed.WantV6 && addrV6 == nil,
WantV4: parsed.WantV4 && addrV4 == nil,
AnnCIDR6: parsed.CIDR6, AnnCIDR6: parsed.CIDR6,
AnnCIDR4: parsed.CIDR4, AnnCIDR4: parsed.CIDR4,
IPAlgo: parsed.IPAlgo, IPAlgo: ipAlgo,
Image: podImageRef(pod),
} }
res, err := h.IPAM.Allocate(allocReq) var res AllocResult
if allocReq.WantV6 || allocReq.WantV4 {
var err error
res, err = h.IPAM.Allocate(allocReq)
if err != nil { if err != nil {
return nil, fmt.Errorf("ipam: %w", err) return nil, fmt.Errorf("ipam: %w", err)
} }
}
// Promote the peeled addresses IPs into the primary slots. They get the
// IPAM-style routing path: bound to eth0 in configurePodSide, default
// route via fe80::1 / v4ProxyGW, on-link host route via setHostRoute.
// BGP advertisement of the /32/128 is handled by the AnycastReconciler
// via renderBird's outside-aggregate detection.
if addrV6 != nil {
res.IP6 = addrV6
}
if addrV4 != nil {
res.IP4 = addrV4
}
// Persist pending entry before any netlink work so a crash mid-ADD // Persist pending entry before any netlink work so a crash mid-ADD
// leaves recoverable state. // leaves recoverable state.
@@ -79,6 +190,7 @@ func (h *PodHandler) Add(ctx context.Context, req flockcni.Request) (*current.Re
IP6: ipString(res.IP6), IP6: ipString(res.IP6),
IP4: ipString(res.IP4), IP4: ipString(res.IP4),
Anycast: anycastStrings(parsed.Anycast), Anycast: anycastStrings(parsed.Anycast),
Addresses: anycastStrings(addrExtras),
State: StatePending, State: StatePending,
AllocatedAt: time.Now().UTC(), AllocatedAt: time.Now().UTC(),
} }
@@ -95,6 +207,7 @@ func (h *PodHandler) Add(ctx context.Context, req flockcni.Request) (*current.Re
IP6: res.IP6, IP6: res.IP6,
IP4: res.IP4, IP4: res.IP4,
Anycast: parsed.Anycast, Anycast: parsed.Anycast,
Addresses: addrExtras,
} }
if err := h.SetupFunc(setup); err != nil { if err := h.SetupFunc(setup); err != nil {
// Roll forward: leave pending entry in place so startup GC can clean // Roll forward: leave pending entry in place so startup GC can clean
@@ -164,6 +277,11 @@ func resultFromAllocation(ifName string, a Allocation) *current.Result {
Address: net.IPNet{IP: ip4, Mask: net.CIDRMask(32, 32)}, Address: net.IPNet{IP: ip4, Mask: net.CIDRMask(32, 32)},
}) })
} }
// Addresses IPs are intentionally excluded from the CNI result.
// Kubernetes limits pod.status.podIPs to one IPv4 + one IPv6; any
// additional IPs returned here are silently dropped by kubelet. The
// addresses IPs are visible inside the pod on eth0 and advertised via
// BGP — that is sufficient for workload use.
return r return r
} }
@@ -175,6 +293,33 @@ func ipString(ip net.IP) string {
return canonical(ip) return canonical(ip)
} }
// splitAddressesPrimary peels off the first IPv6 and first IPv4 from the
// addresses list to use as the pod's primary IPs in place of an IPAM
// allocation. The remaining entries (anything beyond the first of each
// family) stay in extras for the existing layered eth0 binding via the
// AnycastReconciler's via-route path.
//
// Order of the input is preserved in extras. Either of v6/v4 may be nil
// when the addresses list contains no IP of that family — the caller falls
// back to IPAM allocation in that case.
func splitAddressesPrimary(ips []net.IP) (v6, v4 net.IP, extras []net.IP) {
for _, ip := range ips {
if ip.To4() != nil {
if v4 == nil {
v4 = ip.To4()
continue
}
} else {
if v6 == nil {
v6 = ip.To16()
continue
}
}
extras = append(extras, ip)
}
return
}
func anycastStrings(ips []net.IP) []string { func anycastStrings(ips []net.IP) []string {
if len(ips) == 0 { if len(ips) == 0 {
return nil return nil
+186
View File
@@ -0,0 +1,186 @@
package agent
import (
"net"
"testing"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
func ptrBool(b bool) *bool { return &b }
func mkPod(name string, owner *metav1.OwnerReference, labels map[string]string, image string) *corev1.Pod {
p := &corev1.Pod{
ObjectMeta: metav1.ObjectMeta{Name: name, Labels: labels},
}
if owner != nil {
p.OwnerReferences = []metav1.OwnerReference{*owner}
}
if image != "" {
p.Spec.Containers = []corev1.Container{{Image: image}}
}
return p
}
func TestDeriveAppName_DeploymentReplicaSet(t *testing.T) {
owner := &metav1.OwnerReference{
Kind: "ReplicaSet",
Name: "traefik-789df685f",
Controller: ptrBool(true),
}
pod := mkPod("traefik-789df685f-hqvfl", owner,
map[string]string{podTemplateHashLabel: "789df685f"}, "")
if got := deriveAppName(pod); got != "traefik" {
t.Fatalf("got %q, want %q", got, "traefik")
}
}
func TestDeriveAppName_StatefulSet(t *testing.T) {
owner := &metav1.OwnerReference{
Kind: "StatefulSet",
Name: "gitea",
Controller: ptrBool(true),
}
pod := mkPod("gitea-0", owner, nil, "")
if got := deriveAppName(pod); got != "gitea" {
t.Fatalf("got %q, want %q", got, "gitea")
}
}
func TestDeriveAppName_DaemonSet(t *testing.T) {
owner := &metav1.OwnerReference{
Kind: "DaemonSet",
Name: "flock-agent",
Controller: ptrBool(true),
}
pod := mkPod("flock-agent-abcde", owner, nil, "")
if got := deriveAppName(pod); got != "flock-agent" {
t.Fatalf("got %q, want %q", got, "flock-agent")
}
}
func TestDeriveAppName_BarePod(t *testing.T) {
pod := mkPod("standalone", nil, nil, "")
if got := deriveAppName(pod); got != "standalone" {
t.Fatalf("got %q, want %q", got, "standalone")
}
}
// TestDeriveAppName_RSWithoutTemplateHash — ReplicaSet owners that don't
// follow the standard "<deploy>-<hash>" naming convention (e.g. a custom
// controller) keep the RS name as-is. All replicas of that RS still align,
// which is the second-best correctness offer.
func TestDeriveAppName_RSWithoutTemplateHash(t *testing.T) {
owner := &metav1.OwnerReference{
Kind: "ReplicaSet",
Name: "weird-rs-name",
Controller: ptrBool(true),
}
pod := mkPod("weird-rs-name-xyz", owner, nil, "")
if got := deriveAppName(pod); got != "weird-rs-name" {
t.Fatalf("got %q, want %q", got, "weird-rs-name")
}
}
func TestDeriveAppName_NonControllerOwnerIgnored(t *testing.T) {
// OwnerReference without Controller=true must be ignored — only the
// controller owner is the canonical workload.
owner := &metav1.OwnerReference{
Kind: "Foo",
Name: "irrelevant",
// Controller pointer left nil.
}
pod := mkPod("solo", owner, nil, "")
if got := deriveAppName(pod); got != "solo" {
t.Fatalf("got %q, want %q", got, "solo")
}
}
func TestPodImageRef(t *testing.T) {
pod := mkPod("p", nil, nil, "traefik:v3.5")
if got := podImageRef(pod); got != "traefik:v3.5" {
t.Fatalf("got %q, want %q", got, "traefik:v3.5")
}
empty := mkPod("p", nil, nil, "")
if got := podImageRef(empty); got != "" {
t.Fatalf("got %q, want \"\"", got)
}
}
func TestSplitAddressesPrimary_BothFamilies(t *testing.T) {
// Plex pattern: one v6 + one v4 → both peel out, no extras.
ips := []net.IP{
net.ParseIP("2602:817:3000:c606::166"),
net.ParseIP("142.202.202.166"),
}
v6, v4, extras := splitAddressesPrimary(ips)
if v6 == nil || v6.String() != "2602:817:3000:c606::166" {
t.Fatalf("v6 = %v", v6)
}
if v4 == nil || v4.String() != "142.202.202.166" {
t.Fatalf("v4 = %v", v4)
}
if len(extras) != 0 {
t.Fatalf("extras = %v, want empty", extras)
}
}
func TestSplitAddressesPrimary_OnlyV4(t *testing.T) {
v6, v4, extras := splitAddressesPrimary([]net.IP{net.ParseIP("142.202.202.166")})
if v6 != nil {
t.Fatalf("v6 should be nil, got %v", v6)
}
if v4 == nil || v4.String() != "142.202.202.166" {
t.Fatalf("v4 = %v", v4)
}
if len(extras) != 0 {
t.Fatalf("extras = %v", extras)
}
}
func TestSplitAddressesPrimary_OnlyV6(t *testing.T) {
v6, v4, extras := splitAddressesPrimary([]net.IP{net.ParseIP("2602:817:3000:c606::166")})
if v4 != nil {
t.Fatalf("v4 should be nil, got %v", v4)
}
if v6 == nil || v6.String() != "2602:817:3000:c606::166" {
t.Fatalf("v6 = %v", v6)
}
if len(extras) != 0 {
t.Fatalf("extras = %v", extras)
}
}
func TestSplitAddressesPrimary_Empty(t *testing.T) {
v6, v4, extras := splitAddressesPrimary(nil)
if v6 != nil || v4 != nil || extras != nil {
t.Fatalf("nil input should yield nil outputs, got v6=%v v4=%v extras=%v", v6, v4, extras)
}
}
func TestSplitAddressesPrimary_Extras(t *testing.T) {
// Multiple v4s — only the first peels into the primary slot; the rest
// stay in extras for layered-eth0 binding via the AnycastReconciler.
// (Not a current production use case, but the code should handle it
// without dropping IPs.)
ips := []net.IP{
net.ParseIP("142.202.202.166"),
net.ParseIP("2602:817:3000:c606::166"),
net.ParseIP("142.202.202.167"),
net.ParseIP("2602:817:3000:c606::167"),
}
v6, v4, extras := splitAddressesPrimary(ips)
if v4.String() != "142.202.202.166" {
t.Fatalf("v4 primary = %v, want 142.202.202.166", v4)
}
if v6.String() != "2602:817:3000:c606::166" {
t.Fatalf("v6 primary = %v, want 2602:817:3000:c606::166", v6)
}
if len(extras) != 2 {
t.Fatalf("extras len = %d, want 2", len(extras))
}
if extras[0].String() != "142.202.202.167" || extras[1].String() != "2602:817:3000:c606::167" {
t.Fatalf("extras order/content wrong: %v", extras)
}
}
+63
View File
@@ -0,0 +1,63 @@
package agent
import (
"strings"
"testing"
)
func TestHostIfaceName_Format(t *testing.T) {
got := HostIfaceName("0123456789abcdef0123456789abcdef")
if !strings.HasPrefix(got, "flock") || len(got) != len("flock")+8 {
t.Fatalf("HostIfaceName=%q (want flock + 8 hex)", got)
}
}
func TestHostIfaceName_Determinism(t *testing.T) {
a := HostIfaceName("container-xyz")
b := HostIfaceName("container-xyz")
if a != b {
t.Fatalf("not deterministic: %s vs %s", a, b)
}
}
func TestHostIfaceName_DifferentInputs(t *testing.T) {
a := HostIfaceName("a")
b := HostIfaceName("b")
if a == b {
t.Fatalf("collision on trivial inputs")
}
}
// FuzzHostIfaceName ensures the host interface name generator never produces
// an output longer than IFNAMSIZ-1 (15 chars on Linux) and never panics.
// The name format is "flock" + 8 hex chars = 13 chars, always.
func FuzzHostIfaceName(f *testing.F) {
f.Add("")
f.Add("a")
f.Add("/var/run/netns/abc")
f.Add("0123456789abcdef0123456789abcdef")
f.Add(longString("x", 64*1024)) // very long containerID
f.Add("\x00\x00\x00")
f.Add("ünïcødé/контейнер")
f.Fuzz(func(t *testing.T, id string) {
got := HostIfaceName(id)
// Linux IFNAMSIZ is 16 (15 chars + NUL); ours must fit comfortably.
if len(got) > 15 {
t.Fatalf("HostIfaceName(%q)=%q exceeds 15 chars", id, got)
}
if !strings.HasPrefix(got, "flock") {
t.Fatalf("HostIfaceName(%q)=%q missing prefix", id, got)
}
// Suffix must be lowercase hex (8 chars).
suffix := got[len("flock"):]
if len(suffix) != 8 {
t.Fatalf("HostIfaceName(%q) suffix len=%d", id, len(suffix))
}
for _, c := range suffix {
if !((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f')) {
t.Fatalf("HostIfaceName(%q)=%q has non-hex suffix", id, got)
}
}
})
}
+57 -25
View File
@@ -62,25 +62,36 @@ func (cryptoRand) PickIndex(n int) int {
} }
// AllocRequest describes a pending allocation. Values come from Pod metadata // AllocRequest describes a pending allocation. Values come from Pod metadata
// + annotations at CNI ADD time. // + annotations at CNI ADD time, with per-node FamilyDefaults already merged
// in (see ParseAnnotations).
type AllocRequest struct { type AllocRequest struct {
ContainerID string ContainerID string
Namespace string Namespace string
// Pod is the literal pod name (used for logging only — not embedded).
Pod string Pod string
// WantV6 / WantV4 come from the ipv6 / ipv4 annotations (defaults in // App is the stable workload identity for the FieldApp embed field —
// design doc: ipv6=true, ipv4=false). // typically the owning Deployment / StatefulSet / DaemonSet name.
// Computed by the handler; falls back to Pod when no usable owner is
// found (bare pods).
App string
// WantV6 / WantV4 are the post-merge address family selection (pod
// annotation > NodeConfig.Spec.Defaults > built-in baseline of
// dual-stack). At least one MUST be true; Allocate rejects the request
// otherwise.
WantV6 bool WantV6 bool
WantV4 bool WantV4 bool
// AnnCIDR6 / AnnCIDR4 come from the cidr6 / cidr4 annotations. Empty // AnnCIDR6 / AnnCIDR4 come from the cidr6 / cidr4 annotations. Empty
// means "use any of the node's CIDRs". // means "use any of the node's CIDRs".
AnnCIDR6 []*net.IPNet AnnCIDR6 []*net.IPNet
AnnCIDR4 []*net.IPNet AnnCIDR4 []*net.IPNet
// IPAlgo comes from the ip-algo annotation. Empty means random IID. // IPAlgo comes from the resolved ip-algo precedence chain. Empty means
// random IID.
IPAlgo []embed.Field IPAlgo []embed.Field
// ImageDigest is the sha256 manifest digest (with or without "sha256:" // Image is the spec'd image reference (typically
// prefix). If empty, embed.Values.ImageFallback = ContainerID is used // pod.Spec.Containers[0].Image). When 64 hex chars, treated as a
// for ip-algo fields that reference image. // sha256 digest; otherwise FNV-1a-64'd as a string. Empty falls back
ImageDigest string // to FNV(ContainerID) for ip-algo fields that reference image.
Image string
} }
// AllocResult is what the IPAM hands back to the CNI ADD. // AllocResult is what the IPAM hands back to the CNI ADD.
@@ -207,8 +218,8 @@ func (i *IPAM) allocV6(cidr *net.IPNet, req AllocRequest) (net.IP, error) {
} else { } else {
ip, err = embed.Embed(cidr, req.IPAlgo, embed.Values{ ip, err = embed.Embed(cidr, req.IPAlgo, embed.Values{
Namespace: req.Namespace, Namespace: req.Namespace,
Pod: req.Pod, App: req.App,
Image: req.ImageDigest, Image: req.Image,
ImageFallback: req.ContainerID, ImageFallback: req.ContainerID,
}, i.randSrc.NibbleN()) }, i.randSrc.NibbleN())
} }
@@ -224,34 +235,36 @@ func (i *IPAM) allocV6(cidr *net.IPNet, req AllocRequest) (net.IP, error) {
// randomV6 picks a random /128 inside cidr. The network prefix bits are // randomV6 picks a random /128 inside cidr. The network prefix bits are
// preserved from cidr.IP; the host bits are filled from the random source. // preserved from cidr.IP; the host bits are filled from the random source.
//
// Implementation: walk the 16 IPv6 bytes once. For each byte we ask whether
// it's entirely inside the network mask (skip), entirely inside the host
// portion (overwrite with random), or split (combine bits from both).
func (i *IPAM) randomV6(cidr *net.IPNet) (net.IP, error) { func (i *IPAM) randomV6(cidr *net.IPNet) (net.IP, error) {
ones, bits := cidr.Mask.Size() ones, bits := cidr.Mask.Size()
if bits != 128 { if bits != 128 {
return nil, fmt.Errorf("cidr %s is not IPv6", cidr) return nil, fmt.Errorf("cidr %s is not IPv6", cidr)
} }
out := make(net.IP, 16) out := make(net.IP, net.IPv6len)
copy(out, cidr.IP.To16()) copy(out, cidr.IP.To16())
hostBits := 128 - ones rnd := make([]byte, net.IPv6len)
rnd := make([]byte, 16)
i.randSrc.FillIID(rnd) i.randSrc.FillIID(rnd)
// Merge rnd into out where mask bit is 0. for b := 0; b < net.IPv6len; b++ {
for b := 0; b < 16; b++ {
// Host bits start at bit index `ones`, byte `b`.
byteStart := b * 8 byteStart := b * 8
byteEnd := byteStart + 8 byteEnd := byteStart + 8
if byteEnd <= ones { switch {
continue // entirely network case byteEnd <= ones:
} // Entirely inside the network prefix — leave untouched.
if byteStart >= ones {
out[b] = rnd[b] // entirely host
continue continue
} case byteStart >= ones:
// Split byte: top (ones-byteStart) bits are network, rest is host. // Entirely inside the host portion — fully randomise.
out[b] = rnd[b]
default:
// Split byte: top (ones-byteStart) bits are network, rest host.
networkBits := ones - byteStart networkBits := ones - byteStart
hostMask := byte(0xFF) >> uint(networkBits) hostMask := byte(0xFF) >> uint(networkBits)
out[b] = (out[b] & ^hostMask) | (rnd[b] & hostMask) out[b] = (out[b] & ^hostMask) | (rnd[b] & hostMask)
} }
_ = hostBits }
return out, nil return out, nil
} }
@@ -360,15 +373,34 @@ func toStringSlice(ns []*net.IPNet) []string {
return out return out
} }
// canonical returns the textual form of ip in its native family, so the same
// host address is always represented identically regardless of whether it
// arrived as a 4-byte slice, a 16-byte v4-in-v6 slice, or a string-parsed
// net.IP. Used as the key for the in-use map.
//
// Returns "" for nil input — callers MUST treat the returned key as opaque
// and never use the empty string as a sentinel.
func canonical(ip net.IP) string { func canonical(ip net.IP) string {
if ip == nil {
return ""
}
if v4 := ip.To4(); v4 != nil { if v4 := ip.To4(); v4 != nil {
return v4.String() return v4.String()
} }
return ip.To16().String() if v16 := ip.To16(); v16 != nil {
return v16.String()
}
return ""
} }
// ipToU32 reads a 4-byte IPv4 net.IP into a uint32. The caller is expected
// to have already validated that ip is an IPv4 address; mis-use returns 0
// rather than panicking.
func ipToU32(ip net.IP) uint32 { func ipToU32(ip net.IP) uint32 {
v4 := ip.To4() v4 := ip.To4()
if v4 == nil {
return 0
}
return uint32(v4[0])<<24 | uint32(v4[1])<<16 | uint32(v4[2])<<8 | uint32(v4[3]) return uint32(v4[0])<<24 | uint32(v4[1])<<16 | uint32(v4[2])<<8 | uint32(v4[3])
} }
+169
View File
@@ -0,0 +1,169 @@
package agent
import (
"net"
"testing"
)
// FuzzIPAM_Allocate runs randomly-driven Allocate/Release sequences against
// a /120 IPv6 + /28 IPv4 IPAM so the fuzzer can hit address exhaustion.
//
// Properties checked:
//
// 1. Allocate never panics regardless of the action stream.
// 2. The set of in-use addresses never contains an address that has been
// released without a subsequent successful Allocate.
// 3. A successful v6 allocation always yields an address inside the
// configured /120, and a successful v4 always inside the configured /28.
// 4. ipToU32(canonical(allocated v4)) round-trips, and likewise that no
// v4 allocation lands on .0 (network) or .15 (broadcast) of the /28.
//
// The fuzzed bytes are interpreted as an opcode stream:
// - bytes[i] & 0x03 selects the action: 0=alloc-v6, 1=alloc-v4,
// 2=alloc-dual, 3=release-most-recent.
// - bytes[i]>>2 is fed into the deterministic random source so different
// fuzzed bytes drive different IID/index choices.
func FuzzIPAM_Allocate(f *testing.F) {
f.Add([]byte{0, 0, 0, 0})
f.Add([]byte{1, 1, 1, 1})
f.Add([]byte{2, 2, 2, 2})
f.Add([]byte{0, 1, 2, 3})
f.Add([]byte(longString("\x00\x01\x02\x03", 256)))
f.Fuzz(func(t *testing.T, ops []byte) {
ipam, err := NewIPAM(
[]string{"2001:db8::/120"}, // 256 host slots; 16 bytes of fuzzed nibbles
[]string{"10.0.0.0/28"}, // 14 usable hosts (.2..14)
)
if err != nil {
t.Fatal(err)
}
// Deterministic source: replay nibbles cycled from `ops`.
fr := &fakeRand{
nibbles: append([]byte{}, ops...),
iids: [][]byte{
// 16 bytes of "host portion" — only the last byte matters
// for a /120 prefix.
makeIID(ops, 0),
makeIID(ops, 1),
makeIID(ops, 2),
makeIID(ops, 3),
},
}
if len(fr.nibbles) == 0 {
fr.nibbles = []byte{0}
}
ipam.randSrc = fr
net6 := mustNet(t, "2001:db8::/120")
net4 := mustNet(t, "10.0.0.0/28")
var live []AllocResult
seen := map[string]struct{}{}
for idx, op := range ops {
req := AllocRequest{ContainerID: idStr(idx)}
switch op & 0x03 {
case 0:
req.WantV6 = true
case 1:
req.WantV4 = true
case 2:
req.WantV6, req.WantV4 = true, true
case 3:
if len(live) == 0 {
continue
}
rel := live[len(live)-1]
live = live[:len(live)-1]
ipam.Release(rel.IP6, rel.IP4)
delete(seen, canonical(rel.IP6))
delete(seen, canonical(rel.IP4))
continue
}
res, err := ipam.Allocate(req)
if err != nil {
continue // exhaustion is acceptable
}
if req.WantV6 {
if res.IP6 == nil {
t.Fatalf("requested v6 but got nil")
}
if !net6.Contains(res.IP6) {
t.Fatalf("v6 %s outside /120", res.IP6)
}
if _, dup := seen[canonical(res.IP6)]; dup {
t.Fatalf("v6 %s duplicated", res.IP6)
}
seen[canonical(res.IP6)] = struct{}{}
}
if req.WantV4 {
if res.IP4 == nil {
t.Fatalf("requested v4 but got nil")
}
if !net4.Contains(res.IP4) {
t.Fatalf("v4 %s outside /28", res.IP4)
}
v4 := res.IP4.To4()
if v4 == nil {
t.Fatalf("v4 result not 4-byte: %s", res.IP4)
}
// Skip .0 (network) and .15 (broadcast). The allocator
// should also skip .1 (gateway) by convention.
last := v4[3]
if last == 0 || last == 1 || last == 15 {
t.Fatalf("v4 %s in reserved range", res.IP4)
}
if _, dup := seen[canonical(res.IP4)]; dup {
t.Fatalf("v4 %s duplicated", res.IP4)
}
seen[canonical(res.IP4)] = struct{}{}
}
live = append(live, res)
}
})
}
// FuzzCanonical asserts that canonical never panics and is idempotent.
func FuzzCanonical(f *testing.F) {
f.Add([]byte{})
f.Add([]byte{1, 2, 3, 4})
f.Add([]byte{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0})
f.Add([]byte{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0xff, 0xff, 10, 0, 0, 1}) // v4-mapped v6
f.Add([]byte{0xff})
f.Fuzz(func(t *testing.T, b []byte) {
ip := net.IP(b)
s1 := canonical(ip)
// Idempotent: re-canonicalising the parsed form yields the same
// string for any non-empty result.
if s1 != "" {
parsed := net.ParseIP(s1)
if parsed == nil {
t.Fatalf("canonical(%v)=%q is not parseable as IP", b, s1)
}
if got := canonical(parsed); got != s1 {
t.Fatalf("not idempotent: %q -> %q", s1, got)
}
}
})
}
func makeIID(seed []byte, salt byte) []byte {
out := make([]byte, net.IPv6len)
for i := range out {
if i < len(seed) {
out[i] = seed[i] ^ salt
} else {
out[i] = salt
}
}
return out
}
func idStr(i int) string {
const hex = "0123456789abcdef"
return string([]byte{'c', '-', hex[(i>>4)&0xF], hex[i&0xF]})
}
+2 -2
View File
@@ -148,8 +148,8 @@ func TestIPAM_AllocV6_WithEmbed(t *testing.T) {
} }
i.randSrc = &fakeRand{nibbles: []byte{0xe}} i.randSrc = &fakeRand{nibbles: []byte{0xe}}
res, err := i.Allocate(AllocRequest{ res, err := i.Allocate(AllocRequest{
ContainerID: "c1", Namespace: "mail", Pod: "stalwart-0", WantV6: true, ContainerID: "c1", Namespace: "mail", Pod: "stalwart-0", App: "stalwart", WantV6: true,
IPAlgo: []embed.Field{embed.FieldNamespace, embed.FieldPod, embed.FieldImage}, IPAlgo: []embed.Field{embed.FieldNamespace, embed.FieldApp, embed.FieldImage},
}) })
if err != nil { if err != nil {
t.Fatalf("Allocate: %v", err) t.Fatalf("Allocate: %v", err)
+22
View File
@@ -25,6 +25,11 @@ type SetupRequest struct {
// Host /128 and /32 routes are NOT installed here — that happens once // Host /128 and /32 routes are NOT installed here — that happens once
// the pod becomes Ready, see AnycastReconciler. // the pod becomes Ready, see AnycastReconciler.
Anycast []net.IP Anycast []net.IP
// Addresses are additional IPs to bind directly on pod eth0 (NOT lo).
// BGP advertisement is handled identically to Anycast by the
// AnycastReconciler. Use when the workload needs the IP on its primary
// interface (e.g. Plex remote-access detection).
Addresses []net.IP
} }
// LinkLocalGW is the deterministic IPv6 LL gateway placed on every host // LinkLocalGW is the deterministic IPv6 LL gateway placed on every host
@@ -269,6 +274,23 @@ func configurePodSide(req SetupRequest) error {
} }
} }
// Addresses: assign directly to pod eth0. Host routing and BGP
// advertisement are handled identically to Anycast by the
// AnycastReconciler (host route via pod-eth0-ip, /128+/32 in BIRD).
for _, ip := range req.Addresses {
var mask net.IPMask
if ip.To4() != nil {
mask = net.CIDRMask(32, 32)
ip = ip.To4()
} else {
mask = net.CIDRMask(128, 128)
}
a := &netlink.Addr{IPNet: &net.IPNet{IP: ip, Mask: mask}, Scope: int(netlink.SCOPE_UNIVERSE)}
if err := netlink.AddrAdd(eth0, a); err != nil && !errors.Is(err, os.ErrExist) {
return fmt.Errorf("pod eth0 address %s: %w", ip, err)
}
}
return nil return nil
}) })
} }
+1
View File
@@ -16,6 +16,7 @@ type SetupRequest struct {
IP6 net.IP IP6 net.IP
IP4 net.IP IP4 net.IP
Anycast []net.IP Anycast []net.IP
Addresses []net.IP
} }
// Setup is unimplemented on non-Linux platforms; the agent only runs in // Setup is unimplemented on non-Linux platforms; the agent only runs in
+85
View File
@@ -0,0 +1,85 @@
//go:build linux
package netpol
import (
"bytes"
"context"
"fmt"
"os/exec"
"time"
)
// Applier hands rendered nft scripts to the kernel via `nft -f -`.
// nftables guarantees the entire script applies atomically — if any line
// is rejected, the previous ruleset stays intact.
//
// Applier maintains the last-applied script string and skips the exec
// when the new render is byte-identical, so a 5s reconcile tick on a
// quiet cluster is cheap.
type Applier struct {
// NftPath is the path to the nft binary. Empty means "look up `nft`
// on PATH". Tests set this to a fake.
NftPath string
// Timeout bounds an individual nft invocation; if zero, defaults to
// 5 seconds.
Timeout time.Duration
last string
}
// Apply runs `nft -f -` with the supplied script. Idempotent: if script
// equals the last successful application, this is a no-op.
//
// Returns an error from nft (with stderr captured) if the script is
// malformed or the kernel rejects it.
func (a *Applier) Apply(ctx context.Context, script string) error {
if script == a.last {
return nil
}
timeout := a.Timeout
if timeout == 0 {
timeout = 5 * time.Second
}
bin := a.NftPath
if bin == "" {
bin = "nft"
}
cctx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
cmd := exec.CommandContext(cctx, bin, "-f", "-")
cmd.Stdin = bytes.NewBufferString(script)
var stderr bytes.Buffer
cmd.Stderr = &stderr
if err := cmd.Run(); err != nil {
return fmt.Errorf("nft -f -: %w: %s", err, stderr.String())
}
a.last = script
return nil
}
// Clear tears down the flock NetworkPolicy table — used by graceful
// shutdown so a stopping agent doesn't leave stale enforcement behind.
// Best-effort: if nft is missing or the table doesn't exist, returns
// nil.
func (a *Applier) Clear(ctx context.Context) error {
timeout := a.Timeout
if timeout == 0 {
timeout = 5 * time.Second
}
bin := a.NftPath
if bin == "" {
bin = "nft"
}
cctx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
cmd := exec.CommandContext(cctx, bin, "destroy", "table", "inet", "flock_netpol")
if err := cmd.Run(); err != nil {
// nft returns non-zero if the table doesn't exist — that's a
// success for our purposes.
return nil
}
a.last = ""
return nil
}
+16
View File
@@ -0,0 +1,16 @@
//go:build !linux
package netpol
import "context"
// Applier is a no-op on non-Linux build hosts so unit tests run on macOS
// without nft.
type Applier struct {
NftPath string
Timeout interface{}
last string
}
func (a *Applier) Apply(_ context.Context, script string) error { a.last = script; return nil }
func (a *Applier) Clear(_ context.Context) error { a.last = ""; return nil }
+250
View File
@@ -0,0 +1,250 @@
package netpol
import (
"net"
"strings"
"testing"
corev1 "k8s.io/api/core/v1"
netv1 "k8s.io/api/networking/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/util/intstr"
)
// These fixtures mirror the three NetworkPolicies live in the sjc001
// cluster on 2026-04-25. They serve as integration-shaped tests: the
// translator + renderer must produce a sensible nft script for each.
//
// Source of truth (refresh by running `kubectl get netpol -A -o yaml`):
//
// - calico-apiserver/allow-apiserver
// - remote-proxies/lodge-home-assistant-ingress
// - storage/garage-admin-restrict
// allowApiserverPolicy: TCP/5443 ingress to apiserver=true pods, no peer
// restriction (allow-from-anywhere on that port).
func allowApiserverPolicy() netv1.NetworkPolicy {
tcp := corev1.ProtocolTCP
port := intstr.FromInt32(5443)
return netv1.NetworkPolicy{
ObjectMeta: metav1.ObjectMeta{Namespace: "calico-apiserver", Name: "allow-apiserver"},
Spec: netv1.NetworkPolicySpec{
PodSelector: metav1.LabelSelector{MatchLabels: map[string]string{"apiserver": "true"}},
PolicyTypes: []netv1.PolicyType{netv1.PolicyTypeIngress},
Ingress: []netv1.NetworkPolicyIngressRule{{
Ports: []netv1.NetworkPolicyPort{{Protocol: &tcp, Port: &port}},
}},
},
}
}
// lodgeHomeAssistantPolicy: TCP/8080 from any pod in the `edge` namespace
// to pods labelled app=lodge-home-assistant.
func lodgeHomeAssistantPolicy() netv1.NetworkPolicy {
tcp := corev1.ProtocolTCP
port := intstr.FromInt32(8080)
return netv1.NetworkPolicy{
ObjectMeta: metav1.ObjectMeta{Namespace: "remote-proxies", Name: "lodge-home-assistant-ingress"},
Spec: netv1.NetworkPolicySpec{
PodSelector: metav1.LabelSelector{MatchLabels: map[string]string{"app": "lodge-home-assistant"}},
PolicyTypes: []netv1.PolicyType{netv1.PolicyTypeIngress},
Ingress: []netv1.NetworkPolicyIngressRule{{
From: []netv1.NetworkPolicyPeer{{
NamespaceSelector: &metav1.LabelSelector{
MatchLabels: map[string]string{"kubernetes.io/metadata.name": "edge"},
},
}},
Ports: []netv1.NetworkPolicyPort{{Protocol: &tcp, Port: &port}},
}},
},
}
}
// garageAdminPolicy: complex two-rule policy.
//
// 1. Allow TCP/{3900, 80, 3901} from anywhere.
// 2. Allow TCP/3903 only from pods in `edge` or `storage`.
func garageAdminPolicy() netv1.NetworkPolicy {
tcp := corev1.ProtocolTCP
p3900 := intstr.FromInt32(3900)
p80 := intstr.FromInt32(80)
p3901 := intstr.FromInt32(3901)
p3903 := intstr.FromInt32(3903)
return netv1.NetworkPolicy{
ObjectMeta: metav1.ObjectMeta{Namespace: "storage", Name: "garage-admin-restrict"},
Spec: netv1.NetworkPolicySpec{
PodSelector: metav1.LabelSelector{MatchLabels: map[string]string{"app": "garage"}},
PolicyTypes: []netv1.PolicyType{netv1.PolicyTypeIngress},
Ingress: []netv1.NetworkPolicyIngressRule{
{
Ports: []netv1.NetworkPolicyPort{
{Protocol: &tcp, Port: &p3900},
{Protocol: &tcp, Port: &p80},
{Protocol: &tcp, Port: &p3901},
},
},
{
From: []netv1.NetworkPolicyPeer{
{NamespaceSelector: &metav1.LabelSelector{
MatchLabels: map[string]string{"kubernetes.io/metadata.name": "edge"},
}},
{NamespaceSelector: &metav1.LabelSelector{
MatchLabels: map[string]string{"kubernetes.io/metadata.name": "storage"},
}},
},
Ports: []netv1.NetworkPolicyPort{{Protocol: &tcp, Port: &p3903}},
},
},
},
}
}
// TestClusterFixture_AllowApiserver — pod selected by the policy gets
// isolated; the rendered script accepts TCP/5443 from anywhere.
func TestClusterFixture_AllowApiserver(t *testing.T) {
pod := Pod{
Namespace: "calico-apiserver",
Name: "calico-apiserver-1",
Labels: map[string]string{"apiserver": "true"},
HostIface: "flock00000001",
IPs: []net.IP{mustIP("2001:db8::1")},
}
out, err := Translate(Inputs{
LocalPods: []Pod{pod},
Policies: []netv1.NetworkPolicy{allowApiserverPolicy()},
}, nil)
if err != nil {
t.Fatal(err)
}
in, _ := isolationFor(out, "calico-apiserver/calico-apiserver-1")
if !in {
t.Fatalf("apiserver pod should be isolated for ingress")
}
script := Render(out)
if !strings.Contains(script, "tcp dport 5443 accept") {
t.Fatalf("expected TCP/5443 allow:\n%s", script)
}
// No peer filter — allow-all-on-port.
if strings.Contains(script, "ip6 saddr {") || strings.Contains(script, "ip saddr {") {
t.Fatalf("expected no peer filter for allow-from-anywhere:\n%s", script)
}
}
// TestClusterFixture_LodgeHomeAssistant — pod isolated; only TCP/8080
// from edge namespace is allowed.
func TestClusterFixture_LodgeHomeAssistant(t *testing.T) {
pod := Pod{
Namespace: "remote-proxies",
Name: "lodge-home-assistant-0",
Labels: map[string]string{"app": "lodge-home-assistant"},
HostIface: "flock00000002",
IPs: []net.IP{mustIP("2001:db8::2")},
}
traefik := PeerPod{
Namespace: "edge", Name: "traefik-0",
Labels: map[string]string{"app": "traefik"},
IPs: []net.IP{mustIP("2001:db8::aa")},
}
stranger := PeerPod{
Namespace: "default", Name: "random",
Labels: map[string]string{"app": "random"},
IPs: []net.IP{mustIP("2001:db8::bb")},
}
out, err := Translate(Inputs{
LocalPods: []Pod{pod},
PeerPods: []PeerPod{traefik, stranger},
Namespaces: []Namespace{
{Name: "edge", Labels: map[string]string{"kubernetes.io/metadata.name": "edge"}},
{Name: "default", Labels: map[string]string{"kubernetes.io/metadata.name": "default"}},
{Name: "remote-proxies", Labels: map[string]string{"kubernetes.io/metadata.name": "remote-proxies"}},
},
Policies: []netv1.NetworkPolicy{lodgeHomeAssistantPolicy()},
}, nil)
if err != nil {
t.Fatal(err)
}
if len(out.Rules) != 1 {
t.Fatalf("expected 1 rule, got %d", len(out.Rules))
}
r := out.Rules[0]
// Peer should be exactly traefik's IP, not stranger's.
got := map[string]bool{}
for _, c := range r.PeerCIDRs {
got[c.IP.String()] = true
}
if !got["2001:db8::aa"] {
t.Fatalf("traefik IP missing from rule: %v", got)
}
if got["2001:db8::bb"] {
t.Fatalf("stranger IP leaked into rule")
}
script := Render(out)
if !strings.Contains(script, "tcp dport 8080 accept") {
t.Fatalf("expected TCP/8080 allow:\n%s", script)
}
}
// TestClusterFixture_Garage — verifies the two-rule policy:
//
// 1. ports {3900, 80, 3901} accept from any peer
// 2. port 3903 accept only from edge or storage namespaces
func TestClusterFixture_Garage(t *testing.T) {
pod := Pod{
Namespace: "storage", Name: "garage-0",
Labels: map[string]string{"app": "garage"},
HostIface: "flock00000003",
IPs: []net.IP{mustIP("2001:db8::3")},
}
storagePeer := PeerPod{
Namespace: "storage", Name: "garage-1",
Labels: map[string]string{"app": "garage"},
IPs: []net.IP{mustIP("2001:db8::31")},
}
edgePeer := PeerPod{
Namespace: "edge", Name: "traefik-0",
Labels: map[string]string{"app": "traefik"},
IPs: []net.IP{mustIP("2001:db8::41")},
}
stranger := PeerPod{
Namespace: "default", Name: "random",
Labels: map[string]string{"app": "random"},
IPs: []net.IP{mustIP("2001:db8::ff")},
}
out, err := Translate(Inputs{
LocalPods: []Pod{pod},
PeerPods: []PeerPod{storagePeer, edgePeer, stranger},
Namespaces: []Namespace{
{Name: "edge", Labels: map[string]string{"kubernetes.io/metadata.name": "edge"}},
{Name: "storage", Labels: map[string]string{"kubernetes.io/metadata.name": "storage"}},
{Name: "default", Labels: map[string]string{"kubernetes.io/metadata.name": "default"}},
},
Policies: []netv1.NetworkPolicy{garageAdminPolicy()},
}, nil)
if err != nil {
t.Fatal(err)
}
// Two ingress rules in the source policy → two Rules out (one per
// peer set, ports inline).
if len(out.Rules) != 2 {
t.Fatalf("expected 2 rules (one per ingress entry), got %d", len(out.Rules))
}
script := Render(out)
for _, want := range []string{
"tcp dport 3900 accept",
"tcp dport 80 accept",
"tcp dport 3901 accept",
"tcp dport 3903 accept",
} {
if !strings.Contains(script, want) {
t.Errorf("missing %q in script:\n%s", want, script)
}
}
// The 3903 rule must carry a peer filter for both edge and storage
// peer IPs but not the stranger.
if !strings.Contains(script, "2001:db8::31/128") || !strings.Contains(script, "2001:db8::41/128") {
t.Fatalf("expected edge+storage peer IPs in 3903 rule:\n%s", script)
}
if strings.Contains(script, "2001:db8::ff/128") {
t.Fatalf("stranger IP must not appear:\n%s", script)
}
}
+44
View File
@@ -0,0 +1,44 @@
// Package netpol implements Kubernetes NetworkPolicy enforcement for flock.
//
// # Model
//
// NetworkPolicy is a Kubernetes-native API (`networking.k8s.io/v1`) that
// describes which pods may receive traffic (Ingress) and / or initiate
// traffic (Egress). The semantics are isolation by selection: a pod that is
// selected by *any* NetworkPolicy in a given direction becomes default-deny
// in that direction, plus the union of all "allow" rules from every policy
// that selects it. A pod selected by no policy is unrestricted.
//
// flock enforces these semantics with nftables. Each agent is responsible
// for the pods scheduled on its own node — peer addresses (from
// podSelector / namespaceSelector / ipBlock peers) come from a cluster-wide
// informer set so the agent can resolve peers that live elsewhere.
//
// # Pipeline
//
// The work is split into four stages with hard boundaries between them so
// each can be tested in isolation:
//
// 1. Informers (informers.go) — watch NetworkPolicies, Namespaces, and
// all Pods in the cluster. Maintain indices the translator can query.
//
// 2. Translator (translator.go) — pure function from
// (NetworkPolicy set, Namespace set, Pod set, local-node pod set) to
// []Rule. No I/O, no hidden state — straightforward to fuzz and unit
// test. Implements the default-deny semantics and the peer-resolution
// rules from the NetworkPolicy spec.
//
// 3. Renderer (render.go) — pure function from []Rule to an nft script
// (string). Output is deterministic so the apply stage can de-dupe.
//
// 4. Apply (apply_linux.go) — shell out to `nft -f -` for an atomic
// reconfiguration. nftables guarantees the whole script applies as a
// single transaction; partial failures roll back automatically.
//
// # Why nftables (and not eBPF)
//
// Atomic ruleset transactions, kernel-native, no userspace ebpf-loader to
// maintain, and behaviour an operator can read directly with
// `nft list ruleset`. The cost is that we walk per-pod chains in software,
// which is fine at the cluster sizes flock targets.
package netpol
+222
View File
@@ -0,0 +1,222 @@
package netpol
import (
"context"
"fmt"
"log/slog"
"net"
"sync"
"time"
corev1 "k8s.io/api/core/v1"
netv1 "k8s.io/api/networking/v1"
"k8s.io/client-go/informers"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
"k8s.io/client-go/tools/cache"
)
// World aggregates the cluster-wide caches the reconciler queries on
// every pass: NetworkPolicies, Namespaces, and all Pods (for peer
// resolution). Each field is safe for concurrent reads.
type World struct {
logger *slog.Logger
mu sync.RWMutex
policies map[string]netv1.NetworkPolicy // key = ns/name
namespaces map[string]Namespace
peerPods map[string]PeerPod // key = ns/name
onChange []func()
}
// NewWorld returns an empty World. Callers should call Start to populate
// it; before Start, the snapshot accessors return empty slices.
func NewWorld(logger *slog.Logger) *World {
return &World{
logger: logger,
policies: map[string]netv1.NetworkPolicy{},
namespaces: map[string]Namespace{},
peerPods: map[string]PeerPod{},
}
}
// OnChange registers a callback fired (synchronously, inside the informer
// event handler) whenever any watched object changes. The reconciler
// uses this to debounce policy reloads.
func (w *World) OnChange(f func()) {
w.mu.Lock()
defer w.mu.Unlock()
w.onChange = append(w.onChange, f)
}
func (w *World) fireChange() {
w.mu.RLock()
cbs := append([]func(){}, w.onChange...)
w.mu.RUnlock()
for _, f := range cbs {
f()
}
}
// Start launches three informers (NetworkPolicy, Namespace, Pod) against
// the cluster API. It blocks until each cache reports synced. The caller
// is responsible for cancelling ctx on shutdown.
func (w *World) Start(ctx context.Context, cfg *rest.Config) error {
cs, err := kubernetes.NewForConfig(cfg)
if err != nil {
return fmt.Errorf("kubernetes client: %w", err)
}
factory := informers.NewSharedInformerFactory(cs, 10*time.Minute)
npInformer := factory.Networking().V1().NetworkPolicies().Informer()
nsInformer := factory.Core().V1().Namespaces().Informer()
podInformer := factory.Core().V1().Pods().Informer()
if _, err := npInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { w.onPolicy(obj, false) },
UpdateFunc: func(_, n interface{}) { w.onPolicy(n, false) },
DeleteFunc: func(obj interface{}) { w.onPolicy(obj, true) },
}); err != nil {
return fmt.Errorf("add netpol handler: %w", err)
}
if _, err := nsInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { w.onNamespace(obj, false) },
UpdateFunc: func(_, n interface{}) { w.onNamespace(n, false) },
DeleteFunc: func(obj interface{}) { w.onNamespace(obj, true) },
}); err != nil {
return fmt.Errorf("add ns handler: %w", err)
}
if _, err := podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { w.onPod(obj, false) },
UpdateFunc: func(_, n interface{}) { w.onPod(n, false) },
DeleteFunc: func(obj interface{}) { w.onPod(obj, true) },
}); err != nil {
return fmt.Errorf("add pod handler: %w", err)
}
w.logger.Info("netpol informers starting")
factory.Start(ctx.Done())
if !cache.WaitForCacheSync(ctx.Done(),
npInformer.HasSynced, nsInformer.HasSynced, podInformer.HasSynced) {
return fmt.Errorf("netpol informer caches failed to sync")
}
w.logger.Info("netpol informers synced",
"netpols", len(w.snapshotPolicies()),
"namespaces", len(w.snapshotNamespaces()),
"peer_pods", len(w.snapshotPeerPods()))
return nil
}
// unwrapDFSU lifts a DeletedFinalStateUnknown wrapper if present.
func unwrapDFSU(obj interface{}) interface{} {
if d, ok := obj.(cache.DeletedFinalStateUnknown); ok {
return d.Obj
}
return obj
}
func (w *World) onPolicy(obj interface{}, deleted bool) {
p, ok := unwrapDFSU(obj).(*netv1.NetworkPolicy)
if !ok || p == nil {
return
}
key := p.Namespace + "/" + p.Name
w.mu.Lock()
if deleted {
delete(w.policies, key)
} else {
w.policies[key] = *p
}
w.mu.Unlock()
w.fireChange()
}
func (w *World) onNamespace(obj interface{}, deleted bool) {
ns, ok := unwrapDFSU(obj).(*corev1.Namespace)
if !ok || ns == nil {
return
}
w.mu.Lock()
if deleted {
delete(w.namespaces, ns.Name)
} else {
w.namespaces[ns.Name] = Namespace{Name: ns.Name, Labels: ns.Labels}
}
w.mu.Unlock()
w.fireChange()
}
func (w *World) onPod(obj interface{}, deleted bool) {
pod, ok := unwrapDFSU(obj).(*corev1.Pod)
if !ok || pod == nil {
return
}
key := pod.Namespace + "/" + pod.Name
w.mu.Lock()
if deleted {
delete(w.peerPods, key)
} else {
w.peerPods[key] = PeerPod{
Namespace: pod.Namespace,
Name: pod.Name,
Labels: pod.Labels,
IPs: podIPs(pod),
}
}
w.mu.Unlock()
w.fireChange()
}
// podIPs extracts every PodIP from the status. Pods without status (still
// scheduling) yield nil — safe for the translator.
func podIPs(p *corev1.Pod) []net.IP {
out := make([]net.IP, 0, len(p.Status.PodIPs))
for _, addr := range p.Status.PodIPs {
ip := net.ParseIP(addr.IP)
if ip == nil {
continue
}
out = append(out, ip)
}
if len(out) == 0 && p.Status.PodIP != "" {
// Older clusters may populate PodIP but not PodIPs; tolerate both.
if ip := net.ParseIP(p.Status.PodIP); ip != nil {
out = append(out, ip)
}
}
return out
}
// snapshotPolicies returns a defensive copy of the policy map's values.
func (w *World) snapshotPolicies() []netv1.NetworkPolicy {
w.mu.RLock()
defer w.mu.RUnlock()
out := make([]netv1.NetworkPolicy, 0, len(w.policies))
for _, p := range w.policies {
out = append(out, p)
}
return out
}
// snapshotNamespaces returns a defensive copy of the namespace map.
func (w *World) snapshotNamespaces() []Namespace {
w.mu.RLock()
defer w.mu.RUnlock()
out := make([]Namespace, 0, len(w.namespaces))
for _, n := range w.namespaces {
out = append(out, n)
}
return out
}
// snapshotPeerPods returns a defensive copy of the peer-pod map.
func (w *World) snapshotPeerPods() []PeerPod {
w.mu.RLock()
defer w.mu.RUnlock()
out := make([]PeerPod, 0, len(w.peerPods))
for _, p := range w.peerPods {
out = append(out, p)
}
return out
}
+115
View File
@@ -0,0 +1,115 @@
package netpol
import (
"context"
"log/slog"
"sync"
"time"
)
// LocalPodSource produces the set of local pods (with their HostIface and
// IPs) the reconciler should enforce policy for. The agent's allocation
// store + pod informer is the natural implementer.
//
// The function is called inside the reconciler under no lock, so it must
// be safe for concurrent invocation.
type LocalPodSource func() []Pod
// Reconciler turns the World cache + LocalPodSource into nft rule
// applications. One reconcile pass:
//
// pods + policies + namespaces → Translate → Render → Apply
//
// The pass runs on:
//
// - World.OnChange (any informer event), debounced through a single
// coalescing channel,
// - a periodic tick (default 30s) so we self-heal if the kernel
// ruleset diverges from desired (e.g. someone manually `nft flush`d),
// - and explicit Trigger() calls (the agent fires this from CNI ADD /
// DEL hooks so policy lands before pod traffic flows).
type Reconciler struct {
World *World
Local LocalPodSource
Applier *Applier
Logger *slog.Logger
Interval time.Duration
mu sync.Mutex
trigger chan struct{}
}
// NewReconciler returns a Reconciler ready to Run. Interval defaults to
// 30s if zero.
func NewReconciler(world *World, local LocalPodSource, applier *Applier, logger *slog.Logger) *Reconciler {
r := &Reconciler{
World: world,
Local: local,
Applier: applier,
Logger: logger,
Interval: 30 * time.Second,
trigger: make(chan struct{}, 1),
}
world.OnChange(r.Trigger)
return r
}
// Trigger requests one reconcile pass. Coalesces — if a pass is already
// pending, the call is a no-op.
func (r *Reconciler) Trigger() {
select {
case r.trigger <- struct{}{}:
default:
}
}
// Run blocks until ctx is cancelled. Reconciles on Trigger or every
// Interval; calls Applier.Clear on shutdown.
func (r *Reconciler) Run(ctx context.Context) {
t := time.NewTicker(r.Interval)
defer t.Stop()
r.reconcile(ctx) // initial pass
for {
select {
case <-ctx.Done():
// Best-effort: drop our table on graceful exit. If the agent
// crashed without doing this, the next agent's first apply
// will replace the stale table atomically anyway.
_ = r.Applier.Clear(context.Background())
return
case <-t.C:
r.reconcile(ctx)
case <-r.trigger:
r.reconcile(ctx)
}
}
}
func (r *Reconciler) reconcile(ctx context.Context) {
r.mu.Lock()
defer r.mu.Unlock()
in := Inputs{
LocalPods: r.Local(),
PeerPods: r.World.snapshotPeerPods(),
Namespaces: r.World.snapshotNamespaces(),
Policies: r.World.snapshotPolicies(),
}
out, err := Translate(in, func(s string) { r.Logger.Warn(s) })
if err != nil {
r.Logger.Warn("netpol translate failed", "err", err)
return
}
script := Render(out)
if err := r.Applier.Apply(ctx, script); err != nil {
r.Logger.Warn("netpol apply failed", "err", err)
return
}
if len(out.Isolated) > 0 {
r.Logger.Info("netpol applied",
"isolated_chains", len(out.Isolated),
"rules", len(out.Rules),
"local_pods", len(in.LocalPods),
"policies", len(in.Policies))
}
}
+160
View File
@@ -0,0 +1,160 @@
package netpol
import (
"context"
"io"
"log/slog"
"net"
"strings"
"sync"
"sync/atomic"
"testing"
corev1 "k8s.io/api/core/v1"
netv1 "k8s.io/api/networking/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// fakeApplier captures Apply calls for assertion. Drop-in for *Applier in
// tests because Reconciler depends only on the (Apply, Clear) pair.
type fakeApplier struct {
mu sync.Mutex
calls []string
last string
err error
}
func (f *fakeApplier) Apply(_ context.Context, script string) error {
f.mu.Lock()
defer f.mu.Unlock()
if f.err != nil {
return f.err
}
if script == f.last {
return nil // de-dup like the real Applier
}
f.last = script
f.calls = append(f.calls, script)
return nil
}
func (f *fakeApplier) Clear(_ context.Context) error { return nil }
func (f *fakeApplier) lastScript() string {
f.mu.Lock()
defer f.mu.Unlock()
return f.last
}
func (f *fakeApplier) callCount() int {
f.mu.Lock()
defer f.mu.Unlock()
return len(f.calls)
}
// applierIface is satisfied by *Applier and *fakeApplier; we narrow
// Reconciler to this in tests by adapting via a tiny wrapper.
type applierIface interface {
Apply(context.Context, string) error
Clear(context.Context) error
}
// reconcileOnce drives one pass synchronously without spinning a goroutine.
func reconcileOnce(t *testing.T, world *World, local LocalPodSource, app applierIface) {
t.Helper()
in := Inputs{
LocalPods: local(),
PeerPods: world.snapshotPeerPods(),
Namespaces: world.snapshotNamespaces(),
Policies: world.snapshotPolicies(),
}
out, err := Translate(in, nil)
if err != nil {
t.Fatal(err)
}
if err := app.Apply(context.Background(), Render(out)); err != nil {
t.Fatal(err)
}
}
// silentLogger returns a slog.Logger discarding everything — keeps test
// output tidy.
func silentLogger() *slog.Logger {
return slog.New(slog.NewTextHandler(io.Discard, &slog.HandlerOptions{}))
}
func TestReconciler_NoIsolatedPods_ShortScript(t *testing.T) {
world := NewWorld(silentLogger())
local := func() []Pod { return nil }
app := &fakeApplier{}
reconcileOnce(t, world, local, app)
got := app.lastScript()
if !strings.Contains(got, "table inet flock_netpol") {
t.Fatalf("missing table:\n%s", got)
}
// Without any isolated pods the base chain has policy accept and no
// jumps. That's the desired "open" state.
if strings.Contains(got, "jump pod_") {
t.Fatalf("unexpected jump in open state:\n%s", got)
}
}
func TestReconciler_PolicyIsolatesLocalPod(t *testing.T) {
world := NewWorld(silentLogger())
// Seed a default-deny policy in ns1.
world.onPolicy(&netv1.NetworkPolicy{
ObjectMeta: metav1.ObjectMeta{Namespace: "ns1", Name: "deny-all"},
Spec: netv1.NetworkPolicySpec{
PodSelector: metav1.LabelSelector{},
PolicyTypes: []netv1.PolicyType{netv1.PolicyTypeIngress},
},
}, false)
local := func() []Pod {
return []Pod{{
Namespace: "ns1", Name: "web",
Labels: map[string]string{"app": "web"},
HostIface: "flock00000001",
IPs: []net.IP{mustIP("2001:db8::1")},
}}
}
app := &fakeApplier{}
reconcileOnce(t, world, local, app)
got := app.lastScript()
if !strings.Contains(got, "_ingress {") {
t.Fatalf("expected pod ingress chain:\n%s", got)
}
if !strings.Contains(got, "drop") {
t.Fatalf("expected default-deny drop:\n%s", got)
}
if !strings.Contains(got, `oifname "flock00000001" jump pod_`) {
t.Fatalf("expected base-chain jump anchored on veth:\n%s", got)
}
}
func TestReconciler_DedupesIdenticalRender(t *testing.T) {
world := NewWorld(silentLogger())
local := func() []Pod {
return []Pod{{
Namespace: "ns1", Name: "web", HostIface: "f1",
IPs: []net.IP{mustIP("2001:db8::1")},
}}
}
app := &fakeApplier{}
reconcileOnce(t, world, local, app)
reconcileOnce(t, world, local, app)
reconcileOnce(t, world, local, app)
if got := app.callCount(); got != 1 {
t.Fatalf("expected 1 unique apply, got %d", got)
}
}
func TestReconciler_OnChangeFiresTrigger(t *testing.T) {
world := NewWorld(silentLogger())
var triggered atomic.Int32
world.OnChange(func() { triggered.Add(1) })
world.onNamespace(&corev1.Namespace{ObjectMeta: metav1.ObjectMeta{Name: "foo"}}, false)
world.onPolicy(&netv1.NetworkPolicy{ObjectMeta: metav1.ObjectMeta{Namespace: "foo", Name: "p"}}, false)
if triggered.Load() != 2 {
t.Fatalf("expected 2 OnChange calls, got %d", triggered.Load())
}
}
+315
View File
@@ -0,0 +1,315 @@
package netpol
import (
"fmt"
"hash/fnv"
"net"
"sort"
"strings"
)
// Render produces an nftables script that, when applied with `nft -f -`,
// installs the desired NetworkPolicy enforcement state for this node.
//
// Layout:
//
// table inet flock_netpol {
// chain forward { # base chain on hook forward
// type filter hook forward priority filter; policy accept;
// # one jump per (pod, direction) that has rules and/or isolation
// iifname "flock1a2b3c4d" ip6 saddr 2001:db8::1 jump pod_<hash>_egress
// oifname "flock1a2b3c4d" ip6 daddr 2001:db8::1 jump pod_<hash>_ingress
// }
// chain pod_<hash>_ingress { # one per isolated direction
// # explicit allow lines (empty for default-deny)
// drop
// }
// chain pod_<hash>_egress { ... }
// }
//
// The whole table is replaced atomically: a "delete table … 2>/dev/null"
// (best-effort) followed by an "add table" + the chains. nft executes the
// script as a single transaction; partial application is impossible.
//
// Output is deterministic: equal Output → byte-identical script. The
// reconciler relies on this for de-dup.
func Render(out Output) string {
var sb strings.Builder
sb.WriteString("# Generated by flock-agent netpol; do not edit by hand.\n")
// Best-effort delete; if the table doesn't exist (first run) nft
// returns an error, hence the redirect. The "add table" then
// recreates everything.
sb.WriteString("destroy table inet flock_netpol\n")
sb.WriteString("table inet flock_netpol {\n")
// Build per-(pod, direction) chains. We need them defined BEFORE the
// base chain references them, so we render chains first.
chains := buildChains(out)
for _, c := range chains {
writeChain(&sb, c)
}
// Base chain emits jumps in a stable order (chain name asc).
sb.WriteString("\tchain forward {\n")
sb.WriteString("\t\ttype filter hook forward priority filter; policy accept;\n")
for _, c := range chains {
writeBaseJump(&sb, c)
}
sb.WriteString("\t}\n")
sb.WriteString("}\n")
return sb.String()
}
// chain is one rendered chain — one direction of one pod.
type chain struct {
name string // pod_<hash>_ingress / _egress
hostIface string
podIPs []net.IP
direction Direction
rules []Rule
policy string // "drop" or "accept"
}
// buildChains groups rules by (PodKey, Direction) and adds default-deny
// chains for isolated directions that received no explicit rules.
func buildChains(out Output) []chain {
type key struct {
podKey string
dir Direction
}
byKey := map[key]*chain{}
// Seed isolated directions with empty chains so default-deny lands
// even when no explicit allow rule was emitted for them.
for iso := range out.Isolated {
byKey[key{podKey: iso.PodKey, dir: iso.Direction}] = &chain{
direction: iso.Direction,
policy: "drop",
}
}
// Append rules into their chain. Rule.PodIPs and HostIface are
// authoritative — every rule for a given pod carries the same values
// (translator invariant), so we copy from the first.
for _, r := range out.Rules {
k := key{podKey: r.PodKey, dir: r.Direction}
c := byKey[k]
if c == nil {
// Rule for a non-isolated direction shouldn't happen in
// practice (translator only emits rules for selected pods)
// but be tolerant — the chain just gets policy accept.
c = &chain{direction: r.Direction, policy: "accept"}
byKey[k] = c
}
c.rules = append(c.rules, r)
if c.hostIface == "" {
c.hostIface = r.HostIface
c.podIPs = append([]net.IP(nil), r.PodIPs...)
}
}
// If a chain was created from Isolated only (no rules), look up the
// pod's HostIface + IPs from Output.Pods. This is the path a
// default-deny policy takes — no allow rules, only isolation.
for k, c := range byKey {
if c.hostIface != "" {
continue
}
if lp, ok := out.Pods[k.podKey]; ok {
c.hostIface = lp.HostIface
c.podIPs = append([]net.IP(nil), lp.IPs...)
continue
}
// Last resort: lift from any rule sharing the PodKey. Should
// not normally happen — the translator populates Pods for every
// isolated pod — but defends against partially-populated Output
// values constructed by tests.
for _, r := range out.Rules {
if r.PodKey == k.podKey {
c.hostIface = r.HostIface
c.podIPs = append([]net.IP(nil), r.PodIPs...)
break
}
}
}
// Materialise chain names and emit in deterministic order.
var chains []chain
for k, c := range byKey {
if c.hostIface == "" {
continue // can't jump to it; skip
}
c.name = chainName(k.podKey, c.direction)
chains = append(chains, *c)
}
sort.Slice(chains, func(i, j int) bool { return chains[i].name < chains[j].name })
return chains
}
// chainName produces a stable, name-safe chain identifier. Pod keys can
// contain characters nft doesn't allow in identifiers, so we hash them.
// Direction keeps ingress and egress separate.
func chainName(podKey string, dir Direction) string {
h := fnv.New64a()
_, _ = h.Write([]byte(podKey))
return fmt.Sprintf("pod_%016x_%s", h.Sum64(), dir)
}
// writeChain emits the chain definition. Empty chains exist deliberately:
// the chain's drop policy IS the default-deny.
func writeChain(sb *strings.Builder, c chain) {
fmt.Fprintf(sb, "\tchain %s {\n", c.name)
// Stateful accept for return traffic. NetworkPolicy applies to the
// start of a new connection — reply packets for pod-initiated flows
// (egress) and follow-up packets of an established ingress flow must
// pass regardless of the explicit allow set, otherwise the chain's
// final drop kills ephemeral-port replies (e.g. pod → kube-apiserver).
sb.WriteString("\t\tct state established,related accept\n")
for _, r := range c.rules {
writeAllowRule(sb, r)
}
if c.policy == "drop" {
sb.WriteString("\t\tdrop\n")
}
sb.WriteString("\t}\n")
}
// writeAllowRule emits one accept line:
//
// [ip|ip6 saddr {peers}] [ip|ip6 saddr != {except}] [proto dport {port|port-end}] accept
//
// The saddr / daddr field flips based on direction (ingress = from peer →
// match saddr; egress = to peer → match daddr).
func writeAllowRule(sb *strings.Builder, r Rule) {
v6Peers, v4Peers := splitFamily(r.PeerCIDRs)
v6Except, v4Except := splitFamily(r.PeerExcept)
v6Pod, v4Pod := splitIPFamily(r.PodIPs)
hasPeerFilter := len(r.PeerCIDRs) > 0
emit := func(family string, peers, except []*net.IPNet, podIP net.IP) {
if hasPeerFilter && len(peers) == 0 && len(except) == 0 {
// Peer filter exists but no entries of this family — rule
// must not match anything for this family.
return
}
if podIP == nil {
// Pod has no address of this family; nothing to guard.
return
}
for _, port := range r.Ports {
sb.WriteString("\t\t")
// Peer (saddr/daddr) match: address is "peer's address",
// which is saddr on ingress and daddr on egress.
peerField := peerAddrField(family, r.Direction)
if hasPeerFilter && len(peers) > 0 {
fmt.Fprintf(sb, "%s { %s } ", peerField, joinCIDRs(peers))
}
if hasPeerFilter && len(except) > 0 {
fmt.Fprintf(sb, "%s != { %s } ", peerField, joinCIDRs(except))
}
// Port match.
writePortMatch(sb, port)
fmt.Fprintf(sb, "%s\n", r.Action)
}
}
emit("ip6", v6Peers, v6Except, v6Pod)
emit("ip", v4Peers, v4Except, v4Pod)
}
// peerAddrField returns "ip6 saddr" / "ip saddr" / "ip6 daddr" / "ip daddr"
// depending on family + direction. Ingress matches the peer as the source;
// egress matches the peer as the destination.
func peerAddrField(family string, dir Direction) string {
switch {
case dir == DirIngress:
return family + " saddr"
default:
return family + " daddr"
}
}
// writePortMatch appends "tcp dport 80 " (single port) or
// "tcp dport 8000-8999 " (range), or nothing when port is "any".
func writePortMatch(sb *strings.Builder, p PortMatch) {
if p.Port == 0 && p.Protocol == "" {
return
}
proto := p.Protocol
if proto == "" {
proto = "tcp"
}
if p.Port == 0 {
// Protocol-only match. nft has `meta l4proto tcp`.
fmt.Fprintf(sb, "meta l4proto %s ", proto)
return
}
if p.EndPort > p.Port {
fmt.Fprintf(sb, "%s dport %d-%d ", proto, p.Port, p.EndPort)
return
}
fmt.Fprintf(sb, "%s dport %d ", proto, p.Port)
}
// writeBaseJump emits one line per (pod, direction) chain in the base
// `forward` chain. The match is anchored on the host-side veth name —
// the veth uniquely belongs to one pod, so anything traversing it is
// to/from that pod by definition.
//
// We deliberately don't filter on the pod's eth0 address: the pod can
// also receive traffic addressed to its anycast IP (or any other host
// route the operator has installed via flock-agent), and policy must
// apply uniformly to all of it.
func writeBaseJump(sb *strings.Builder, c chain) {
var iface string
if c.direction == DirEgress {
iface = "iifname"
} else {
iface = "oifname"
}
fmt.Fprintf(sb, "\t\t%s \"%s\" jump %s\n", iface, c.hostIface, c.name)
}
// splitFamily partitions CIDRs into (v6, v4) lists, preserving order
// within each family.
func splitFamily(cs []*net.IPNet) ([]*net.IPNet, []*net.IPNet) {
var v6, v4 []*net.IPNet
for _, c := range cs {
if c.IP.To4() != nil {
v4 = append(v4, c)
} else {
v6 = append(v6, c)
}
}
return v6, v4
}
// splitIPFamily picks one v6 and one v4 from a list of pod IPs (a pod has
// at most one of each in flock's model).
func splitIPFamily(ips []net.IP) (v6, v4 net.IP) {
for _, ip := range ips {
if ip == nil {
continue
}
if ip.To4() != nil {
if v4 == nil {
v4 = ip
}
} else {
if v6 == nil {
v6 = ip
}
}
}
return
}
func joinCIDRs(cs []*net.IPNet) string {
parts := make([]string, len(cs))
for i, c := range cs {
parts[i] = c.String()
}
sort.Strings(parts)
return strings.Join(parts, ", ")
}
+228
View File
@@ -0,0 +1,228 @@
package netpol
import (
"net"
"strings"
"testing"
)
// TestRender_DefaultDeny — an isolated direction with no rules renders
// to a chain whose last action is "drop".
func TestRender_DefaultDeny(t *testing.T) {
out := Output{
Isolated: map[Isolation]struct{}{
{PodKey: "ns/web", Direction: DirIngress}: {},
},
Rules: []Rule{
// Need at least one rule to give the chain its HostIface +
// PodIPs. Use an empty rule that selects the same chain.
{PodKey: "ns/web", HostIface: "flock00000001", PodIPs: []net.IP{mustIP("2001:db8::1")},
Direction: DirIngress, Action: ActionAccept,
Ports: []PortMatch{{}}},
},
}
got := Render(out)
if !strings.Contains(got, "table inet flock_netpol") {
t.Fatalf("missing table:\n%s", got)
}
if !strings.Contains(got, "type filter hook forward") {
t.Fatalf("missing base chain:\n%s", got)
}
if !strings.Contains(got, "drop") {
t.Fatalf("expected default-deny drop in chain:\n%s", got)
}
// Pod chain name must be deterministic-looking (pod_<hex>_ingress).
if !strings.Contains(got, "_ingress {") {
t.Fatalf("missing pod ingress chain:\n%s", got)
}
// Base chain jump anchored solely on veth — anycast must not bypass.
if !strings.Contains(got, `oifname "flock00000001" jump pod_`) {
t.Fatalf("missing veth-only ingress jump in base chain:\n%s", got)
}
// Stateful accept must be present so reply traffic for pod-initiated
// outbound (e.g. ephemeral-port replies from kube-apiserver) is not
// dropped by the chain's final drop. Regression guard: production hit
// this when garage's k8s-discovery → apiserver replies got dropped.
if !strings.Contains(got, "ct state established,related accept") {
t.Fatalf("missing ct state established,related accept:\n%s", got)
}
}
// TestRender_DualStack — dual-stack pod gets one veth-anchored jump per
// direction (no per-family jump; the chain handles both).
func TestRender_DualStack(t *testing.T) {
out := Output{
Isolated: map[Isolation]struct{}{
{PodKey: "ns/web", Direction: DirIngress}: {},
},
Rules: []Rule{{
PodKey: "ns/web", HostIface: "f1",
PodIPs: []net.IP{mustIP("2001:db8::1"), mustIP("10.0.0.1")},
Direction: DirIngress, Action: ActionAccept,
Ports: []PortMatch{{Protocol: "tcp", Port: 80}},
}},
}
got := Render(out)
// Exactly one ingress jump line with no per-family daddr.
if got != "" && strings.Count(got, `oifname "f1" jump`) != 1 {
t.Fatalf("expected exactly one veth-only ingress jump:\n%s", got)
}
// The accept rule itself should still split per family inside the
// pod chain.
if !strings.Contains(got, "ip6 saddr") || !strings.Contains(got, "ip saddr") {
// no peer filter set → should NOT have ip6/ip saddr filters
// inside the chain. (Skip this assertion: TestRender_AllowAllPeers
// covers the no-peer-filter case.)
}
}
// TestRender_PortAndPeer — a Rule with peer + port emits a syntactically
// well-formed allow line.
func TestRender_PortAndPeer(t *testing.T) {
out := Output{
Isolated: map[Isolation]struct{}{
{PodKey: "ns/web", Direction: DirIngress}: {},
},
Rules: []Rule{{
PodKey: "ns/web", HostIface: "f1",
PodIPs: []net.IP{mustIP("2001:db8::1")},
Direction: DirIngress, Action: ActionAccept,
PeerCIDRs: []*net.IPNet{mustNet("2001:db8::a/128")},
Ports: []PortMatch{{Protocol: "tcp", Port: 80}},
}},
}
got := Render(out)
if !strings.Contains(got, "ip6 saddr { 2001:db8::a/128 } tcp dport 80 accept") {
t.Fatalf("expected ingress allow with v6 peer + tcp/80:\n%s", got)
}
}
// TestRender_PortRange — endPort renders as "8000-8999".
func TestRender_PortRange(t *testing.T) {
out := Output{
Isolated: map[Isolation]struct{}{
{PodKey: "ns/web", Direction: DirIngress}: {},
},
Rules: []Rule{{
PodKey: "ns/web", HostIface: "f1",
PodIPs: []net.IP{mustIP("2001:db8::1")},
Direction: DirIngress, Action: ActionAccept,
PeerCIDRs: []*net.IPNet{mustNet("0.0.0.0/0"), mustNet("::/0")},
Ports: []PortMatch{{Protocol: "tcp", Port: 8000, EndPort: 8999}},
}},
}
got := Render(out)
if !strings.Contains(got, "tcp dport 8000-8999") {
t.Fatalf("expected port range:\n%s", got)
}
}
// TestRender_IPBlockExcept — except produces a "saddr != { … }" guard.
func TestRender_IPBlockExcept(t *testing.T) {
out := Output{
Isolated: map[Isolation]struct{}{
{PodKey: "ns/web", Direction: DirIngress}: {},
},
Rules: []Rule{{
PodKey: "ns/web", HostIface: "f1",
PodIPs: []net.IP{mustIP("10.0.0.1")},
Direction: DirIngress, Action: ActionAccept,
PeerCIDRs: []*net.IPNet{mustNet("10.0.0.0/8")},
PeerExcept: []*net.IPNet{mustNet("10.99.0.0/16")},
Ports: []PortMatch{{}},
}},
}
got := Render(out)
if !strings.Contains(got, "ip saddr { 10.0.0.0/8 }") {
t.Fatalf("expected ipBlock cidr:\n%s", got)
}
if !strings.Contains(got, "ip saddr != { 10.99.0.0/16 }") {
t.Fatalf("expected ipBlock except:\n%s", got)
}
}
// TestRender_AllowAllPeers — empty PeerCIDRs/PeerExcept means "any peer";
// the rule should emit an unconditional accept (modulo port).
func TestRender_AllowAllPeers(t *testing.T) {
out := Output{
Isolated: map[Isolation]struct{}{
{PodKey: "ns/web", Direction: DirIngress}: {},
},
Rules: []Rule{{
PodKey: "ns/web", HostIface: "f1",
PodIPs: []net.IP{mustIP("2001:db8::1")},
Direction: DirIngress, Action: ActionAccept,
Ports: []PortMatch{{Protocol: "tcp", Port: 443}},
}},
}
got := Render(out)
if !strings.Contains(got, "tcp dport 443 accept") {
t.Fatalf("expected unconditional tcp/443 allow:\n%s", got)
}
// Should NOT have a saddr/daddr filter (empty peers).
if strings.Contains(got, "ip6 saddr {") || strings.Contains(got, "ip saddr {") {
t.Fatalf("expected no peer filter:\n%s", got)
}
}
// TestRender_Determinism — same input → byte-identical output.
func TestRender_Determinism(t *testing.T) {
out := Output{
Isolated: map[Isolation]struct{}{
{PodKey: "ns/web", Direction: DirIngress}: {},
{PodKey: "ns/db", Direction: DirEgress}: {},
},
Rules: []Rule{
{PodKey: "ns/web", HostIface: "f1", PodIPs: []net.IP{mustIP("2001:db8::1")},
Direction: DirIngress, Action: ActionAccept,
PeerCIDRs: []*net.IPNet{mustNet("2001:db8::5/128"), mustNet("2001:db8::3/128")},
Ports: []PortMatch{{Protocol: "tcp", Port: 80}}},
{PodKey: "ns/db", HostIface: "f2", PodIPs: []net.IP{mustIP("2001:db8::2")},
Direction: DirEgress, Action: ActionAccept,
PeerCIDRs: []*net.IPNet{mustNet("2001:db8::aa/128")},
Ports: []PortMatch{{}}},
},
}
a := Render(out)
b := Render(out)
if a != b {
t.Fatalf("Render not deterministic:\nA=\n%s\nB=\n%s", a, b)
}
// And peers in the rule must be sorted (we deliberately gave 5 then 3).
if strings.Index(a, "2001:db8::3/128") > strings.Index(a, "2001:db8::5/128") {
t.Fatalf("peer CIDRs not sorted within rule:\n%s", a)
}
}
// TestRender_EgressDirection — egress rules use iifname + saddr (pod-side).
func TestRender_EgressDirection(t *testing.T) {
out := Output{
Isolated: map[Isolation]struct{}{
{PodKey: "ns/web", Direction: DirEgress}: {},
},
Rules: []Rule{{
PodKey: "ns/web", HostIface: "f1",
PodIPs: []net.IP{mustIP("2001:db8::1")},
Direction: DirEgress, Action: ActionAccept,
PeerCIDRs: []*net.IPNet{mustNet("2001:db8::aa/128")},
Ports: []PortMatch{{Protocol: "tcp", Port: 53}},
}},
}
got := Render(out)
// Base-chain jump for egress matches iifname only.
if !strings.Contains(got, `iifname "f1" jump pod_`) {
t.Fatalf("missing egress base-chain jump:\n%s", got)
}
// Peer filter for egress matches the *destination* (the peer is downstream).
if !strings.Contains(got, "ip6 daddr { 2001:db8::aa/128 }") {
t.Fatalf("expected daddr peer filter for egress:\n%s", got)
}
}
func mustNet(s string) *net.IPNet {
_, n, err := net.ParseCIDR(s)
if err != nil {
panic(err)
}
return n
}
+443
View File
@@ -0,0 +1,443 @@
package netpol
import (
"fmt"
"net"
"sort"
netv1 "k8s.io/api/networking/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/labels"
)
// Inputs is the world-view the translator consumes. All fields are owned
// by the caller; the translator does not mutate them.
type Inputs struct {
// LocalPods are the pods scheduled on this node that have a committed
// flock allocation. Only these pods get rules — peers may live
// elsewhere.
LocalPods []Pod
// PeerPods is the cluster-wide pod set used to resolve podSelector +
// namespaceSelector peers. It is fine to include the local pods here
// too; duplicates are deduped by (namespace, name).
PeerPods []PeerPod
// Namespaces is the cluster's full Namespace set. Used for
// namespaceSelector matching.
Namespaces []Namespace
// Policies is every NetworkPolicy in the cluster. The translator
// filters down to those that select at least one local pod.
Policies []netv1.NetworkPolicy
}
// Output is the result of one translation pass.
type Output struct {
// Rules is the flat ordered list of allow rules to render. The
// renderer groups them by (PodKey, Direction) into chains.
Rules []Rule
// Isolated is the set of (PodKey, Direction) pairs whose chain must
// have a default-deny policy. A pod selected by at least one policy
// in a given direction shows up here. The renderer uses this to
// decide whether to emit a chain at all and what its base policy is.
Isolated map[Isolation]struct{}
// Pods carries the HostIface + IPs for every local pod referenced
// by the policy world, including pods that produced only isolation
// (default-deny) without any allow rules. The renderer needs this
// because such a pod has no Rule to lift the HostIface from.
Pods map[string]LocalPod // key = namespace/name
}
// Isolation is the (PodKey, Direction) key of the Isolated map.
type Isolation struct {
PodKey string
Direction Direction
}
// Translate runs the translation pass. It is a pure function: same Inputs
// always produces semantically equal Output. (Order of slices is stable
// but Rules within a chain follow the order in which selecting policies
// appear, which is itself sorted; see canonicalisePolicies.)
//
// Errors are returned only for unrecoverable malformed input; per-rule
// translation errors are logged via warn and skipped so that a single
// broken policy can't take down enforcement for a whole node. The optional
// warn callback is invoked for each skipped sub-rule with a human-readable
// message. Pass nil to silently drop.
func Translate(in Inputs, warn func(string)) (Output, error) {
if warn == nil {
warn = func(string) {}
}
out := Output{
Isolated: map[Isolation]struct{}{},
Pods: map[string]LocalPod{},
}
policies := canonicalisePolicies(in.Policies)
nsByName := indexNamespaces(in.Namespaces)
peerPodsByNS := indexPeerPods(in.PeerPods)
for _, pod := range in.LocalPods {
if len(pod.IPs) == 0 {
continue // no allocation yet; translator skips
}
key := pod.Namespace + "/" + pod.Name
// Find every policy in pod.Namespace whose podSelector matches.
// Cross-namespace policies do not select pods outside their own
// namespace; that's how the NetworkPolicy spec defines it.
for _, p := range policies {
if p.Namespace != pod.Namespace {
continue
}
sel, err := metav1.LabelSelectorAsSelector(&p.Spec.PodSelector)
if err != nil {
warn(fmt.Sprintf("policy %s/%s: invalid podSelector: %v", p.Namespace, p.Name, err))
continue
}
if !sel.Matches(labels.Set(pod.Labels)) {
continue
}
ingress, egress := policyDirections(&p)
if ingress || egress {
out.Pods[key] = LocalPod{
PodKey: key,
HostIface: pod.HostIface,
IPs: append([]net.IP(nil), pod.IPs...),
}
}
if ingress {
out.Isolated[Isolation{PodKey: key, Direction: DirIngress}] = struct{}{}
}
if egress {
out.Isolated[Isolation{PodKey: key, Direction: DirEgress}] = struct{}{}
}
// Translate ingress rules.
if ingress {
for ri, r := range p.Spec.Ingress {
rules, err := buildIngressRules(pod, r, p.Namespace, nsByName, peerPodsByNS)
if err != nil {
warn(fmt.Sprintf("policy %s/%s ingress[%d]: %v", p.Namespace, p.Name, ri, err))
continue
}
out.Rules = append(out.Rules, rules...)
}
}
// Translate egress rules.
if egress {
for ri, r := range p.Spec.Egress {
rules, err := buildEgressRules(pod, r, p.Namespace, nsByName, peerPodsByNS)
if err != nil {
warn(fmt.Sprintf("policy %s/%s egress[%d]: %v", p.Namespace, p.Name, ri, err))
continue
}
out.Rules = append(out.Rules, rules...)
}
}
}
}
return out, nil
}
// policyDirections reports which directions a NetworkPolicy isolates.
//
// Per the spec, the PolicyTypes field is the source of truth when set;
// when omitted, isolation is inferred from which rule lists are populated
// (Ingress always; Egress only if Spec.Egress is non-empty).
func policyDirections(p *netv1.NetworkPolicy) (ingress, egress bool) {
if len(p.Spec.PolicyTypes) > 0 {
for _, t := range p.Spec.PolicyTypes {
switch t {
case netv1.PolicyTypeIngress:
ingress = true
case netv1.PolicyTypeEgress:
egress = true
}
}
return
}
ingress = true
egress = len(p.Spec.Egress) > 0
return
}
// buildIngressRules expands one NetworkPolicyIngressRule into Rule(s).
// One Rule per allowed peer-set; each Rule carries the full Ports filter
// from the source rule.
func buildIngressRules(
pod Pod,
r netv1.NetworkPolicyIngressRule,
policyNS string,
nsByName map[string]Namespace,
peerPodsByNS map[string][]PeerPod,
) ([]Rule, error) {
ports, err := translatePorts(r.Ports)
if err != nil {
return nil, err
}
peers, err := translatePeers(r.From, policyNS, nsByName, peerPodsByNS)
if err != nil {
return nil, err
}
return assembleRules(pod, DirIngress, peers, ports), nil
}
// buildEgressRules is the egress mirror of buildIngressRules.
func buildEgressRules(
pod Pod,
r netv1.NetworkPolicyEgressRule,
policyNS string,
nsByName map[string]Namespace,
peerPodsByNS map[string][]PeerPod,
) ([]Rule, error) {
ports, err := translatePorts(r.Ports)
if err != nil {
return nil, err
}
peers, err := translatePeers(r.To, policyNS, nsByName, peerPodsByNS)
if err != nil {
return nil, err
}
return assembleRules(pod, DirEgress, peers, ports), nil
}
// peerSet is the resolved peer information for one rule's From / To list.
type peerSet struct {
// allowAll is true when the rule has no peers at all (an empty From /
// To list, which the spec defines as "from anywhere"). It overrides
// CIDRs and Except.
allowAll bool
// CIDRs is the union of every IP / CIDR contributed by the rule's
// peer entries (resolved Pod IPs, namespace pods, and ipBlock.cidr).
CIDRs []*net.IPNet
// Except is the union of every ipBlock.except entry across the rule.
Except []*net.IPNet
}
// translatePeers resolves a list of NetworkPolicyPeer entries into a
// peerSet. Each peer entry contributes either CIDRs (resolved from
// pod / namespace selectors, or copied from ipBlock) or Except entries.
func translatePeers(
peers []netv1.NetworkPolicyPeer,
policyNS string,
nsByName map[string]Namespace,
peerPodsByNS map[string][]PeerPod,
) (peerSet, error) {
if len(peers) == 0 {
return peerSet{allowAll: true}, nil
}
out := peerSet{}
for i, p := range peers {
switch {
case p.IPBlock != nil:
_, cidr, err := net.ParseCIDR(p.IPBlock.CIDR)
if err != nil {
return peerSet{}, fmt.Errorf("peer[%d] ipBlock.cidr %q: %w", i, p.IPBlock.CIDR, err)
}
out.CIDRs = append(out.CIDRs, cidr)
for j, ex := range p.IPBlock.Except {
_, exNet, err := net.ParseCIDR(ex)
if err != nil {
return peerSet{}, fmt.Errorf("peer[%d] ipBlock.except[%d] %q: %w", i, j, ex, err)
}
out.Except = append(out.Except, exNet)
}
case p.PodSelector != nil || p.NamespaceSelector != nil:
ips, err := resolvePodNamespacePeer(p, policyNS, nsByName, peerPodsByNS)
if err != nil {
return peerSet{}, fmt.Errorf("peer[%d]: %w", i, err)
}
out.CIDRs = append(out.CIDRs, ips...)
default:
return peerSet{}, fmt.Errorf("peer[%d] is empty (must set ipBlock, podSelector, or namespaceSelector)", i)
}
}
return out, nil
}
// resolvePodNamespacePeer walks the cluster's peer-pod set and returns
// /128 (v6) and /32 (v4) CIDRs for each pod that matches the (possibly
// combined) pod + namespace selectors.
//
// Selector semantics from the NetworkPolicy spec:
//
// - podSelector + namespaceSelector both nil → handled upstream.
// - podSelector set, namespaceSelector nil → match in the policy's
// own namespace.
// - podSelector nil, namespaceSelector set → match every pod in
// namespaces that match the namespaceSelector.
// - both set → AND: pod must be in a matching namespace AND match
// the podSelector.
//
// An empty (non-nil) selector matches everything in scope.
func resolvePodNamespacePeer(
p netv1.NetworkPolicyPeer,
policyNS string,
nsByName map[string]Namespace,
peerPodsByNS map[string][]PeerPod,
) ([]*net.IPNet, error) {
var podSel, nsSel labels.Selector
if p.PodSelector != nil {
s, err := metav1.LabelSelectorAsSelector(p.PodSelector)
if err != nil {
return nil, fmt.Errorf("podSelector: %w", err)
}
podSel = s
}
if p.NamespaceSelector != nil {
s, err := metav1.LabelSelectorAsSelector(p.NamespaceSelector)
if err != nil {
return nil, fmt.Errorf("namespaceSelector: %w", err)
}
nsSel = s
}
// Decide which namespaces are in scope.
var inScope []string
if nsSel == nil {
// Pod-only selector → just the policy's own namespace.
inScope = []string{policyNS}
} else {
for name, ns := range nsByName {
if nsSel.Matches(labels.Set(ns.Labels)) {
inScope = append(inScope, name)
}
}
}
var out []*net.IPNet
for _, ns := range inScope {
for _, pp := range peerPodsByNS[ns] {
if podSel != nil && !podSel.Matches(labels.Set(pp.Labels)) {
continue
}
for _, ip := range pp.IPs {
out = append(out, ipToHostCIDR(ip))
}
}
}
return out, nil
}
// translatePorts converts NetworkPolicyPort entries into PortMatch.
//
// A nil/empty Ports list on a NetworkPolicy rule means "all ports" by
// spec; we represent that as a single zero-valued PortMatch (any proto,
// any port) so the renderer can emit a single rule rather than a chain
// of port-equality matches.
func translatePorts(ports []netv1.NetworkPolicyPort) ([]PortMatch, error) {
if len(ports) == 0 {
return []PortMatch{{}}, nil
}
var out []PortMatch
for i, p := range ports {
var protoStr string
if p.Protocol != nil {
switch *p.Protocol {
case "TCP":
protoStr = "tcp"
case "UDP":
protoStr = "udp"
case "SCTP":
protoStr = "sctp"
default:
return nil, fmt.Errorf("port[%d]: protocol %q not supported", i, *p.Protocol)
}
} else {
// Spec default: TCP. We use empty string to mean "any of
// the three" only when the user explicitly sets neither
// protocol nor port; here the user has supplied a Port,
// which implies a protocol — and the spec default is TCP.
protoStr = "tcp"
}
var port, endPort int
if p.Port != nil {
if p.Port.Type != 0 { // intstr.Int = 0; intstr.String = 1
return nil, fmt.Errorf("port[%d]: named ports are not yet supported", i)
}
port = int(p.Port.IntVal)
}
if p.EndPort != nil {
endPort = int(*p.EndPort)
if endPort < port {
return nil, fmt.Errorf("port[%d]: endPort %d < port %d", i, endPort, port)
}
}
out = append(out, PortMatch{Protocol: protoStr, Port: port, EndPort: endPort})
}
return out, nil
}
// assembleRules emits the cross-product of (one peer-set) × (port list).
// We currently emit a single Rule per direction since the peer-set is the
// expensive shared field; ports go inline. allowAll peers result in a
// rule with no PeerCIDRs, which the renderer treats as "any source".
func assembleRules(pod Pod, dir Direction, peers peerSet, ports []PortMatch) []Rule {
if !peers.allowAll && len(peers.CIDRs) == 0 {
// Selector matched no peers (e.g. podSelector for a label that
// no live pod has). Emit nothing — the rule cannot allow any
// real traffic. The pod stays in default-deny for this rule.
return nil
}
r := Rule{
PodKey: pod.Namespace + "/" + pod.Name,
HostIface: pod.HostIface,
PodIPs: append([]net.IP(nil), pod.IPs...),
Direction: dir,
Action: ActionAccept,
Ports: append([]PortMatch(nil), ports...),
}
if !peers.allowAll {
r.PeerCIDRs = append([]*net.IPNet(nil), peers.CIDRs...)
r.PeerExcept = append([]*net.IPNet(nil), peers.Except...)
}
return []Rule{r}
}
// canonicalisePolicies sorts the policy slice by (namespace, name) so the
// translator's output is deterministic regardless of informer event order.
func canonicalisePolicies(p []netv1.NetworkPolicy) []netv1.NetworkPolicy {
out := append([]netv1.NetworkPolicy(nil), p...)
sort.Slice(out, func(i, j int) bool {
if out[i].Namespace != out[j].Namespace {
return out[i].Namespace < out[j].Namespace
}
return out[i].Name < out[j].Name
})
return out
}
func indexNamespaces(nss []Namespace) map[string]Namespace {
out := make(map[string]Namespace, len(nss))
for _, ns := range nss {
out[ns.Name] = ns
}
return out
}
func indexPeerPods(pods []PeerPod) map[string][]PeerPod {
out := map[string][]PeerPod{}
for _, p := range pods {
out[p.Namespace] = append(out[p.Namespace], p)
}
// Sort each namespace's pod list by (name) so the translator's IP
// ordering is stable.
for k := range out {
sort.Slice(out[k], func(i, j int) bool { return out[k][i].Name < out[k][j].Name })
}
return out
}
// ipToHostCIDR returns ip/32 (v4) or ip/128 (v6) — the smallest CIDR
// covering exactly that one address.
func ipToHostCIDR(ip net.IP) *net.IPNet {
if v4 := ip.To4(); v4 != nil {
return &net.IPNet{IP: v4, Mask: net.CIDRMask(32, 32)}
}
return &net.IPNet{IP: ip.To16(), Mask: net.CIDRMask(128, 128)}
}
+147
View File
@@ -0,0 +1,147 @@
package netpol
import (
"net"
"strings"
"testing"
corev1 "k8s.io/api/core/v1"
netv1 "k8s.io/api/networking/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/util/intstr"
)
// FuzzTranslate_AndRender stitches the Translator and Renderer together
// against synthetic NetworkPolicies built from fuzzed bytes. We are not
// trying to produce *valid* policies — the goal is to confirm that:
//
// 1. Neither stage panics on weird input.
// 2. Render output is balanced (every "{" has a matching "}").
// 3. Rendering twice is byte-stable.
// 4. The Pods set in Output is consistent with Isolated (every isolated
// PodKey has a matching entry in Pods).
//
// The translator's warn callback is captured to ensure it never panics
// with unexpected message types either.
func FuzzTranslate_AndRender(f *testing.F) {
type seed struct {
policyNS, policyName string
podSelectorKey, podSelValue string
peerSelectorKey, peerSelV string
peerNS, peerName, peerIP string
port uint16
ipBlockCIDR, ipBlockExcept string
}
for _, s := range []seed{
{policyNS: "ns1", policyName: "p1", podSelectorKey: "app", podSelValue: "web", port: 80},
{policyNS: "ns1", policyName: "p1", peerSelectorKey: "app", peerSelV: "client", peerNS: "ns1", peerName: "c1", peerIP: "2001:db8::aa", port: 443},
{policyNS: "ns1", policyName: "p1", ipBlockCIDR: "10.0.0.0/8", ipBlockExcept: "10.99.0.0/16", port: 0},
{policyNS: "", policyName: ""}, // pathological
{policyNS: "ns1", policyName: "p1", podSelectorKey: "app\x00", podSelValue: "web\nnewline"},
{policyNS: "ns1", policyName: "p1", port: 65535},
{policyNS: "ns1", policyName: "p1", port: 1},
} {
f.Add(s.policyNS, s.policyName, s.podSelectorKey, s.podSelValue,
s.peerSelectorKey, s.peerSelV, s.peerNS, s.peerName, s.peerIP,
s.port, s.ipBlockCIDR, s.ipBlockExcept)
}
f.Fuzz(func(t *testing.T,
policyNS, policyName,
podSelectorKey, podSelValue,
peerSelectorKey, peerSelV,
peerNS, peerName, peerIP string,
port uint16,
ipBlockCIDR, ipBlockExcept string,
) {
// Build a synthetic policy.
policy := netv1.NetworkPolicy{
ObjectMeta: metav1.ObjectMeta{Namespace: policyNS, Name: policyName},
Spec: netv1.NetworkPolicySpec{
PolicyTypes: []netv1.PolicyType{netv1.PolicyTypeIngress},
},
}
if podSelectorKey != "" {
policy.Spec.PodSelector = metav1.LabelSelector{
MatchLabels: map[string]string{podSelectorKey: podSelValue},
}
} else {
policy.Spec.PodSelector = metav1.LabelSelector{}
}
ingress := netv1.NetworkPolicyIngressRule{}
if peerSelectorKey != "" {
ingress.From = append(ingress.From, netv1.NetworkPolicyPeer{
PodSelector: &metav1.LabelSelector{
MatchLabels: map[string]string{peerSelectorKey: peerSelV},
},
})
}
if ipBlockCIDR != "" {
peer := netv1.NetworkPolicyPeer{
IPBlock: &netv1.IPBlock{CIDR: ipBlockCIDR},
}
if ipBlockExcept != "" {
peer.IPBlock.Except = []string{ipBlockExcept}
}
ingress.From = append(ingress.From, peer)
}
if port != 0 {
tcp := corev1.ProtocolTCP
p := intstr.FromInt32(int32(port))
ingress.Ports = append(ingress.Ports, netv1.NetworkPolicyPort{
Protocol: &tcp, Port: &p,
})
}
policy.Spec.Ingress = append(policy.Spec.Ingress, ingress)
// Local pod, possibly matching the policy.
pod := Pod{
Namespace: "ns1", Name: "web",
Labels: map[string]string{podSelectorKey: podSelValue, "app": "web"},
HostIface: "flock00000001",
IPs: []net.IP{mustIP("2001:db8::1")},
}
// Peer pod, possibly matching the peer selector.
var peers []PeerPod
if peerName != "" {
peerIPParsed := net.ParseIP(peerIP)
if peerIPParsed != nil {
peers = append(peers, PeerPod{
Namespace: peerNS, Name: peerName,
Labels: map[string]string{peerSelectorKey: peerSelV},
IPs: []net.IP{peerIPParsed},
})
}
}
out, err := Translate(Inputs{
LocalPods: []Pod{pod},
PeerPods: peers,
Namespaces: []Namespace{
{Name: "ns1", Labels: map[string]string{"kubernetes.io/metadata.name": "ns1"}},
},
Policies: []netv1.NetworkPolicy{policy},
}, func(string) {})
if err != nil {
return // any error is acceptable
}
// Property: every isolated PodKey appears in Output.Pods.
for iso := range out.Isolated {
if _, ok := out.Pods[iso.PodKey]; !ok {
t.Fatalf("isolated %s has no Pods entry", iso.PodKey)
}
}
script := Render(out)
// Property: balanced braces.
if got := strings.Count(script, "{") - strings.Count(script, "}"); got != 0 {
t.Fatalf("unbalanced braces (%d):\n%s", got, script)
}
// Property: deterministic (run again, compare).
script2 := Render(out)
if script != script2 {
t.Fatalf("Render not deterministic")
}
})
}
+452
View File
@@ -0,0 +1,452 @@
package netpol
import (
"net"
"testing"
corev1 "k8s.io/api/core/v1"
netv1 "k8s.io/api/networking/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/util/intstr"
)
func mustIP(s string) net.IP {
ip := net.ParseIP(s)
if ip == nil {
panic("bad IP: " + s)
}
return ip
}
func newPolicy(ns, name string, mods ...func(*netv1.NetworkPolicy)) netv1.NetworkPolicy {
p := netv1.NetworkPolicy{
ObjectMeta: metav1.ObjectMeta{Namespace: ns, Name: name},
Spec: netv1.NetworkPolicySpec{},
}
for _, m := range mods {
m(&p)
}
return p
}
func tcpPort(port int) netv1.NetworkPolicyPort {
proto := corev1.ProtocolTCP
p := intstr.FromInt32(int32(port))
return netv1.NetworkPolicyPort{Protocol: &proto, Port: &p}
}
// Pod-only selector that matches everything (`{}`).
func emptySelector() *metav1.LabelSelector {
return &metav1.LabelSelector{}
}
func selectorMatching(kv map[string]string) *metav1.LabelSelector {
return &metav1.LabelSelector{MatchLabels: kv}
}
// Helper: collect Isolated keys for the given pod into a string list.
func isolationFor(out Output, podKey string) (in, eg bool) {
if _, ok := out.Isolated[Isolation{PodKey: podKey, Direction: DirIngress}]; ok {
in = true
}
if _, ok := out.Isolated[Isolation{PodKey: podKey, Direction: DirEgress}]; ok {
eg = true
}
return
}
// TestTranslate_NoPolicies — pod with no matching policy is unrestricted.
func TestTranslate_NoPolicies(t *testing.T) {
pod := Pod{
Namespace: "ns1", Name: "p1",
Labels: map[string]string{"app": "web"},
HostIface: "flock00000001",
IPs: []net.IP{mustIP("2001:db8::1")},
}
out, err := Translate(Inputs{LocalPods: []Pod{pod}}, nil)
if err != nil {
t.Fatal(err)
}
if len(out.Rules) != 0 {
t.Fatalf("expected no rules, got %d", len(out.Rules))
}
in, eg := isolationFor(out, "ns1/p1")
if in || eg {
t.Fatalf("pod should not be isolated: in=%v eg=%v", in, eg)
}
}
// TestTranslate_DefaultDeny — a policy with empty Ingress + PolicyTypes
// = [Ingress] selects the pod and isolates it; no allow rules emitted.
func TestTranslate_DefaultDenyIngress(t *testing.T) {
pod := Pod{
Namespace: "ns1", Name: "web",
Labels: map[string]string{"app": "web"},
HostIface: "flock00000001",
IPs: []net.IP{mustIP("2001:db8::1")},
}
policy := newPolicy("ns1", "default-deny", func(p *netv1.NetworkPolicy) {
p.Spec.PodSelector = *emptySelector()
p.Spec.PolicyTypes = []netv1.PolicyType{netv1.PolicyTypeIngress}
})
out, err := Translate(Inputs{
LocalPods: []Pod{pod},
Policies: []netv1.NetworkPolicy{policy},
}, nil)
if err != nil {
t.Fatal(err)
}
if len(out.Rules) != 0 {
t.Fatalf("expected no rules from a deny-all, got %d", len(out.Rules))
}
in, eg := isolationFor(out, "ns1/web")
if !in {
t.Fatalf("ingress should be isolated")
}
if eg {
t.Fatalf("egress should NOT be isolated (policy only set ingress)")
}
}
// TestTranslate_DefaultDenyEgress_InferredFromEgressList — when
// PolicyTypes is omitted but Spec.Egress is non-empty, egress should
// also be isolated by inference.
func TestTranslate_DefaultDenyEgress_InferredFromEgressList(t *testing.T) {
pod := Pod{
Namespace: "ns1", Name: "web",
Labels: map[string]string{"app": "web"},
HostIface: "f1", IPs: []net.IP{mustIP("2001:db8::1")},
}
policy := newPolicy("ns1", "egress-rule", func(p *netv1.NetworkPolicy) {
p.Spec.PodSelector = *emptySelector()
p.Spec.Egress = []netv1.NetworkPolicyEgressRule{{}}
})
out, _ := Translate(Inputs{LocalPods: []Pod{pod}, Policies: []netv1.NetworkPolicy{policy}}, nil)
in, eg := isolationFor(out, "ns1/web")
if !in || !eg {
t.Fatalf("both directions should be isolated: in=%v eg=%v", in, eg)
}
}
// TestTranslate_PodSelectorPeer_SameNamespace — peer is a single pod in
// the same namespace, identified by label.
func TestTranslate_PodSelectorPeer(t *testing.T) {
web := Pod{
Namespace: "ns1", Name: "web",
Labels: map[string]string{"app": "web"},
HostIface: "f1", IPs: []net.IP{mustIP("2001:db8::1")},
}
clientIP := mustIP("2001:db8::2")
peer := PeerPod{
Namespace: "ns1", Name: "client",
Labels: map[string]string{"app": "client"},
IPs: []net.IP{clientIP},
}
policy := newPolicy("ns1", "allow-from-client", func(p *netv1.NetworkPolicy) {
p.Spec.PodSelector = *selectorMatching(map[string]string{"app": "web"})
p.Spec.PolicyTypes = []netv1.PolicyType{netv1.PolicyTypeIngress}
p.Spec.Ingress = []netv1.NetworkPolicyIngressRule{{
From: []netv1.NetworkPolicyPeer{{
PodSelector: selectorMatching(map[string]string{"app": "client"}),
}},
Ports: []netv1.NetworkPolicyPort{tcpPort(80)},
}}
})
out, err := Translate(Inputs{
LocalPods: []Pod{web},
PeerPods: []PeerPod{peer},
Policies: []netv1.NetworkPolicy{policy},
}, nil)
if err != nil {
t.Fatal(err)
}
if len(out.Rules) != 1 {
t.Fatalf("expected 1 rule, got %d: %+v", len(out.Rules), out.Rules)
}
r := out.Rules[0]
if r.PodKey != "ns1/web" || r.Direction != DirIngress {
t.Fatalf("rule has wrong subject: %+v", r)
}
if len(r.PeerCIDRs) != 1 || !r.PeerCIDRs[0].IP.Equal(clientIP) {
t.Fatalf("peer CIDR wrong: %+v", r.PeerCIDRs)
}
if len(r.Ports) != 1 || r.Ports[0].Protocol != "tcp" || r.Ports[0].Port != 80 {
t.Fatalf("port wrong: %+v", r.Ports)
}
}
// TestTranslate_NamespaceSelector — peer is "every pod in any namespace
// with label tier=trusted".
func TestTranslate_NamespaceSelector(t *testing.T) {
web := Pod{
Namespace: "ns1", Name: "web",
Labels: map[string]string{"app": "web"},
HostIface: "f1", IPs: []net.IP{mustIP("2001:db8::1")},
}
out, err := Translate(Inputs{
LocalPods: []Pod{web},
Namespaces: []Namespace{
{Name: "ns1", Labels: map[string]string{}},
{Name: "trusted-1", Labels: map[string]string{"tier": "trusted"}},
{Name: "trusted-2", Labels: map[string]string{"tier": "trusted"}},
{Name: "untrusted", Labels: map[string]string{"tier": "wild"}},
},
PeerPods: []PeerPod{
{Namespace: "trusted-1", Name: "a", IPs: []net.IP{mustIP("2001:db8::a")}},
{Namespace: "trusted-2", Name: "b", IPs: []net.IP{mustIP("2001:db8::b")}},
{Namespace: "untrusted", Name: "x", IPs: []net.IP{mustIP("2001:db8::ff")}},
},
Policies: []netv1.NetworkPolicy{newPolicy("ns1", "allow-trusted", func(p *netv1.NetworkPolicy) {
p.Spec.PodSelector = *emptySelector()
p.Spec.PolicyTypes = []netv1.PolicyType{netv1.PolicyTypeIngress}
p.Spec.Ingress = []netv1.NetworkPolicyIngressRule{{
From: []netv1.NetworkPolicyPeer{{
NamespaceSelector: selectorMatching(map[string]string{"tier": "trusted"}),
}},
}}
})},
}, nil)
if err != nil {
t.Fatal(err)
}
if len(out.Rules) != 1 {
t.Fatalf("expected 1 rule, got %d", len(out.Rules))
}
got := map[string]bool{}
for _, c := range out.Rules[0].PeerCIDRs {
got[c.IP.String()] = true
}
if !got["2001:db8::a"] || !got["2001:db8::b"] {
t.Fatalf("trusted pod IPs missing: %v", got)
}
if got["2001:db8::ff"] {
t.Fatalf("untrusted pod IP leaked into rule")
}
}
// TestTranslate_IPBlockWithExcept — ipBlock with an except range.
func TestTranslate_IPBlockWithExcept(t *testing.T) {
pod := Pod{
Namespace: "ns1", Name: "web", HostIface: "f1",
Labels: map[string]string{"app": "web"},
IPs: []net.IP{mustIP("10.0.0.1")},
}
policy := newPolicy("ns1", "ipblock", func(p *netv1.NetworkPolicy) {
p.Spec.PodSelector = *emptySelector()
p.Spec.PolicyTypes = []netv1.PolicyType{netv1.PolicyTypeIngress}
p.Spec.Ingress = []netv1.NetworkPolicyIngressRule{{
From: []netv1.NetworkPolicyPeer{{
IPBlock: &netv1.IPBlock{
CIDR: "10.0.0.0/8",
Except: []string{"10.99.0.0/16", "10.42.42.0/24"},
},
}},
}}
})
out, err := Translate(Inputs{
LocalPods: []Pod{pod},
Policies: []netv1.NetworkPolicy{policy},
}, nil)
if err != nil {
t.Fatal(err)
}
if len(out.Rules) != 1 {
t.Fatalf("expected 1 rule, got %d", len(out.Rules))
}
r := out.Rules[0]
if len(r.PeerCIDRs) != 1 || r.PeerCIDRs[0].String() != "10.0.0.0/8" {
t.Fatalf("peer CIDR wrong: %v", r.PeerCIDRs)
}
if len(r.PeerExcept) != 2 {
t.Fatalf("expected 2 except, got %d", len(r.PeerExcept))
}
}
// TestTranslate_AllowAllPeers — empty From list means "from anywhere".
func TestTranslate_AllowAllPeers(t *testing.T) {
pod := Pod{
Namespace: "ns1", Name: "web", HostIface: "f1",
Labels: map[string]string{"app": "web"},
IPs: []net.IP{mustIP("2001:db8::1")},
}
policy := newPolicy("ns1", "allow-all-on-port", func(p *netv1.NetworkPolicy) {
p.Spec.PodSelector = *emptySelector()
p.Spec.PolicyTypes = []netv1.PolicyType{netv1.PolicyTypeIngress}
p.Spec.Ingress = []netv1.NetworkPolicyIngressRule{{
Ports: []netv1.NetworkPolicyPort{tcpPort(443)},
}}
})
out, _ := Translate(Inputs{LocalPods: []Pod{pod}, Policies: []netv1.NetworkPolicy{policy}}, nil)
if len(out.Rules) != 1 {
t.Fatalf("expected 1 rule, got %d", len(out.Rules))
}
r := out.Rules[0]
if len(r.PeerCIDRs) != 0 || len(r.PeerExcept) != 0 {
t.Fatalf("expected allow-all peers, got CIDRs=%v Except=%v", r.PeerCIDRs, r.PeerExcept)
}
}
// TestTranslate_AllowAllPorts — empty Ports list means "all ports".
func TestTranslate_AllowAllPorts(t *testing.T) {
pod := Pod{
Namespace: "ns1", Name: "web", HostIface: "f1",
Labels: map[string]string{"app": "web"},
IPs: []net.IP{mustIP("2001:db8::1")},
}
policy := newPolicy("ns1", "allow-from-all", func(p *netv1.NetworkPolicy) {
p.Spec.PodSelector = *emptySelector()
p.Spec.PolicyTypes = []netv1.PolicyType{netv1.PolicyTypeIngress}
p.Spec.Ingress = []netv1.NetworkPolicyIngressRule{{
From: []netv1.NetworkPolicyPeer{{
PodSelector: emptySelector(),
}},
}}
})
peer := PeerPod{
Namespace: "ns1", Name: "x",
IPs: []net.IP{mustIP("2001:db8::aa")},
}
out, _ := Translate(Inputs{
LocalPods: []Pod{pod}, PeerPods: []PeerPod{peer},
Policies: []netv1.NetworkPolicy{policy},
}, nil)
if len(out.Rules) != 1 {
t.Fatalf("expected 1 rule, got %d", len(out.Rules))
}
r := out.Rules[0]
if len(r.Ports) != 1 || r.Ports[0] != (PortMatch{}) {
t.Fatalf("expected single any-port match, got %+v", r.Ports)
}
}
// TestTranslate_PortRange — endPort field.
func TestTranslate_PortRange(t *testing.T) {
pod := Pod{
Namespace: "ns1", Name: "web", HostIface: "f1",
Labels: map[string]string{"app": "web"},
IPs: []net.IP{mustIP("2001:db8::1")},
}
policy := newPolicy("ns1", "range", func(p *netv1.NetworkPolicy) {
p.Spec.PodSelector = *emptySelector()
p.Spec.PolicyTypes = []netv1.PolicyType{netv1.PolicyTypeIngress}
proto := corev1.ProtocolTCP
port := intstr.FromInt32(8000)
end := int32(8999)
p.Spec.Ingress = []netv1.NetworkPolicyIngressRule{{
Ports: []netv1.NetworkPolicyPort{{Protocol: &proto, Port: &port, EndPort: &end}},
}}
})
out, _ := Translate(Inputs{LocalPods: []Pod{pod}, Policies: []netv1.NetworkPolicy{policy}}, nil)
if len(out.Rules) != 1 || out.Rules[0].Ports[0].Port != 8000 || out.Rules[0].Ports[0].EndPort != 8999 {
t.Fatalf("range not preserved: %+v", out.Rules)
}
}
// TestTranslate_NamedPortRejected — named ports aren't supported yet;
// translator must skip the rule and warn.
func TestTranslate_NamedPortRejected(t *testing.T) {
pod := Pod{
Namespace: "ns1", Name: "web", HostIface: "f1",
Labels: map[string]string{"app": "web"},
IPs: []net.IP{mustIP("2001:db8::1")},
}
proto := corev1.ProtocolTCP
named := intstr.FromString("http")
policy := newPolicy("ns1", "named", func(p *netv1.NetworkPolicy) {
p.Spec.PodSelector = *emptySelector()
p.Spec.PolicyTypes = []netv1.PolicyType{netv1.PolicyTypeIngress}
p.Spec.Ingress = []netv1.NetworkPolicyIngressRule{{
Ports: []netv1.NetworkPolicyPort{{Protocol: &proto, Port: &named}},
}}
})
var warns []string
out, _ := Translate(Inputs{LocalPods: []Pod{pod}, Policies: []netv1.NetworkPolicy{policy}}, func(s string) {
warns = append(warns, s)
})
if len(out.Rules) != 0 {
t.Fatalf("expected named-port rule to be skipped")
}
if len(warns) == 0 {
t.Fatalf("expected a warning about named ports")
}
// The pod should still be isolated since the policy selected it.
in, _ := isolationFor(out, "ns1/web")
if !in {
t.Fatalf("pod should be isolated even when its rule is dropped")
}
}
// TestTranslate_PolicyOnlyAppliesToOwnNamespace — a policy in nsA does
// NOT select pods in nsB even if their labels match.
func TestTranslate_PolicyScopedToNamespace(t *testing.T) {
a := Pod{Namespace: "nsA", Name: "p", HostIface: "f1",
Labels: map[string]string{"app": "web"}, IPs: []net.IP{mustIP("2001:db8::1")}}
b := Pod{Namespace: "nsB", Name: "p", HostIface: "f2",
Labels: map[string]string{"app": "web"}, IPs: []net.IP{mustIP("2001:db8::2")}}
policy := newPolicy("nsA", "deny", func(p *netv1.NetworkPolicy) {
p.Spec.PodSelector = *selectorMatching(map[string]string{"app": "web"})
p.Spec.PolicyTypes = []netv1.PolicyType{netv1.PolicyTypeIngress}
})
out, _ := Translate(Inputs{LocalPods: []Pod{a, b}, Policies: []netv1.NetworkPolicy{policy}}, nil)
inA, _ := isolationFor(out, "nsA/p")
inB, _ := isolationFor(out, "nsB/p")
if !inA {
t.Fatalf("nsA/p should be isolated")
}
if inB {
t.Fatalf("nsB/p must NOT be isolated by a policy in nsA")
}
}
// TestTranslate_PodWithoutAllocationSkipped — pod with no IPs is silently
// skipped (its rule could not match any traffic anyway).
func TestTranslate_PodWithoutAllocationSkipped(t *testing.T) {
pod := Pod{Namespace: "ns1", Name: "p", HostIface: "f1",
Labels: map[string]string{"app": "web"}}
policy := newPolicy("ns1", "deny", func(p *netv1.NetworkPolicy) {
p.Spec.PodSelector = *emptySelector()
p.Spec.PolicyTypes = []netv1.PolicyType{netv1.PolicyTypeIngress}
})
out, _ := Translate(Inputs{LocalPods: []Pod{pod}, Policies: []netv1.NetworkPolicy{policy}}, nil)
in, _ := isolationFor(out, "ns1/p")
if in {
t.Fatalf("pod without IP should not appear in output")
}
}
// TestTranslate_Determinism — translating the same Inputs twice produces
// equal outputs (Rules in equal order, Isolated equal).
func TestTranslate_Determinism(t *testing.T) {
pod := Pod{
Namespace: "ns1", Name: "web", HostIface: "f1",
Labels: map[string]string{"app": "web"},
IPs: []net.IP{mustIP("2001:db8::1")},
}
peers := []PeerPod{
{Namespace: "ns1", Name: "z", Labels: map[string]string{"app": "client"}, IPs: []net.IP{mustIP("2001:db8::2")}},
{Namespace: "ns1", Name: "a", Labels: map[string]string{"app": "client"}, IPs: []net.IP{mustIP("2001:db8::3")}},
}
policies := []netv1.NetworkPolicy{
newPolicy("ns1", "z-second", func(p *netv1.NetworkPolicy) {
p.Spec.PodSelector = *emptySelector()
p.Spec.PolicyTypes = []netv1.PolicyType{netv1.PolicyTypeIngress}
p.Spec.Ingress = []netv1.NetworkPolicyIngressRule{{
From: []netv1.NetworkPolicyPeer{{
PodSelector: selectorMatching(map[string]string{"app": "client"}),
}},
}}
}),
}
in := Inputs{LocalPods: []Pod{pod}, PeerPods: peers, Policies: policies}
a, _ := Translate(in, nil)
b, _ := Translate(in, nil)
if len(a.Rules) != len(b.Rules) {
t.Fatalf("rule count differs: %d vs %d", len(a.Rules), len(b.Rules))
}
for i := range a.Rules {
if a.Rules[i].PodKey != b.Rules[i].PodKey || len(a.Rules[i].PeerCIDRs) != len(b.Rules[i].PeerCIDRs) {
t.Fatalf("rule[%d] differs", i)
}
}
}
+147
View File
@@ -0,0 +1,147 @@
package netpol
import "net"
// Direction is the NetworkPolicy direction, named from the *pod's*
// perspective (matching the NetworkPolicy API). "Ingress" is traffic
// arriving at the pod; "Egress" is traffic the pod initiates.
//
// Note that on the host this maps the opposite way at the veth: an
// Ingress rule matches packets whose oifname is the pod's host-side veth
// (the kernel is forwarding into the pod), and an Egress rule matches
// packets whose iifname is the pod's host-side veth (the kernel just
// received from the pod).
type Direction int
const (
DirIngress Direction = iota
DirEgress
)
// String returns the lower-case wire form ("ingress" / "egress").
func (d Direction) String() string {
if d == DirEgress {
return "egress"
}
return "ingress"
}
// Pod is the local-pod information the translator needs. The reconciler
// populates this from its store of CNI allocations — every pod with a
// committed allocation on this node appears here.
type Pod struct {
// Namespace + Name uniquely identify the pod.
Namespace string
Name string
// Labels are the pod labels. NetworkPolicy.Spec.PodSelector matches
// against these.
Labels map[string]string
// HostIface is the host-side veth name (e.g. "flock1a2b3c4d"). All
// rules guarding this pod hook off iifname/oifname == HostIface.
HostIface string
// IPs are the pod's eth0 addresses (IPv6 and/or IPv4). Empty means
// the agent has no allocation for this pod yet — translator should
// skip such pods.
IPs []net.IP
}
// PeerPod is a (potentially remote) pod whose IPs may be referenced as a
// NetworkPolicy peer. The translator resolves podSelector +
// namespaceSelector peers to their IPs by walking the cluster-wide
// peer-pod set.
type PeerPod struct {
Namespace string
Name string
Labels map[string]string
IPs []net.IP
}
// Namespace carries just enough metadata for namespaceSelector matching.
type Namespace struct {
Name string
Labels map[string]string
}
// LocalPod is the renderer-visible subset of a local pod — just enough
// to anchor a base-chain jump. Carried in Output so the renderer can
// emit chains for default-deny pods that have no explicit allow rules.
type LocalPod struct {
PodKey string
HostIface string
IPs []net.IP
}
// PortMatch is one allowed (protocol, port) tuple. EndPort is inclusive;
// when zero the rule matches the single Port.
type PortMatch struct {
Protocol string // "tcp", "udp", "sctp"; empty means "any of the three"
Port int // 1..65535. Zero means "any port".
EndPort int // 0 if not a range; otherwise inclusive range end.
}
// Rule is the canonical intermediate representation between the translator
// and the renderer. One Rule is one accept-line in the rendered nft
// script. A pod's chain is the ordered concatenation of every Rule whose
// PodKey matches; any packet that falls off the end is denied by the
// trailing default-deny verdict (the chain has policy drop).
//
// PeerCIDRs are OR'd together, then PeerExcept is subtracted. Empty
// PeerCIDRs + empty PeerExcept means "any source/destination".
type Rule struct {
// PodKey is namespace/name of the pod this rule guards. Used by the
// renderer to slot the rule into the correct chain.
PodKey string
// HostIface is the pod's host-side veth name; the renderer uses it
// to anchor the base-chain jump.
HostIface string
// PodIPs are the pod's eth0 addresses. The base chain matches on
// (oifname == HostIface AND daddr ∈ PodIPs) for ingress, and
// (iifname == HostIface AND saddr ∈ PodIPs) for egress, so packets
// that aren't destined to / from the actual pod address don't get
// counted as policy-protected.
PodIPs []net.IP
// Direction is Ingress or Egress, named from the pod's perspective.
Direction Direction
// Action is "accept" for explicit allows; default-deny is implicit
// in the chain's policy drop and is not represented as a Rule.
// (Reserved for future deny-list semantics like AdminNetworkPolicy.)
Action Action
// PeerCIDRs are the addresses of allowed peers. OR'd together.
// Empty means "any peer".
PeerCIDRs []*net.IPNet
// PeerExcept narrows PeerCIDRs by subtracting these ranges. Only
// meaningful with non-empty PeerCIDRs (it comes from
// ipBlock.except, which requires ipBlock.cidr).
PeerExcept []*net.IPNet
// Ports is the set of allowed (protocol, port) tuples. Empty means
// "any port / any protocol".
Ports []PortMatch
}
// Action is the verdict emitted by a Rule.
type Action int
const (
// ActionAccept lets the packet through. The default-deny is implicit
// in the chain policy.
ActionAccept Action = iota
// ActionDrop is reserved for future use (AdminNetworkPolicy /
// BaselineAdminNetworkPolicy explicit denies). Not produced by the
// v1 translator.
ActionDrop
)
// String returns the nft-syntax verdict.
func (a Action) String() string {
if a == ActionDrop {
return "drop"
}
return "accept"
}
+56
View File
@@ -0,0 +1,56 @@
package agent
import (
"net"
"code.fritzlab.net/fritzlab/flock/pkg/agent/netpol"
)
// collectLocalPods bridges the agent's allocation store + pod informer
// cache into the netpol-package input shape. It returns one Pod per
// committed allocation that has a matching pod in the informer cache;
// allocations whose pod was just deleted (DEL race) are skipped.
//
// Called on every netpol reconcile pass, so it must be cheap. The work
// here is O(allocations) and reads from in-memory maps only.
func collectLocalPods(store *Store, pods *PodCache) []netpol.Pod {
allocs := store.Snapshot()
out := make([]netpol.Pod, 0, len(allocs))
for _, a := range allocs {
if a.State != StateCommitted {
continue
}
pod, ok := pods.Get(a.Namespace, a.PodName)
if !ok {
// Pod evicted but DEL hasn't fired yet; nothing to enforce.
continue
}
ips := allocationIPs(a)
if len(ips) == 0 {
continue
}
out = append(out, netpol.Pod{
Namespace: a.Namespace,
Name: a.PodName,
Labels: pod.Labels,
HostIface: HostIfaceName(a.ContainerID),
IPs: ips,
})
}
return out
}
func allocationIPs(a Allocation) []net.IP {
var out []net.IP
if a.IP6 != "" {
if ip := net.ParseIP(a.IP6); ip != nil {
out = append(out, ip)
}
}
if a.IP4 != "" {
if ip := net.ParseIP(a.IP4); ip != nil {
out = append(out, ip)
}
}
return out
}
+38 -23
View File
@@ -2,25 +2,37 @@ package agent
import ( import (
"context" "context"
"encoding/json"
"fmt" "fmt"
"log/slog" "log/slog"
"time" "time"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types" "k8s.io/apimachinery/pkg/types"
"k8s.io/client-go/kubernetes" "k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest" "k8s.io/client-go/rest"
) )
// fieldManager identifies flock-agent in apiserver field-manager bookkeeping.
// Server-Side Apply only takes ownership of the fields we send, so other
// managers (kubelet, kcm) keep their conditions untouched between our writes.
const nodeStatusFieldManager = "flock-agent"
// keepNetworkAvailable maintains a NetworkUnavailable=False condition on // keepNetworkAvailable maintains a NetworkUnavailable=False condition on
// the node's status. Calico-node sets this False while it owns CNI; on // the node's status. Calico-node sets this False while it owns CNI; on
// shutdown it sets it to True with reason CalicoIsDown, which adds the // shutdown it sets it to True with reason CalicoIsDown, which adds the
// node.kubernetes.io/network-unavailable taint and blocks new scheduling. // node.kubernetes.io/network-unavailable taint and blocks new scheduling.
// Once flock-agent is in charge, we own the condition. // Once flock-agent is in charge, we own that single condition.
// //
// Re-applies every minute — heartbeat-style — so a stale condition from a // Uses Server-Side Apply against the status subresource. NodeStatus.Conditions
// is a listType=map keyed by `type`, so SSA merges by type — our partial body
// declares ownership of just the NetworkUnavailable entry and leaves the
// kubelet-managed conditions (Ready, MemoryPressure, DiskPressure, PIDPressure)
// alone. A prior implementation used JSON merge-patch with a one-element
// conditions array, which the apiserver REPLACES (merge-patch on arrays is
// whole-array semantics) — that race-stripped the kubelet conditions every
// 60s and produced ~5s flickers in `kubectl get nodes`.
//
// Re-applies every minute (heartbeat-style) so a stale condition from a
// previous CNI is overwritten without an explicit transition. // previous CNI is overwritten without an explicit transition.
func keepNetworkAvailable(ctx context.Context, cfg *rest.Config, node string, logger *slog.Logger) { func keepNetworkAvailable(ctx context.Context, cfg *rest.Config, node string, logger *slog.Logger) {
cs, err := kubernetes.NewForConfig(cfg) cs, err := kubernetes.NewForConfig(cfg)
@@ -29,23 +41,29 @@ func keepNetworkAvailable(ctx context.Context, cfg *rest.Config, node string, lo
return return
} }
apply := func() { apply := func() {
now := metav1.Now() now := metav1.Now().UTC().Format(time.RFC3339)
patch := map[string]interface{}{ // Hand-build the SSA body so we only declare the fields we own.
"status": map[string]interface{}{ // Force=true lets us reclaim the condition if a previous CNI's
"conditions": []corev1.NodeCondition{{ // finalizer/cleanup left it owned by a different manager.
Type: corev1.NodeNetworkUnavailable, body := []byte(fmt.Sprintf(`{
Status: corev1.ConditionFalse, "apiVersion": "v1",
Reason: "FlockReady", "kind": "Node",
Message: "flock-agent owns CNI on this node", "metadata": {"name": %q},
LastHeartbeatTime: now, "status": {"conditions": [{
LastTransitionTime: now, "type": "NetworkUnavailable",
}}, "status": "False",
}, "reason": "FlockReady",
} "message": "flock-agent owns CNI on this node",
body, _ := json.Marshal(patch) "lastHeartbeatTime": %q,
_, err := cs.CoreV1().Nodes().Patch(ctx, node, types.MergePatchType, body, metav1.PatchOptions{}, "status") "lastTransitionTime": %q
}]}
}`, node, now, now))
force := true
_, err := cs.CoreV1().Nodes().Patch(ctx, node, types.ApplyPatchType, body,
metav1.PatchOptions{FieldManager: nodeStatusFieldManager, Force: &force},
"status")
if err != nil { if err != nil {
logger.Warn("network-condition: patch failed", "err", err) logger.Warn("network-condition: ssa apply failed", "err", err)
return return
} }
} }
@@ -61,6 +79,3 @@ func keepNetworkAvailable(ctx context.Context, cfg *rest.Config, node string, lo
} }
} }
} }
// silence unused-import warnings on non-Linux builds where this is unused.
var _ = fmt.Sprintf
+15 -2
View File
@@ -28,6 +28,16 @@ func podReady(pod *corev1.Pod) bool {
return false return false
} }
// podAnycastEligible reports whether a pod should contribute its IP as a
// nexthop for its anycast IPs. A pod is eligible when it is Ready AND not
// being deleted. Once the apiserver sets DeletionTimestamp, kubelet has
// started teardown — kube-proxy will keep routing for terminationGracePeriod
// but the pod is on the way out; we should withdraw the nexthop immediately
// so BGP shifts traffic to a sibling before the pod actually exits.
func podAnycastEligible(pod *corev1.Pod) bool {
return pod.DeletionTimestamp == nil && podReady(pod)
}
// PodCache exposes a Get(ns, name) lookup against a node-scoped Pod // PodCache exposes a Get(ns, name) lookup against a node-scoped Pod
// informer. ADD/DEL handlers consult it to read annotations + labels for // informer. ADD/DEL handlers consult it to read annotations + labels for
// IPAM and (later) NetworkPolicy. Callers can subscribe to Ready // IPAM and (later) NetworkPolicy. Callers can subscribe to Ready
@@ -58,7 +68,7 @@ func StartPodInformer(ctx context.Context, cfg *rest.Config, node string, logger
_, _ = inf.AddEventHandler(cache.ResourceEventHandlerFuncs{ _, _ = inf.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { AddFunc: func(obj interface{}) {
if pod, ok := obj.(*corev1.Pod); ok && podReady(pod) { if pod, ok := obj.(*corev1.Pod); ok && podAnycastEligible(pod) {
pc.fireReady() pc.fireReady()
} }
}, },
@@ -68,7 +78,10 @@ func StartPodInformer(ctx context.Context, cfg *rest.Config, node string, logger
if oldP == nil || newP == nil { if oldP == nil || newP == nil {
return return
} }
if podReady(oldP) != podReady(newP) { // Fire on Ready transition OR DeletionTimestamp transition.
// The latter catches "pod was Ready, now being deleted" so the
// reconciler withdraws the nexthop before the pod actually exits.
if podAnycastEligible(oldP) != podAnycastEligible(newP) {
pc.fireReady() pc.fireReady()
} }
}, },
+46
View File
@@ -0,0 +1,46 @@
package agent
import (
"testing"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
func readyPod(deletionTimestamp *metav1.Time) *corev1.Pod {
return &corev1.Pod{
ObjectMeta: metav1.ObjectMeta{DeletionTimestamp: deletionTimestamp},
Status: corev1.PodStatus{
Conditions: []corev1.PodCondition{
{Type: corev1.PodReady, Status: corev1.ConditionTrue},
},
},
}
}
func TestPodAnycastEligible(t *testing.T) {
now := metav1.Now()
cases := []struct {
name string
pod *corev1.Pod
want bool
}{
{"ready, not deleting", readyPod(nil), true},
{"ready, but deleting", readyPod(&now), false},
{
"not ready, not deleting",
&corev1.Pod{Status: corev1.PodStatus{Conditions: []corev1.PodCondition{
{Type: corev1.PodReady, Status: corev1.ConditionFalse},
}}},
false,
},
{"no conditions, not deleting", &corev1.Pod{}, false},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
if got := podAnycastEligible(c.pod); got != c.want {
t.Fatalf("got %v want %v", got, c.want)
}
})
}
}
+47 -1
View File
@@ -6,9 +6,36 @@ import (
"context" "context"
"fmt" "fmt"
"net" "net"
"os"
"time" "time"
"code.fritzlab.net/fritzlab/flock/pkg/agent/netpol"
) )
// hostMultipathHashSysctls is the set of node-level sysctls flock-agent
// best-effort writes at startup. Default policy 0 hashes only on
// (saddr, daddr); policy 1 adds L4 (sport, dport, proto), giving real
// per-connection ECMP across multipath nexthops — required for sensible
// distribution across multiple anycast pods on the same node.
var hostMultipathHashSysctls = map[string]string{
"/proc/sys/net/ipv4/fib_multipath_hash_policy": "1",
"/proc/sys/net/ipv6/fib_multipath_hash_policy": "1",
}
// applyHostSysctls writes the sysctls in m, logging but not failing on
// errors. flock-agent is privileged so this works in the production
// DaemonSet; in environments where it doesn't, single-pod-per-node
// anycast still works (this only affects the multi-pod-per-node case).
func applyHostSysctls(s *Server) {
for path, value := range hostMultipathHashSysctls {
if err := os.WriteFile(path, []byte(value), 0o644); err != nil {
s.Logger.Warn("set host sysctl", "path", path, "value", value, "err", err)
continue
}
s.Logger.Info("host sysctl set", "path", path, "value", value)
}
}
// configureRuntime wires Pod informer, IPAM, netlink, and BIRD on a real // configureRuntime wires Pod informer, IPAM, netlink, and BIRD on a real
// Linux node. Steps: // Linux node. Steps:
// //
@@ -21,6 +48,8 @@ import (
// 5. Build PodHandler and SetHandlers(add, del, check). // 5. Build PodHandler and SetHandlers(add, del, check).
// 6. Install BIRD blackhole summary routes + render initial config. // 6. Install BIRD blackhole summary routes + render initial config.
func (s *Server) configureRuntime(ctx context.Context) error { func (s *Server) configureRuntime(ctx context.Context) error {
applyHostSysctls(s)
if err := s.firstAvailableNodeConfig(ctx, 60*time.Second); err != nil { if err := s.firstAvailableNodeConfig(ctx, 60*time.Second); err != nil {
return err return err
} }
@@ -103,15 +132,32 @@ func (s *Server) configureRuntime(ctx context.Context) error {
} }
}() }()
// NetworkPolicy enforcement.
world := netpol.NewWorld(s.Logger)
if err := world.Start(ctx, s.restCfg); err != nil {
return fmt.Errorf("netpol informers: %w", err)
}
npApplier := &netpol.Applier{}
npReconciler := netpol.NewReconciler(world, func() []netpol.Pod {
return collectLocalPods(s.Store, pods)
}, npApplier, s.Logger)
go npReconciler.Run(ctx)
handler := &PodHandler{ handler := &PodHandler{
Node: s.Node, Node: s.Node,
Store: s.Store, Store: s.Store,
IPAM: ipam, IPAM: ipam,
Pods: pods, Pods: pods,
NodeConfig: s.NodeConfig, NodeConfig: s.NodeConfig,
Logger: s.Logger,
SetupFunc: Setup, SetupFunc: Setup,
TeardownFunc: Teardown, TeardownFunc: Teardown,
AfterCommit: anycast.Trigger, AfterCommit: func() {
anycast.Trigger()
// Re-evaluate policy on every CNI ADD/DEL so a brand-new
// pod's chain lands before its first packet egresses.
npReconciler.Trigger()
},
} }
s.RPC.SetHandlers(handler.Add, handler.Del, handler.Check) s.RPC.SetHandlers(handler.Add, handler.Del, handler.Check)
s.Logger.Info("runtime ready", s.Logger.Info("runtime ready",
+4 -3
View File
@@ -1,6 +1,6 @@
// Package agent owns the in-process flock-agent runtime: IPAM, netns, state, // This file implements the durable per-node allocation file at
// anycast, and NetworkPolicy. This file implements the durable per-node // /var/lib/flock/allocations.json. The package-level doc lives in doc.go.
// allocation file at /var/lib/flock/allocations.json.
package agent package agent
import ( import (
@@ -33,6 +33,7 @@ type Allocation struct {
IP6 string `json:"ip6,omitempty"` IP6 string `json:"ip6,omitempty"`
IP4 string `json:"ip4,omitempty"` IP4 string `json:"ip4,omitempty"`
Anycast []string `json:"anycast,omitempty"` Anycast []string `json:"anycast,omitempty"`
Addresses []string `json:"addresses,omitempty"`
State AllocationState `json:"state"` State AllocationState `json:"state"`
AllocatedAt time.Time `json:"allocated_at"` AllocatedAt time.Time `json:"allocated_at"`
} }
+58 -2
View File
@@ -1,3 +1,8 @@
// Package v1alpha1 contains the operator-facing API types for flock.
//
// Stability: alpha. The shape of these types may change in incompatible ways
// between minor releases. CRDs are versioned and the agent reads only its
// pinned version.
package v1alpha1 package v1alpha1
import metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" import metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
@@ -6,26 +11,77 @@ import metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
// //
// The agent reads this on startup and via informer for live updates. There is // The agent reads this on startup and via informer for live updates. There is
// no controller and no auto-allocation — purely declarative input. // no controller and no auto-allocation — purely declarative input.
//
// A NodeConfig's name MUST equal the Kubernetes node name it configures
// (NodeConfigs are cluster-scoped). The agent ignores all NodeConfigs whose
// name does not match its own node.
type NodeConfigSpec struct { type NodeConfigSpec struct {
// CIDR6 is the set of IPv6 CIDRs this node owns and advertises as BGP // CIDR6 is the set of IPv6 CIDRs this node owns and advertises as BGP
// aggregates. Pod IPv6 addresses are allocated from these. // aggregates. Pod IPv6 addresses are allocated from these. May be empty
// only if Defaults disables IPv6 for every pod on this node.
CIDR6 []string `json:"cidr6,omitempty"` CIDR6 []string `json:"cidr6,omitempty"`
// CIDR4 is the set of IPv4 CIDRs this node owns and advertises as BGP // CIDR4 is the set of IPv4 CIDRs this node owns and advertises as BGP
// aggregates. Pod IPv4 addresses are allocated from these. // aggregates. Pod IPv4 addresses are allocated from these. May be empty
// when no pod on this node ever opts into IPv4.
CIDR4 []string `json:"cidr4,omitempty"` CIDR4 []string `json:"cidr4,omitempty"`
// BGP configures the BGP sessions this node establishes upstream. // BGP configures the BGP sessions this node establishes upstream.
BGP BGPSpec `json:"bgp"` BGP BGPSpec `json:"bgp"`
// Defaults sets the per-node baseline for which address families a pod
// receives when its own annotations don't say. Pod-level
// `flock.fritzlab.net/ipv6` and `flock.fritzlab.net/ipv4` annotations
// always override these defaults.
//
// When a field is unset (nil), the agent falls back to its built-in
// baseline of IPv6=true, IPv4=true (dual-stack). When the whole Defaults
// block is nil, both built-in defaults apply.
//
// Typical uses:
// - dual-stack node (built-in default): omit Defaults entirely.
// - IPv6-only node: Defaults: { ipv6: true, ipv4: false }
// - IPv4-only node: Defaults: { ipv6: false, ipv4: true }
//
// Validation: at least one of IPv6 or IPv4 must end up true after merging
// (annotations + defaults + built-in baseline). The agent rejects pods
// that resolve to neither.
Defaults *FamilyDefaults `json:"defaults,omitempty"`
} }
// FamilyDefaults is the per-node default for which address families a pod
// receives when its annotations don't specify. Each field is a pointer so
// "unset" is distinguishable from explicit "false".
type FamilyDefaults struct {
// IPv6 is the default value for the `flock.fritzlab.net/ipv6` annotation.
// nil → fall back to the built-in baseline (true).
IPv6 *bool `json:"ipv6,omitempty"`
// IPv4 is the default value for the `flock.fritzlab.net/ipv4` annotation.
// nil → fall back to the built-in baseline (true).
IPv4 *bool `json:"ipv4,omitempty"`
}
// BGPSpec describes this node's BGP speaker configuration. Each upstream peer
// becomes one BGP session in the rendered bird.conf.
type BGPSpec struct { type BGPSpec struct {
// ASN is this node's local autonomous system number. flock uses private
// ASNs in the 64512-65534 range by convention but accepts any value.
ASN uint32 `json:"asn"` ASN uint32 `json:"asn"`
// Peers is the set of upstream BGP neighbors. At least one is required
// for BGP advertisement to function. Multiple peers of the same family
// are allowed (multi-homing).
Peers []BGPPeer `json:"peers"` Peers []BGPPeer `json:"peers"`
} }
// BGPPeer is a single upstream BGP neighbor.
type BGPPeer struct { type BGPPeer struct {
// Address is the peer's IP. May be IPv4 or IPv6. The agent picks an
// appropriate local source address on the same subnet.
Address string `json:"address"` Address string `json:"address"`
// ASN is the peer's remote ASN.
ASN uint32 `json:"asn"` ASN uint32 `json:"asn"`
} }
+45 -12
View File
@@ -1,5 +1,5 @@
// Package embed implements ip-algo: deterministic embedding of pod identity // Package embed implements ip-algo: deterministic embedding of workload
// (namespace, pod name, image digest) into the host portion of an IPv6 // identity (namespace, app name, image) into the host portion of an IPv6
// address. The mapping is operator-friendly cosmetics — NOT a security // address. The mapping is operator-friendly cosmetics — NOT a security
// boundary. See dfritz-cni.md "IPv6 IID Embedding" for the full spec. // boundary. See dfritz-cni.md "IPv6 IID Embedding" for the full spec.
package embed package embed
@@ -17,17 +17,26 @@ type Field string
const ( const (
FieldNamespace Field = "namespace" FieldNamespace Field = "namespace"
FieldPod Field = "pod" FieldApp Field = "app"
FieldImage Field = "image" FieldImage Field = "image"
) )
// Values carries the inputs for one embedding call. Image holds the SHA-256 // Values carries the inputs for one embedding call.
// manifest digest as 64 hex chars when known; otherwise pass the containerID //
// in ImageFallback and we'll FNV-1a-64 it. // App is the stable workload identifier — typically the owning Deployment /
// StatefulSet / DaemonSet name (callers strip the pod-template-hash from
// ReplicaSet names before passing it in). Caller is responsible for picking
// the right level of stability; this package just hashes whatever it gets.
//
// Image is whatever string the caller wants embedded for the image field;
// the most common choice is pod.Spec.Containers[0].Image (the spec'd
// reference). If the caller passes a 64-hex-char SHA-256 digest, the top
// bits are taken as a hex value; otherwise it is FNV-1a-64'd as a plain
// string. ImageFallback is used only when Image == "".
type Values struct { type Values struct {
Namespace string Namespace string
Pod string App string
Image string // 64-char hex sha256 manifest digest, or empty Image string // sha256 hex (64 chars), or any string to FNV; empty → fallback
ImageFallback string // typically containerID, used when Image=="". ImageFallback string // typically containerID, used when Image=="".
} }
@@ -127,13 +136,22 @@ func fieldValue(f Field, v Values, bits int) (uint64, error) {
switch f { switch f {
case FieldNamespace: case FieldNamespace:
return topBitsFNV(v.Namespace, bits), nil return topBitsFNV(v.Namespace, bits), nil
case FieldPod: case FieldApp:
return topBitsFNV(v.Pod, bits), nil return topBitsFNV(v.App, bits), nil
case FieldImage: case FieldImage:
if v.Image != "" { if v.Image == "" {
return topBitsFNV(v.ImageFallback, bits), nil
}
// SHA-256 manifest digests are exactly 64 hex chars (with optional
// "sha256:" prefix). Anything else — image:tag references like
// "traefik:v3", or short SHAs — gets FNV-1a-64'd as a string. This
// preserves the original digest behaviour while letting callers
// pass pod.Spec.Containers[0].Image directly.
s := strings.TrimPrefix(v.Image, "sha256:")
if len(s) == 64 && isHex(s) {
return topBitsHex(v.Image, bits) return topBitsHex(v.Image, bits)
} }
return topBitsFNV(v.ImageFallback, bits), nil return topBitsFNV(v.Image, bits), nil
default: default:
return 0, fmt.Errorf("unknown field %q", f) return 0, fmt.Errorf("unknown field %q", f)
} }
@@ -163,6 +181,21 @@ func topBitsHex(s string, bits int) (uint64, error) {
return v >> uint(64-bits), nil return v >> uint(64-bits), nil
} }
// isHex reports whether every byte in s is a valid hex digit.
func isHex(s string) bool {
for i := 0; i < len(s); i++ {
c := s[i]
switch {
case c >= '0' && c <= '9':
case c >= 'a' && c <= 'f':
case c >= 'A' && c <= 'F':
default:
return false
}
}
return true
}
// writeNibble sets the (nibIdx)-th nibble of addr (0 = highest nibble of byte 0). // writeNibble sets the (nibIdx)-th nibble of addr (0 = highest nibble of byte 0).
func writeNibble(addr net.IP, nibIdx int, nb byte) { func writeNibble(addr net.IP, nibIdx int, nb byte) {
bytePos := nibIdx / 2 bytePos := nibIdx / 2
+104
View File
@@ -0,0 +1,104 @@
package embed
import (
"net"
"testing"
)
// FuzzEmbed verifies that Embed never panics and that any successful return
// keeps the output address inside the requested network.
func FuzzEmbed(f *testing.F) {
type seed struct {
prefix string
fields string // comma-separated, mapped below to []Field
ns, app string
image string
fallback string
nNibble byte
}
for _, s := range []seed{
{"2602:817:3000:f001::/64", "namespace,app,image", "mail", "stalwart", "", "ctr", 0xe},
{"2001:db8::/64", "namespace", "ns", "a", "", "", 0},
{"2001:db8::/96", "app", "", "appname", "", "ctr", 0xf},
{"2001:db8::/48", "namespace,app", "ns", "a", "", "ctr", 0x1},
{"2001:db8::/120", "namespace", "n", "a", "", "ctr", 0x0}, // 8 host nibbles
{"2001:db8::/124", "namespace", "n", "a", "", "ctr", 0x0}, // 4 host nibbles
{"2001:db8::/127", "namespace", "n", "a", "", "ctr", 0x0}, // not nibble-aligned
{"2001:db8::/63", "namespace", "n", "a", "", "ctr", 0x0}, // not nibble-aligned
{"2001:db8::/64", "namespace,app,image", "", "", "sha256:abcdef0123456789aabbccddeeff00112233445566778899aabbccddeeff0011", "", 0xa},
{"2001:db8::/64", "namespace,app,image", "", "", "traefik:v3.5", "ctr", 0xa},
{"2001:db8::/64", "namespace,app,image", "", "", "", "ctr", 0xa},
{"2001:db8::/64", "namespace", "🦆", "🐧", "", "", 0},
{"2001:db8::/64", "namespace", "ns\x00\x00", "a", "", "", 0},
} {
f.Add(s.prefix, s.fields, s.ns, s.app, s.image, s.fallback, s.nNibble)
}
f.Fuzz(func(t *testing.T, prefix, fieldsStr, ns, app, image, fallback string, nNibble byte) {
_, network, err := net.ParseCIDR(prefix)
if err != nil {
return
}
fields, ok := decodeFields(fieldsStr)
if !ok {
return
}
got, err := Embed(network, fields, Values{
Namespace: ns,
App: app,
Image: image,
ImageFallback: fallback,
}, nNibble)
if err != nil {
return
}
if !network.Contains(got) {
t.Fatalf("Embed(%s, %v) = %s, outside network", prefix, fields, got)
}
// Property: low nibble of last byte equals nNibble & 0x0F.
if want := nNibble & 0x0F; got[len(got)-1]&0x0F != want {
t.Fatalf("low nibble = %x, want %x", got[len(got)-1]&0x0F, want)
}
})
}
func decodeFields(s string) ([]Field, bool) {
if s == "" {
return nil, false
}
var out []Field
cur := []byte{}
flush := func() bool {
if len(cur) == 0 {
return true
}
switch string(cur) {
case string(FieldNamespace):
out = append(out, FieldNamespace)
case string(FieldApp):
out = append(out, FieldApp)
case string(FieldImage):
out = append(out, FieldImage)
default:
return false
}
cur = cur[:0]
return true
}
for i := 0; i < len(s); i++ {
if s[i] == ',' {
if !flush() {
return nil, false
}
continue
}
cur = append(cur, s[i])
}
if !flush() {
return nil, false
}
if len(out) == 0 {
return nil, false
}
return out, true
}
+6 -6
View File
@@ -70,8 +70,8 @@ func TestEmbed_Slash64Deterministic(t *testing.T) {
// /64 with 3 fields: 5+5+5+1 nibbles = 64-bit IID. // /64 with 3 fields: 5+5+5+1 nibbles = 64-bit IID.
net64 := mustCIDR(t, "2602:817:3000:f001::/64") net64 := mustCIDR(t, "2602:817:3000:f001::/64")
addr, err := Embed(net64, addr, err := Embed(net64,
[]Field{FieldNamespace, FieldPod, FieldImage}, []Field{FieldNamespace, FieldApp, FieldImage},
Values{Namespace: "mail", Pod: "stalwart-0", ImageFallback: "container-abc"}, Values{Namespace: "mail", App: "stalwart", ImageFallback: "container-abc"},
0xe, 0xe,
) )
if err != nil { if err != nil {
@@ -79,8 +79,8 @@ func TestEmbed_Slash64Deterministic(t *testing.T) {
} }
// Property: same inputs → same output (twice). // Property: same inputs → same output (twice).
addr2, err := Embed(net64, addr2, err := Embed(net64,
[]Field{FieldNamespace, FieldPod, FieldImage}, []Field{FieldNamespace, FieldApp, FieldImage},
Values{Namespace: "mail", Pod: "stalwart-0", ImageFallback: "container-abc"}, Values{Namespace: "mail", App: "stalwart", ImageFallback: "container-abc"},
0xe, 0xe,
) )
if err != nil { if err != nil {
@@ -101,8 +101,8 @@ func TestEmbed_Slash64Deterministic(t *testing.T) {
func TestEmbed_DifferentInputsDifferentOutputs(t *testing.T) { func TestEmbed_DifferentInputsDifferentOutputs(t *testing.T) {
net64 := mustCIDR(t, "2602:817:3000:f001::/64") net64 := mustCIDR(t, "2602:817:3000:f001::/64")
a, _ := Embed(net64, []Field{FieldNamespace, FieldPod}, Values{Namespace: "ns1", Pod: "p1"}, 0) a, _ := Embed(net64, []Field{FieldNamespace, FieldApp}, Values{Namespace: "ns1", App: "p1"}, 0)
b, _ := Embed(net64, []Field{FieldNamespace, FieldPod}, Values{Namespace: "ns2", Pod: "p1"}, 0) b, _ := Embed(net64, []Field{FieldNamespace, FieldApp}, Values{Namespace: "ns2", App: "p1"}, 0)
if a.Equal(b) { if a.Equal(b) {
t.Fatalf("different namespace produced identical IID: %s", a) t.Fatalf("different namespace produced identical IID: %s", a)
} }
+170 -8
View File
@@ -9,6 +9,7 @@ import (
"fmt" "fmt"
"net" "net"
"sort" "sort"
"strings"
"text/template" "text/template"
) )
@@ -25,6 +26,14 @@ type NodeBGP struct {
// hop self that crt001 accepts). // hop self that crt001 accepts).
LocalV6 string LocalV6 string
LocalV4 string LocalV4 string
// LocalSubnetV6 / LocalSubnetV4 are the directly-connected subnets
// (CIDR) the BGP peers live on. When set, the per-peer ipv6 / ipv4
// channel uses `import where net != <subnet>` so the gateway can't
// re-advertise our own connected /64 (or /24) back to us — accepting
// it would override the kernel-connected route and hairpin all
// inter-host traffic via the gateway.
LocalSubnetV6 string
LocalSubnetV4 string
// CIDR6 / CIDR4 are the per-node summary aggregates the agent wants // CIDR6 / CIDR4 are the per-node summary aggregates the agent wants
// advertised. The agent installs blackhole kernel routes for each so // advertised. The agent installs blackhole kernel routes for each so
// BIRD's protocol kernel imports them. // BIRD's protocol kernel imports them.
@@ -91,7 +100,7 @@ protocol bgp upstream6_{{$i}} {
neighbor {{$p.Address}} as {{$p.ASN}}; neighbor {{$p.Address}} as {{$p.ASN}};
graceful restart; graceful restart;
ipv6 { ipv6 {
import all; {{if $.LocalSubnetV6}}import where net != {{$.LocalSubnetV6}};{{else}}import all;{{end}}
next hop self; next hop self;
export filter { export filter {
{{range $cidr := $.CIDR6}}if net = {{$cidr}} then accept; {{range $cidr := $.CIDR6}}if net = {{$cidr}} then accept;
@@ -106,7 +115,7 @@ protocol bgp upstream4_{{$i}} {
neighbor {{$p.Address}} as {{$p.ASN}}; neighbor {{$p.Address}} as {{$p.ASN}};
graceful restart; graceful restart;
ipv4 { ipv4 {
import all; {{if $.LocalSubnetV4}}import where net != {{$.LocalSubnetV4}};{{else}}import all;{{end}}
next hop self; next hop self;
export filter { export filter {
{{range $cidr := $.CIDR4}}if net = {{$cidr}} then accept; {{range $cidr := $.CIDR4}}if net = {{$cidr}} then accept;
@@ -118,28 +127,181 @@ protocol bgp upstream4_{{$i}} {
{{end}}{{end}}` {{end}}{{end}}`
// Render produces the bird.conf text. // Render produces the bird.conf text.
//
// The output is deterministic: the same NodeBGP input always produces the
// same string. CIDR lists, anycast lists, and peer lists are sorted before
// templating so that the only way the rendered config changes is when
// semantically meaningful inputs change. This stability matters because
// BirdManager compares Render output against the last-written config to
// avoid superfluous birdc reloads.
//
// Render validates every operator-supplied value that flows into the
// templated output (peer addresses, CIDRs, anycast IPs, source addresses)
// so a malformed NodeConfig or annotation cannot produce a malformed
// bird.conf — even one that BIRD would later reject.
func Render(in NodeBGP) (string, error) { func Render(in NodeBGP) (string, error) {
if in.RouterID == "" { if in.RouterID == "" {
return "", fmt.Errorf("RouterID is required") return "", fmt.Errorf("bird render: RouterID is required")
}
if net.ParseIP(in.RouterID) == nil {
return "", fmt.Errorf("bird render: RouterID %q is not a valid IP", in.RouterID)
} }
if in.LocalASN == 0 { if in.LocalASN == 0 {
return "", fmt.Errorf("LocalASN is required") return "", fmt.Errorf("bird render: LocalASN is required")
} }
// Stable order — important so config changes only when something real if err := validateLocalSource(in.LocalV6, "v6"); err != nil {
// changes (avoids needless birdc reloads). return "", err
}
if err := validateLocalSource(in.LocalV4, "v4"); err != nil {
return "", err
}
if err := validateLocalSubnet(in.LocalSubnetV6, "v6"); err != nil {
return "", err
}
if err := validateLocalSubnet(in.LocalSubnetV4, "v4"); err != nil {
return "", err
}
for i, p := range in.Peers {
if err := validatePeer(p); err != nil {
return "", fmt.Errorf("bird render: peer[%d]: %w", i, err)
}
}
if err := validateCIDRs(in.CIDR6, "v6"); err != nil {
return "", fmt.Errorf("bird render: cidr6: %w", err)
}
if err := validateCIDRs(in.CIDR4, "v4"); err != nil {
return "", fmt.Errorf("bird render: cidr4: %w", err)
}
if err := validateAnycastIPs(in.Anycast6, "v6"); err != nil {
return "", fmt.Errorf("bird render: anycast6: %w", err)
}
if err := validateAnycastIPs(in.Anycast4, "v4"); err != nil {
return "", fmt.Errorf("bird render: anycast4: %w", err)
}
in = normalize(in) in = normalize(in)
t, err := template.New("bird").Parse(tpl) t, err := template.New("bird").Parse(tpl)
if err != nil { if err != nil {
return "", err return "", fmt.Errorf("bird template parse: %w", err)
} }
var buf bytes.Buffer var buf bytes.Buffer
if err := t.Execute(&buf, in); err != nil { if err := t.Execute(&buf, in); err != nil {
return "", err return "", fmt.Errorf("bird template execute: %w", err)
} }
return buf.String(), nil return buf.String(), nil
} }
// validatePeer checks that a peer entry has a parseable IP whose family
// matches its declared Family field, and a non-zero ASN.
func validatePeer(p Peer) error {
if p.ASN == 0 {
return fmt.Errorf("ASN must be non-zero")
}
ip := net.ParseIP(p.Address)
if ip == nil {
return fmt.Errorf("address %q is not a valid IP", p.Address)
}
isV4 := ip.To4() != nil
switch p.Family {
case "v6":
if isV4 {
return fmt.Errorf("address %q is IPv4 but Family is v6", p.Address)
}
case "v4":
if !isV4 {
return fmt.Errorf("address %q is IPv6 but Family is v4", p.Address)
}
default:
return fmt.Errorf("Family %q must be v6 or v4", p.Family)
}
return nil
}
// validateCIDRs parses each entry as a CIDR and rejects family mismatches.
// fam must be "v6" or "v4".
func validateCIDRs(cidrs []string, fam string) error {
for _, c := range cidrs {
_, n, err := net.ParseCIDR(c)
if err != nil {
return fmt.Errorf("invalid CIDR %q: %w", c, err)
}
isV4 := n.IP.To4() != nil
if fam == "v6" && isV4 {
return fmt.Errorf("CIDR %q is IPv4, expected IPv6", c)
}
if fam == "v4" && !isV4 {
return fmt.Errorf("CIDR %q is IPv6, expected IPv4", c)
}
}
return nil
}
// validateAnycastIPs parses each entry as a literal IP (no prefix) and rejects
// family mismatches.
func validateAnycastIPs(ips []string, fam string) error {
for _, s := range ips {
ip := net.ParseIP(s)
if ip == nil {
return fmt.Errorf("invalid IP %q", s)
}
isV4 := ip.To4() != nil
if fam == "v6" && isV4 {
return fmt.Errorf("IP %q is IPv4, expected IPv6", s)
}
if fam == "v4" && !isV4 {
return fmt.Errorf("IP %q is IPv6, expected IPv4", s)
}
}
return nil
}
// validateLocalSource validates an optional LocalV6/LocalV4 source address.
// Empty is allowed (BIRD picks its own); non-empty must be a parseable IP of
// the matching family.
func validateLocalSource(s, fam string) error {
if s == "" {
return nil
}
ip := net.ParseIP(s)
if ip == nil {
return fmt.Errorf("bird render: Local%s %q is not a valid IP", strings.ToUpper(fam), s)
}
isV4 := ip.To4() != nil
if fam == "v6" && isV4 {
return fmt.Errorf("bird render: LocalV6 %q is IPv4", s)
}
if fam == "v4" && !isV4 {
return fmt.Errorf("bird render: LocalV4 %q is IPv6", s)
}
return nil
}
// validateLocalSubnet validates an optional LocalSubnetV6/LocalSubnetV4 CIDR.
// Empty is allowed (no import filter); non-empty must be a parseable CIDR of
// the matching family in canonical form (host bits zero) so the BIRD `net !=`
// comparison matches the route the gateway re-advertises.
func validateLocalSubnet(s, fam string) error {
if s == "" {
return nil
}
ip, n, err := net.ParseCIDR(s)
if err != nil {
return fmt.Errorf("bird render: LocalSubnet%s %q is not a valid CIDR: %w", strings.ToUpper(fam), s, err)
}
if !ip.Equal(n.IP) {
return fmt.Errorf("bird render: LocalSubnet%s %q has non-zero host bits (want %s)", strings.ToUpper(fam), s, n.String())
}
isV4 := n.IP.To4() != nil
if fam == "v6" && isV4 {
return fmt.Errorf("bird render: LocalSubnetV6 %q is IPv4", s)
}
if fam == "v4" && !isV4 {
return fmt.Errorf("bird render: LocalSubnetV4 %q is IPv6", s)
}
return nil
}
func normalize(in NodeBGP) NodeBGP { func normalize(in NodeBGP) NodeBGP {
cp := in cp := in
cp.CIDR6 = sortedUnique(in.CIDR6) cp.CIDR6 = sortedUnique(in.CIDR6)
+101
View File
@@ -0,0 +1,101 @@
package bird
import (
"strings"
"testing"
)
// FuzzRender drives the bird template with a wide range of inputs and
// confirms two safety properties:
//
// 1. Render never panics.
// 2. On nil-error return, the output is deterministic (calling Render
// twice with the same input yields byte-identical output) and contains
// no unbalanced braces (a smoke test for malformed template branches).
func FuzzRender(f *testing.F) {
type seed struct {
routerID string
asn uint32
peerAddr string
peerASN uint32
cidr6 string
cidr4 string
anycast6 string
anycast4 string
localV6 string
localV4 string
subnet6 string
subnet4 string
}
seeds := []seed{
{routerID: "10.0.0.1", asn: 65101, peerAddr: "2001:db8::1", peerASN: 65000, cidr6: "2001:db8:f001::/64"},
{routerID: "172.25.25.101", asn: 65101, peerAddr: "172.25.25.1", peerASN: 65000, cidr4: "172.25.210.0/24"},
{routerID: "10.0.0.1", asn: 65101, peerAddr: "2001:db8::1", peerASN: 65000, cidr6: "2001:db8:f001::/64", anycast6: "2001:db8:a::1"},
{routerID: "10.0.0.1", asn: 65101, peerAddr: "10.0.0.2", peerASN: 65000, cidr4: "10.0.0.0/24", anycast4: "10.255.0.1"},
{routerID: "10.0.0.1", asn: 65101}, // no peer, no cidrs
{routerID: "", asn: 65101, peerAddr: "10.0.0.2", peerASN: 1}, // empty routerID → expect error
{routerID: "10.0.0.1", asn: 0, peerAddr: "10.0.0.2", peerASN: 1}, // zero ASN → expect error
// Backtick-bearing inputs to defend the template against accidental
// closure of the raw-string literal.
{routerID: "10.0.0.1`", asn: 65101},
// Newlines and template-meta in user-supplied addresses
{routerID: "10.0.0.1", asn: 65101, peerAddr: "2001:db8::1\n{{kaboom}}", peerASN: 65000, cidr6: "2001:db8:f001::/64"},
// LocalSubnet filters set.
{routerID: "172.25.25.104", asn: 65104, peerAddr: "2602:817:3000:a25::1", peerASN: 65000, subnet6: "2602:817:3000:a25::/64", subnet4: "172.25.25.0/24"},
// Malformed subnet should be rejected by validation, not crash.
{routerID: "10.0.0.1", asn: 65101, subnet6: "not-a-cidr"},
}
for _, s := range seeds {
f.Add(s.routerID, s.asn, s.peerAddr, s.peerASN, s.cidr6, s.cidr4, s.anycast6, s.anycast4, s.localV6, s.localV4, s.subnet6, s.subnet4)
}
f.Fuzz(func(t *testing.T, routerID string, asn uint32, peerAddr string, peerASN uint32, cidr6, cidr4, anycast6, anycast4, localV6, localV4, subnet6, subnet4 string) {
in := NodeBGP{
RouterID: routerID,
LocalASN: asn,
LocalV6: localV6,
LocalV4: localV4,
LocalSubnetV6: subnet6,
LocalSubnetV4: subnet4,
}
// Add the peer in whichever family it belongs to, if any. FamilyOf
// returns "" for non-IPs; that test exercises the "skip unknown
// family" branch in the bird agent code path.
if fam := FamilyOf(peerAddr); fam != "" {
in.Peers = []Peer{{Family: fam, Address: peerAddr, ASN: peerASN}}
}
if cidr6 != "" {
in.CIDR6 = []string{cidr6}
}
if cidr4 != "" {
in.CIDR4 = []string{cidr4}
}
if anycast6 != "" {
in.Anycast6 = []string{anycast6}
}
if anycast4 != "" {
in.Anycast4 = []string{anycast4}
}
out, err := Render(in)
if err != nil {
return
}
// Determinism.
out2, err := Render(in)
if err != nil {
t.Fatalf("Render became flaky: first ok, second %v", err)
}
if out != out2 {
t.Fatalf("Render not deterministic on identical input")
}
// Smoke test for balanced braces. The template uses `{` and `}`
// as BIRD's block delimiters; if our template engine ever
// produced an unbalanced output we'd catch it here.
if got := strings.Count(out, "{") - strings.Count(out, "}"); got != 0 {
t.Fatalf("unbalanced braces: %d", got)
}
})
}
+83
View File
@@ -75,6 +75,89 @@ func TestRender_StableOutput(t *testing.T) {
} }
} }
func TestRender_LocalSubnetImportFilter(t *testing.T) {
out, err := Render(NodeBGP{
RouterID: "172.25.25.104",
LocalASN: 65104,
Peers: []Peer{{Family: "v6", Address: "2602:817:3000:a25::1", ASN: 65000}, {Family: "v4", Address: "172.25.25.1", ASN: 65000}},
CIDR6: []string{"2602:817:3000:f004::/64"},
CIDR4: []string{"172.25.214.0/24"},
LocalSubnetV6: "2602:817:3000:a25::/64",
LocalSubnetV4: "172.25.25.0/24",
})
if err != nil {
t.Fatal(err)
}
for _, want := range []string{
"import where net != 2602:817:3000:a25::/64;",
"import where net != 172.25.25.0/24;",
} {
if !strings.Contains(out, want) {
t.Errorf("missing %q in output:\n%s", want, out)
}
}
// Each BGP peer block should use the import filter, not import all.
// Slice out just the `protocol bgp ...` stanzas to avoid catching the
// kernel proto's legitimate `import all;`.
for _, marker := range []string{"protocol bgp upstream6_", "protocol bgp upstream4_"} {
idx := strings.Index(out, marker)
if idx < 0 {
continue
}
end := strings.Index(out[idx:], "\n}")
if end < 0 {
continue
}
stanza := out[idx : idx+end]
if strings.Contains(stanza, "import all;") {
t.Errorf("BGP stanza still has `import all;`:\n%s", stanza)
}
}
}
func TestRender_LocalSubnetEmpty_FallsBackToImportAll(t *testing.T) {
out, err := Render(NodeBGP{
RouterID: "10.0.0.1",
LocalASN: 65101,
Peers: []Peer{{Family: "v6", Address: "2001:db8::1", ASN: 65000}},
CIDR6: []string{"2001:db8:f001::/64"},
})
if err != nil {
t.Fatal(err)
}
if !strings.Contains(out, "import all;") {
t.Errorf("expected `import all;` when LocalSubnetV6 unset:\n%s", out)
}
}
func TestRender_LocalSubnetValidation(t *testing.T) {
cases := []struct {
name string
v6, v4 string
wantErr string
}{
{name: "non-canonical v6", v6: "2602:817:3000:a25::1/64", wantErr: "non-zero host bits"},
{name: "non-canonical v4", v4: "172.25.25.1/24", wantErr: "non-zero host bits"},
{name: "v6 family mismatch", v6: "172.25.25.0/24", wantErr: "is IPv4"},
{name: "v4 family mismatch", v4: "2602:817:3000:a25::/64", wantErr: "is IPv6"},
{name: "garbage", v6: "not-a-cidr", wantErr: "not a valid CIDR"},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
_, err := Render(NodeBGP{
RouterID: "10.0.0.1",
LocalASN: 65101,
Peers: []Peer{{Family: "v6", Address: "2001:db8::1", ASN: 65000}},
LocalSubnetV6: tc.v6,
LocalSubnetV4: tc.v4,
})
if err == nil || !strings.Contains(err.Error(), tc.wantErr) {
t.Fatalf("want error containing %q, got %v", tc.wantErr, err)
}
})
}
}
func TestFamilyOf(t *testing.T) { func TestFamilyOf(t *testing.T) {
if FamilyOf("2001:db8::1") != "v6" { if FamilyOf("2001:db8::1") != "v6" {
t.Fatal("v6 detection broken") t.Fatal("v6 detection broken")
@@ -0,0 +1,13 @@
go test fuzz v1
string("0")
uint32(65101)
string("0")
uint32(1)
string("")
string("")
string("")
string("}")
string("")
string("")
string("")
string("")