Files
Donavan Fritz a17d33e182
Build flock Image / build (push) Successful in 5m27s
agent: addresses annotation replaces IPAM allocation
When flock.fritzlab.net/addresses provides a v6 or v4, the IP becomes
the pod's primary IP for that family — bound to eth0, default route off
it, on-link host route via setHostRoute, and a per-pod /128 or /32 in
BGP. IPAM no longer allocates a private IP alongside it. The pod ends up
with exactly the operator-supplied addresses on eth0 (plus any extras
beyond the first-of-family, which keep the pre-existing layered
behavior).

This is the fix the original addresses-annotation work missed: bug #1
allocated a private IP next to the public one (so VPN-routed clients
could land on the private path on Plex). Promoting addresses-supplied
IPs into the IPAM-style routing slot keeps the public IP as the only
primary IP visible from outside.

Three pieces:
- annotations.go: reject pods whose addresses/anycast IP family is
  disabled (ipv6/ipv4 annotation or NodeConfig default). Both annotation
  types rely on the family being enabled for return-path routing.
- handlers.go: peel first v6 + first v4 from Addresses into res.IP6/IP4;
  suppress IPAM for those families; skip IPAM call entirely if both
  families are addresses-supplied.
- anycast_linux.go: extend renderBird to advertise any IPAM IP that's
  outside the node's BGP aggregate as a per-pod /32 or /128. This is
  what makes 142.202.202.166 reachable when host004's pod CIDR is
  172.25.214.0/24 — the addresses-promoted IP isn't covered by the
  aggregate.

Tests: 7 new annotation tests covering the conflict cases (ipv4=false +
addresses-v4, NodeConfig default + addresses-v4, etc.) plus 5 unit tests
for the splitAddressesPrimary helper.

README updated with the addresses-replaces-IPAM behavior, the
addresses-vs-anycast comparison, the conflict rule, and a Plex-style
example.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 09:46:48 -05:00

399 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# flock
A small, opinionated Kubernetes CNI built around three ideas:
1. **Dual-stack, IPv6-friendly.** Every pod gets a globally routable IPv6
address by default. IPv4 is also enabled by default; either family can
be turned off per-node or per-pod when you really mean to.
2. **No tunnels, no NAT.** Pod addresses are the real packets on the wire.
Each node speaks BGP to its upstream router and advertises its own
per-node prefix. The pod network is just the LAN, plus host routes.
3. **Anycast as a primitive.** A pod can request an anycast address via
an annotation; flock binds it on the pod's loopback and advertises a
`/128` (or `/32`) over BGP, but only while the pod is `Ready`. Multiple
replicas advertise the same address from different nodes for ECMP load
balancing without a separate Service or external LB.
flock is built for clusters where every node already speaks BGP to one
or more upstream routers. It deliberately leaves out features you'd
expect from a general-purpose CNI — overlays, IPsec/Wireguard, IPAM
coordination across nodes, kube-proxy integration — so the moving parts
that remain are easy to reason about.
> **Status:** alpha. CRD shape and annotation keys may still change.
## Table of contents
- [How it works](#how-it-works)
- [Requirements](#requirements)
- [Quickstart](#quickstart)
- [NodeConfig CRD](#nodeconfig-crd)
- [Pod annotations](#pod-annotations)
- [Use cases](#use-cases)
- [Comparison vs Calico / Cilium](#comparison-vs-calico--cilium)
- [Limitations and non-goals](#limitations-and-non-goals)
- [Building and testing](#building-and-testing)
- [License](#license)
## How it works
Each node runs a single `flock-agent` DaemonSet pod with three containers:
- a privileged init container (`flock-installer`) that drops the CNI
plugin binary into `/opt/cni/bin/flock` and writes
`/etc/cni/net.d/01-flock.conflist`,
- the agent itself, which owns IPAM, programs veth pairs, and tracks
pod readiness, and
- a [BIRD2](https://bird.network.cz/) sidecar that the agent re-renders
and reloads when the per-node config or the active anycast set changes.
Each node has a `NodeConfig` CR (cluster-scoped, name = node name) that
declares its IPv6 and IPv4 prefixes, its local BGP ASN, and its upstream
peers. The agent reads the CR via a dynamic informer.
When kubelet runs the CNI plugin on `ADD`, the plugin opens a unix-socket
RPC to the agent. The agent allocates an address from the per-node
CIDRs, creates a veth pair, configures the pod side, persists the
allocation to `/var/lib/flock/allocations.json`, and returns the result.
There is no controller loop and no IPAM coordination across nodes — each
node owns a non-overlapping CIDR and allocates locally.
For anycast, the agent installs `<anycast-ip> via <pod-eth0-ip> dev <veth>`
host routes on the node and adds the anycast IP to BIRD's BGP export
filter. When a pod loses readiness, the agent withdraws the route from
both the kernel and BGP within one reconcile cycle (sub-second).
### Packet path
`pod.eth0` (a veth) ↔ host-side veth (with `addrgenmode none`,
`fe80::1/64`, proxy-ARP for the v4 default-via) ↔ host kernel ↔ uplink
NIC ↔ upstream router. No conntrack, no SNAT, no encapsulation.
For IPv6 the host side of every veth carries the deterministic link-local
gateway `fe80::1`, so every pod can use a fixed default route. For IPv4
the host side answers ARP for `169.254.1.1`, providing the same fixed
default route in v4.
## Requirements
- Linux nodes. flock has not been tested on, and does not target,
Windows nodes.
- Kubernetes ≥ 1.27.
- An upstream router (or pair) that accepts a BGP session from each
node. flock has been tested with Cisco IOS-XE, Arista EOS, and FRR
acting as the upstream; anything that speaks standard eBGP should work.
- Globally routable (or at least datacentre-routable) IPv6 prefix
delegated to the cluster, sliced into a per-node /64. IPv4 is
optional but supported.
- Each node must have a unique local ASN. Private ASNs (`6451265534`,
`42000000004294967294`) are typical.
## Quickstart
```sh
# 1. Install CRD + RBAC + DaemonSet (single bundled manifest):
kubectl apply -f deploy/install.yaml
# 2. Label the node(s) you want flock to manage:
kubectl label node <node-name> flock.fritzlab.net/agent=
# 3. Apply a NodeConfig CR for that node (see "NodeConfig CRD" below):
kubectl apply -f my-nodeconfig.yaml
# 4. Verify the agent is up:
kubectl -n kube-system get pod -l app=flock-agent -o wide
kubectl -n kube-system exec -it ds/flock-agent -c bird -- \
birdc -s /run/flock/bird.ctl show protocols
```
The DaemonSet is gated by the `flock.fritzlab.net/agent` node label, so
unlabelled nodes continue to use whatever CNI was installed before. This
lets you migrate node-by-node — start with one node, prove it works, then
proceed.
## NodeConfig CRD
A `NodeConfig` is the only operator-supplied input. One per node, name
matches the node name. Example:
```yaml
apiVersion: flock.fritzlab.net/v1alpha1
kind: NodeConfig
metadata:
name: node-a
spec:
cidr6:
- 2001:db8:f001::/64 # Pods on this node get addresses from here.
cidr4:
- 192.0.2.0/24 # IPv4 pool, used only when a pod opts in.
defaults:
ipv6: true # Optional. Built-in baseline if omitted.
ipv4: true # Optional. Built-in baseline if omitted.
bgp:
asn: 65101 # This node's local ASN.
peers:
- address: 2001:db8::1 # Upstream router (IPv6 session).
asn: 65000
- address: 192.0.2.1 # Same router, IPv4 session.
asn: 65000
```
### `spec.defaults`
`spec.defaults` controls which address families a pod *gets by default*
on this node — i.e. when the pod has no explicit `flock.fritzlab.net/ipv6`
or `flock.fritzlab.net/ipv4` annotation. Pod annotations always override.
If you omit `spec.defaults` (or any individual field inside it) flock
falls back to its built-in baseline of **dual-stack (IPv6 on, IPv4 on)**.
| Goal | `spec.defaults` |
|-----------------------------------|----------------------------------------|
| Dual-stack (the default) | omit, or `{ ipv6: true, ipv4: true }` |
| IPv6-only node | `{ ipv6: true, ipv4: false }` |
| IPv4-only (legacy node) | `{ ipv6: false, ipv4: true }` |
A NodeConfig that resolves to "neither family" is rejected at allocation
time, so misconfiguring both to false will surface as an error on the
first `CNI ADD`.
### `spec.bgp`
Each `peer` becomes one BGP session. The agent picks a node-local source
address on the same subnet as the peer; if there isn't one, BIRD uses
its default. Multi-homing (multiple peers per family — or per upstream
router pair) is allowed.
## Pod annotations
All annotations live under `flock.fritzlab.net/`. Every annotation is
optional; leave them off to inherit the per-node defaults.
| Annotation | Type | Purpose |
|-------------------------------------|--------|-----------------------------------------------------------------------------------------------|
| `flock.fritzlab.net/ipv6` | bool | Override `spec.defaults.ipv6` for this pod (`true`/`false`). |
| `flock.fritzlab.net/ipv4` | bool | Override `spec.defaults.ipv4` for this pod (`true`/`false`). |
| `flock.fritzlab.net/cidr6` | CIDRs | Restrict IPv6 allocation to a sub-range of the node's `cidr6`. Comma-separated. |
| `flock.fritzlab.net/cidr4` | CIDRs | Restrict IPv4 allocation to a sub-range of the node's `cidr4`. Comma-separated. |
| `flock.fritzlab.net/ip-algo` | list | Embed identity into the IPv6 IID. Subset of `namespace,pod,image`, in order, comma-separated. |
| `flock.fritzlab.net/anycast` | IPs | Bind these IPs on the pod's `lo`; advertise via BGP while pod is `Ready`. Mixed v6+v4 ok. |
| `flock.fritzlab.net/addresses` | IPs | Bind these IPs on the pod's `eth0`. The first v6 and first v4 **replace** IPAM allocation for that family — the addresses IP becomes the pod's primary IP. Mixed v6+v4 ok. Single-replica only in practice. |
Bool values must be the literal strings `"true"` or `"false"`
(case-insensitive, surrounding whitespace tolerated). Other values —
`1`, `0`, `yes`, `no` — are rejected so a typo can't silently flip
behaviour.
### `addresses` vs `anycast`
Both annotations bind operator-supplied IPs onto a pod and have flock
advertise `/128` (or `/32`) per-pod over BGP. The differences are
where the IP lands and what it's for:
| | `anycast` | `addresses` |
|----------------------------|----------------------------------------------------|-------------------------------------------------------------------|
| Bound on | pod `lo` | pod `eth0` |
| Multi-replica? | yes — every Ready replica advertises the same IP and the upstream router ECMPs across them | no — the same IP on multiple replicas is operator error |
| Replaces IPAM? | no — pod still has an IPAM-allocated unicast IP | **yes** — the first v6 + first v4 in the list become the pod's primary IPs in place of an IPAM allocation |
| Workload visibility | only the IPAM IP is on the primary interface | the public IP is `eth0`'s primary address — workloads that read their own NIC see it (e.g. Plex's remote-access detection) |
Use `anycast` for shared services with many replicas (DNS, ingress).
Use `addresses` when one specific pod needs a known public IP that the
workload itself must see on its primary interface.
### Conflict detection
`addresses` and `anycast` reject pods that supply an IP whose family is
disabled. If the resolved `WantV4` is false (via the pod's `ipv4`
annotation or the NodeConfig default) and any addresses- or
anycast-supplied IP is IPv4, the CNI ADD fails with an explicit error.
Same for v6. Both annotation types put IPs on a pod interface and rely
on the family being enabled for return-path routing — silently accepting
the IP would leave a non-functional pod.
### Outside-aggregate advertisement
When an `addresses` IP replaces IPAM (becomes the pod's primary IP) the
IP is typically **outside** the node's BGP aggregate (e.g. a public
`/32` on a node whose pod CIDR is private). flock notices this during
BGP rendering and advertises the IP individually as a per-pod `/32` or
`/128` so the upstream router has a route to it.
### Example pods
Default dual-stack — no annotations needed:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: minimal
```
IPv6 only — opt out of the default v4 allocation:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: v6-only
annotations:
flock.fritzlab.net/ipv4: "false"
```
Operator-friendly addressing — `fnv(namespace) | fnv(pod) | random`
packed into the host bits, so a pod's identity is recognisable from
its IP in `kubectl get pods -o wide`:
```yaml
metadata:
annotations:
flock.fritzlab.net/ip-algo: "namespace,pod"
```
Anycast service — three replicas, each advertising the same v6+v4
anycast pair from the node it lands on. The upstream router does ECMP
across the active set:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: dns
spec:
replicas: 3
template:
metadata:
annotations:
flock.fritzlab.net/anycast: "2001:db8:a::53, 192.0.2.53"
spec:
containers:
- name: coredns
image: coredns/coredns
readinessProbe:
httpGet: { path: /ready, port: 8181 }
periodSeconds: 1
failureThreshold: 1
```
Workload with a known public IP — single-replica pod whose application
inspects its own primary interface (Plex's remote-access flow). The
addresses become the pod's primary IPs in place of any IPAM allocation;
the pod's `eth0` ends up with exactly the supplied addresses, and BGP
advertises them as a `/128` and `/32`:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: plex
spec:
replicas: 1
template:
metadata:
annotations:
flock.fritzlab.net/addresses: "2001:db8:c606::166, 192.0.2.166"
spec:
containers:
- name: plex
image: plexinc/pms-docker
```
## Use cases
**Highly-available DNS.** Run N CoreDNS replicas, each annotated with
the same `anycast` IP. Point client `/etc/resolv.conf` at the anycast
address. Each replica advertises a `/128` from its own node; the
upstream router does ECMP. Lose a pod, traffic fails over within a
probe cycle.
**Replacing a kube-proxy `ClusterIP`.** Headless Service plus an anycast
IP gives you a single stable address with load-balancing across pods,
without the DNAT-pinning that makes long-lived TCP keepalive connections
stick to one backend forever. ECMP routes each new flow independently.
**Per-pod public IPv6.** Because every pod has a globally routable IPv6
address and the cluster does no NAT, a pod's `eth0` IP is reachable from
the rest of the internet (subject to your firewall). Useful for things
like outgoing SMTP, where you want a stable from-address per pod, or for
peer-to-peer protocols that don't tolerate NAT.
**Fast pod identification in `kubectl`.** With
`flock.fritzlab.net/ip-algo: namespace,pod` the IPv6 host bits encode
the pod's namespace+name, so you can recognise a pod from its IP without
a lookup. Reverse-DNS via a wildcard zone makes those IPs human-readable
too.
**Static-IP migration.** Annotation-driven address allocation means you
can ask for a specific sub-CIDR (`cidr6: 2001:db8:f001::ab00/120`) for
services that previously needed pinned IPs (mail server, ingress
controller). When the static-IP requirement goes away, drop the
annotation and the pod gets a normal allocation.
## Comparison vs Calico / Cilium
| | flock | Calico | Cilium |
|--------------------------|-----------------------------|------------------------------|------------------------------|
| Default address family | dual (IPv6+IPv4) | IPv4 | dual |
| BGP | yes (BIRD) | yes | optional |
| Overlay (VXLAN/IPIP) | never | optional | yes (geneve) or native |
| NAT in datapath | never | masquerade by default | masquerade by default |
| Anycast pod addressing | first-class | manual | optional, via service mesh |
| eBPF datapath | no | optional | yes |
| NetworkPolicy | yes (nftables) | yes (Felix) | yes (eBPF) |
| Cluster size target | small (< 100 nodes) | thousands | thousands |
| Operational surface area | low (1 DaemonSet, 1 CRD) | medium | high |
| Production-ready | alpha | yes | yes |
flock is not trying to compete with Calico or Cilium. The right answer
for most clusters is one of those two — flock exists for clusters where
every node already speaks BGP, the operator wants real (no NAT) IPv6
addressing on every pod, and per-pod anycast is something they actually
want to use rather than work around.
## Limitations and non-goals
- NetworkPolicy supports `networking.k8s.io/v1` (ingress + egress, all
three peer types, numeric ports + port ranges). Named ports and
AdminNetworkPolicy are not yet implemented.
- No NAT, no masquerade, no SNAT-egress. Pods reach the wider internet
using their real cluster-routable addresses; if your IPv4 pool isn't
routable beyond your network, those pods can't reach v4-only hosts on
the public internet without help from your border router.
- No multi-cluster, no peering across clusters.
- Linux-only datapath.
- IPAM is per-node — there's no global allocator and no IP mobility.
When a pod moves to a different node it gets a new address.
- The agent is privileged. It mounts `/var/run/netns`, configures veth
pairs, manages kernel routes, and holds `CAP_NET_ADMIN`. This is
inherent to being a CNI; reducing privilege further is not a goal.
- If BIRD dies but the agent stays up, pods on that node stop being
reachable from off-node. The DaemonSet liveness probes catch this.
## Building and testing
```sh
# Unit tests + fuzz seed corpora (fast, ~1s):
go test ./...
# Targeted fuzz pass:
go test -run NEVERMATCH -fuzz=FuzzParseAnnotations -fuzztime=30s ./pkg/agent
go test -run NEVERMATCH -fuzz=FuzzRender -fuzztime=30s ./pkg/routing/bird
go test -run NEVERMATCH -fuzz=FuzzEmbed -fuzztime=30s ./pkg/embed
go test -run NEVERMATCH -fuzz=FuzzIPAM_Allocate -fuzztime=30s ./pkg/agent
# Build the container image (used by the DaemonSet):
docker build -t flock:dev .
```
The fuzz tests are also run as plain unit tests via their seed corpora,
so every `go test ./...` exercises the discovered edge cases as
regressions.
`pkg/agent` has Linux-only files (`*_linux.go`) for netlink and netns
work; the macOS/Windows build pulls in stubs from `*_stub.go` so tests
run cleanly on developer laptops.
## License
Apache 2.0 — see [LICENSE](LICENSE).