# flock A small, opinionated Kubernetes CNI built around three ideas: 1. **Dual-stack, IPv6-friendly.** Every pod gets a globally routable IPv6 address by default. IPv4 is also enabled by default; either family can be turned off per-node or per-pod when you really mean to. 2. **No tunnels, no NAT.** Pod addresses are the real packets on the wire. Each node speaks BGP to its upstream router and advertises its own per-node prefix. The pod network is just the LAN, plus host routes. 3. **Anycast as a primitive.** A pod can request an anycast address via an annotation; flock binds it on the pod's loopback and advertises a `/128` (or `/32`) over BGP, but only while the pod is `Ready`. Multiple replicas advertise the same address from different nodes for ECMP load balancing without a separate Service or external LB. flock is built for clusters where every node already speaks BGP to one or more upstream routers. It deliberately leaves out features you'd expect from a general-purpose CNI — overlays, IPsec/Wireguard, IPAM coordination across nodes, kube-proxy integration — so the moving parts that remain are easy to reason about. > **Status:** alpha. CRD shape and annotation keys may still change. ## Table of contents - [How it works](#how-it-works) - [Requirements](#requirements) - [Quickstart](#quickstart) - [NodeConfig CRD](#nodeconfig-crd) - [Pod annotations](#pod-annotations) - [Use cases](#use-cases) - [Comparison vs Calico / Cilium](#comparison-vs-calico--cilium) - [Limitations and non-goals](#limitations-and-non-goals) - [Building and testing](#building-and-testing) - [License](#license) ## How it works Each node runs a single `flock-agent` DaemonSet pod with three containers: - a privileged init container (`flock-installer`) that drops the CNI plugin binary into `/opt/cni/bin/flock` and writes `/etc/cni/net.d/01-flock.conflist`, - the agent itself, which owns IPAM, programs veth pairs, and tracks pod readiness, and - a [BIRD2](https://bird.network.cz/) sidecar that the agent re-renders and reloads when the per-node config or the active anycast set changes. Each node has a `NodeConfig` CR (cluster-scoped, name = node name) that declares its IPv6 and IPv4 prefixes, its local BGP ASN, and its upstream peers. The agent reads the CR via a dynamic informer. When kubelet runs the CNI plugin on `ADD`, the plugin opens a unix-socket RPC to the agent. The agent allocates an address from the per-node CIDRs, creates a veth pair, configures the pod side, persists the allocation to `/var/lib/flock/allocations.json`, and returns the result. There is no controller loop and no IPAM coordination across nodes — each node owns a non-overlapping CIDR and allocates locally. For anycast, the agent installs ` via dev ` host routes on the node and adds the anycast IP to BIRD's BGP export filter. When a pod loses readiness, the agent withdraws the route from both the kernel and BGP within one reconcile cycle (sub-second). ### Packet path `pod.eth0` (a veth) ↔ host-side veth (with `addrgenmode none`, `fe80::1/64`, proxy-ARP for the v4 default-via) ↔ host kernel ↔ uplink NIC ↔ upstream router. No conntrack, no SNAT, no encapsulation. For IPv6 the host side of every veth carries the deterministic link-local gateway `fe80::1`, so every pod can use a fixed default route. For IPv4 the host side answers ARP for `169.254.1.1`, providing the same fixed default route in v4. ## Requirements - Linux nodes. flock has not been tested on, and does not target, Windows nodes. - Kubernetes ≥ 1.27. - An upstream router (or pair) that accepts a BGP session from each node. flock has been tested with Cisco IOS-XE, Arista EOS, and FRR acting as the upstream; anything that speaks standard eBGP should work. - Globally routable (or at least datacentre-routable) IPv6 prefix delegated to the cluster, sliced into a per-node /64. IPv4 is optional but supported. - Each node must have a unique local ASN. Private ASNs (`64512–65534`, `4200000000–4294967294`) are typical. ## Quickstart ```sh # 1. Install CRD + RBAC + DaemonSet (single bundled manifest): kubectl apply -f deploy/install.yaml # 2. Label the node(s) you want flock to manage: kubectl label node flock.fritzlab.net/agent= # 3. Apply a NodeConfig CR for that node (see "NodeConfig CRD" below): kubectl apply -f my-nodeconfig.yaml # 4. Verify the agent is up: kubectl -n kube-system get pod -l app=flock-agent -o wide kubectl -n kube-system exec -it ds/flock-agent -c bird -- \ birdc -s /run/flock/bird.ctl show protocols ``` The DaemonSet is gated by the `flock.fritzlab.net/agent` node label, so unlabelled nodes continue to use whatever CNI was installed before. This lets you migrate node-by-node — start with one node, prove it works, then proceed. ## NodeConfig CRD A `NodeConfig` is the only operator-supplied input. One per node, name matches the node name. Example: ```yaml apiVersion: flock.fritzlab.net/v1alpha1 kind: NodeConfig metadata: name: node-a spec: cidr6: - 2001:db8:f001::/64 # Pods on this node get addresses from here. cidr4: - 192.0.2.0/24 # IPv4 pool, used only when a pod opts in. defaults: ipv6: true # Optional. Built-in baseline if omitted. ipv4: true # Optional. Built-in baseline if omitted. bgp: asn: 65101 # This node's local ASN. peers: - address: 2001:db8::1 # Upstream router (IPv6 session). asn: 65000 - address: 192.0.2.1 # Same router, IPv4 session. asn: 65000 ``` ### `spec.defaults` `spec.defaults` controls which address families a pod *gets by default* on this node — i.e. when the pod has no explicit `flock.fritzlab.net/ipv6` or `flock.fritzlab.net/ipv4` annotation. Pod annotations always override. If you omit `spec.defaults` (or any individual field inside it) flock falls back to its built-in baseline of **dual-stack (IPv6 on, IPv4 on)**. | Goal | `spec.defaults` | |-----------------------------------|----------------------------------------| | Dual-stack (the default) | omit, or `{ ipv6: true, ipv4: true }` | | IPv6-only node | `{ ipv6: true, ipv4: false }` | | IPv4-only (legacy node) | `{ ipv6: false, ipv4: true }` | A NodeConfig that resolves to "neither family" is rejected at allocation time, so misconfiguring both to false will surface as an error on the first `CNI ADD`. ### `spec.bgp` Each `peer` becomes one BGP session. The agent picks a node-local source address on the same subnet as the peer; if there isn't one, BIRD uses its default. Multi-homing (multiple peers per family — or per upstream router pair) is allowed. ## Pod annotations All annotations live under `flock.fritzlab.net/`. Every annotation is optional; leave them off to inherit the per-node defaults. | Annotation | Type | Purpose | |-------------------------------------|--------|-----------------------------------------------------------------------------------------------| | `flock.fritzlab.net/ipv6` | bool | Override `spec.defaults.ipv6` for this pod (`true`/`false`). | | `flock.fritzlab.net/ipv4` | bool | Override `spec.defaults.ipv4` for this pod (`true`/`false`). | | `flock.fritzlab.net/cidr6` | CIDRs | Restrict IPv6 allocation to a sub-range of the node's `cidr6`. Comma-separated. | | `flock.fritzlab.net/cidr4` | CIDRs | Restrict IPv4 allocation to a sub-range of the node's `cidr4`. Comma-separated. | | `flock.fritzlab.net/ip-algo` | list | Embed identity into the IPv6 IID. Subset of `namespace,pod,image`, in order, comma-separated. | | `flock.fritzlab.net/anycast` | IPs | Bind these IPs on the pod's `lo`; advertise via BGP while pod is `Ready`. Mixed v6+v4 ok. | Bool values must be the literal strings `"true"` or `"false"` (case-insensitive, surrounding whitespace tolerated). Other values — `1`, `0`, `yes`, `no` — are rejected so a typo can't silently flip behaviour. ### Example pods Default dual-stack — no annotations needed: ```yaml apiVersion: v1 kind: Pod metadata: name: minimal ``` IPv6 only — opt out of the default v4 allocation: ```yaml apiVersion: v1 kind: Pod metadata: name: v6-only annotations: flock.fritzlab.net/ipv4: "false" ``` Operator-friendly addressing — `fnv(namespace) | fnv(pod) | random` packed into the host bits, so a pod's identity is recognisable from its IP in `kubectl get pods -o wide`: ```yaml metadata: annotations: flock.fritzlab.net/ip-algo: "namespace,pod" ``` Anycast service — three replicas, each advertising the same v6+v4 anycast pair from the node it lands on. The upstream router does ECMP across the active set: ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: dns spec: replicas: 3 template: metadata: annotations: flock.fritzlab.net/anycast: "2001:db8:a::53, 192.0.2.53" spec: containers: - name: coredns image: coredns/coredns readinessProbe: httpGet: { path: /ready, port: 8181 } periodSeconds: 1 failureThreshold: 1 ``` ## Use cases **Highly-available DNS.** Run N CoreDNS replicas, each annotated with the same `anycast` IP. Point client `/etc/resolv.conf` at the anycast address. Each replica advertises a `/128` from its own node; the upstream router does ECMP. Lose a pod, traffic fails over within a probe cycle. **Replacing a kube-proxy `ClusterIP`.** Headless Service plus an anycast IP gives you a single stable address with load-balancing across pods, without the DNAT-pinning that makes long-lived TCP keepalive connections stick to one backend forever. ECMP routes each new flow independently. **Per-pod public IPv6.** Because every pod has a globally routable IPv6 address and the cluster does no NAT, a pod's `eth0` IP is reachable from the rest of the internet (subject to your firewall). Useful for things like outgoing SMTP, where you want a stable from-address per pod, or for peer-to-peer protocols that don't tolerate NAT. **Fast pod identification in `kubectl`.** With `flock.fritzlab.net/ip-algo: namespace,pod` the IPv6 host bits encode the pod's namespace+name, so you can recognise a pod from its IP without a lookup. Reverse-DNS via a wildcard zone makes those IPs human-readable too. **Static-IP migration.** Annotation-driven address allocation means you can ask for a specific sub-CIDR (`cidr6: 2001:db8:f001::ab00/120`) for services that previously needed pinned IPs (mail server, ingress controller). When the static-IP requirement goes away, drop the annotation and the pod gets a normal allocation. ## Comparison vs Calico / Cilium | | flock | Calico | Cilium | |--------------------------|-----------------------------|------------------------------|------------------------------| | Default address family | dual (IPv6+IPv4) | IPv4 | dual | | BGP | yes (BIRD) | yes | optional | | Overlay (VXLAN/IPIP) | never | optional | yes (geneve) or native | | NAT in datapath | never | masquerade by default | masquerade by default | | Anycast pod addressing | first-class | manual | optional, via service mesh | | eBPF datapath | no | optional | yes | | NetworkPolicy | yes (nftables) | yes (Felix) | yes (eBPF) | | Cluster size target | small (< 100 nodes) | thousands | thousands | | Operational surface area | low (1 DaemonSet, 1 CRD) | medium | high | | Production-ready | alpha | yes | yes | flock is not trying to compete with Calico or Cilium. The right answer for most clusters is one of those two — flock exists for clusters where every node already speaks BGP, the operator wants real (no NAT) IPv6 addressing on every pod, and per-pod anycast is something they actually want to use rather than work around. ## Limitations and non-goals - NetworkPolicy supports `networking.k8s.io/v1` (ingress + egress, all three peer types, numeric ports + port ranges). Named ports and AdminNetworkPolicy are not yet implemented. - No NAT, no masquerade, no SNAT-egress. Pods reach the wider internet using their real cluster-routable addresses; if your IPv4 pool isn't routable beyond your network, those pods can't reach v4-only hosts on the public internet without help from your border router. - No multi-cluster, no peering across clusters. - Linux-only datapath. - IPAM is per-node — there's no global allocator and no IP mobility. When a pod moves to a different node it gets a new address. - The agent is privileged. It mounts `/var/run/netns`, configures veth pairs, manages kernel routes, and holds `CAP_NET_ADMIN`. This is inherent to being a CNI; reducing privilege further is not a goal. - If BIRD dies but the agent stays up, pods on that node stop being reachable from off-node. The DaemonSet liveness probes catch this. ## Building and testing ```sh # Unit tests + fuzz seed corpora (fast, ~1s): go test ./... # Targeted fuzz pass: go test -run NEVERMATCH -fuzz=FuzzParseAnnotations -fuzztime=30s ./pkg/agent go test -run NEVERMATCH -fuzz=FuzzRender -fuzztime=30s ./pkg/routing/bird go test -run NEVERMATCH -fuzz=FuzzEmbed -fuzztime=30s ./pkg/embed go test -run NEVERMATCH -fuzz=FuzzIPAM_Allocate -fuzztime=30s ./pkg/agent # Build the container image (used by the DaemonSet): docker build -t flock:dev . ``` The fuzz tests are also run as plain unit tests via their seed corpora, so every `go test ./...` exercises the discovered edge cases as regressions. `pkg/agent` has Linux-only files (`*_linux.go`) for netlink and netns work; the macOS/Windows build pulls in stubs from `*_stub.go` so tests run cleanly on developer laptops. ## License Apache 2.0 — see [LICENSE](LICENSE).