2026-04-24 21:17:42 -05:00
|
|
|
|
# flock
|
|
|
|
|
|
|
2026-04-25 09:25:45 -05:00
|
|
|
|
A small, opinionated Kubernetes CNI built around three ideas:
|
2026-04-24 21:17:42 -05:00
|
|
|
|
|
2026-04-25 09:25:45 -05:00
|
|
|
|
1. **IPv6-first.** Every pod gets a globally routable IPv6 address. IPv4 is
|
|
|
|
|
|
per-pod opt-in for legacy clients.
|
|
|
|
|
|
2. **No tunnels, no NAT.** Pod addresses are the real packets on the wire.
|
|
|
|
|
|
Each node speaks BGP to its upstream router and advertises its own
|
|
|
|
|
|
per-node prefix. The pod network is just the LAN, plus host routes.
|
|
|
|
|
|
3. **Anycast as a primitive.** A pod can request an anycast address via
|
|
|
|
|
|
an annotation; flock binds it on the pod's loopback and advertises a
|
|
|
|
|
|
`/128` (or `/32`) over BGP, but only while the pod is `Ready`. Multiple
|
|
|
|
|
|
replicas advertise the same address from different nodes for ECMP load
|
|
|
|
|
|
balancing without a separate Service or external LB.
|
2026-04-24 21:17:42 -05:00
|
|
|
|
|
2026-04-25 09:25:45 -05:00
|
|
|
|
flock is built for clusters where every node already speaks BGP to one
|
|
|
|
|
|
or more upstream routers. It deliberately leaves out features you'd
|
|
|
|
|
|
expect from a general-purpose CNI — overlays, IPsec/Wireguard, IPAM
|
|
|
|
|
|
coordination across nodes, kube-proxy integration — so the moving parts
|
|
|
|
|
|
that remain are easy to reason about.
|
2026-04-24 21:17:42 -05:00
|
|
|
|
|
2026-04-25 09:25:45 -05:00
|
|
|
|
> **Status:** alpha. CRD shape and annotation keys may still change.
|
2026-04-24 21:17:42 -05:00
|
|
|
|
|
2026-04-25 09:25:45 -05:00
|
|
|
|
## Table of contents
|
|
|
|
|
|
|
|
|
|
|
|
- [How it works](#how-it-works)
|
|
|
|
|
|
- [Requirements](#requirements)
|
|
|
|
|
|
- [Quickstart](#quickstart)
|
|
|
|
|
|
- [NodeConfig CRD](#nodeconfig-crd)
|
|
|
|
|
|
- [Pod annotations](#pod-annotations)
|
|
|
|
|
|
- [Use cases](#use-cases)
|
|
|
|
|
|
- [Comparison vs Calico / Cilium](#comparison-vs-calico--cilium)
|
|
|
|
|
|
- [Limitations and non-goals](#limitations-and-non-goals)
|
|
|
|
|
|
- [Building and testing](#building-and-testing)
|
|
|
|
|
|
- [License](#license)
|
|
|
|
|
|
|
|
|
|
|
|
## How it works
|
|
|
|
|
|
|
|
|
|
|
|
Each node runs a single `flock-agent` DaemonSet pod with three containers:
|
|
|
|
|
|
|
|
|
|
|
|
- a privileged init container (`flock-installer`) that drops the CNI
|
|
|
|
|
|
plugin binary into `/opt/cni/bin/flock` and writes
|
|
|
|
|
|
`/etc/cni/net.d/01-flock.conflist`,
|
|
|
|
|
|
- the agent itself, which owns IPAM, programs veth pairs, and tracks
|
|
|
|
|
|
pod readiness, and
|
|
|
|
|
|
- a [BIRD2](https://bird.network.cz/) sidecar that the agent re-renders
|
|
|
|
|
|
and reloads when the per-node config or the active anycast set changes.
|
|
|
|
|
|
|
|
|
|
|
|
Each node has a `NodeConfig` CR (cluster-scoped, name = node name) that
|
|
|
|
|
|
declares its IPv6 and IPv4 prefixes, its local BGP ASN, and its upstream
|
|
|
|
|
|
peers. The agent reads the CR via a dynamic informer.
|
|
|
|
|
|
|
|
|
|
|
|
When kubelet runs the CNI plugin on `ADD`, the plugin opens a unix-socket
|
|
|
|
|
|
RPC to the agent. The agent allocates an address from the per-node
|
|
|
|
|
|
CIDRs, creates a veth pair, configures the pod side, persists the
|
|
|
|
|
|
allocation to `/var/lib/flock/allocations.json`, and returns the result.
|
|
|
|
|
|
There is no controller loop and no IPAM coordination across nodes — each
|
|
|
|
|
|
node owns a non-overlapping CIDR and allocates locally.
|
|
|
|
|
|
|
|
|
|
|
|
For anycast, the agent installs `<anycast-ip> via <pod-eth0-ip> dev <veth>`
|
|
|
|
|
|
host routes on the node and adds the anycast IP to BIRD's BGP export
|
|
|
|
|
|
filter. When a pod loses readiness, the agent withdraws the route from
|
|
|
|
|
|
both the kernel and BGP within one reconcile cycle (sub-second).
|
|
|
|
|
|
|
|
|
|
|
|
### Packet path
|
|
|
|
|
|
|
|
|
|
|
|
`pod.eth0` (a veth) ↔ host-side veth (with `addrgenmode none`,
|
|
|
|
|
|
`fe80::1/64`, proxy-ARP for the v4 default-via) ↔ host kernel ↔ uplink
|
|
|
|
|
|
NIC ↔ upstream router. No conntrack, no SNAT, no encapsulation.
|
|
|
|
|
|
|
|
|
|
|
|
For IPv6 the host side of every veth carries the deterministic link-local
|
|
|
|
|
|
gateway `fe80::1`, so every pod can use a fixed default route. For IPv4
|
|
|
|
|
|
the host side answers ARP for `169.254.1.1`, providing the same fixed
|
|
|
|
|
|
default route in v4.
|
|
|
|
|
|
|
|
|
|
|
|
## Requirements
|
|
|
|
|
|
|
|
|
|
|
|
- Linux nodes. flock has not been tested on, and does not target,
|
|
|
|
|
|
Windows nodes.
|
|
|
|
|
|
- Kubernetes ≥ 1.27.
|
|
|
|
|
|
- An upstream router (or pair) that accepts a BGP session from each
|
|
|
|
|
|
node. flock has been tested with Cisco IOS-XE, Arista EOS, and FRR
|
|
|
|
|
|
acting as the upstream; anything that speaks standard eBGP should work.
|
|
|
|
|
|
- Globally routable (or at least datacentre-routable) IPv6 prefix
|
|
|
|
|
|
delegated to the cluster, sliced into a per-node /64. IPv4 is
|
|
|
|
|
|
optional but supported.
|
|
|
|
|
|
- Each node must have a unique local ASN. Private ASNs (`64512–65534`,
|
|
|
|
|
|
`4200000000–4294967294`) are typical.
|
|
|
|
|
|
|
|
|
|
|
|
## Quickstart
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
# 1. Install CRD + RBAC + DaemonSet (single bundled manifest):
|
|
|
|
|
|
kubectl apply -f deploy/install.yaml
|
|
|
|
|
|
|
|
|
|
|
|
# 2. Label the node(s) you want flock to manage:
|
|
|
|
|
|
kubectl label node <node-name> flock.fritzlab.net/agent=
|
|
|
|
|
|
|
|
|
|
|
|
# 3. Apply a NodeConfig CR for that node (see "NodeConfig CRD" below):
|
|
|
|
|
|
kubectl apply -f my-nodeconfig.yaml
|
|
|
|
|
|
|
|
|
|
|
|
# 4. Verify the agent is up:
|
|
|
|
|
|
kubectl -n kube-system get pod -l app=flock-agent -o wide
|
|
|
|
|
|
kubectl -n kube-system exec -it ds/flock-agent -c bird -- \
|
|
|
|
|
|
birdc -s /run/flock/bird.ctl show protocols
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
The DaemonSet is gated by the `flock.fritzlab.net/agent` node label, so
|
|
|
|
|
|
unlabelled nodes continue to use whatever CNI was installed before. This
|
|
|
|
|
|
lets you migrate node-by-node — start with one node, prove it works, then
|
|
|
|
|
|
proceed.
|
|
|
|
|
|
|
|
|
|
|
|
## NodeConfig CRD
|
|
|
|
|
|
|
|
|
|
|
|
A `NodeConfig` is the only operator-supplied input. One per node, name
|
|
|
|
|
|
matches the node name. Example:
|
|
|
|
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
|
|
apiVersion: flock.fritzlab.net/v1alpha1
|
|
|
|
|
|
kind: NodeConfig
|
|
|
|
|
|
metadata:
|
|
|
|
|
|
name: node-a
|
|
|
|
|
|
spec:
|
|
|
|
|
|
cidr6:
|
|
|
|
|
|
- 2001:db8:f001::/64 # Pods on this node get addresses from here.
|
|
|
|
|
|
cidr4:
|
|
|
|
|
|
- 192.0.2.0/24 # IPv4 pool, used only when a pod opts in.
|
|
|
|
|
|
defaults:
|
|
|
|
|
|
ipv6: true # Optional. Built-in baseline if omitted.
|
|
|
|
|
|
ipv4: false # Optional. Built-in baseline if omitted.
|
|
|
|
|
|
bgp:
|
|
|
|
|
|
asn: 65101 # This node's local ASN.
|
|
|
|
|
|
peers:
|
|
|
|
|
|
- address: 2001:db8::1 # Upstream router (IPv6 session).
|
|
|
|
|
|
asn: 65000
|
|
|
|
|
|
- address: 192.0.2.1 # Same router, IPv4 session.
|
|
|
|
|
|
asn: 65000
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### `spec.defaults`
|
|
|
|
|
|
|
|
|
|
|
|
`spec.defaults` controls which address families a pod *gets by default*
|
|
|
|
|
|
on this node — i.e. when the pod has no explicit `flock.fritzlab.net/ipv6`
|
|
|
|
|
|
or `flock.fritzlab.net/ipv4` annotation. Pod annotations always override.
|
|
|
|
|
|
If you omit `spec.defaults` (or any individual field inside it) flock
|
|
|
|
|
|
falls back to its built-in baseline of **IPv6 on, IPv4 off**.
|
|
|
|
|
|
|
|
|
|
|
|
| Goal | `spec.defaults` |
|
|
|
|
|
|
|---------------------------|----------------------------------------|
|
|
|
|
|
|
| IPv6-only (the default) | omit, or `{ ipv6: true, ipv4: false }`|
|
|
|
|
|
|
| Dual-stack by default | `{ ipv6: true, ipv4: true }` |
|
|
|
|
|
|
| IPv4-only (legacy node) | `{ ipv6: false, ipv4: true }` |
|
|
|
|
|
|
|
|
|
|
|
|
A NodeConfig that resolves to "neither family" is rejected at allocation
|
|
|
|
|
|
time, so misconfiguring both to false will surface as an error on the
|
|
|
|
|
|
first `CNI ADD`.
|
|
|
|
|
|
|
|
|
|
|
|
### `spec.bgp`
|
|
|
|
|
|
|
|
|
|
|
|
Each `peer` becomes one BGP session. The agent picks a node-local source
|
|
|
|
|
|
address on the same subnet as the peer; if there isn't one, BIRD uses
|
|
|
|
|
|
its default. Multi-homing (multiple peers per family — or per upstream
|
|
|
|
|
|
router pair) is allowed.
|
|
|
|
|
|
|
|
|
|
|
|
## Pod annotations
|
|
|
|
|
|
|
|
|
|
|
|
All annotations live under `flock.fritzlab.net/`. Every annotation is
|
|
|
|
|
|
optional; leave them off to inherit the per-node defaults.
|
|
|
|
|
|
|
|
|
|
|
|
| Annotation | Type | Purpose |
|
|
|
|
|
|
|-------------------------------------|--------|-----------------------------------------------------------------------------------------------|
|
|
|
|
|
|
| `flock.fritzlab.net/ipv6` | bool | Override `spec.defaults.ipv6` for this pod (`true`/`false`). |
|
|
|
|
|
|
| `flock.fritzlab.net/ipv4` | bool | Override `spec.defaults.ipv4` for this pod (`true`/`false`). |
|
|
|
|
|
|
| `flock.fritzlab.net/cidr6` | CIDRs | Restrict IPv6 allocation to a sub-range of the node's `cidr6`. Comma-separated. |
|
|
|
|
|
|
| `flock.fritzlab.net/cidr4` | CIDRs | Restrict IPv4 allocation to a sub-range of the node's `cidr4`. Comma-separated. |
|
|
|
|
|
|
| `flock.fritzlab.net/ip-algo` | list | Embed identity into the IPv6 IID. Subset of `namespace,pod,image`, in order, comma-separated. |
|
|
|
|
|
|
| `flock.fritzlab.net/anycast` | IPs | Bind these IPs on the pod's `lo`; advertise via BGP while pod is `Ready`. Mixed v6+v4 ok. |
|
|
|
|
|
|
|
|
|
|
|
|
Bool values must be the literal strings `"true"` or `"false"`
|
|
|
|
|
|
(case-insensitive, surrounding whitespace tolerated). Other values —
|
|
|
|
|
|
`1`, `0`, `yes`, `no` — are rejected so a typo can't silently flip
|
|
|
|
|
|
behaviour.
|
|
|
|
|
|
|
|
|
|
|
|
### Example pods
|
|
|
|
|
|
|
|
|
|
|
|
Default IPv6-only — no annotations needed:
|
|
|
|
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
|
|
apiVersion: v1
|
|
|
|
|
|
kind: Pod
|
|
|
|
|
|
metadata:
|
|
|
|
|
|
name: minimal
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Dual-stack on a node whose default is IPv6-only:
|
|
|
|
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
|
|
apiVersion: v1
|
|
|
|
|
|
kind: Pod
|
|
|
|
|
|
metadata:
|
|
|
|
|
|
name: legacy-client
|
|
|
|
|
|
annotations:
|
|
|
|
|
|
flock.fritzlab.net/ipv4: "true"
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Operator-friendly addressing — `fnv(namespace) | fnv(pod) | random`
|
|
|
|
|
|
packed into the host bits, so a pod's identity is recognisable from
|
|
|
|
|
|
its IP in `kubectl get pods -o wide`:
|
|
|
|
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
|
|
metadata:
|
|
|
|
|
|
annotations:
|
|
|
|
|
|
flock.fritzlab.net/ip-algo: "namespace,pod"
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
Anycast service — three replicas, each advertising the same v6+v4
|
|
|
|
|
|
anycast pair from the node it lands on. The upstream router does ECMP
|
|
|
|
|
|
across the active set:
|
|
|
|
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
|
|
apiVersion: apps/v1
|
|
|
|
|
|
kind: Deployment
|
|
|
|
|
|
metadata:
|
|
|
|
|
|
name: dns
|
|
|
|
|
|
spec:
|
|
|
|
|
|
replicas: 3
|
|
|
|
|
|
template:
|
|
|
|
|
|
metadata:
|
|
|
|
|
|
annotations:
|
|
|
|
|
|
flock.fritzlab.net/ipv4: "true"
|
|
|
|
|
|
flock.fritzlab.net/anycast: "2001:db8:a::53, 192.0.2.53"
|
|
|
|
|
|
spec:
|
|
|
|
|
|
containers:
|
|
|
|
|
|
- name: coredns
|
|
|
|
|
|
image: coredns/coredns
|
|
|
|
|
|
readinessProbe:
|
|
|
|
|
|
httpGet: { path: /ready, port: 8181 }
|
|
|
|
|
|
periodSeconds: 1
|
|
|
|
|
|
failureThreshold: 1
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## Use cases
|
|
|
|
|
|
|
|
|
|
|
|
**Highly-available DNS.** Run N CoreDNS replicas, each annotated with
|
|
|
|
|
|
the same `anycast` IP. Point client `/etc/resolv.conf` at the anycast
|
|
|
|
|
|
address. Each replica advertises a `/128` from its own node; the
|
|
|
|
|
|
upstream router does ECMP. Lose a pod, traffic fails over within a
|
|
|
|
|
|
probe cycle.
|
|
|
|
|
|
|
|
|
|
|
|
**Replacing a kube-proxy `ClusterIP`.** Headless Service plus an anycast
|
|
|
|
|
|
IP gives you a single stable address with load-balancing across pods,
|
|
|
|
|
|
without the DNAT-pinning that makes long-lived TCP keepalive connections
|
|
|
|
|
|
stick to one backend forever. ECMP routes each new flow independently.
|
|
|
|
|
|
|
|
|
|
|
|
**Per-pod public IPv6.** Because every pod has a globally routable IPv6
|
|
|
|
|
|
address and the cluster does no NAT, a pod's `eth0` IP is reachable from
|
|
|
|
|
|
the rest of the internet (subject to your firewall). Useful for things
|
|
|
|
|
|
like outgoing SMTP, where you want a stable from-address per pod, or for
|
|
|
|
|
|
peer-to-peer protocols that don't tolerate NAT.
|
|
|
|
|
|
|
|
|
|
|
|
**Fast pod identification in `kubectl`.** With
|
|
|
|
|
|
`flock.fritzlab.net/ip-algo: namespace,pod` the IPv6 host bits encode
|
|
|
|
|
|
the pod's namespace+name, so you can recognise a pod from its IP without
|
|
|
|
|
|
a lookup. Reverse-DNS via a wildcard zone makes those IPs human-readable
|
|
|
|
|
|
too.
|
|
|
|
|
|
|
|
|
|
|
|
**Static-IP migration.** Annotation-driven address allocation means you
|
|
|
|
|
|
can ask for a specific sub-CIDR (`cidr6: 2001:db8:f001::ab00/120`) for
|
|
|
|
|
|
services that previously needed pinned IPs (mail server, ingress
|
|
|
|
|
|
controller). When the static-IP requirement goes away, drop the
|
|
|
|
|
|
annotation and the pod gets a normal allocation.
|
|
|
|
|
|
|
|
|
|
|
|
## Comparison vs Calico / Cilium
|
|
|
|
|
|
|
|
|
|
|
|
| | flock | Calico | Cilium |
|
|
|
|
|
|
|--------------------------|-----------------------------|------------------------------|------------------------------|
|
|
|
|
|
|
| Default address family | IPv6 | IPv4 | dual |
|
|
|
|
|
|
| BGP | yes (BIRD) | yes | optional |
|
|
|
|
|
|
| Overlay (VXLAN/IPIP) | never | optional | yes (geneve) or native |
|
|
|
|
|
|
| NAT in datapath | never | masquerade by default | masquerade by default |
|
|
|
|
|
|
| Anycast pod addressing | first-class | manual | optional, via service mesh |
|
|
|
|
|
|
| eBPF datapath | no | optional | yes |
|
|
|
|
|
|
| NetworkPolicy | not yet | yes (Felix) | yes (eBPF) |
|
|
|
|
|
|
| Cluster size target | small (< 100 nodes) | thousands | thousands |
|
|
|
|
|
|
| Operational surface area | low (1 DaemonSet, 1 CRD) | medium | high |
|
|
|
|
|
|
| Production-ready | alpha | yes | yes |
|
|
|
|
|
|
|
|
|
|
|
|
flock is not trying to compete with Calico or Cilium. The right answer
|
|
|
|
|
|
for most clusters is one of those two — flock exists for clusters where
|
|
|
|
|
|
every node already speaks BGP, the operator wants to think in IPv6-first
|
|
|
|
|
|
terms, and per-pod anycast is something they actually want to use rather
|
|
|
|
|
|
than work around.
|
|
|
|
|
|
|
|
|
|
|
|
## Limitations and non-goals
|
|
|
|
|
|
|
|
|
|
|
|
- No NetworkPolicy enforcement yet (planned).
|
|
|
|
|
|
- No NAT, no masquerade, no SNAT-egress. If your pods need to reach a
|
|
|
|
|
|
legacy IPv4-only destination, give them an IPv4 address explicitly.
|
|
|
|
|
|
- No multi-cluster, no peering across clusters.
|
|
|
|
|
|
- Linux-only datapath.
|
|
|
|
|
|
- IPAM is per-node — there's no global allocator and no IP mobility.
|
|
|
|
|
|
When a pod moves to a different node it gets a new address.
|
|
|
|
|
|
- The agent is privileged. It mounts `/var/run/netns`, configures veth
|
|
|
|
|
|
pairs, manages kernel routes, and holds `CAP_NET_ADMIN`. This is
|
|
|
|
|
|
inherent to being a CNI; reducing privilege further is not a goal.
|
|
|
|
|
|
- If BIRD dies but the agent stays up, pods on that node stop being
|
|
|
|
|
|
reachable from off-node. The DaemonSet liveness probes catch this.
|
|
|
|
|
|
|
|
|
|
|
|
## Building and testing
|
|
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
|
# Unit tests + fuzz seed corpora (fast, ~1s):
|
|
|
|
|
|
go test ./...
|
|
|
|
|
|
|
|
|
|
|
|
# Targeted fuzz pass:
|
|
|
|
|
|
go test -run NEVERMATCH -fuzz=FuzzParseAnnotations -fuzztime=30s ./pkg/agent
|
|
|
|
|
|
go test -run NEVERMATCH -fuzz=FuzzRender -fuzztime=30s ./pkg/routing/bird
|
|
|
|
|
|
go test -run NEVERMATCH -fuzz=FuzzEmbed -fuzztime=30s ./pkg/embed
|
|
|
|
|
|
go test -run NEVERMATCH -fuzz=FuzzIPAM_Allocate -fuzztime=30s ./pkg/agent
|
|
|
|
|
|
|
|
|
|
|
|
# Build the container image (used by the DaemonSet):
|
|
|
|
|
|
docker build -t flock:dev .
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
The fuzz tests are also run as plain unit tests via their seed corpora,
|
|
|
|
|
|
so every `go test ./...` exercises the discovered edge cases as
|
|
|
|
|
|
regressions.
|
|
|
|
|
|
|
|
|
|
|
|
`pkg/agent` has Linux-only files (`*_linux.go`) for netlink and netns
|
|
|
|
|
|
work; the macOS/Windows build pulls in stubs from `*_stub.go` so tests
|
|
|
|
|
|
run cleanly on developer laptops.
|
2026-04-24 21:17:42 -05:00
|
|
|
|
|
|
|
|
|
|
## License
|
|
|
|
|
|
|
2026-04-25 09:25:45 -05:00
|
|
|
|
Apache 2.0 — see [LICENSE](LICENSE).
|