Add flock.fritzlab.net/ip-algo as a node-wide default via NodeConfig
metadata.annotations. Pod-level annotation still wins. Empty, missing,
or invalid input at either level falls through to the next; invalid
values warn-log via the agent's slog. Both unset → fully random IID
(unchanged baseline).
ParseAnnotations no longer touches ip-algo; ResolveIPAlgo handles the
full precedence chain, called from PodHandler.Add with the cached
NodeConfig's annotations and the agent logger.
Tests: 9 new TestResolveIPAlgo_* cases covering pod-wins, all
fall-through paths, both-absent, nil node map, whitespace, and
duplicate-as-invalid. Fuzz target rebuilt without ip-algo input space
(now exercised by ResolveIPAlgo unit tests).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BuiltinFamilyDefaults() now returns {WantV6: true, WantV4: true}. Pods
that want a single family explicitly opt out via the
flock.fritzlab.net/ipv4 (or ipv6) annotation, or the operator narrows
the default at the node level via NodeConfig.Spec.Defaults.
Annotation precedence is unchanged: pod annotation > NodeConfig defaults
> built-in baseline. Tests updated to reflect the new baseline; the
"opt out of v4" path now has explicit coverage.
Docs updated:
- NodeConfig.Spec.Defaults Go doc + CRD descriptions reflect the new
baseline and its overrides
- README opening framing softened from "IPv6-first" to "dual-stack,
IPv6-friendly"; example pods + spec.defaults table flipped to
treat dual-stack as the default and v6/v4-only as overrides
- README NetworkPolicy line in the comparison table flipped to
"yes (nftables)" since v1 enforcement shipped
- Limitations note about IPv4-only destinations rewritten — every
pod has v4 by default now, so the question is whether your IPv4
pool is routable beyond your network
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Move pure resolver logic out of anycast_linux.go into anycast.go so it's
unit-testable on any host. Reshape anycastTarget from a single
{hostIface, via} into a sorted list of nexthops; multiple Ready pods on
the same node binding the same anycast IP now contribute one nexthop
each.
installAnycastRoute uses RTA_MULTIPATH (via netlink.Route.MultiPath)
when the target has more than one nexthop. Single-nexthop targets keep
the simple via-route shape so 1-pod-per-node keeps rendering identically
to today's production form in `ip route show`.
flock-agent writes net.ipv{4,6}.fib_multipath_hash_policy = 1 at
startup so the kernel hashes flows on (saddr, daddr, sport, dport, proto)
rather than just IPs. Best-effort — runs privileged in production, so
it works; falls back to L3 hash on environments where the write fails
(only matters for the multi-pod-per-node case anyway).
resolveAnycastTargets sorts nexthops by canonical(via) for stable
comparison so a quiet reconcile pass doesn't churn the kernel route.
8 new unit tests cover: 1-pod, 2-pods-same-anycast (multi-nexthop),
NotReady drop, no-Ready omits the IP, pending skipped, mixed v6+v4,
family mismatch warns, determinism.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The previous base-chain jump matched iifname/oifname AND saddr/daddr ==
pod eth0 IP. Anycast traffic has the anycast IP as daddr, not the pod's
eth0 unicast — so anycast packets skipped the policy chain entirely and
fell through to the forward chain's policy=accept.
The veth uniquely belongs to one pod. Anything traversing it is to or
from that pod by definition (anycast, unicast, future overlay routes).
Match on iifname/oifname alone; let the pod-side chain's accept lines +
trailing drop be the policy.
Validated end-to-end on host001: anycast nginx pod with default-deny
ingress NetPol now correctly drops traffic from any peer; adding an
allow-from-podSelector rule unblocks only the matched peer.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
New pkg/agent/netpol implementing standard networking.k8s.io/v1
NetworkPolicy. Pipeline:
pods + policies + namespaces → Translate → Render → Apply
Supports ingress + egress, all three peer types (podSelector,
namespaceSelector, ipBlock with except), numeric ports + port ranges,
default-deny semantics derived from PolicyTypes (or inferred from
non-empty Spec.Egress when unset).
Apply path is `nft -f -` shell-out — single transaction, atomic, kernel
guarantees partial-failure rollback. Idempotent dedup via last-applied
script. Reconcile triggers: informer events, 30s self-heal tick, every
CNI ADD/DEL.
Verified against the three live cluster NetPols (calico-apiserver,
remote-proxies/lodge-home-assistant, storage/garage-admin-restrict).
Fuzz target stitches Translate + Render with random selector and peer
inputs; 21 unit tests cover the policy semantics.
Named ports skip with a warn — deferred until kubelet exposes them in a
form that doesn't require shadowing pod state.
Dockerfile: + nftables.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
NodeConfig.Spec.Defaults adds per-node IPv6/IPv4 family defaults that pod
annotations can override; built-in baseline (v6=true, v4=false) still
applies when the field is omitted.
bird.Render now validates every operator-supplied value (peer addresses,
CIDRs, anycast IPs, source addresses) before templating — fuzz found a
peer address containing `}` produced unbalanced braces in bird.conf.
Failing input preserved as a regression seed.
Fuzz targets added for ParseAnnotations, ParseCNIArgs, HostIfaceName,
canonical, IPAM allocate sequences, embed.Embed, and bird.Render.
Hardened canonical/ipToU32 against nil and non-IPv4 inputs.
README rewritten for outside readers — quickstart, NodeConfig + annotation
reference with worked examples, anycast use cases, comparison vs Calico
and Cilium, requirements, limitations.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
A single Ready/NotReady transition no longer pays a 500ms reload wait —
the first call to scheduleReload fires birdc immediately; further calls
within 500ms are coalesced into one tail reload at the cooldown's end.
Burst behavior is the same as before: under heavy churn (deploy rolling
all replicas at once), at most one reload per 500ms.
Steady-state latency from pod Ready transition to crt001 BGP withdraw:
- probe period (set in pod spec, 1s minimum)
- ~ms informer + reconcile + birdc + BGP UPDATE
The 500ms hardcoded delay is gone.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Two coupled changes that fix the anycast advertisement path:
1. Add anycast /128 + /32 prefixes as `route … blackhole` lines in the
protocol static stanzas. BIRD's master tables pick them up at
preference 200 — higher than kernel-learned routes — so they're the
ones the BGP export filter sees.
2. The kernel protocol's export filter now rejects RTS_STATIC. Without
this, BIRD would push its blackhole back into the kernel, clobbering
the agent-installed `<anycast> via <pod-eth0> dev flock<8hex>` route
that's actually responsible for forwarding to the pod.
Result: BIRD has the route to advertise via BGP; the kernel has the
right route to forward; nothing fights over the kernel table.
Replaces the abandoned `gateway recursive` attempt — that's a BIRD 1.x
keyword, not BIRD 2.15.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Default is `gateway direct` — BIRD silently rejects kernel routes whose
via address isn't on a directly-connected network interface. Our anycast
host routes use a pod /128 (or /32) as via, which is itself a kernel
route on a flock veth, not a connected network. With `gateway
recursive`, BIRD does a recursive lookup and accepts the kernel route.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Reverts the eth0-placement hack from e1e9544. The design doc's lo
placement is correct.
Real fix: the host's anycast /128 (or /32) route now uses the pod's own
eth0 unicast IP (same family) as the route's `via` next-hop. The kernel
then does NDP/ARP for that eth0 IP — which IS configured on the pod's
eth0 — so the pod responds normally with no proxy_ndp / proxy_arp
trickery on the anycast IP itself.
ip -6 route add <anycast>/128 via <pod-eth0-v6> dev flock<8hex>
ip -4 route add <anycast>/32 via <pod-eth0-v4> dev flock<8hex>
Validation: an anycast IP whose family the pod doesn't have a unicast
for is skipped with a warn (an v4 anycast on an IPv6-only pod cannot be
NDP-resolved this way; require dual-stack).
Bonus cleanup: ESRCH from RouteDel is treated as success (idempotent).
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The design doc's lo placement was motivated by avoiding NDP/ARP DAD
conflicts "across nodes advertising the same IP" — but flock pods each
sit on their own /64 veth subnet. DAD on eth0 only sees the host peer,
no cross-node L2.
With the IP on lo, the pod kernel doesn't reply to NDP solicits arriving
on eth0 (Linux default: answer NDP only for addresses on the receiving
interface). The host route `<ip>/128 dev flock<8hex>` causes the host
to do NDP for the destination on the veth; pod ignores; packet drops
silently between forwarding decision and transmit. Symptom: v4 anycast
works (proxy_arp=1 on the host veth handles ARP), v6 anycast doesn't.
Putting on eth0 makes NDP just work.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Cisco IOS rejects IPv6 BGP advertisements whose next-hop is link-local-
only. BIRD2 was synthesising a link-local next-hop for kernel-learned
routes whose dev had no via gateway (our anycast /128s). Symptom: v4
anycast worked (Cisco doesn't have the same constraint for /32s), v6
anycast didn't make it past crt001.
- pkg/routing/bird/config.go: NodeBGP.LocalV6/LocalV4. Template now
emits `local <addr> as <asn>` and `next hop self;` in the BGP
channel for both families, mirroring Calico's `source address` +
`next hop self` pattern.
- pkg/agent/bird.go: localAddrSameSubnet picks an interface address
on the peer's /64 or /24 to use as source.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
CNI ADD now adds anycast IPs to the pod's lo interface (NOT eth0 — design
doc rationale: avoid NDP/ARP DAD conflicts when N replicas share an IP).
Allocation persists the anycast list.
AnycastReconciler:
desired = { ip → flock<8hex> } from
committed allocations × pod.Status.PodReady=True
diff against advertised, install/remove host /128 (v6) or /32 (v4)
re-render bird.conf with the active set
Triggers: 2s tick, AfterCommit (per ADD/DEL), Pod informer Ready
transitions (PodCache.OnReadyChange callback).
The bird template already supported Anycast6/Anycast4 via the export
filter — this turn finally drives those slices from runtime.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
When Calico shuts down on a flock-labeled node, calico-node sets
NetworkUnavailable=True with reason CalicoIsDown. Nothing replaces it,
so kubelet's NodeController applies node.kubernetes.io/network-
unavailable:NoSchedule and new pods can't land.
flock-agent now patches Status.Conditions every 60s with
NetworkUnavailable=False (reason=FlockReady). RBAC: nodes/status patch.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
BIRD2's protocol kernel does not import kernel routes by default; the
import filter on the channel is just for what BIRD has already learned.
Added `learn;` so the kernel-installed blackholes (from the agent's
SummaryRoutes) are picked up.
Also added explicit `protocol static static6/static4` with one
`route <cidr> blackhole;` per NodeConfig CIDR. This is belt-and-
suspenders: even if `learn` doesn't capture the kernel blackhole, BIRD
has the route directly and exports it via the BGP filter.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Calico fenced off via Tigera Installation CR (apps@2121892). flock-agent
now renders bird.conf with the per-node BGP peers; bird sidecar reloads
on changes (debounced 500ms). Re-render tick every 15s reacts to
NodeConfig updates.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Calico's calico-node still runs on every node (Tigera-Operator-managed
via ArgoCD with selfHeal). Two birds with the same ASN can't peer to
crt001 from the same source. Use a manual static route on crt001 for
the flock /64 for the first cutover; switch to live BGP after Calico is
fenced off flock-labeled nodes.
The bird sidecar stays running with the bootstrap config (kernel +
device only, no BGP), so flipping live BGP on later is a single-line
change in runtime_linux.go.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Locks the wire format between /opt/cni/bin/flock and flock-agent. ADD
returns a CNI Result, DEL returns success/error, CHECK returns
success/error. Connection-per-RPC, newline-delimited JSON.
- pkg/cni/rpc.go: shared Op + Request + Response + framed encode/decode.
- pkg/cni/rpc_client.go: net.Dial + EncodeRequest + DecodeResponse;
rpcSocket overridable for tests.
- pkg/cni/plugin.go: real implementations of CmdAdd/Del/Check that call
through, mapping agent errors to types.Error.
- pkg/agent/rpc.go: rpcServer with swappable AddHandler/DelHandler/
CheckHandler (defaults: not-implemented for ADD; idempotent-no-op for
DEL/CHECK so kubelet teardown of a never-ADDed pod doesn't fail).
- pkg/agent/server.go: replaces the M1 accept-and-close placeholder
with rpcServer.serve(ctx, listener); listener closes on ctx cancel.
Tests cover: Request/Response JSON roundtrip, end-to-end client →
unix-socket → fake server, agent error → CNI types.Error mapping.
ADD remains "not implemented" until netlink + IPAM wire-up — the agent
returns an error and kubelet will fail pod sandbox creation IF a node
were configured to use this CNI. host001's CNI plane is still 100%
Calico, so this changes nothing observable on the cluster.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Core building block for M2 CNI ADD. Pure logic (no netlink), mutex-
serialized, seedable from committed state via MarkInUse. Hooks into
pkg/embed for ip-algo IID derivation.
- resolveEffective() implements the design-doc cidr6/cidr4 annotation
rules: equal→node, supernet→node, subnet→ann, disjoint→error.
First-match-wins across multiple annotation CIDRs.
- allocV6() random IID within the effective CIDR; on ip-algo, defers
to embed.Embed. 16-retry on collision (regenerates IID or N nibble).
- allocV4() linear scan skipping .0 (network), .1 (gateway), .<last>
(broadcast). Smallest supported block: /30 with 1 usable address.
- Deterministic fakeRand in tests covers: intersection matrix, random
IID, embed path, collision→retry, v4 skip-gateway, v4 exhaustion,
dual-stack, release-then-reallocate, family mismatch rejection.
No agent Run-loop integration yet — NewIPAM(nc.Spec.CIDR6, nc.Spec.CIDR4)
will be called from Server.Run once netlink + RPC are in place.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Agent now watches nodeconfigs.flock.fritzlab.net via a client-go dynamic
informer, filters events to its own node name, and caches the typed
NodeConfig in memory (NodeConfigCache, atomic pointer). M2's IPAM will
read from that cache.
- pkg/agent/nodeconfig.go: informer + JSON-round-trip decode (avoids
hand-written DeepCopy + scheme registration for this small a use).
- pkg/agent/server.go: starts the informer goroutine; Run terminates if
the informer returns.
- pkg/api/v1alpha1: switch placeholder TypeMeta/ObjectMeta to metav1.
- deploy/rbac: get/list/watch on nodeconfigs.
- cmd/flock-agent: --kubeconfig flag for out-of-cluster runs (tests).
Satisfies M1 verified-by: "kubectl apply NodeConfig; agent logs read it".
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>