anycast: kernel multipath route + L4 hash for multi-pod-per-node
Build flock Image / build (push) Has been cancelled

Move pure resolver logic out of anycast_linux.go into anycast.go so it's
unit-testable on any host. Reshape anycastTarget from a single
{hostIface, via} into a sorted list of nexthops; multiple Ready pods on
the same node binding the same anycast IP now contribute one nexthop
each.

installAnycastRoute uses RTA_MULTIPATH (via netlink.Route.MultiPath)
when the target has more than one nexthop. Single-nexthop targets keep
the simple via-route shape so 1-pod-per-node keeps rendering identically
to today's production form in `ip route show`.

flock-agent writes net.ipv{4,6}.fib_multipath_hash_policy = 1 at
startup so the kernel hashes flows on (saddr, daddr, sport, dport, proto)
rather than just IPs. Best-effort — runs privileged in production, so
it works; falls back to L3 hash on environments where the write fails
(only matters for the multi-pod-per-node case anyway).

resolveAnycastTargets sorts nexthops by canonical(via) for stable
comparison so a quiet reconcile pass doesn't churn the kernel route.

8 new unit tests cover: 1-pod, 2-pods-same-anycast (multi-nexthop),
NotReady drop, no-Ready omits the IP, pending skipped, mixed v6+v4,
family mismatch warns, determinism.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Donavan Fritz
2026-04-25 09:57:32 -05:00
parent 5d9b6bfeec
commit a7dc7bf1f4
4 changed files with 436 additions and 73 deletions
+110
View File
@@ -0,0 +1,110 @@
package agent
import (
"net"
"sort"
)
// anycastNexthop is one (host-side veth, pod-eth0-IP) pair the kernel route
// can use as a multipath nexthop.
type anycastNexthop struct {
hostIface string
via net.IP
}
// anycastTarget describes the kernel route shape for one advertised anycast
// IP. When more than one Ready pod on this node binds the same anycast IP,
// every Ready pod contributes a nexthop and the kernel does per-flow ECMP
// across them.
//
// nexthops is sorted by canonical(via) for deterministic comparison and
// stable kernel-route ordering across reconcile passes — the
// AnycastReconciler skips kernel writes when the new and old targets are
// equal, which only works if the slice order is stable.
type anycastTarget struct {
nexthops []anycastNexthop
}
// equal reports whether two targets describe the same kernel route.
// Both sides are expected to be sorted (the canonical constructor sorts).
func (t anycastTarget) equal(o anycastTarget) bool {
if len(t.nexthops) != len(o.nexthops) {
return false
}
for i := range t.nexthops {
if t.nexthops[i].hostIface != o.nexthops[i].hostIface {
return false
}
if !t.nexthops[i].via.Equal(o.nexthops[i].via) {
return false
}
}
return true
}
// resolveAnycastTargets walks the committed allocation set and returns the
// desired kernel-route shape for every anycast IP that has at least one
// Ready local pod binding it. Multiple Ready pods sharing the same anycast
// IP collapse into a single multi-nexthop target so the kernel can
// per-flow ECMP across them.
//
// Pure: no kernel calls, no informer access. Pods are surfaced via the
// isReady callback so the reconciler can plug in its informer; tests can
// pass any function that satisfies the signature.
//
// warn is invoked for human-facing skip reasons (e.g. anycast with no
// unicast of same family). nil-safe — pass nil to silently drop.
func resolveAnycastTargets(
allocations []Allocation,
isReady func(namespace, name string) bool,
warn func(string),
) map[string]anycastTarget {
if warn == nil {
warn = func(string) {}
}
out := map[string]anycastTarget{}
for _, a := range allocations {
if a.State != StateCommitted || len(a.Anycast) == 0 {
continue
}
if !isReady(a.Namespace, a.PodName) {
continue
}
host := HostIfaceName(a.ContainerID)
via6 := net.ParseIP(a.IP6)
via4 := net.ParseIP(a.IP4)
for _, ipStr := range a.Anycast {
ip := net.ParseIP(ipStr)
if ip == nil {
continue
}
var via net.IP
if ip.To4() != nil {
via = via4
} else {
via = via6
}
if via == nil {
warn("anycast " + ipStr + " skipped: pod " +
a.Namespace + "/" + a.PodName +
" has no unicast of same family")
continue
}
key := canonical(ip)
t := out[key]
t.nexthops = append(t.nexthops, anycastNexthop{hostIface: host, via: via})
out[key] = t
}
}
// Sort each target's nexthops for stable comparison + stable kernel
// ordering. Sort key is canonical(via) — sufficient for stability
// because (host, via) pairs are 1:1 (one veth per pod, one v6+v4 per
// pod, so via uniquely identifies the nexthop).
for k, t := range out {
sort.Slice(t.nexthops, func(i, j int) bool {
return canonical(t.nexthops[i].via) < canonical(t.nexthops[j].via)
})
out[k] = t
}
return out
}