anycast: kernel multipath route + L4 hash for multi-pod-per-node
Build flock Image / build (push) Has been cancelled
Build flock Image / build (push) Has been cancelled
Move pure resolver logic out of anycast_linux.go into anycast.go so it's
unit-testable on any host. Reshape anycastTarget from a single
{hostIface, via} into a sorted list of nexthops; multiple Ready pods on
the same node binding the same anycast IP now contribute one nexthop
each.
installAnycastRoute uses RTA_MULTIPATH (via netlink.Route.MultiPath)
when the target has more than one nexthop. Single-nexthop targets keep
the simple via-route shape so 1-pod-per-node keeps rendering identically
to today's production form in `ip route show`.
flock-agent writes net.ipv{4,6}.fib_multipath_hash_policy = 1 at
startup so the kernel hashes flows on (saddr, daddr, sport, dport, proto)
rather than just IPs. Best-effort — runs privileged in production, so
it works; falls back to L3 hash on environments where the write fails
(only matters for the multi-pod-per-node case anyway).
resolveAnycastTargets sorts nexthops by canonical(via) for stable
comparison so a quiet reconcile pass doesn't churn the kernel route.
8 new unit tests cover: 1-pod, 2-pods-same-anycast (multi-nexthop),
NotReady drop, no-Ready omits the IP, pending skipped, mixed v6+v4,
family mismatch warns, determinism.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,110 @@
|
||||
package agent
|
||||
|
||||
import (
|
||||
"net"
|
||||
"sort"
|
||||
)
|
||||
|
||||
// anycastNexthop is one (host-side veth, pod-eth0-IP) pair the kernel route
|
||||
// can use as a multipath nexthop.
|
||||
type anycastNexthop struct {
|
||||
hostIface string
|
||||
via net.IP
|
||||
}
|
||||
|
||||
// anycastTarget describes the kernel route shape for one advertised anycast
|
||||
// IP. When more than one Ready pod on this node binds the same anycast IP,
|
||||
// every Ready pod contributes a nexthop and the kernel does per-flow ECMP
|
||||
// across them.
|
||||
//
|
||||
// nexthops is sorted by canonical(via) for deterministic comparison and
|
||||
// stable kernel-route ordering across reconcile passes — the
|
||||
// AnycastReconciler skips kernel writes when the new and old targets are
|
||||
// equal, which only works if the slice order is stable.
|
||||
type anycastTarget struct {
|
||||
nexthops []anycastNexthop
|
||||
}
|
||||
|
||||
// equal reports whether two targets describe the same kernel route.
|
||||
// Both sides are expected to be sorted (the canonical constructor sorts).
|
||||
func (t anycastTarget) equal(o anycastTarget) bool {
|
||||
if len(t.nexthops) != len(o.nexthops) {
|
||||
return false
|
||||
}
|
||||
for i := range t.nexthops {
|
||||
if t.nexthops[i].hostIface != o.nexthops[i].hostIface {
|
||||
return false
|
||||
}
|
||||
if !t.nexthops[i].via.Equal(o.nexthops[i].via) {
|
||||
return false
|
||||
}
|
||||
}
|
||||
return true
|
||||
}
|
||||
|
||||
// resolveAnycastTargets walks the committed allocation set and returns the
|
||||
// desired kernel-route shape for every anycast IP that has at least one
|
||||
// Ready local pod binding it. Multiple Ready pods sharing the same anycast
|
||||
// IP collapse into a single multi-nexthop target so the kernel can
|
||||
// per-flow ECMP across them.
|
||||
//
|
||||
// Pure: no kernel calls, no informer access. Pods are surfaced via the
|
||||
// isReady callback so the reconciler can plug in its informer; tests can
|
||||
// pass any function that satisfies the signature.
|
||||
//
|
||||
// warn is invoked for human-facing skip reasons (e.g. anycast with no
|
||||
// unicast of same family). nil-safe — pass nil to silently drop.
|
||||
func resolveAnycastTargets(
|
||||
allocations []Allocation,
|
||||
isReady func(namespace, name string) bool,
|
||||
warn func(string),
|
||||
) map[string]anycastTarget {
|
||||
if warn == nil {
|
||||
warn = func(string) {}
|
||||
}
|
||||
out := map[string]anycastTarget{}
|
||||
for _, a := range allocations {
|
||||
if a.State != StateCommitted || len(a.Anycast) == 0 {
|
||||
continue
|
||||
}
|
||||
if !isReady(a.Namespace, a.PodName) {
|
||||
continue
|
||||
}
|
||||
host := HostIfaceName(a.ContainerID)
|
||||
via6 := net.ParseIP(a.IP6)
|
||||
via4 := net.ParseIP(a.IP4)
|
||||
for _, ipStr := range a.Anycast {
|
||||
ip := net.ParseIP(ipStr)
|
||||
if ip == nil {
|
||||
continue
|
||||
}
|
||||
var via net.IP
|
||||
if ip.To4() != nil {
|
||||
via = via4
|
||||
} else {
|
||||
via = via6
|
||||
}
|
||||
if via == nil {
|
||||
warn("anycast " + ipStr + " skipped: pod " +
|
||||
a.Namespace + "/" + a.PodName +
|
||||
" has no unicast of same family")
|
||||
continue
|
||||
}
|
||||
key := canonical(ip)
|
||||
t := out[key]
|
||||
t.nexthops = append(t.nexthops, anycastNexthop{hostIface: host, via: via})
|
||||
out[key] = t
|
||||
}
|
||||
}
|
||||
// Sort each target's nexthops for stable comparison + stable kernel
|
||||
// ordering. Sort key is canonical(via) — sufficient for stability
|
||||
// because (host, via) pairs are 1:1 (one veth per pod, one v6+v4 per
|
||||
// pod, so via uniquely identifies the nexthop).
|
||||
for k, t := range out {
|
||||
sort.Slice(t.nexthops, func(i, j int) bool {
|
||||
return canonical(t.nexthops[i].via) < canonical(t.nexthops[j].via)
|
||||
})
|
||||
out[k] = t
|
||||
}
|
||||
return out
|
||||
}
|
||||
Reference in New Issue
Block a user