MetalLB ARP vs BGP: The Production Decision Guide You Actually Need

Hey folks! 👋 I'm Vikash Kumar, a seasoned DevOps Engineer navigating the thrilling landscapes of DevOps and Cloud ☁️. My passion? Simplifying and automating processes to enhance our tech experiences. By day, I'm a Terraform wizard; by night, a Kubernetes aficionado crafting ingenious solutions with the latest DevOps methodologies 🚀. From troubleshooting deployment snags to orchestrating seamless CI/CD pipelines, I've got your back. Fluent in scripts and infrastructure as code. With AWS ☁️ expertise, I'm your go-to guide in the cloud. And when it comes to monitoring and observability 📊, Prometheus and Grafana are my trusty allies. In the realm of source code management, I'm at ease with GitLab, Bitbucket, and Git. Eager to stay ahead of the curve 📚, I'm committed to exploring the ever-evolving domains of DevOps and Cloud. Let's connect and embark on this journey together! Drop me a line at thenameisvikash@gmail.com.
📘 Production On-Prem Kubernetes Series - Part 1
This is Part 1 of my series on deploying production-grade Kubernetes clusters on-premise. This article covers the critical MetalLB architecture decision you need to make BEFORE deployment.
Series outline:
Part 1: MetalLB L2 vs BGP Decision Guide (you are here)
Part 2: Control Plane HA - Keepalived vs Kube-VIP (coming next)
Part 3: Full Deployment with Kubespray + Kube-VIP
MetalLB Layer 2 vs BGP: The Production Decision Guide You Actually Need
Context: I'm running Kubernetes for an energy trading platform in India. Last year, we faced a decision: MetalLB Layer 2 or BGP? Here's what I learned after deploying both in production.
What this covers:
Why MetalLB exists (the real problem)
How L2 (ARP) actually works—and where it breaks
How BGP actually works—and why it scales
Real configs, real thresholds, real decisions
When to use each (with honest trade-offs)
Read time: 12 minutes that save you months
1. The Problem MetalLB Solves (Not the Marketing Version)
In cloud Kubernetes, this works instantly:
apiVersion: v1
kind: Service
metadata:
name: webapp
spec:
type: LoadBalancer
ports:
- port: 443
AWS gives you an ELB. GCP gives you a Load Balancer. Done.
On-prem?
$ kubectl get svc webapp
NAME TYPE EXTERNAL-IP PORT(S)
webapp LoadBalancer <pending> 443:32156/TCP
That <pending> means: Kubernetes doesn't know how to expose this.
You need three things:
IP allocation - Where does the external IP come from?
IP announcement - How do clients find this IP?
Traffic routing - How do packets reach the right node?
MetalLB solves all three.
But HOW it solves them creates entirely different architectures.
2. Two Ways MetalLB Announces IPs
MetalLB must answer one question:
"When traffic comes for this IP, who should receive it?"
There are two fundamentally different answers:
Layer 2 (ARP): Broadcasting "I'm here!" to everyone
BGP: Telling the router "Route to me" directly
That fundamental difference drives everything else.
Let's break down each approach properly.
3. MetalLB Layer 2 Mode: The Mechanics
How It Works
MetalLB uses ARP (Address Resolution Protocol) to announce IPs.

The flow:
MetalLB assigns IP
192.168.1.100to your serviceOne node becomes the "speaker" (leader election)
Speaker sends ARP announcement: "192.168.1.100 is at MAC
aa:bb:cc:dd:ee:ff"Network switch updates its table
All traffic flows to that speaker node
Client → Switch → Speaker Node → kube-proxy → Pod
My Staging Config:
apiVersion: v1
kind: ConfigMap
metadata:
namespace: metallb-system
name: config
data:
config: |
address-pools:
- name: staging-pool
protocol: layer2
addresses:
- 10.50.100.10-10.50.100.50
Applied it, worked instantly. No router configuration needed.
That simplicity is intoxicating—and dangerous.
4. Where ARP Breaks: The Deep Dive
Let's go beyond "it doesn't scale" and understand precisely why.
❌ Problem 1: The Broadcast Domain Wall
ARP only works within a single broadcast domain (one subnet/VLAN).
Real scenario from our setup:
Kubernetes cluster: 10.50.0.0/24
Office network: 192.168.1.0/24
Client network: 172.16.0.0/16
With Layer 2 mode, only clients in 10.50.0.0/24 can reach services directly.
Others need:
Proxy servers
VPN tunnels
Manual routing rules
Why? ARP packets can't cross routers—it's physically impossible by Layer 2 design.
❌ Problem 2: Single Speaker Bottleneck
Only one node owns each IP at a time.
Traffic pattern:
LoadBalancer IP: 10.50.100.10
Pods distributed across: Node1, Node2, Node3
ALL inbound traffic enters through: Node2 (speaker)
Real throughput limits:
| NIC Speed | Theoretical Max | Real-World Max | Your Bottleneck |
| 1 Gbps | 125 MB/s | ~90 MB/s | Single node NIC |
| 10 Gbps | 1.25 GB/s | ~900 MB/s | Single node NIC |
Even with 10 nodes and 100 Gbps total capacity, you're limited to one node's NIC.
We hit this at ~80 MB/s sustained during peak traffic hours (energy meter data ingestion).
❌ Problem 3: The Failover Gap (SLA Killer)
When speaker node dies, here's the timeline:
T+0s: Node crashes
T+2s: MetalLB detects failure
T+3s: New speaker elected
T+4s: New speaker sends gratuitous ARP
T+4-30s: Switch ARP cache updates (vendor-dependent)
T+30s: Traffic restored
That 30-second gap destroyed our staging environment twice during node maintenance.
| Network Equipment | Observed Failover Time |
| Our Cisco switches | 15-25 seconds |
| Generic enterprise switches | 30-45 seconds |
| Older switches | 60-120 seconds |
For SLA math:
15 seconds downtime per node failure
4 planned maintenances/year = 60 seconds
99.99% SLA broken (requires < 52 seconds/year)
❌ Problem 4: ARP Noise at Scale
Every service change triggers ARP broadcasts.
Our staging cluster:
12 LoadBalancer services
5 nodes
~20 pod migrations per day (autoscaling, deployments)
Result: ~50 ARP announcements per day
Calculation for 50 services across 20 nodes:
3 speaker changes/service/day (rolling updates)
\= 150 ARP broadcasts/day
\= ~6 per hour
Modern switches handle this, but it adds:
Control plane CPU load
Log noise
Unpredictable behavior during storms
❌ Problem 5: Security Blocks
Many environments treat gratuitous ARP as suspicious (it's a classic spoofing attack).
Blocked in:
AWS/GCP/Azure (all cloud environments)
Enterprise networks with ARP inspection
Firewalls with strict MAC filtering
We learned this the hard way: Our client's enterprise network rejected MetalLB L2 entirely due to security policies.
✅ When L2 Mode Actually Works
Don't overthink for:
Nodes: 3-5
Services: < 10
Network: Single subnet
Environment: Dev/staging
Downtime tolerance: 30-60 seconds OK
Perfect for:
MVP/prototype clusters
Internal developer platforms
Testing environments
Small teams with no network engineer
5. MetalLB BGP Mode: The Grown-Up Solution
What Changes?
Instead of broadcasting (ARP), nodes talk to routers using BGP.

The flow:
MetalLB establishes BGP sessions with your router
Advertises: "To reach 10.50.100.10, route to Node2 (10.50.0.5)"
Router updates its routing table
Traffic gets routed via Layer 3
Client → Router → Best Node → kube-proxy → Pod
My Production Config:
MetalLB side:
apiVersion: v1
kind: ConfigMap
metadata:
namespace: metallb-system
name: config
data:
config: |
peers:
- peer-address: 10.50.0.1 # Router IP
peer-asn: 64500 # Router's AS
my-asn: 64501 # MetalLB's AS
address-pools:
- name: production-pool
protocol: bgp
addresses:
- 10.50.100.100-10.50.100.200
Router side (VyOS example):
set protocols bgp 64500 neighbor 10.50.0.3 remote-as 64501
set protocols bgp 64500 neighbor 10.50.0.4 remote-as 64501
set protocols bgp 64500 neighbor 10.50.0.5 remote-as 64501
Setup time: ~2 hours (including router access approval from network team)
6. Why BGP Scales: The Technical Advantages
✅ Multi-Path Load Balancing (ECMP)
Multiple nodes can advertise the same IP:
Router sees:
10.50.100.10 → Node1 (10.50.0.3)
10.50.100.10 → Node2 (10.50.0.4)
10.50.100.10 → Node3 (10.50.0.5)
Router distributes traffic across all three
Our production results:
| Metric | Layer 2 (Staging) | BGP (Production) |
| Inbound paths | 1 node | 12 nodes |
| Max throughput | ~80 MB/s | ~960 MB/s (12x) |
| Bottleneck | Speaker NIC | Application logic |
We went from NIC-limited to application-limited. That's the difference.
✅ Sub-Second Failover
BGP uses keepalive timers to detect failures:
Our tuned config:
Keepalive: 3 seconds
Hold time: 9 seconds
Failure timeline:
T+0s: Node dies
T+9s: Router detects (missed 3 keepalives)
T+9.2s: Router removes routes
T+9.5s: Traffic flows to healthy nodes
Measured failover: 5-10 seconds (vs 30-60 seconds with ARP)
✅ Cross-VLAN/Subnet Support
BGP works across:
Multiple VLANs
Different subnets
WAN links
Routed networks
Our actual topology:
Kubernetes cluster: 10.50.0.0/24
Office network: 192.168.1.0/24
Remote site: 172.16.0.0/16
Client network: 203.x.x.x/29 (via firewall)
With BGP, all networks reach services via standard IP routing.
With ARP, we'd need VPNs or proxies for every remote network.

✅ Zero Broadcast Noise
BGP uses point-to-point TCP sessions (port 179).
No broadcasts. No ARP storms. No switch overhead.
Scaling characteristics:
| Services | Nodes | ARP Packets/Hour (L2) | BGP Updates (BGP) |
| 10 | 5 | ~30 | 0 (steady state) |
| 50 | 20 | ~200 | 0 (steady state) |
| 100 | 50 | ~500 | 0 (steady state) |
BGP updates only during actual changes (new service, scaling events).
✅ Production Observability
On router:
$ show ip bgp summary
Neighbor AS State Uptime
10.50.0.3 64501 Established 12d 4h
10.50.0.4 64501 Established 12d 4h
10.50.0.5 64501 Established 3h 22m # Recently restarted
$ show ip bgp 10.50.100.10
Network Next Hop Metric Status
10.50.100.10 10.50.0.3 0 Active
10.50.100.10 10.50.0.4 0 Active
10.50.100.10 10.50.0.5 0 Active
In Kubernetes:
$ kubectl logs -n metallb-system speaker-xyz | grep BGP
level=info msg="BGP session established" peer=10.50.0.1
With L2, you get... nothing. Just hope ARP works.
7. "But BGP is Complex/Expensive/Public" (Myths Destroyed)
Myth 1: BGP Requires Enterprise Routers
False.
| Router | Cost | BGP Support |
| Cisco Catalyst | $5,000+ | Yes |
| Mikrotik RouterBOARD | $200-800 | Yes ✅ |
| VyOS (VM) | Free | Yes ✅ |
| pfSense + FRR | Free | Yes ✅ |
We use VyOS (free) in staging and Mikrotik (~$600) in production.
Myth 2: BGP Means Public Internet
Absolutely not.
BGP advertises reachability, not exposure.
Our setup:
IPs:
10.50.100.0/24(RFC1918 private)AS:
64501(private AS range: 64512-65535)Peers: Internal routers only
No external exposure. BGP is internal routing.
Myth 3: You Need Deep BGP Knowledge
What you actually need:
Understand what an AS number is (2 minutes)
3 lines of router config per node
How to check
show ip bgp summary
What you don't need:
BGP path selection algorithms
AS-path manipulation
BGP communities
For MetalLB, BGP is simple—you're not building the internet.
8. Quick Mental Model
Think of it like giving directions:
Layer 2 (ARP):
Shouting "I'm at 123 Main Street!" to everyone in the neighborhood
Works great in a small neighborhood
Chaos in a city
BGP:
Updating Google Maps: "123 Main Street is here"
Google routes everyone correctly
Scales to millions of addresses
That's the fundamental difference.
| Aspect | Layer 2 | BGP |
| Announcement | Broadcast (everyone hears) | Unicast (router only) |
| Failure handling | Network relearns (30-60s) | Router updates (5-10s) |
| Scale limit | Network broadcasts | Routing table size |
| Production use | Small clusters only | Enterprise standard |
9. Decision Framework (Pin This)
IF
single subnet
AND < 5 nodes
AND < 10 services
AND downtime tolerance > 30 seconds
THEN
Layer 2 is acceptable
ELSE IF
production environment
OR multi-rack/multi-VLAN
OR > 10 nodes
OR > 20 services
OR SLA > 99.9%
THEN
Use BGP from day 1
ELSE
Start with Layer 2
Plan for BGP migration within 6 months
10. Scale Thresholds (Real Numbers)
Based on our production experience + research:
| Cluster Profile | Nodes | Services | Network | Mode | Rationale |
| Small | 1-5 | < 10 | Single subnet | L2 OK | Failover acceptable, simple setup |
| Medium | 5-20 | 10-50 | Single/multi-rack | BGP preferred | L2 starts showing cracks |
| Large | 20-50 | 50-100 | Multi-VLAN/site | BGP mandatory | L2 fundamentally broken |
| Enterprise | 50+ | 100+ | Complex routed | BGP only | L2 won't work |
Our actual journey:
Staging (L2): 5 nodes, 12 services → works, but 15-25s failover hurts
Production (BGP): 12 nodes, 28 services → 5-10s failover, zero issues
11. Migration Path: L2 → BGP (Proven Strategy)
Good news: No downtime required.
Phase 1: Dual-Mode Config
address-pools:
- name: legacy-l2
protocol: layer2
addresses:
- 10.50.100.10-10.50.100.50
- name: new-bgp
protocol: bgp
addresses:
- 10.50.100.100-10.50.100.200
peers:
- peer-address: 10.50.0.1
peer-asn: 64500
my-asn: 64501
Phase 2: Service-by-Service Migration
For new services: Use BGP pool (automatic)
For existing services:
# 1. Create new service with BGP IP
kubectl expose deployment app --type=LoadBalancer \
--name=app-bgp --port=443
# 2. Update DNS
app.example.com → 10.50.100.101 (BGP)
# 3. Wait for TTL (24-48 hours)
# 4. Delete old L2 service
kubectl delete svc app-l2
Phase 3: Complete Cutover
Remove L2 pool once all services migrated.
Our timeline: 3 weeks for 12 services (low-priority, phased approach)
12. Troubleshooting (Lessons from Production)
Issue 1: BGP Session Won't Establish
$ kubectl logs -n metallb-system speaker-xyz
level=error msg="BGP session failed" error="connection refused"
Fix: Firewall blocking TCP port 179
# On each node
sudo iptables -A INPUT -p tcp --dport 179 -s 10.50.0.1 -j ACCEPT
Issue 2: Service Gets IP But Unreachable
$ kubectl get svc
NAME EXTERNAL-IP PORT(S)
app 10.50.100.10 443:31234/TCP
$ curl 10.50.100.10
# Timeout
Debug checklist:
# 1. Check MetalLB speaker logs
kubectl logs -n metallb-system speaker-xyz
# 2. Verify BGP routes on router
show ip bgp 10.50.100.10
# 3. Check if pods exist and are ready
kubectl get pods -l app=myapp
kubectl describe svc app # Check endpoints
# 4. Test from node directly
curl localhost:31234 # Should work
Common causes:
Pods not ready (missing readiness probes)
Router not advertising to client networks
Firewall between router and clients
Issue 3: Intermittent Connectivity
Traffic works 50% of the time.
Root cause: ECMP distributing to node with unready pod.
Fix: Proper readiness probes
spec:
containers:
- name: app
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 3
MetalLB only advertises from nodes with ready pods.
13. The Honest Recommendation
For 80% of production clusters: Use BGP.
Why?
Clusters grow unpredictably
You plan for 5 nodes. 8 months later you have 20.BGP setup is one-time effort
2-4 hours upfront. Saves painful migration later.L2 fails silently
Works fine until scale hits. Then suddenly doesn't.Network teams already know BGP
It's standard practice, not a Kubernetes oddity.
Exception: Use L2 if:
Genuinely small cluster (< 5 nodes, < 10 services)
Dev/staging only
Hard blocker on router access
Prototyping/MVP phase
Then plan BGP migration before you hit 10 services.
14. Pre-Deployment Checklist
Before deploying MetalLB:
Network:
[ ] Dedicated IP range allocated (minimum /28 = 14 IPs)
[ ] Single subnet or multi-subnet? (Multi = BGP required)
[ ] VLANs or flat network?
[ ] Switch vendor/model documented
Scale:
[ ] Current node count: ___
[ ] 12-month projection: ___
[ ] Current service count: ___
[ ] 12-month projection: ___
BGP Readiness:
[ ] BGP-capable router available? (Check: Mikrotik, VyOS, pfSense, Cumulus)
[ ] Network team approval process understood
[ ] Private AS number available (64512-65535 range)
[ ] Router config change lead time: ___
Decision:
[ ] Layer 2 chosen (document migration trigger)
[ ] BGP chosen (router config scheduled)
15. Quick Reference Tables
Comparison Matrix:
| Feature | Layer 2 (ARP) | BGP |
| Setup complexity | ⭐ Simple | ⭐⭐⭐ Moderate |
| Router config needed | ❌ No | ✅ Yes |
| Failover time | 30-60s | 5-10s |
| Cross-VLAN support | ❌ No | ✅ Yes |
| Multi-node load balancing | ❌ No (single speaker) | ✅ Yes (ECMP) |
| Broadcast overhead | ⚠️ Yes | ✅ No |
| Production-ready | ⚠️ Small scale only | ✅ Yes |
| Cloud compatibility | ❌ No | ✅ Yes |
Cost Comparison:
| Approach | Hardware Cost | Setup Time | Operational Risk |
| L2 (staging) | $0 | 30 minutes | Medium (failover delays) |
| BGP (cheap router) | $200-600 (Mikrotik) | 2-4 hours | Low |
| BGP (enterprise) | $3,000+ | 1-2 days | Very low |
| BGP (virtual) | $0 (VyOS/pfSense) | 2-4 hours | Low |
16. Resources That Actually Help
Official Docs:
BGP Basics:
Router-Specific:
17. Final Thought
MetalLB isn't magic. ARP isn't evil. BGP isn't scary.
They're tools—and scale decides which one hurts less.
Design simple. Upgrade deliberately.
The best time to set up BGP was at deployment.
The second-best time is now.
What's Next?
This article helped you decide L2 vs BGP. Now you need to deploy.
Upcoming in this series:
Part 2: Kubernetes control plane HA - Keepalived vs Kube-VIP (coming next week)
Part 3: Production deployment with Kubespray + Kube-VIP + MetalLB (coming soon)
Follow me for the next parts!
Questions? Real production experiences to share? Drop them in the comments. I spent 8 months figuring this out for our energy platform ,happy to discuss specifics.
If this helped you avoid an outage, share it with your team.


