Skip to main content

Command Palette

Search for a command to run...

MetalLB ARP vs BGP: The Production Decision Guide You Actually Need

Updated
14 min read
MetalLB ARP vs BGP: The Production Decision Guide You Actually Need
V

Hey folks! 👋 I'm Vikash Kumar, a seasoned DevOps Engineer navigating the thrilling landscapes of DevOps and Cloud ☁️. My passion? Simplifying and automating processes to enhance our tech experiences. By day, I'm a Terraform wizard; by night, a Kubernetes aficionado crafting ingenious solutions with the latest DevOps methodologies 🚀. From troubleshooting deployment snags to orchestrating seamless CI/CD pipelines, I've got your back. Fluent in scripts and infrastructure as code. With AWS ☁️ expertise, I'm your go-to guide in the cloud. And when it comes to monitoring and observability 📊, Prometheus and Grafana are my trusty allies. In the realm of source code management, I'm at ease with GitLab, Bitbucket, and Git. Eager to stay ahead of the curve 📚, I'm committed to exploring the ever-evolving domains of DevOps and Cloud. Let's connect and embark on this journey together! Drop me a line at thenameisvikash@gmail.com.

📘 Production On-Prem Kubernetes Series - Part 1

This is Part 1 of my series on deploying production-grade Kubernetes clusters on-premise. This article covers the critical MetalLB architecture decision you need to make BEFORE deployment.

Series outline:

  • Part 1: MetalLB L2 vs BGP Decision Guide (you are here)

  • Part 2: Control Plane HA - Keepalived vs Kube-VIP (coming next)

  • Part 3: Full Deployment with Kubespray + Kube-VIP


MetalLB Layer 2 vs BGP: The Production Decision Guide You Actually Need

Context: I'm running Kubernetes for an energy trading platform in India. Last year, we faced a decision: MetalLB Layer 2 or BGP? Here's what I learned after deploying both in production.

What this covers:

  • Why MetalLB exists (the real problem)

  • How L2 (ARP) actually works—and where it breaks

  • How BGP actually works—and why it scales

  • Real configs, real thresholds, real decisions

  • When to use each (with honest trade-offs)

Read time: 12 minutes that save you months


1. The Problem MetalLB Solves (Not the Marketing Version)

In cloud Kubernetes, this works instantly:

apiVersion: v1
kind: Service
metadata:
  name: webapp
spec:
  type: LoadBalancer
  ports:
  - port: 443

AWS gives you an ELB. GCP gives you a Load Balancer. Done.

On-prem?

$ kubectl get svc webapp
NAME     TYPE           EXTERNAL-IP   PORT(S)
webapp   LoadBalancer   <pending>     443:32156/TCP

That <pending> means: Kubernetes doesn't know how to expose this.

You need three things:

  1. IP allocation - Where does the external IP come from?

  2. IP announcement - How do clients find this IP?

  3. Traffic routing - How do packets reach the right node?

MetalLB solves all three.

But HOW it solves them creates entirely different architectures.


2. Two Ways MetalLB Announces IPs

MetalLB must answer one question:

"When traffic comes for this IP, who should receive it?"

There are two fundamentally different answers:

Layer 2 (ARP): Broadcasting "I'm here!" to everyone
BGP: Telling the router "Route to me" directly

That fundamental difference drives everything else.

Let's break down each approach properly.


3. MetalLB Layer 2 Mode: The Mechanics

How It Works

MetalLB uses ARP (Address Resolution Protocol) to announce IPs.

The flow:

  1. MetalLB assigns IP 192.168.1.100 to your service

  2. One node becomes the "speaker" (leader election)

  3. Speaker sends ARP announcement: "192.168.1.100 is at MAC aa:bb:cc:dd:ee:ff"

  4. Network switch updates its table

  5. All traffic flows to that speaker node

Client  Switch  Speaker Node  kube-proxy  Pod

My Staging Config:

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    address-pools:
    - name: staging-pool
      protocol: layer2
      addresses:
      - 10.50.100.10-10.50.100.50

Applied it, worked instantly. No router configuration needed.

That simplicity is intoxicating—and dangerous.


4. Where ARP Breaks: The Deep Dive

Let's go beyond "it doesn't scale" and understand precisely why.

❌ Problem 1: The Broadcast Domain Wall

ARP only works within a single broadcast domain (one subnet/VLAN).

Real scenario from our setup:

Kubernetes cluster: 10.50.0.0/24
Office network:     192.168.1.0/24
Client network:     172.16.0.0/16

With Layer 2 mode, only clients in 10.50.0.0/24 can reach services directly.

Others need:

  • Proxy servers

  • VPN tunnels

  • Manual routing rules

Why? ARP packets can't cross routers—it's physically impossible by Layer 2 design.


❌ Problem 2: Single Speaker Bottleneck

Only one node owns each IP at a time.

Traffic pattern:

LoadBalancer IP: 10.50.100.10
Pods distributed across: Node1, Node2, Node3
ALL inbound traffic enters through: Node2 (speaker)

Real throughput limits:

NIC SpeedTheoretical MaxReal-World MaxYour Bottleneck
1 Gbps125 MB/s~90 MB/sSingle node NIC
10 Gbps1.25 GB/s~900 MB/sSingle node NIC

Even with 10 nodes and 100 Gbps total capacity, you're limited to one node's NIC.

We hit this at ~80 MB/s sustained during peak traffic hours (energy meter data ingestion).


❌ Problem 3: The Failover Gap (SLA Killer)

When speaker node dies, here's the timeline:

T+0s:   Node crashes
T+2s:   MetalLB detects failure
T+3s:   New speaker elected
T+4s:   New speaker sends gratuitous ARP
T+4-30s: Switch ARP cache updates (vendor-dependent)
T+30s:  Traffic restored

That 30-second gap destroyed our staging environment twice during node maintenance.

Network EquipmentObserved Failover Time
Our Cisco switches15-25 seconds
Generic enterprise switches30-45 seconds
Older switches60-120 seconds

For SLA math:

  • 15 seconds downtime per node failure

  • 4 planned maintenances/year = 60 seconds

  • 99.99% SLA broken (requires < 52 seconds/year)


❌ Problem 4: ARP Noise at Scale

Every service change triggers ARP broadcasts.

Our staging cluster:

  • 12 LoadBalancer services

  • 5 nodes

  • ~20 pod migrations per day (autoscaling, deployments)

Result: ~50 ARP announcements per day

Calculation for 50 services across 20 nodes:

  • 3 speaker changes/service/day (rolling updates)

  • \= 150 ARP broadcasts/day

  • \= ~6 per hour

Modern switches handle this, but it adds:

  • Control plane CPU load

  • Log noise

  • Unpredictable behavior during storms


❌ Problem 5: Security Blocks

Many environments treat gratuitous ARP as suspicious (it's a classic spoofing attack).

Blocked in:

  • AWS/GCP/Azure (all cloud environments)

  • Enterprise networks with ARP inspection

  • Firewalls with strict MAC filtering

We learned this the hard way: Our client's enterprise network rejected MetalLB L2 entirely due to security policies.


✅ When L2 Mode Actually Works

Don't overthink for:

Nodes: 3-5
Services: < 10
Network: Single subnet
Environment: Dev/staging
Downtime tolerance: 30-60 seconds OK

Perfect for:

  • MVP/prototype clusters

  • Internal developer platforms

  • Testing environments

  • Small teams with no network engineer


5. MetalLB BGP Mode: The Grown-Up Solution

What Changes?

Instead of broadcasting (ARP), nodes talk to routers using BGP.

The flow:

  1. MetalLB establishes BGP sessions with your router

  2. Advertises: "To reach 10.50.100.10, route to Node2 (10.50.0.5)"

  3. Router updates its routing table

  4. Traffic gets routed via Layer 3

Client  Router  Best Node  kube-proxy  Pod

My Production Config:

MetalLB side:

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    peers:
    - peer-address: 10.50.0.1  # Router IP
      peer-asn: 64500           # Router's AS
      my-asn: 64501             # MetalLB's AS
    address-pools:
    - name: production-pool
      protocol: bgp
      addresses:
      - 10.50.100.100-10.50.100.200

Router side (VyOS example):

set protocols bgp 64500 neighbor 10.50.0.3 remote-as 64501
set protocols bgp 64500 neighbor 10.50.0.4 remote-as 64501
set protocols bgp 64500 neighbor 10.50.0.5 remote-as 64501

Setup time: ~2 hours (including router access approval from network team)


6. Why BGP Scales: The Technical Advantages

✅ Multi-Path Load Balancing (ECMP)

Multiple nodes can advertise the same IP:

Router sees:
  10.50.100.10  Node1 (10.50.0.3)
  10.50.100.10  Node2 (10.50.0.4)
  10.50.100.10  Node3 (10.50.0.5)

Router distributes traffic across all three

Our production results:

MetricLayer 2 (Staging)BGP (Production)
Inbound paths1 node12 nodes
Max throughput~80 MB/s~960 MB/s (12x)
BottleneckSpeaker NICApplication logic

We went from NIC-limited to application-limited. That's the difference.


✅ Sub-Second Failover

BGP uses keepalive timers to detect failures:

Our tuned config:
  Keepalive: 3 seconds
  Hold time: 9 seconds

Failure timeline:

T+0s:   Node dies
T+9s:   Router detects (missed 3 keepalives)
T+9.2s: Router removes routes
T+9.5s: Traffic flows to healthy nodes

Measured failover: 5-10 seconds (vs 30-60 seconds with ARP)


✅ Cross-VLAN/Subnet Support

BGP works across:

  • Multiple VLANs

  • Different subnets

  • WAN links

  • Routed networks

Our actual topology:

Kubernetes cluster: 10.50.0.0/24
Office network:      192.168.1.0/24
Remote site:         172.16.0.0/16
Client network:      203.x.x.x/29 (via firewall)

With BGP, all networks reach services via standard IP routing.

With ARP, we'd need VPNs or proxies for every remote network.


✅ Zero Broadcast Noise

BGP uses point-to-point TCP sessions (port 179).

No broadcasts. No ARP storms. No switch overhead.

Scaling characteristics:

ServicesNodesARP Packets/Hour (L2)BGP Updates (BGP)
105~300 (steady state)
5020~2000 (steady state)
10050~5000 (steady state)

BGP updates only during actual changes (new service, scaling events).


✅ Production Observability

On router:

$ show ip bgp summary
Neighbor    AS      State         Uptime
10.50.0.3   64501   Established   12d 4h
10.50.0.4   64501   Established   12d 4h
10.50.0.5   64501   Established   3h 22m  # Recently restarted

$ show ip bgp 10.50.100.10
Network        Next Hop      Metric   Status
10.50.100.10   10.50.0.3     0        Active
10.50.100.10   10.50.0.4     0        Active
10.50.100.10   10.50.0.5     0        Active

In Kubernetes:

$ kubectl logs -n metallb-system speaker-xyz | grep BGP
level=info msg="BGP session established" peer=10.50.0.1

With L2, you get... nothing. Just hope ARP works.


7. "But BGP is Complex/Expensive/Public" (Myths Destroyed)

Myth 1: BGP Requires Enterprise Routers

False.

RouterCostBGP Support
Cisco Catalyst$5,000+Yes
Mikrotik RouterBOARD$200-800Yes
VyOS (VM)FreeYes
pfSense + FRRFreeYes

We use VyOS (free) in staging and Mikrotik (~$600) in production.


Myth 2: BGP Means Public Internet

Absolutely not.

BGP advertises reachability, not exposure.

Our setup:

  • IPs: 10.50.100.0/24 (RFC1918 private)

  • AS: 64501 (private AS range: 64512-65535)

  • Peers: Internal routers only

No external exposure. BGP is internal routing.


Myth 3: You Need Deep BGP Knowledge

What you actually need:

  • Understand what an AS number is (2 minutes)

  • 3 lines of router config per node

  • How to check show ip bgp summary

What you don't need:

  • BGP path selection algorithms

  • AS-path manipulation

  • BGP communities

For MetalLB, BGP is simple—you're not building the internet.


8. Quick Mental Model

Think of it like giving directions:

Layer 2 (ARP):

  • Shouting "I'm at 123 Main Street!" to everyone in the neighborhood

  • Works great in a small neighborhood

  • Chaos in a city

BGP:

  • Updating Google Maps: "123 Main Street is here"

  • Google routes everyone correctly

  • Scales to millions of addresses

That's the fundamental difference.

AspectLayer 2BGP
AnnouncementBroadcast (everyone hears)Unicast (router only)
Failure handlingNetwork relearns (30-60s)Router updates (5-10s)
Scale limitNetwork broadcastsRouting table size
Production useSmall clusters onlyEnterprise standard

9. Decision Framework (Pin This)

IF
  single subnet
  AND < 5 nodes
  AND < 10 services
  AND downtime tolerance > 30 seconds
THEN
  Layer 2 is acceptable

ELSE IF
  production environment
  OR multi-rack/multi-VLAN
  OR > 10 nodes
  OR > 20 services
  OR SLA > 99.9%
THEN
  Use BGP from day 1

ELSE
  Start with Layer 2
  Plan for BGP migration within 6 months

10. Scale Thresholds (Real Numbers)

Based on our production experience + research:

Cluster ProfileNodesServicesNetworkModeRationale
Small1-5< 10Single subnetL2 OKFailover acceptable, simple setup
Medium5-2010-50Single/multi-rackBGP preferredL2 starts showing cracks
Large20-5050-100Multi-VLAN/siteBGP mandatoryL2 fundamentally broken
Enterprise50+100+Complex routedBGP onlyL2 won't work

Our actual journey:

  • Staging (L2): 5 nodes, 12 services → works, but 15-25s failover hurts

  • Production (BGP): 12 nodes, 28 services → 5-10s failover, zero issues


11. Migration Path: L2 → BGP (Proven Strategy)

Good news: No downtime required.

Phase 1: Dual-Mode Config

address-pools:
- name: legacy-l2
  protocol: layer2
  addresses:
  - 10.50.100.10-10.50.100.50

- name: new-bgp
  protocol: bgp
  addresses:
  - 10.50.100.100-10.50.100.200

peers:
- peer-address: 10.50.0.1
  peer-asn: 64500
  my-asn: 64501

Phase 2: Service-by-Service Migration

For new services: Use BGP pool (automatic)

For existing services:

# 1. Create new service with BGP IP
kubectl expose deployment app --type=LoadBalancer \
  --name=app-bgp --port=443

# 2. Update DNS
app.example.com → 10.50.100.101 (BGP)

# 3. Wait for TTL (24-48 hours)

# 4. Delete old L2 service
kubectl delete svc app-l2

Phase 3: Complete Cutover

Remove L2 pool once all services migrated.

Our timeline: 3 weeks for 12 services (low-priority, phased approach)


12. Troubleshooting (Lessons from Production)

Issue 1: BGP Session Won't Establish

$ kubectl logs -n metallb-system speaker-xyz
level=error msg="BGP session failed" error="connection refused"

Fix: Firewall blocking TCP port 179

# On each node
sudo iptables -A INPUT -p tcp --dport 179 -s 10.50.0.1 -j ACCEPT

Issue 2: Service Gets IP But Unreachable

$ kubectl get svc
NAME   EXTERNAL-IP      PORT(S)
app    10.50.100.10     443:31234/TCP

$ curl 10.50.100.10
# Timeout

Debug checklist:

# 1. Check MetalLB speaker logs
kubectl logs -n metallb-system speaker-xyz

# 2. Verify BGP routes on router
show ip bgp 10.50.100.10

# 3. Check if pods exist and are ready
kubectl get pods -l app=myapp
kubectl describe svc app  # Check endpoints

# 4. Test from node directly
curl localhost:31234  # Should work

Common causes:

  • Pods not ready (missing readiness probes)

  • Router not advertising to client networks

  • Firewall between router and clients


Issue 3: Intermittent Connectivity

Traffic works 50% of the time.

Root cause: ECMP distributing to node with unready pod.

Fix: Proper readiness probes

spec:
  containers:
  - name: app
    readinessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 3

MetalLB only advertises from nodes with ready pods.


13. The Honest Recommendation

For 80% of production clusters: Use BGP.

Why?

  1. Clusters grow unpredictably
    You plan for 5 nodes. 8 months later you have 20.

  2. BGP setup is one-time effort
    2-4 hours upfront. Saves painful migration later.

  3. L2 fails silently
    Works fine until scale hits. Then suddenly doesn't.

  4. Network teams already know BGP
    It's standard practice, not a Kubernetes oddity.

Exception: Use L2 if:

  • Genuinely small cluster (< 5 nodes, < 10 services)

  • Dev/staging only

  • Hard blocker on router access

  • Prototyping/MVP phase

Then plan BGP migration before you hit 10 services.


14. Pre-Deployment Checklist

Before deploying MetalLB:

Network:

  • [ ] Dedicated IP range allocated (minimum /28 = 14 IPs)

  • [ ] Single subnet or multi-subnet? (Multi = BGP required)

  • [ ] VLANs or flat network?

  • [ ] Switch vendor/model documented

Scale:

  • [ ] Current node count: ___

  • [ ] 12-month projection: ___

  • [ ] Current service count: ___

  • [ ] 12-month projection: ___

BGP Readiness:

  • [ ] BGP-capable router available? (Check: Mikrotik, VyOS, pfSense, Cumulus)

  • [ ] Network team approval process understood

  • [ ] Private AS number available (64512-65535 range)

  • [ ] Router config change lead time: ___

Decision:

  • [ ] Layer 2 chosen (document migration trigger)

  • [ ] BGP chosen (router config scheduled)


15. Quick Reference Tables

Comparison Matrix:

FeatureLayer 2 (ARP)BGP
Setup complexity⭐ Simple⭐⭐⭐ Moderate
Router config needed❌ No✅ Yes
Failover time30-60s5-10s
Cross-VLAN support❌ No✅ Yes
Multi-node load balancing❌ No (single speaker)✅ Yes (ECMP)
Broadcast overhead⚠️ Yes✅ No
Production-ready⚠️ Small scale only✅ Yes
Cloud compatibility❌ No✅ Yes

Cost Comparison:

ApproachHardware CostSetup TimeOperational Risk
L2 (staging)$030 minutesMedium (failover delays)
BGP (cheap router)$200-600 (Mikrotik)2-4 hoursLow
BGP (enterprise)$3,000+1-2 daysVery low
BGP (virtual)$0 (VyOS/pfSense)2-4 hoursLow

16. Resources That Actually Help

Official Docs:

BGP Basics:

Router-Specific:


17. Final Thought

MetalLB isn't magic. ARP isn't evil. BGP isn't scary.

They're tools—and scale decides which one hurts less.

Design simple. Upgrade deliberately.

The best time to set up BGP was at deployment.
The second-best time is now.


What's Next?

This article helped you decide L2 vs BGP. Now you need to deploy.

Upcoming in this series:

  • Part 2: Kubernetes control plane HA - Keepalived vs Kube-VIP (coming next week)

  • Part 3: Production deployment with Kubespray + Kube-VIP + MetalLB (coming soon)

Follow me for the next parts!


Questions? Real production experiences to share? Drop them in the comments. I spent 8 months figuring this out for our energy platform ,happy to discuss specifics.

If this helped you avoid an outage, share it with your team.

Production On-Prem Kubernetes

Part 1 of 1

A practical series on building production-grade on-prem Kubernetes from real experience: MetalLB L2 vs BGP, control-plane HA (Keepalived vs Kube-VIP), Kubespray deployments, real configs, failures, and lessons for DevOps engineers.