📘 Production On-Prem Kubernetes Series - Part 1

This is Part 1 of my series on deploying production-grade Kubernetes clusters on-premise. This article covers the critical MetalLB architecture decision you need to make BEFORE deployment.

Series outline:

Part 1: MetalLB L2 vs BGP Decision Guide (you are here)

Part 2: Control Plane HA - Keepalived vs Kube-VIP (coming next)

Part 3: Full Deployment with Kubespray + Kube-VIP

MetalLB Layer 2 vs BGP: The Production Decision Guide You Actually Need

Context: I'm running Kubernetes for an energy trading platform in India. Last year, we faced a decision: MetalLB Layer 2 or BGP? Here's what I learned after deploying both in production.

What this covers:

Why MetalLB exists (the real problem)
How L2 (ARP) actually works—and where it breaks
How BGP actually works—and why it scales
Real configs, real thresholds, real decisions
When to use each (with honest trade-offs)

Read time: 12 minutes that save you months

1. The Problem MetalLB Solves (Not the Marketing Version)

In cloud Kubernetes, this works instantly:

apiVersion: v1
kind: Service
metadata:
  name: webapp
spec:
  type: LoadBalancer
  ports:
  - port: 443

AWS gives you an ELB. GCP gives you a Load Balancer. Done.

On-prem?

$ kubectl get svc webapp
NAME     TYPE           EXTERNAL-IP   PORT(S)
webapp   LoadBalancer   <pending>     443:32156/TCP

That <pending> means: Kubernetes doesn't know how to expose this.

You need three things:

IP allocation - Where does the external IP come from?
IP announcement - How do clients find this IP?
Traffic routing - How do packets reach the right node?

MetalLB solves all three.

But HOW it solves them creates entirely different architectures.

2. Two Ways MetalLB Announces IPs

MetalLB must answer one question:

"When traffic comes for this IP, who should receive it?"

There are two fundamentally different answers:

Layer 2 (ARP): Broadcasting "I'm here!" to everyone
BGP: Telling the router "Route to me" directly

That fundamental difference drives everything else.

Let's break down each approach properly.

3. MetalLB Layer 2 Mode: The Mechanics

How It Works

MetalLB uses ARP (Address Resolution Protocol) to announce IPs.

The flow:

MetalLB assigns IP 192.168.1.100 to your service
One node becomes the "speaker" (leader election)
Speaker sends ARP announcement: "192.168.1.100 is at MAC aa:bb:cc:dd:ee:ff"
Network switch updates its table
All traffic flows to that speaker node

Client → Switch → Speaker Node → kube-proxy → Pod

My Staging Config:

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    address-pools:
    - name: staging-pool
      protocol: layer2
      addresses:
      - 10.50.100.10-10.50.100.50

Applied it, worked instantly. No router configuration needed.

That simplicity is intoxicating—and dangerous.

4. Where ARP Breaks: The Deep Dive

Let's go beyond "it doesn't scale" and understand precisely why.

❌ Problem 1: The Broadcast Domain Wall

ARP only works within a single broadcast domain (one subnet/VLAN).

Real scenario from our setup:

Kubernetes cluster: 10.50.0.0/24
Office network:     192.168.1.0/24
Client network:     172.16.0.0/16

With Layer 2 mode, only clients in 10.50.0.0/24 can reach services directly.

Others need:

Proxy servers
VPN tunnels
Manual routing rules

Why? ARP packets can't cross routers—it's physically impossible by Layer 2 design.

❌ Problem 2: Single Speaker Bottleneck

Only one node owns each IP at a time.

Traffic pattern:

LoadBalancer IP: 10.50.100.10
Pods distributed across: Node1, Node2, Node3
ALL inbound traffic enters through: Node2 (speaker)

Real throughput limits:

NIC Speed	Theoretical Max	Real-World Max	Your Bottleneck
1 Gbps	125 MB/s	~90 MB/s	Single node NIC
10 Gbps	1.25 GB/s	~900 MB/s	Single node NIC

Even with 10 nodes and 100 Gbps total capacity, you're limited to one node's NIC.

We hit this at ~80 MB/s sustained during peak traffic hours (energy meter data ingestion).

❌ Problem 3: The Failover Gap (SLA Killer)

When speaker node dies, here's the timeline:

T+0s:   Node crashes
T+2s:   MetalLB detects failure
T+3s:   New speaker elected
T+4s:   New speaker sends gratuitous ARP
T+4-30s: Switch ARP cache updates (vendor-dependent)
T+30s:  Traffic restored

That 30-second gap destroyed our staging environment twice during node maintenance.

Network Equipment	Observed Failover Time
Our Cisco switches	15-25 seconds
Generic enterprise switches	30-45 seconds
Older switches	60-120 seconds

For SLA math:

15 seconds downtime per node failure
4 planned maintenances/year = 60 seconds
99.99% SLA broken (requires < 52 seconds/year)

❌ Problem 4: ARP Noise at Scale

Every service change triggers ARP broadcasts.

Our staging cluster:

12 LoadBalancer services
5 nodes
~20 pod migrations per day (autoscaling, deployments)

Result: ~50 ARP announcements per day

Calculation for 50 services across 20 nodes:

3 speaker changes/service/day (rolling updates)
\= 150 ARP broadcasts/day
\= ~6 per hour

Modern switches handle this, but it adds:

Control plane CPU load
Log noise
Unpredictable behavior during storms

❌ Problem 5: Security Blocks

Many environments treat gratuitous ARP as suspicious (it's a classic spoofing attack).

Blocked in:

AWS/GCP/Azure (all cloud environments)
Enterprise networks with ARP inspection
Firewalls with strict MAC filtering

We learned this the hard way: Our client's enterprise network rejected MetalLB L2 entirely due to security policies.

✅ When L2 Mode Actually Works

Don't overthink for:

Nodes: 3-5
Services: < 10
Network: Single subnet
Environment: Dev/staging
Downtime tolerance: 30-60 seconds OK

Perfect for:

MVP/prototype clusters
Internal developer platforms
Testing environments
Small teams with no network engineer

5. MetalLB BGP Mode: The Grown-Up Solution

What Changes?

Instead of broadcasting (ARP), nodes talk to routers using BGP.

The flow:

MetalLB establishes BGP sessions with your router
Advertises: "To reach 10.50.100.10, route to Node2 (10.50.0.5)"
Router updates its routing table
Traffic gets routed via Layer 3

Client → Router → Best Node → kube-proxy → Pod

My Production Config:

MetalLB side:

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: metallb-system
  name: config
data:
  config: |
    peers:
    - peer-address: 10.50.0.1  # Router IP
      peer-asn: 64500           # Router's AS
      my-asn: 64501             # MetalLB's AS
    address-pools:
    - name: production-pool
      protocol: bgp
      addresses:
      - 10.50.100.100-10.50.100.200

Router side (VyOS example):

set protocols bgp 64500 neighbor 10.50.0.3 remote-as 64501
set protocols bgp 64500 neighbor 10.50.0.4 remote-as 64501
set protocols bgp 64500 neighbor 10.50.0.5 remote-as 64501

Setup time: ~2 hours (including router access approval from network team)

6. Why BGP Scales: The Technical Advantages

✅ Multi-Path Load Balancing (ECMP)

Multiple nodes can advertise the same IP:

Router sees:
  10.50.100.10 → Node1 (10.50.0.3)
  10.50.100.10 → Node2 (10.50.0.4)
  10.50.100.10 → Node3 (10.50.0.5)

Router distributes traffic across all three

Our production results:

Metric	Layer 2 (Staging)	BGP (Production)
Inbound paths	1 node	12 nodes
Max throughput	~80 MB/s	~960 MB/s (12x)
Bottleneck	Speaker NIC	Application logic

We went from NIC-limited to application-limited. That's the difference.

✅ Sub-Second Failover

BGP uses keepalive timers to detect failures:

Our tuned config:
  Keepalive: 3 seconds
  Hold time: 9 seconds

Failure timeline:

T+0s:   Node dies
T+9s:   Router detects (missed 3 keepalives)
T+9.2s: Router removes routes
T+9.5s: Traffic flows to healthy nodes

Measured failover: 5-10 seconds (vs 30-60 seconds with ARP)

✅ Cross-VLAN/Subnet Support

BGP works across:

Multiple VLANs
Different subnets
WAN links
Routed networks

Our actual topology:

Kubernetes cluster: 10.50.0.0/24
Office network:      192.168.1.0/24
Remote site:         172.16.0.0/16
Client network:      203.x.x.x/29 (via firewall)

With BGP, all networks reach services via standard IP routing.

With ARP, we'd need VPNs or proxies for every remote network.

✅ Zero Broadcast Noise

BGP uses point-to-point TCP sessions (port 179).

No broadcasts. No ARP storms. No switch overhead.

Scaling characteristics:

Services	Nodes	ARP Packets/Hour (L2)	BGP Updates (BGP)
10	5	~30	0 (steady state)
50	20	~200	0 (steady state)
100	50	~500	0 (steady state)

BGP updates only during actual changes (new service, scaling events).

✅ Production Observability

On router:

$ show ip bgp summary
Neighbor    AS      State         Uptime
10.50.0.3   64501   Established   12d 4h
10.50.0.4   64501   Established   12d 4h
10.50.0.5   64501   Established   3h 22m  # Recently restarted

$ show ip bgp 10.50.100.10
Network        Next Hop      Metric   Status
10.50.100.10   10.50.0.3     0        Active
10.50.100.10   10.50.0.4     0        Active
10.50.100.10   10.50.0.5     0        Active

In Kubernetes:

$ kubectl logs -n metallb-system speaker-xyz | grep BGP
level=info msg="BGP session established" peer=10.50.0.1

With L2, you get... nothing. Just hope ARP works.

7. "But BGP is Complex/Expensive/Public" (Myths Destroyed)

Myth 1: BGP Requires Enterprise Routers

False.

Router	Cost	BGP Support
Cisco Catalyst	$5,000+	Yes
Mikrotik RouterBOARD	$200-800	Yes ✅
VyOS (VM)	Free	Yes ✅
pfSense + FRR	Free	Yes ✅

We use VyOS (free) in staging and Mikrotik (~$600) in production.

Myth 2: BGP Means Public Internet

Absolutely not.

BGP advertises reachability, not exposure.

Our setup:

IPs: 10.50.100.0/24 (RFC1918 private)
AS: 64501 (private AS range: 64512-65535)
Peers: Internal routers only

No external exposure. BGP is internal routing.

Myth 3: You Need Deep BGP Knowledge

What you actually need:

Understand what an AS number is (2 minutes)
3 lines of router config per node
How to check show ip bgp summary

What you don't need:

BGP path selection algorithms
AS-path manipulation
BGP communities

For MetalLB, BGP is simple—you're not building the internet.

8. Quick Mental Model

Think of it like giving directions:

Layer 2 (ARP):

Shouting "I'm at 123 Main Street!" to everyone in the neighborhood
Works great in a small neighborhood
Chaos in a city

BGP:

Updating Google Maps: "123 Main Street is here"
Google routes everyone correctly
Scales to millions of addresses

That's the fundamental difference.

Aspect	Layer 2	BGP
Announcement	Broadcast (everyone hears)	Unicast (router only)
Failure handling	Network relearns (30-60s)	Router updates (5-10s)
Scale limit	Network broadcasts	Routing table size
Production use	Small clusters only	Enterprise standard

9. Decision Framework (Pin This)

IF
  single subnet
  AND < 5 nodes
  AND < 10 services
  AND downtime tolerance > 30 seconds
THEN
  Layer 2 is acceptable

ELSE IF
  production environment
  OR multi-rack/multi-VLAN
  OR > 10 nodes
  OR > 20 services
  OR SLA > 99.9%
THEN
  Use BGP from day 1

ELSE
  Start with Layer 2
  Plan for BGP migration within 6 months

10. Scale Thresholds (Real Numbers)

Based on our production experience + research:

Cluster Profile	Nodes	Services	Network	Mode	Rationale
Small	1-5	< 10	Single subnet	L2 OK	Failover acceptable, simple setup
Medium	5-20	10-50	Single/multi-rack	BGP preferred	L2 starts showing cracks
Large	20-50	50-100	Multi-VLAN/site	BGP mandatory	L2 fundamentally broken
Enterprise	50+	100+	Complex routed	BGP only	L2 won't work

Our actual journey:

Staging (L2): 5 nodes, 12 services → works, but 15-25s failover hurts
Production (BGP): 12 nodes, 28 services → 5-10s failover, zero issues

11. Migration Path: L2 → BGP (Proven Strategy)

Good news: No downtime required.

Phase 1: Dual-Mode Config

address-pools:
- name: legacy-l2
  protocol: layer2
  addresses:
  - 10.50.100.10-10.50.100.50

- name: new-bgp
  protocol: bgp
  addresses:
  - 10.50.100.100-10.50.100.200

peers:
- peer-address: 10.50.0.1
  peer-asn: 64500
  my-asn: 64501

Phase 2: Service-by-Service Migration

For new services: Use BGP pool (automatic)

For existing services:

# 1. Create new service with BGP IP
kubectl expose deployment app --type=LoadBalancer \
  --name=app-bgp --port=443

# 2. Update DNS
app.example.com → 10.50.100.101 (BGP)

# 3. Wait for TTL (24-48 hours)

# 4. Delete old L2 service
kubectl delete svc app-l2

Phase 3: Complete Cutover

Remove L2 pool once all services migrated.

Our timeline: 3 weeks for 12 services (low-priority, phased approach)

12. Troubleshooting (Lessons from Production)

Issue 1: BGP Session Won't Establish

$ kubectl logs -n metallb-system speaker-xyz
level=error msg="BGP session failed" error="connection refused"

Fix: Firewall blocking TCP port 179

# On each node
sudo iptables -A INPUT -p tcp --dport 179 -s 10.50.0.1 -j ACCEPT

Issue 2: Service Gets IP But Unreachable

$ kubectl get svc
NAME   EXTERNAL-IP      PORT(S)
app    10.50.100.10     443:31234/TCP

$ curl 10.50.100.10
# Timeout

Debug checklist:

# 1. Check MetalLB speaker logs
kubectl logs -n metallb-system speaker-xyz

# 2. Verify BGP routes on router
show ip bgp 10.50.100.10

# 3. Check if pods exist and are ready
kubectl get pods -l app=myapp
kubectl describe svc app  # Check endpoints

# 4. Test from node directly
curl localhost:31234  # Should work

Common causes:

Pods not ready (missing readiness probes)
Router not advertising to client networks
Firewall between router and clients

Issue 3: Intermittent Connectivity

Traffic works 50% of the time.

Root cause: ECMP distributing to node with unready pod.

Fix: Proper readiness probes

spec:
  containers:
  - name: app
    readinessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 3

MetalLB only advertises from nodes with ready pods.

13. The Honest Recommendation

For 80% of production clusters: Use BGP.

Why?

Clusters grow unpredictably
You plan for 5 nodes. 8 months later you have 20.
BGP setup is one-time effort
2-4 hours upfront. Saves painful migration later.
L2 fails silently
Works fine until scale hits. Then suddenly doesn't.
Network teams already know BGP
It's standard practice, not a Kubernetes oddity.

Exception: Use L2 if:

Genuinely small cluster (< 5 nodes, < 10 services)
Dev/staging only
Hard blocker on router access
Prototyping/MVP phase

Then plan BGP migration before you hit 10 services.

14. Pre-Deployment Checklist

Before deploying MetalLB:

Network:

[ ] Dedicated IP range allocated (minimum /28 = 14 IPs)
[ ] Single subnet or multi-subnet? (Multi = BGP required)
[ ] VLANs or flat network?
[ ] Switch vendor/model documented

Scale:

[ ] Current node count: ___
[ ] 12-month projection: ___
[ ] Current service count: ___
[ ] 12-month projection: ___

BGP Readiness:

[ ] BGP-capable router available? (Check: Mikrotik, VyOS, pfSense, Cumulus)
[ ] Network team approval process understood
[ ] Private AS number available (64512-65535 range)
[ ] Router config change lead time: ___

Decision:

[ ] Layer 2 chosen (document migration trigger)
[ ] BGP chosen (router config scheduled)

15. Quick Reference Tables

Comparison Matrix:

Feature	Layer 2 (ARP)	BGP
Setup complexity	⭐ Simple	⭐⭐⭐ Moderate
Router config needed	❌ No	✅ Yes
Failover time	30-60s	5-10s
Cross-VLAN support	❌ No	✅ Yes
Multi-node load balancing	❌ No (single speaker)	✅ Yes (ECMP)
Broadcast overhead	⚠️ Yes	✅ No
Production-ready	⚠️ Small scale only	✅ Yes
Cloud compatibility	❌ No	✅ Yes

Cost Comparison:

Approach	Hardware Cost	Setup Time	Operational Risk
L2 (staging)	$0	30 minutes	Medium (failover delays)
BGP (cheap router)	$200-600 (Mikrotik)	2-4 hours	Low
BGP (enterprise)	$3,000+	1-2 days	Very low
BGP (virtual)	$0 (VyOS/pfSense)	2-4 hours	Low

16. Resources That Actually Help

Official Docs:

BGP Basics:

Router-Specific:

17. Final Thought

MetalLB isn't magic. ARP isn't evil. BGP isn't scary.

They're tools—and scale decides which one hurts less.

Design simple. Upgrade deliberately.

The best time to set up BGP was at deployment.
The second-best time is now.

What's Next?

This article helped you decide L2 vs BGP. Now you need to deploy.

Upcoming in this series:

Part 2: Kubernetes control plane HA - Keepalived vs Kube-VIP (coming next week)
Part 3: Production deployment with Kubespray + Kube-VIP + MetalLB (coming soon)

Follow me for the next parts!

Questions? Real production experiences to share? Drop them in the comments. I spent 8 months figuring this out for our energy platform ,happy to discuss specifics.

If this helped you avoid an outage, share it with your team.

Command Palette