TL;DR: Migrated the 50+ TCP services from my previous posts to Envoy Gateway. The migration itself was smooth - the Helm chart from my Envoy Gateway post handled it. What wasn’t smooth: AWS security groups have a 60-rule quota per group, and the AWS Load Balancer Controller manages one security group per NLB. Hit the limit, Envoy froze completely - no errors, no logs, nothing. Fixed it by disabling automatic security group management and opening the port range manually.
Context
If you’ve been following along:
- Running 50+ TCP Services on EKS - the multi-NLB architecture with ingress-nginx
- GitOps Self-Service - the GitHub Actions + ArgoCD management layer
- Replacing ingress-nginx with Envoy Gateway - the Helm chart for Gateway API resources
This post is the intersection: taking the TCP services from posts 1 and 2, and running them through the Envoy Gateway setup from post 3. Plus a nasty AWS gotcha that caused downtime.
What Changed
The old architecture used multiple ingress-nginx controllers with dedicated IngressClasses, each behind its own NLB. The new architecture uses Envoy Gateway’s TCP support with TLS passthrough:
flowchart TB
subgraph "Before: ingress-nginx"
CLIENT1[Clients] --> NLB1A[NLB tcp01<br/>Ports 6100-6149]
CLIENT1 --> NLB1B[NLB tcp02<br/>Ports 6150-6199]
NLB1A --> ING1[ingress-nginx<br/>tcp01]
NLB1B --> ING2[ingress-nginx<br/>tcp02]
ING1 --> SVC1[Services 1-50]
ING2 --> SVC2[Services 51-100]
end
style NLB1A fill:#ff6b6b,color:#fff
style NLB1B fill:#ff6b6b,color:#fff
style ING1 fill:#ff6b6b,color:#fff
style ING2 fill:#ff6b6b,color:#fff
flowchart TB
subgraph "After: Envoy Gateway"
CLIENT2[Clients] --> NLB2[NLB tcpstaging<br/>All TCP ports]
NLB2 --> |"PROXY protocol v2"| ENVOY[Envoy Proxy<br/>TLS passthrough]
ENVOY --> GW[TCP Gateway]
GW --> TLS1[TLSRoute :6101]
GW --> TLS2[TLSRoute :6102]
GW --> TLS3[TLSRoute :6135]
GW --> MORE[... 20+ routes]
TLS1 --> SVC3[Service A]
TLS2 --> SVC4[Service B]
TLS3 --> SVC5[Service C]
CTP[ClientTrafficPolicy<br/>PROXY Protocol v2] -.-> |"targets"| GW
BTP[BackendTrafficPolicy<br/>PROXY Protocol v2] -.-> |"targets"| GW
end
style ENVOY fill:#4ecdc4,color:#000
style CTP fill:#96ceb4,color:#000
style BTP fill:#45b7d1,color:#000
Key difference: one TCP Gateway replaces multiple ingress-nginx controllers. Envoy Gateway handles TLS passthrough natively via TLSRoutes, and each listener maps directly to a backend service. No more IngressClass juggling.
The TCP Gateway Config
The TCP Gateway is a separate Gateway resource from the HTTPS one. It gets its own EnvoyProxy (and therefore its own NLB), because these services need different networking characteristics - PROXY protocol in both directions, TLS passthrough instead of termination.
Values
tcpGateways:
- name: envoy-tcp
enabled: true
envoyProxy:
name: eg-tcp-proxy
replicas: 2
service:
type: LoadBalancer
externalTrafficPolicy: Local
annotations:
service.beta.kubernetes.io/aws-load-balancer-name: tcp-services
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: tcp
service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: traffic-port
service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "30"
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: proxy_protocol_v2.enabled=true
service.beta.kubernetes.io/aws-load-balancer-subnets: subnet-aaa,subnet-bbb
service.beta.kubernetes.io/aws-load-balancer-eip-allocations: eipalloc-aaa,eipalloc-bbb
# PROXY protocol v2 in both directions:
# NLB → Envoy (client policy) and Envoy → backend (backend policy)
clientTrafficPolicy:
proxyProtocol:
version: V2
backendTrafficPolicy:
proxyProtocol: true
# Each listener maps a port to a backend service
listeners:
- name: tcp-6101
port: 6101
backendNamespace: services
backendName: service-alpha
backendPort: 6101
- name: tcp-6102
port: 6102
backendNamespace: services
backendName: service-beta
backendPort: 6102
- name: tcp-6135
port: 6135
backendNamespace: services
backendName: service-gamma
backendPort: 6135
# ... 20+ more listeners
Compare this to the old ingress-nginx TCP configmap:
# Old: ingress-nginx
tcp:
"6101": services/service-alpha:6101
"6102": services/service-beta:6102
"6135": services/service-gamma:6135
The Gateway API version is more verbose, but each listener is a first-class object. You get individual TLSRoutes per service, which means you can target policies at specific routes instead of the entire controller.
PROXY Protocol: Both Directions
These services need the real client IP before their TLS handshake. That means PROXY protocol has to flow end-to-end:
flowchart LR
CLIENT[Client<br/>203.0.113.10] --> NLB[NLB<br/>PROXY v2 enabled]
NLB --> |"PROXY header<br/>+ original payload"| ENVOY[Envoy Proxy]
ENVOY --> |"PROXY header<br/>+ TLS passthrough"| POD[Service Pod<br/>extracts client IP]
style NLB fill:#4ecdc4,color:#000
style ENVOY fill:#45b7d1,color:#000
style POD fill:#96ceb4,color:#000
The ClientTrafficPolicy tells Envoy to parse incoming PROXY protocol headers from the NLB. The BackendTrafficPolicy tells Envoy to forward PROXY protocol headers to the backend pods. Without the backend policy, the services would only see Envoy’s pod IP.
The Templates
The TCP Gateway templates generate several resources per gateway:
EnvoyProxy - configures the data plane (NLB annotations, replicas):
{{- range .Values.tcpGateways }}
{{- if .enabled }}
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
name: {{ .envoyProxy.name }}
spec:
provider:
type: Kubernetes
kubernetes:
envoyService:
type: {{ .envoyProxy.service.type | default "LoadBalancer" }}
externalTrafficPolicy: {{ .envoyProxy.service.externalTrafficPolicy }}
annotations:
{{- toYaml .envoyProxy.service.annotations | nindent 10 }}
envoyDeployment:
replicas: {{ .envoyProxy.replicas }}
{{- end }}
{{- end }}
Gateway - TLS passthrough listeners per port:
{{- range .Values.tcpGateways }}
{{- if .enabled }}
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: {{ .name }}
spec:
gatewayClassName: {{ $.Values.gateway.className | default "envoy" }}
infrastructure:
parametersRef:
group: gateway.envoyproxy.io
kind: EnvoyProxy
name: {{ .envoyProxy.name }}
listeners:
{{- range .listeners }}
- name: {{ .name }}
protocol: TLS
port: {{ .port }}
tls:
mode: Passthrough
allowedRoutes:
namespaces:
from: "All"
{{- end }}
{{- end }}
{{- end }}
TLSRoutes - one per listener, routing to the backend:
{{- range .Values.tcpGateways }}
{{- if .enabled }}
{{- $gwName := .name }}
{{- range .listeners }}
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: TLSRoute
metadata:
name: {{ $gwName }}-{{ .name }}
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: {{ $gwName }}
sectionName: {{ .name }}
rules:
- backendRefs:
- kind: Service
name: {{ .backendName }}
namespace: {{ .backendNamespace }}
port: {{ .backendPort }}
{{- end }}
{{- end }}
{{- end }}
ReferenceGrants - since the TLSRoutes live in the gateway namespace but reference Services in application namespaces:
{{- $uniqueNamespaces := dict }}
{{- range .Values.tcpGateways }}
{{- if .enabled }}
{{- range .listeners }}
{{- $_ := set $uniqueNamespaces .backendNamespace true }}
{{- end }}
{{- end }}
{{- end }}
{{- range $namespace, $_ := $uniqueNamespaces }}
apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
name: allow-tcp-from-{{ $.Release.Namespace }}
namespace: {{ $namespace }}
spec:
from:
- group: gateway.networking.k8s.io
kind: TLSRoute
namespace: {{ $.Release.Namespace }}
to:
- group: ""
kind: Service
{{- end }}
The unique namespace collection is a nice trick - it deduplicates automatically, so even with 50 listeners all pointing to the same namespace, you get one ReferenceGrant.
The Security Group Gotcha
This is why I’m writing this post.
The migration went smoothly in staging. Everything deployed, routes worked, PROXY protocol flowed end-to-end. Then we added more listeners and things stopped working. Not gradually - completely.
What Happened
Our setup uses a single Envoy Gateway Gateway resource with multiple EnvoyProxy resources attached to it - one for the HTTPS data plane, one for the TCP data plane. Each EnvoyProxy provisions its own NLB via the AWS Load Balancer Controller. Separate NLBs, separate listeners, separate Elastic IPs. Sounds isolated, right?
It’s not. When multiple EnvoyProxy resources hang off the same Gateway, Envoy Gateway creates one shared security group and attaches it to all the resulting NLBs. The AWS LB Controller then adds an individual inbound rule per listener port to that shared group. Every HTTPS listener, every TCP port - all accumulating rules in one place.
flowchart TD
subgraph "What We Expected"
GW_E[Gateway Resource] --> EP1_E[EnvoyProxy: HTTPS]
GW_E --> EP2_E[EnvoyProxy: TCP]
EP1_E --> NLB1_E[NLB: HTTPS]
EP2_E --> NLB2_E[NLB: TCP]
NLB1_E --> SG1_E[Security Group A<br/>Rules: 80, 443]
NLB2_E --> SG2_E[Security Group B<br/>Rules: 6101, 6102, ...]
end
subgraph "What Actually Happens"
GW_A[Gateway Resource] --> SG_SHARED[ONE Shared<br/>Security Group]
GW_A --> EP1_A[EnvoyProxy: HTTPS]
GW_A --> EP2_A[EnvoyProxy: TCP]
EP1_A --> NLB1_A[NLB: HTTPS] --> SG_SHARED
EP2_A --> NLB2_A[NLB: TCP] --> SG_SHARED
SG_SHARED --> R1[Rule: 80]
SG_SHARED --> R2[Rule: 443]
SG_SHARED --> R3[Rule: 6101]
SG_SHARED --> R4[Rule: 6102]
SG_SHARED --> R5[...]
SG_SHARED --> R6[Rule: 6199<br/>COMBINED 60+ 💥]
end
style SG1_E fill:#96ceb4,color:#000
style SG2_E fill:#96ceb4,color:#000
style SG_SHARED fill:#ff6b6b,color:#fff
style R6 fill:#ff6b6b,color:#fff
style GW_A fill:#ffd93d,color:#000
Here’s the chain of events:
- Envoy Gateway reconciles the
Gatewayresource and its associatedEnvoyProxyconfigs - Because the proxies share a parent
Gateway, Envoy Gateway provisions one security group and attaches it to all the NLBs - The AWS LB Controller adds an inbound rule per listener port to that shared group - across all proxies
- AWS has a default quota of 60 rules per security group
- The HTTPS proxy contributes its listener ports (80, 443 per hostname). The TCP proxy contributes 20+ individual port rules. Combined, they blow past 60
- When you exceed 60 rules, the security group update fails
- The LB Controller doesn’t surface this error in its own logs
- Envoy doesn’t surface it either - it just… stops
No error logs in Envoy. No error logs in the LB Controller. The NLB listener creation succeeds, but the security group can’t be updated to allow traffic on the new ports. Existing connections keep working, but no new listeners get traffic.
This is the part that caught us off guard. Each EnvoyProxy has its own NLB, its own set of listeners, its own target groups, its own Elastic IPs - everything looks independent at the NLB level. But because they share a parent Gateway resource, Envoy Gateway ties them together through one security group. You might be well under 50 listeners per NLB and still hit the 60-rule quota because the rules are cumulative across all your proxies.
Where the Error Actually Was
Kubernetes Events. Not pod logs, not controller logs - events:
kubectl get events -n envoy-gateway-system --field-selector reason=SyncLoadBalancerFailed
That’s where the security group quota exceeded error showed up. If you weren’t watching events, you’d never know.
The Impact
This is what made it dangerous: Envoy completely froze. It wasn’t just the new listeners that failed - the entire data plane stopped processing configuration updates. No new routes, no policy changes, nothing. The Gateway resource showed Programmed: True because the Gateway API reconciliation succeeded. The actual data plane was stuck.
flowchart LR
subgraph "What You See"
GW_STATUS[Gateway Status:<br/>Programmed ✓]
ENVOY_LOGS[Envoy Logs:<br/>nothing unusual]
LBC_LOGS[LB Controller Logs:<br/>nothing unusual]
end
subgraph "What's Actually Happening"
SG_FAIL[Security Group:<br/>quota exceeded]
EVENTS[K8s Events:<br/>SyncLoadBalancerFailed]
TRAFFIC[New listeners:<br/>no traffic]
end
style GW_STATUS fill:#96ceb4,color:#000
style ENVOY_LOGS fill:#96ceb4,color:#000
style LBC_LOGS fill:#96ceb4,color:#000
style SG_FAIL fill:#ff6b6b,color:#fff
style TRAFFIC fill:#ff6b6b,color:#fff
style EVENTS fill:#ffd93d,color:#000
The Fix
Disable automatic security group management and manage the port range manually:
envoyProxy:
service:
annotations:
# ... existing annotations ...
# Disable automatic security group management
service.beta.kubernetes.io/aws-load-balancer-manage-backend-security-group-rules: "false"
Then create a security group with a single rule covering your entire port range:
Inbound Rule:
Protocol: TCP
Port Range: 6100-6200
Source: 0.0.0.0/0
One rule instead of 50+. The NLB still only forwards traffic for ports with actual listeners configured in the Envoy Gateway, so you’re not exposing unused ports - the NLB acts as the access control.
flowchart TD
subgraph "Before: Shared SG, Per-Port Rules (hits quota)"
SG1[Shared Security Group]
SG1 --> R0A[Allow TCP 80<br/>HTTPS GW]
SG1 --> R0B[Allow TCP 443<br/>HTTPS GW]
SG1 --> R1[Allow TCP 6101<br/>TCP GW]
SG1 --> R2[Allow TCP 6102<br/>TCP GW]
SG1 --> R3[...]
SG1 --> R4[Allow TCP 6199<br/>COMBINED 60+ 💥]
end
subgraph "After: Manual SG, Range Rule"
SG2[Manual Security Group]
SG2 --> RANGE[Allow TCP 6100-6200]
SG2 --> HTTPS[Allow TCP 80, 443]
RANGE --> NLB_TCP[TCP NLB only forwards<br/>ports with listeners]
HTTPS --> NLB_HTTPS[HTTPS NLB]
end
style R4 fill:#ff6b6b,color:#fff
style RANGE fill:#96ceb4,color:#000
style HTTPS fill:#96ceb4,color:#000
style NLB_TCP fill:#4ecdc4,color:#000
style NLB_HTTPS fill:#4ecdc4,color:#000
This actually reduced complexity. The LB Controller was creating and deleting individual security group rules across all NLBs every time a listener was added or removed. Now the security group is static, and each NLB’s listener configuration determines which ports actually accept traffic. Fewer moving parts.
Other Options
Disabling automatic security group management and using a port range rule is what worked for us. There are other approaches - do your own due diligence on what fits your environment:
- Request a quota increase. The 60-rule limit is a default, not a hard ceiling. You can request an increase through AWS Service Quotas. This lets you keep automatic security group management, but you’re still one scaling event away from hitting the new limit.
- Separate security groups per NLB. Pre-create dedicated security groups and assign them via the
aws-load-balancer-security-groupsannotation on each EnvoyProxy Service. This overrides the shared group that Envoy Gateway creates and gives each NLB its own rule budget. You take on the management overhead, but each proxy’s NLB is fully isolated. - Separate Gateway resources. Instead of multiple EnvoyProxy resources under one Gateway, use independent Gateway resources. Each Gateway gets its own security group by default. More Gateway resources to manage, but you sidestep the shared security group entirely.
- Split across multiple TCP Gateways. Same pattern as the old multi-NLB ingress-nginx setup - keep each Gateway under the rule limit. More NLBs, more cost, but no shared security group issues.
The right answer depends on how many services you’re running, how often the listener set changes, and how locked down your security groups need to be. The key insight is knowing the shared security group behavior exists in the first place - once you’re aware of it, you can pick the mitigation that fits.
What the Full Setup Looks Like
With both the HTTPS Gateway (from the previous post) and the TCP Gateway running side by side:
flowchart TB
subgraph "HTTPS Traffic"
H_CLIENT[Web Clients] --> H_NLB[NLB: https-gateway]
H_NLB --> H_ENVOY[Envoy Proxy<br/>TLS termination]
H_ENVOY --> H_GW[Gateway: envoy]
H_GW --> HR1[HTTPRoute: app-a]
H_GW --> HR2[HTTPRoute: app-b]
H_GW --> HR3[HTTPRoute: app-c]
HR1 --> H_SVC1[Web App A]
HR2 --> H_SVC2[Web App B]
HR3 --> H_SVC3[Web App C]
end
subgraph "TCP Traffic"
T_CLIENT[TCP Clients] --> T_NLB[NLB: tcp-services]
T_NLB --> T_ENVOY[Envoy Proxy<br/>TLS passthrough]
T_ENVOY --> T_GW[TCP Gateway: envoy-tcp]
T_GW --> TR1[TLSRoute :6101]
T_GW --> TR2[TLSRoute :6102]
T_GW --> MORE[... 20+ routes]
TR1 --> T_SVC1[Service Alpha]
TR2 --> T_SVC2[Service Beta]
end
subgraph "Shared Infrastructure"
CI[ClusterIssuer<br/>Let's Encrypt]
ARGO[ArgoCD<br/>GitOps sync]
end
ARGO -.-> H_GW
ARGO -.-> T_GW
CI -.-> H_GW
style H_ENVOY fill:#4ecdc4,color:#000
style T_ENVOY fill:#45b7d1,color:#000
style ARGO fill:#ffd93d,color:#000
The HTTPS Gateway handles web traffic with TLS termination, certificates, security policies, and rate limiting. The TCP Gateway handles the service fleet with TLS passthrough and bidirectional PROXY protocol. Same Envoy Gateway controller manages both.
ingress-nginx vs Envoy Gateway for TCP
| Aspect | ingress-nginx | Envoy Gateway |
|---|---|---|
| TCP routing | ConfigMap (tcp key) | TLSRoute per service |
| Multiple NLBs | Separate Helm releases, unique IngressClasses | Separate Gateway resources |
| PROXY protocol | Global config, one direction | Per-gateway policy, bidirectional |
| Per-service policies | Not possible | BackendTrafficPolicy targets individual routes |
| Port limit | 50 per NLB (NLB listener limit) | 50 per NLB (same limit, different bottleneck) |
| Security group management | Same issue if auto-managed | Same issue - disable and use range rules |
| Adding a service | Add line to ConfigMap, redeploy | Add listener + TLSRoute via values |
The NLB 50-listener limit still applies - that’s an AWS constraint, not an ingress controller one. But with Envoy Gateway, you’re more likely to hit the security group 60-rule quota first if you let the LB Controller manage rules automatically.
Takeaways
-
Multiple EnvoyProxy resources under one Gateway share a single security group. Envoy Gateway creates one security group for the Gateway and attaches it to all NLBs provisioned by its proxies. Listener ports from every proxy accumulate as rules on that one group. You can be well under 50 listeners per NLB and still hit the 60-rule quota.
-
When the quota is exceeded, Envoy freezes silently. No error logs in Envoy, no error logs in the LB Controller. The error only shows up in Kubernetes events. Add event monitoring to your alerting if you haven’t already.
-
Disable automatic security group management or assign dedicated security groups per NLB. Use
aws-load-balancer-manage-backend-security-group-rules: "false"with a port range rule, or useaws-load-balancer-security-groupsto give each proxy its own group. Either way, understand the shared security group default before it bites you. -
The NLB is your access control layer. With security group management disabled, the NLB’s listener configuration determines which ports accept traffic. No listener, no traffic - even if the security group allows the port range.
-
TLSRoutes are the Gateway API equivalent of the TCP ConfigMap. More verbose, but each route is independently targetable with policies. Worth the trade-off.
-
PROXY protocol needs to flow both ways for TCP services. ClientTrafficPolicy parses it from the NLB, BackendTrafficPolicy forwards it to the pods. Miss the backend policy and your services see Envoy’s IP instead of the client’s.
-
Check Kubernetes events. Seriously. Pod logs and controller logs don’t surface every failure.
kubectl get eventswith field selectors should be part of your debugging playbook.
The migration from ingress-nginx to Envoy Gateway for TCP services was straightforward. The silent freeze from a shared security group hitting AWS quotas was not. If you’re running multiple EnvoyProxy resources under one Gateway, understand how security groups are shared before you learn this lesson the hard way.