Migrating 50+ TCP Services to Envoy Gateway: The Silent Security Group Gotcha

TL;DR: Migrated the 50+ TCP services from my previous posts to Envoy Gateway. The migration itself was smooth - the Helm chart from my Envoy Gateway post handled it. What wasn’t smooth: AWS security groups have a 60-rule quota per group, and the AWS Load Balancer Controller manages one security group per NLB. Hit the limit, Envoy froze completely - no errors, no logs, nothing. Fixed it by disabling automatic security group management and opening the port range manually.

Context

If you’ve been following along:

Running 50+ TCP Services on EKS - the multi-NLB architecture with ingress-nginx
GitOps Self-Service - the GitHub Actions + ArgoCD management layer
Replacing ingress-nginx with Envoy Gateway - the Helm chart for Gateway API resources

This post is the intersection: taking the TCP services from posts 1 and 2, and running them through the Envoy Gateway setup from post 3. Plus a nasty AWS gotcha that caused downtime.

What Changed

The old architecture used multiple ingress-nginx controllers with dedicated IngressClasses, each behind its own NLB. The new architecture uses Envoy Gateway’s TCP support with TLS passthrough:

flowchart TB
    subgraph "Before: ingress-nginx"
        CLIENT1[Clients] --> NLB1A[NLB tcp01<br/>Ports 6100-6149]
        CLIENT1 --> NLB1B[NLB tcp02<br/>Ports 6150-6199]
        NLB1A --> ING1[ingress-nginx<br/>tcp01]
        NLB1B --> ING2[ingress-nginx<br/>tcp02]
        ING1 --> SVC1[Services 1-50]
        ING2 --> SVC2[Services 51-100]
    end

    style NLB1A fill:#ff6b6b,color:#fff
    style NLB1B fill:#ff6b6b,color:#fff
    style ING1 fill:#ff6b6b,color:#fff
    style ING2 fill:#ff6b6b,color:#fff

flowchart TB
    subgraph "After: Envoy Gateway"
        CLIENT2[Clients] --> NLB2[NLB tcpstaging<br/>All TCP ports]
        NLB2 --> |"PROXY protocol v2"| ENVOY[Envoy Proxy<br/>TLS passthrough]
        ENVOY --> GW[TCP Gateway]
        GW --> TLS1[TLSRoute :6101]
        GW --> TLS2[TLSRoute :6102]
        GW --> TLS3[TLSRoute :6135]
        GW --> MORE[... 20+ routes]
        TLS1 --> SVC3[Service A]
        TLS2 --> SVC4[Service B]
        TLS3 --> SVC5[Service C]

        CTP[ClientTrafficPolicy<br/>PROXY Protocol v2] -.-> |"targets"| GW
        BTP[BackendTrafficPolicy<br/>PROXY Protocol v2] -.-> |"targets"| GW
    end

    style ENVOY fill:#4ecdc4,color:#000
    style CTP fill:#96ceb4,color:#000
    style BTP fill:#45b7d1,color:#000

Key difference: one TCP Gateway replaces multiple ingress-nginx controllers. Envoy Gateway handles TLS passthrough natively via TLSRoutes, and each listener maps directly to a backend service. No more IngressClass juggling.

The TCP Gateway Config

The TCP Gateway is a separate Gateway resource from the HTTPS one. It gets its own EnvoyProxy (and therefore its own NLB), because these services need different networking characteristics - PROXY protocol in both directions, TLS passthrough instead of termination.

Values

tcpGateways:
  - name: envoy-tcp
    enabled: true
    envoyProxy:
      name: eg-tcp-proxy
      replicas: 2
      service:
        type: LoadBalancer
        externalTrafficPolicy: Local
        annotations:
          service.beta.kubernetes.io/aws-load-balancer-name: tcp-services
          service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
          service.beta.kubernetes.io/aws-load-balancer-type: external
          service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
          service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: tcp
          service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: traffic-port
          service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "30"
          service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
          service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
          service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: proxy_protocol_v2.enabled=true
          service.beta.kubernetes.io/aws-load-balancer-subnets: subnet-aaa,subnet-bbb
          service.beta.kubernetes.io/aws-load-balancer-eip-allocations: eipalloc-aaa,eipalloc-bbb
    # PROXY protocol v2 in both directions:
    # NLB → Envoy (client policy) and Envoy → backend (backend policy)
    clientTrafficPolicy:
      proxyProtocol:
        version: V2
    backendTrafficPolicy:
      proxyProtocol: true
    # Each listener maps a port to a backend service
    listeners:
      - name: tcp-6101
        port: 6101
        backendNamespace: services
        backendName: service-alpha
        backendPort: 6101
      - name: tcp-6102
        port: 6102
        backendNamespace: services
        backendName: service-beta
        backendPort: 6102
      - name: tcp-6135
        port: 6135
        backendNamespace: services
        backendName: service-gamma
        backendPort: 6135
      # ... 20+ more listeners

Compare this to the old ingress-nginx TCP configmap:

# Old: ingress-nginx
tcp:
  "6101": services/service-alpha:6101
  "6102": services/service-beta:6102
  "6135": services/service-gamma:6135

The Gateway API version is more verbose, but each listener is a first-class object. You get individual TLSRoutes per service, which means you can target policies at specific routes instead of the entire controller.

PROXY Protocol: Both Directions

These services need the real client IP before their TLS handshake. That means PROXY protocol has to flow end-to-end:

flowchart LR
    CLIENT[Client<br/>203.0.113.10] --> NLB[NLB<br/>PROXY v2 enabled]
    NLB --> |"PROXY header<br/>+ original payload"| ENVOY[Envoy Proxy]
    ENVOY --> |"PROXY header<br/>+ TLS passthrough"| POD[Service Pod<br/>extracts client IP]

    style NLB fill:#4ecdc4,color:#000
    style ENVOY fill:#45b7d1,color:#000
    style POD fill:#96ceb4,color:#000

The ClientTrafficPolicy tells Envoy to parse incoming PROXY protocol headers from the NLB. The BackendTrafficPolicy tells Envoy to forward PROXY protocol headers to the backend pods. Without the backend policy, the services would only see Envoy’s pod IP.

The Templates

The TCP Gateway templates generate several resources per gateway:

EnvoyProxy - configures the data plane (NLB annotations, replicas):

{{- range .Values.tcpGateways }}
{{- if .enabled }}
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: {{ .envoyProxy.name }}
spec:
  provider:
    type: Kubernetes
    kubernetes:
      envoyService:
        type: {{ .envoyProxy.service.type | default "LoadBalancer" }}
        externalTrafficPolicy: {{ .envoyProxy.service.externalTrafficPolicy }}
        annotations:
          {{- toYaml .envoyProxy.service.annotations | nindent 10 }}
      envoyDeployment:
        replicas: {{ .envoyProxy.replicas }}
{{- end }}
{{- end }}

Gateway - TLS passthrough listeners per port:

{{- range .Values.tcpGateways }}
{{- if .enabled }}
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: {{ .name }}
spec:
  gatewayClassName: {{ $.Values.gateway.className | default "envoy" }}
  infrastructure:
    parametersRef:
      group: gateway.envoyproxy.io
      kind: EnvoyProxy
      name: {{ .envoyProxy.name }}
  listeners:
    {{- range .listeners }}
    - name: {{ .name }}
      protocol: TLS
      port: {{ .port }}
      tls:
        mode: Passthrough
      allowedRoutes:
        namespaces:
          from: "All"
    {{- end }}
{{- end }}
{{- end }}

TLSRoutes - one per listener, routing to the backend:

{{- range .Values.tcpGateways }}
{{- if .enabled }}
{{- $gwName := .name }}
{{- range .listeners }}
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: TLSRoute
metadata:
  name: {{ $gwName }}-{{ .name }}
spec:
  parentRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: {{ $gwName }}
      sectionName: {{ .name }}
  rules:
    - backendRefs:
        - kind: Service
          name: {{ .backendName }}
          namespace: {{ .backendNamespace }}
          port: {{ .backendPort }}
{{- end }}
{{- end }}
{{- end }}

ReferenceGrants - since the TLSRoutes live in the gateway namespace but reference Services in application namespaces:

{{- $uniqueNamespaces := dict }}
{{- range .Values.tcpGateways }}
{{- if .enabled }}
{{- range .listeners }}
{{- $_ := set $uniqueNamespaces .backendNamespace true }}
{{- end }}
{{- end }}
{{- end }}
{{- range $namespace, $_ := $uniqueNamespaces }}
apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
  name: allow-tcp-from-{{ $.Release.Namespace }}
  namespace: {{ $namespace }}
spec:
  from:
    - group: gateway.networking.k8s.io
      kind: TLSRoute
      namespace: {{ $.Release.Namespace }}
  to:
    - group: ""
      kind: Service
{{- end }}

The unique namespace collection is a nice trick - it deduplicates automatically, so even with 50 listeners all pointing to the same namespace, you get one ReferenceGrant.

The Security Group Gotcha

This is why I’m writing this post.

The migration went smoothly in staging. Everything deployed, routes worked, PROXY protocol flowed end-to-end. Then we added more listeners and things stopped working. Not gradually - completely.

What Happened

Our setup uses a single Envoy Gateway Gateway resource with multiple EnvoyProxy resources attached to it - one for the HTTPS data plane, one for the TCP data plane. Each EnvoyProxy provisions its own NLB via the AWS Load Balancer Controller. Separate NLBs, separate listeners, separate Elastic IPs. Sounds isolated, right?

It’s not. When multiple EnvoyProxy resources hang off the same Gateway, Envoy Gateway creates one shared security group and attaches it to all the resulting NLBs. The AWS LB Controller then adds an individual inbound rule per listener port to that shared group. Every HTTPS listener, every TCP port - all accumulating rules in one place.

flowchart TD
    subgraph "What We Expected"
        GW_E[Gateway Resource] --> EP1_E[EnvoyProxy: HTTPS]
        GW_E --> EP2_E[EnvoyProxy: TCP]
        EP1_E --> NLB1_E[NLB: HTTPS]
        EP2_E --> NLB2_E[NLB: TCP]
        NLB1_E --> SG1_E[Security Group A<br/>Rules: 80, 443]
        NLB2_E --> SG2_E[Security Group B<br/>Rules: 6101, 6102, ...]
    end

    subgraph "What Actually Happens"
        GW_A[Gateway Resource] --> SG_SHARED[ONE Shared<br/>Security Group]
        GW_A --> EP1_A[EnvoyProxy: HTTPS]
        GW_A --> EP2_A[EnvoyProxy: TCP]
        EP1_A --> NLB1_A[NLB: HTTPS] --> SG_SHARED
        EP2_A --> NLB2_A[NLB: TCP] --> SG_SHARED
        SG_SHARED --> R1[Rule: 80]
        SG_SHARED --> R2[Rule: 443]
        SG_SHARED --> R3[Rule: 6101]
        SG_SHARED --> R4[Rule: 6102]
        SG_SHARED --> R5[...]
        SG_SHARED --> R6[Rule: 6199<br/>COMBINED 60+ 💥]
    end

    style SG1_E fill:#96ceb4,color:#000
    style SG2_E fill:#96ceb4,color:#000
    style SG_SHARED fill:#ff6b6b,color:#fff
    style R6 fill:#ff6b6b,color:#fff
    style GW_A fill:#ffd93d,color:#000

Here’s the chain of events:

Envoy Gateway reconciles the Gateway resource and its associated EnvoyProxy configs
Because the proxies share a parent Gateway, Envoy Gateway provisions one security group and attaches it to all the NLBs
The AWS LB Controller adds an inbound rule per listener port to that shared group - across all proxies
AWS has a default quota of 60 rules per security group
The HTTPS proxy contributes its listener ports (80, 443 per hostname). The TCP proxy contributes 20+ individual port rules. Combined, they blow past 60
When you exceed 60 rules, the security group update fails
The LB Controller doesn’t surface this error in its own logs
Envoy doesn’t surface it either - it just… stops

No error logs in Envoy. No error logs in the LB Controller. The NLB listener creation succeeds, but the security group can’t be updated to allow traffic on the new ports. Existing connections keep working, but no new listeners get traffic.

This is the part that caught us off guard. Each EnvoyProxy has its own NLB, its own set of listeners, its own target groups, its own Elastic IPs - everything looks independent at the NLB level. But because they share a parent Gateway resource, Envoy Gateway ties them together through one security group. You might be well under 50 listeners per NLB and still hit the 60-rule quota because the rules are cumulative across all your proxies.

Where the Error Actually Was

Kubernetes Events. Not pod logs, not controller logs - events:

kubectl get events -n envoy-gateway-system --field-selector reason=SyncLoadBalancerFailed

That’s where the security group quota exceeded error showed up. If you weren’t watching events, you’d never know.

The Impact

This is what made it dangerous: Envoy completely froze. It wasn’t just the new listeners that failed - the entire data plane stopped processing configuration updates. No new routes, no policy changes, nothing. The Gateway resource showed Programmed: True because the Gateway API reconciliation succeeded. The actual data plane was stuck.

flowchart LR
    subgraph "What You See"
        GW_STATUS[Gateway Status:<br/>Programmed ✓]
        ENVOY_LOGS[Envoy Logs:<br/>nothing unusual]
        LBC_LOGS[LB Controller Logs:<br/>nothing unusual]
    end

    subgraph "What's Actually Happening"
        SG_FAIL[Security Group:<br/>quota exceeded]
        EVENTS[K8s Events:<br/>SyncLoadBalancerFailed]
        TRAFFIC[New listeners:<br/>no traffic]
    end

    style GW_STATUS fill:#96ceb4,color:#000
    style ENVOY_LOGS fill:#96ceb4,color:#000
    style LBC_LOGS fill:#96ceb4,color:#000
    style SG_FAIL fill:#ff6b6b,color:#fff
    style TRAFFIC fill:#ff6b6b,color:#fff
    style EVENTS fill:#ffd93d,color:#000

The Fix

Disable automatic security group management and manage the port range manually:

envoyProxy:
  service:
    annotations:
      # ... existing annotations ...
      # Disable automatic security group management
      service.beta.kubernetes.io/aws-load-balancer-manage-backend-security-group-rules: "false"

Then create a security group with a single rule covering your entire port range:

Inbound Rule:
  Protocol: TCP
  Port Range: 6100-6200
  Source: 0.0.0.0/0

One rule instead of 50+. The NLB still only forwards traffic for ports with actual listeners configured in the Envoy Gateway, so you’re not exposing unused ports - the NLB acts as the access control.

flowchart TD
    subgraph "Before: Shared SG, Per-Port Rules (hits quota)"
        SG1[Shared Security Group]
        SG1 --> R0A[Allow TCP 80<br/>HTTPS GW]
        SG1 --> R0B[Allow TCP 443<br/>HTTPS GW]
        SG1 --> R1[Allow TCP 6101<br/>TCP GW]
        SG1 --> R2[Allow TCP 6102<br/>TCP GW]
        SG1 --> R3[...]
        SG1 --> R4[Allow TCP 6199<br/>COMBINED 60+ 💥]
    end

    subgraph "After: Manual SG, Range Rule"
        SG2[Manual Security Group]
        SG2 --> RANGE[Allow TCP 6100-6200]
        SG2 --> HTTPS[Allow TCP 80, 443]
        RANGE --> NLB_TCP[TCP NLB only forwards<br/>ports with listeners]
        HTTPS --> NLB_HTTPS[HTTPS NLB]
    end

    style R4 fill:#ff6b6b,color:#fff
    style RANGE fill:#96ceb4,color:#000
    style HTTPS fill:#96ceb4,color:#000
    style NLB_TCP fill:#4ecdc4,color:#000
    style NLB_HTTPS fill:#4ecdc4,color:#000

This actually reduced complexity. The LB Controller was creating and deleting individual security group rules across all NLBs every time a listener was added or removed. Now the security group is static, and each NLB’s listener configuration determines which ports actually accept traffic. Fewer moving parts.

Other Options

Disabling automatic security group management and using a port range rule is what worked for us. There are other approaches - do your own due diligence on what fits your environment:

Request a quota increase. The 60-rule limit is a default, not a hard ceiling. You can request an increase through AWS Service Quotas. This lets you keep automatic security group management, but you’re still one scaling event away from hitting the new limit.
Separate security groups per NLB. Pre-create dedicated security groups and assign them via the aws-load-balancer-security-groups annotation on each EnvoyProxy Service. This overrides the shared group that Envoy Gateway creates and gives each NLB its own rule budget. You take on the management overhead, but each proxy’s NLB is fully isolated.
Separate Gateway resources. Instead of multiple EnvoyProxy resources under one Gateway, use independent Gateway resources. Each Gateway gets its own security group by default. More Gateway resources to manage, but you sidestep the shared security group entirely.
Split across multiple TCP Gateways. Same pattern as the old multi-NLB ingress-nginx setup - keep each Gateway under the rule limit. More NLBs, more cost, but no shared security group issues.

The right answer depends on how many services you’re running, how often the listener set changes, and how locked down your security groups need to be. The key insight is knowing the shared security group behavior exists in the first place - once you’re aware of it, you can pick the mitigation that fits.

What the Full Setup Looks Like

With both the HTTPS Gateway (from the previous post) and the TCP Gateway running side by side:

flowchart TB
    subgraph "HTTPS Traffic"
        H_CLIENT[Web Clients] --> H_NLB[NLB: https-gateway]
        H_NLB --> H_ENVOY[Envoy Proxy<br/>TLS termination]
        H_ENVOY --> H_GW[Gateway: envoy]
        H_GW --> HR1[HTTPRoute: app-a]
        H_GW --> HR2[HTTPRoute: app-b]
        H_GW --> HR3[HTTPRoute: app-c]
        HR1 --> H_SVC1[Web App A]
        HR2 --> H_SVC2[Web App B]
        HR3 --> H_SVC3[Web App C]
    end

    subgraph "TCP Traffic"
        T_CLIENT[TCP Clients] --> T_NLB[NLB: tcp-services]
        T_NLB --> T_ENVOY[Envoy Proxy<br/>TLS passthrough]
        T_ENVOY --> T_GW[TCP Gateway: envoy-tcp]
        T_GW --> TR1[TLSRoute :6101]
        T_GW --> TR2[TLSRoute :6102]
        T_GW --> MORE[... 20+ routes]
        TR1 --> T_SVC1[Service Alpha]
        TR2 --> T_SVC2[Service Beta]
    end

    subgraph "Shared Infrastructure"
        CI[ClusterIssuer<br/>Let's Encrypt]
        ARGO[ArgoCD<br/>GitOps sync]
    end

    ARGO -.-> H_GW
    ARGO -.-> T_GW
    CI -.-> H_GW

    style H_ENVOY fill:#4ecdc4,color:#000
    style T_ENVOY fill:#45b7d1,color:#000
    style ARGO fill:#ffd93d,color:#000

The HTTPS Gateway handles web traffic with TLS termination, certificates, security policies, and rate limiting. The TCP Gateway handles the service fleet with TLS passthrough and bidirectional PROXY protocol. Same Envoy Gateway controller manages both.

ingress-nginx vs Envoy Gateway for TCP

Aspect	ingress-nginx	Envoy Gateway
TCP routing	ConfigMap (`tcp` key)	TLSRoute per service
Multiple NLBs	Separate Helm releases, unique IngressClasses	Separate Gateway resources
PROXY protocol	Global config, one direction	Per-gateway policy, bidirectional
Per-service policies	Not possible	BackendTrafficPolicy targets individual routes
Port limit	50 per NLB (NLB listener limit)	50 per NLB (same limit, different bottleneck)
Security group management	Same issue if auto-managed	Same issue - disable and use range rules
Adding a service	Add line to ConfigMap, redeploy	Add listener + TLSRoute via values

The NLB 50-listener limit still applies - that’s an AWS constraint, not an ingress controller one. But with Envoy Gateway, you’re more likely to hit the security group 60-rule quota first if you let the LB Controller manage rules automatically.

Takeaways

Multiple EnvoyProxy resources under one Gateway share a single security group. Envoy Gateway creates one security group for the Gateway and attaches it to all NLBs provisioned by its proxies. Listener ports from every proxy accumulate as rules on that one group. You can be well under 50 listeners per NLB and still hit the 60-rule quota.
When the quota is exceeded, Envoy freezes silently. No error logs in Envoy, no error logs in the LB Controller. The error only shows up in Kubernetes events. Add event monitoring to your alerting if you haven’t already.
Disable automatic security group management or assign dedicated security groups per NLB. Use aws-load-balancer-manage-backend-security-group-rules: "false" with a port range rule, or use aws-load-balancer-security-groups to give each proxy its own group. Either way, understand the shared security group default before it bites you.
The NLB is your access control layer. With security group management disabled, the NLB’s listener configuration determines which ports accept traffic. No listener, no traffic - even if the security group allows the port range.
TLSRoutes are the Gateway API equivalent of the TCP ConfigMap. More verbose, but each route is independently targetable with policies. Worth the trade-off.
PROXY protocol needs to flow both ways for TCP services. ClientTrafficPolicy parses it from the NLB, BackendTrafficPolicy forwards it to the pods. Miss the backend policy and your services see Envoy’s IP instead of the client’s.
Check Kubernetes events. Seriously. Pod logs and controller logs don’t surface every failure. kubectl get events with field selectors should be part of your debugging playbook.

The migration from ingress-nginx to Envoy Gateway for TCP services was straightforward. The silent freeze from a shared security group hitting AWS quotas was not. If you’re running multiple EnvoyProxy resources under one Gateway, understand how security groups are shared before you learn this lesson the hard way.