Scenario 2 - Stateless-lb-frontend pods are not ready

Problem Statement

The user tries to send traffic to the application pod, target-a, which has IP address 20.0.0.1 and port 4000. However, traffic doesn't reach the intended destination. Troubleshooting is requested.

Known Inputs

Cluster resource

Spire is deployed in the namespace spire
NSM is deployed in the namespace nsm

Meridio configuration

Meridio version: v1.0.0
TAPA version: v1.0.0
Meridio is deployed in the namespace red
Meridio components:
- 1 Trench trench-a
- 1 Attractor attractor-a-1
- 2 Gateways gateway-v4-a/gateway-v6-a
- 2 Vips vip-a-1-v4/vip-a-1-v6
- 1 Conduit conduit-a-1
- 1 Stream stream-a-i
- 1 Flow flow-a-z-tcp

Gateway configuration

Interface
- VLAN: VLAN ID 100
- The VLAN network is based on the network the Kubernetes worker nodes are attached to via eth0
- IPv4 169.254.100.150/24 and IPv6 100:100::150::/64
Routing protocol
- BGP + BFD
- local-asn: 8103
- remote-asn: 4248829953
- local-port: 10179
- remote-port: 10179

Target configuration

Deployment: target-a in the namespace red
Stream stream-a-i in conduit-a-1 in trench-a is open in all the target pods

Error

Running traffic gives the following output

docker exec -it trench-a mconnect -address 20.0.0.1:4000 -nconn 400 -timeout 2s
Failed connects; 400
Failed reads; 0

docker exec -it trench-a mconnect -address  [2000::1]:4000 -nconn 400 -timeout 2s
Failed connects; 400
Failed reads; 0

Solution

Check the status of all the Spire and NSM pods running in the namespaces spire and nsm, correspondingly.
- All pods should be in the running state, which means that all the pods have been bound to a node, and all of the containers have been created. All the containers inside the pods are in the ready state.

kubectl get pods -n=spire
NAME                READY   STATUS    RESTARTS   AGE
spire-agent-4wj68   1/1     Running   0          2m1s
spire-agent-q77lz   1/1     Running   0          2m4s
spire-server-0      2/2     Running   0          2m6s

kubectl get pods -n=nsm
NAME                                    READY   STATUS    RESTARTS   AGE
admission-webhook-k8s-b9589cbcb-g9wxs   1/1     Running   0          7m33s
forwarder-vpp-d4sqt                     1/1     Running   0          7m33s
forwarder-vpp-n5l2p                     1/1     Running   0          7m33s
nsm-registry-5b5b897645-6bcs2           1/1     Running   0          7m33s
nsmgr-6msjz                             2/2     Running   0          7m33s
nsmgr-8kh7t                             2/2     Running   0          7m33s

Check the status of all the Meridio pods running in the namespace red.
- All pods should be in the running state, which means that all the pods have been bound to a node, and all of the containers have been created. All the containers inside the pods are in the ready state.
- In the case of the current scenario, all the pods in the namespace red are in the running state ; however, it can be noticed that some containers inside the few pods are not in the ready state. Particularly, 2 containers in stateless-lb-frontend-attractor-a-1 pods and 1 container in proxy-conduit-a-1 pods.

kubectl get pods -n=red
NAME                                                  READY   STATUS    RESTARTS   AGE
ipam-trench-a-0                                       1/1     Running   0          34m
meridio-operator-596d7f88b8-v5gf9                     1/1     Running   0          34m
nse-vlan-attractor-a-1-5cf67947d5-f8slj               1/1     Running   0          34m
nsp-trench-a-0                                        1/1     Running   0          34m
proxy-conduit-a-1-6vdxl                               0/1     Running   0          34m
proxy-conduit-a-1-hv4m5                               0/1     Running   0          34m
stateless-lb-frontend-attractor-a-1-d8db96c8f-6kv6l   1/3     Running   0          34m
stateless-lb-frontend-attractor-a-1-d8db96c8f-shvfd   1/3     Running   0          34m
target-a-77b5b48457-6xkcj                             2/2     Running   0          33m
target-a-77b5b48457-jxbdv                             2/2     Running   0          33m
target-a-77b5b48457-pl9fz                             2/2     Running   0          33m
target-a-77b5b48457-szzzw                             2/2     Running   0          33m

As the proxy-conduit-a-1 start-up depends on the successful start-up of the stateless-lb-frontend-attractor-a-1, the first step would be to check the details of the stateless-lb-frontend-attractor-a-1. This can give some hints on the possible reasons for the failed containers' start.
- Going through the stateless-lb-frontend-attractor-a-1 detailed description presented below, a few things can be observed:
  - Which containers are not in the Ready state (Ready: False) : frontend and stateless-lb.
  - The history of Events: the following warning is received Readiness probe failed: service unhealthy (responded with "NOT_SERVING").

kubectl describe pods stateless-lb-frontend-attractor-a-1-d8db96c8f-q92dc -n=red

Containers:
  stateless-lb:
    Container ID:   ...
    Image:          ...
    Image ID:       ...
    Port:           ...
    Host Port:      ...
    State:          Running
      Started:      Tue, 21 Feb 2023 21:49:20 +0000
    Ready:          False
    Restart Count:  0
    Liveness:       exec [/bin/grpc_health_probe -addr=unix:///tmp/health.sock -service= -connect-timeout=250ms -rpc-timeout=350ms] delay=0s timeout=3s period=10s #success=1 #failure=5
    Readiness:      exec [/bin/grpc_health_probe -addr=unix:///tmp/health.sock -service=Readiness -connect-timeout=250ms -rpc-timeout=350ms] delay=0s timeout=3s period=10s #success=1 #failure=5
    Startup:        exec [/bin/grpc_health_probe -addr=unix:///tmp/health.sock -service= -connect-timeout=250ms -rpc-timeout=350ms] delay=0s timeout=2s period=2s #success=1 #failure=30

---
  nsc:
    Container ID:   ...
    Image:          ...
    Image ID:       ...
    Port:           ...
    Host Port:      ...
    State:          Running
      Started:      Tue, 21 Feb 2023 21:49:43 +0000
    Ready:          True
    Restart Count:  0

---
  frontend:
    Container ID:   ...
    Image:          ...
    Image ID:       ...
    Port:           ...
    Host Port:      ...
    State:          Running
      Started:      Tue, 21 Feb 2023 21:49:52 +0000
    Ready:          False
    Restart Count:  0
    Liveness:       exec [/bin/grpc_health_probe -addr=unix:///tmp/health.sock -service= -connect-timeout=250ms -rpc-timeout=350ms] delay=0s timeout=3s period=10s #success=1 #failure=5
    Readiness:      exec [/bin/grpc_health_probe -addr=unix:///tmp/health.sock -service=Readiness -connect-timeout=250ms -rpc-timeout=350ms] delay=0s timeout=3s period=10s #success=1 #failure=5
    Startup:        exec [/bin/grpc_health_probe -addr=unix:///tmp/health.sock -service= -connect-timeout=250ms -rpc-timeout=350ms] delay=0s timeout=2s period=2s #success=1 #failure=30

  Warning  Unhealthy  62s (x4 over 64s)  kubelet            Readiness probe failed: service unhealthy (responded with "NOT_SERVING")
  Warning  Unhealthy  53s (x5 over 64s)  kubelet            Readiness probe failed: service unhealthy (responded with "NOT_SERVING")

From the previous step, it can be concluded that the reason for the failed readiness of the two containers, frontend and stateless-lb, is the failed readiness probe. To debug further, it would be useful to check the logs of the two containers for any sort of errors.
- Going through the logs of the stateless-lb container, it can be noticed that there are no error messages.
- Going through the logs of the frontend container, it can be noticed that there is one error message, which states "error":"gateway down".

kubectl logs stateless-lb-frontend-attractor-a-1-d8db96c8f-q92dc -n=red -c=stateless-lb | grep "\"severity\":\"error\""

...

kubectl logs stateless-lb-frontend-attractor-a-1-d8db96c8f-q92dc -n=red -c=frontend | grep "\"severity\":\"error\""
{"severity":"error","timestamp":"2023-02-21T21:49:56.733+00:00","service_id":"Meridio-frontend","message":"connectivity","version":"1.0.0","extra_data":{"class":"FrontEndService","func":"Monitor","status":16,"out":["BIRD 2.0.8 ready.","Name       Proto      Table      State  Since         Info","NBR-gateway-v4-a BGP        ---        start  21:49:53.760  Active            Neighbor address: 169.254.100.150%ext-vlan0","NBR-gateway-v6-a BGP        ---        start  21:49:53.760  Active            Neighbor address: 100:100::150%ext-vlan0",""],"error":"gateway down"}}

Furthermore, it would be useful to check the state of the BGP connection and Bird logs.
- In a normal case, the BGP state should be ESTABLISHED, and IPv4 and IPv6 channels' states should be UP. However, as can be seen, the BGP state is ACTIVE (refer to 'Helpful Resources: BGP' to get more info about the states), and IPv4 and IPv6 channels are DOWN. The error received is connection refused.

kubectl exec -it stateless-lb-frontend-attractor-a-1-d8db96c8f-q92dc -n=red -c=frontend -- birdc -s /var/run/bird/bird.ctl show protocol all
BIRD 2.0.8 ready.
Name       Proto      Table      State  Since         Info
device1    Device     ---        up     10:06:22.017

kernel1    Kernel     master4    up     10:06:22.017
  Channel ipv4
    State:          UP
    Table:          master4
    Preference:     10
    Input filter:   REJECT
    Output filter:  default_rt
    Routes:         0 imported, 0 exported, 0 preferred
    Route change stats:     received   rejected   filtered    ignored   accepted
      Import updates:              0          0          0          0          0
      Import withdraws:            0          0        ---          0          0
      Export updates:              0          0          0        ---          0
      Export withdraws:            0        ---        ---        ---          0

kernel2    Kernel     master6    up     10:06:22.017
  Channel ipv6
    State:          UP
    Table:          master6
    Preference:     10
    Input filter:   REJECT
    Output filter:  default_rt
    Routes:         0 imported, 0 exported, 0 preferred
    Route change stats:     received   rejected   filtered    ignored   accepted
      Import updates:              0          0          0          0          0
      Import withdraws:            0          0        ---          0          0
      Export updates:              0          0          0        ---          0
      Export withdraws:            0        ---        ---        ---          0

NBR-gateway-v4-a BGP        ---        start  10:06:22.326  Active        Socket: Connection refused
  BGP state:          Active
    Neighbor address: 169.254.100.150%ext-vlan0
    Neighbor AS:      4248829953
    Local AS:         8103
    Connect delay:    2.089/5
    Last error:       Socket: Connection refused
  Channel ipv4
    State:          DOWN
    Table:          master4
    Preference:     100
    Input filter:   default_rt
    Output filter:  cluster_e_static
  Channel ipv6
    State:          DOWN
    Table:          master6
    Preference:     100
    Input filter:   REJECT
    Output filter:  REJECT

NBR-gateway-v6-a BGP        ---        start  10:06:22.326  Active        Socket: Connection refused
  BGP state:          Active
    Neighbor address: 100:100::150%ext-vlan0
    Neighbor AS:      4248829953
    Local AS:         8103
    Connect delay:    0.463/5
    Last error:       Socket: Connection refused
  Channel ipv4
    State:          DOWN
    Table:          master4
    Preference:     100
    Input filter:   REJECT
    Output filter:  REJECT
  Channel ipv6
    State:          DOWN
    Table:          master6
    Preference:     100
    Input filter:   default_rt
    Output filter:  cluster_e_static

NBR-BFD    BFD        ---        up     10:06:22.017

kubectl exec -it stateless-lb-frontend-attractor-a-1-d8db96c8f-q92dc -n=red -c=frontend -- cat /var/log/bird.log

2023-02-22 10:24:45.650 <TRACE> NBR-gateway-v4-a: Connecting to 169.254.100.150 from local address 169.254.100.2
2023-02-22 10:24:45.652 <TRACE> NBR-gateway-v4-a: Connection lost (Connection refused)
2023-02-22 10:24:45.652 <TRACE> NBR-gateway-v4-a: Connect delayed by 5 seconds
2023-02-22 10:24:46.141 <TRACE> NBR-gateway-v6-a: Connecting to 100:100::150 from local address 100:100::2
2023-02-22 10:24:46.141 <TRACE> NBR-gateway-v6-a: Connection lost (Connection refused)
2023-02-22 10:24:46.141 <TRACE> NBR-gateway-v6-a: Connect delayed by 5 second

One of the reasons for the ACTIVE BGP state is the BGP configuration error. Therefore, a suggestion would be to check the BGP configuration for any sort of error.
- The BGP configuration contains information about the gateways. Commonly, the error occurs in the gateway custom resource configuration.

kubectl exec -it stateless-lb-frontend-attractor-a-1-d8db96c8f-q92dc -n=red -c=frontend -- cat /etc/bird/bird-fe-meridio.conf

log "/var/log/bird.log" 20000 "/var/log/bird.log.backup" { debug, trace, info, remote, warning, error, auth, fatal, bug };
log stderr all;

protocol device {
}

filter default_rt {
        if ( net ~ [ 0.0.0.0/0 ] ) then accept;
        if ( net ~ [ 0::/0 ] ) then accept;
        else reject;
}

filter cluster_e_static {
        if ( net ~ [ 0.0.0.0/0 ] ) then reject;
        if ( net ~ [ 0::/0 ] ) then reject;
        if source = RTS_STATIC && dest != RTD_BLACKHOLE then accept;
        else reject;
}

template bgp LINK {
        debug {events, states, interfaces};
        direct;
        hold time 3;
        bfd off;
        graceful restart off;
        setkey off;
        ipv4 {
                import none;
                export none;
                next hop self;
        };
        ipv6 {
                import none;
                export none;
                next hop self;
        };
}

protocol kernel {
        ipv4 {
                import none;
                export filter default_rt;
        };
        kernel table 4096;
        merge paths on;
}

protocol kernel {
        ipv6 {
                import none;
                export filter default_rt;
        };
        kernel table 4096;
        merge paths on;
}

protocol bgp 'NBR-gateway-v4-a' from LINK {
        interface "ext-vlan0";
        local port 10180 as 8103;
        neighbor 169.254.100.150 port 10180 as 4248829953;
        bfd {
                min rx interval 300ms;
                min tx interval 300ms;
                multiplier 5;
        };
        hold time 24;
        ipv4 {
                import filter default_rt;
                export filter cluster_e_static;
        };
}

protocol bgp 'NBR-gateway-v6-a' from LINK {
        interface "ext-vlan0";
        local port 10180 as 8103;
        neighbor 100:100::150 port 10180 as 4248829953;
        bfd {
                min rx interval 300ms;
                min tx interval 300ms;
                multiplier 5;
        };
        hold time 24;
        ipv6 {
                import filter default_rt;
                export filter cluster_e_static;
        };
}

protocol bfd 'NBR-BFD' {
        interface "ext-vlan0" {
        };
}

As mentioned in the previous step, there might be an issue with the gateway custom resource configuration. Therefore, it would be helpful to check the details of the gateway-v4-a and gateway-v6-a for any sort of misconfiguration.
- There are several properties of the gateway custom resource, which probably should be checked thoroughly since they are a common source of errors if configured incorrectly. Particularly, check if namespace, trench, address, bgp.local-asn, bgp.remote-asn, bgp.local-port, bgp.remote-port are set to the correct values following the specified configuration (reference: known inputs).
- Going through the gateway-v4-a and gateway-v6-a detailed description presented below and verifying the mentioned properties, it can be noticed that there is a misconfiguration in the bgp.local-port and bgp.remote-port fields. These properties should be set to 10179, however, in the current deployment they are set to 10180. This misconfiguration leads to the ingress traffic not reaching the intended destination.

kubectl get gateway gateway-v4-a -n=red -o=yaml

apiVersion: meridio.nordix.org/v1
kind: Gateway
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"meridio.nordix.org/v1","kind":"Gateway","metadata":{"annotations":{},"labels":{"trench":"trench-a"},"name":"gateway-v4-a","namespace":"red"},"spec":{"address":"169.254.100.150","bgp":{"bfd":{"min-rx":"300ms","min-tx":"300ms","multiplier":5,"switch":true},"hold-time":"24s","local-asn":8103,"local-port":10180,"remote-asn":4248829953,"remote-port":10180}}}
  creationTimestamp: "2023-02-21T21:49:04Z"
  generation: 2
  labels:
    trench: trench-a
  name: gateway-v4-a
  namespace: red
  ownerReferences:
  - apiVersion: meridio.nordix.org/v1
    kind: Trench
    name: trench-a
    uid: 713fd582-0f86-4c64-94e7-bcdcaa581a92
  resourceVersion: "1185"
  uid: 37cf0655-62f4-44eb-83dc-ec420f452f04
spec:
  address: 169.254.100.150
  bgp:
    bfd:
      min-rx: 300ms
      min-tx: 300ms
      multiplier: 5
      switch: true
    hold-time: 24s
    local-asn: 8103
    local-port: 10180
    remote-asn: 4248829953
    remote-port: 10180
  protocol: bgp
  static:
    bfd: {}

kubectl get gateway gateway-v6-a -n=red -o=yaml

apiVersion: meridio.nordix.org/v1
kind: Gateway
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"meridio.nordix.org/v1","kind":"Gateway","metadata":{"annotations":{},"labels":{"trench":"trench-a"},"name":"gateway-v6-a","namespace":"red"},"spec":{"address":"100:100::150","bgp":{"bfd":{"min-rx":"300ms","min-tx":"300ms","multiplier":5,"switch":true},"hold-time":"24s","local-asn":8103,"local-port":10180,"remote-asn":4248829953,"remote-port":10180}}}
  creationTimestamp: "2023-02-21T21:49:04Z"
  generation: 2
  labels:
    trench: trench-a
  name: gateway-v6-a
  namespace: red
  ownerReferences:
  - apiVersion: meridio.nordix.org/v1
    kind: Trench
    name: trench-a
    uid: 713fd582-0f86-4c64-94e7-bcdcaa581a92
  resourceVersion: "1219"
  uid: b259507f-b40c-4d5e-848b-bb1b2b020fdc
spec:
  address: 100:100::150
  bgp:
    bfd:
      min-rx: 300ms
      min-tx: 300ms
      multiplier: 5
      switch: true
    hold-time: 24s
    local-asn: 8103
    local-port: 10180
    remote-asn: 4248829953
    remote-port: 10180
  protocol: bgp
  static:
    bfd: {}

Changing bgp.local-port and bgp.remote-port for both gateway-v4-a and gateway-v6-a to 10179 fixes the traffic issue which was reported by the user.

docker exec -it trench-a mconnect -address 20.0.0.1:4000 -nconn 400 -timeout 2s
Failed connects; 0
Failed reads; 0
target-a-77b5b48457-n46p4 91
target-a-77b5b48457-qsrwg 89
target-a-77b5b48457-pdtq5 107
target-a-77b5b48457-72vlp 113

docker exec -it trench-a mconnect -address  [2000::1]:4000 -nconn 400 -timeout 2s
Failed connects; 0
Failed reads; 0
target-a-77b5b48457-pdtq5 90
target-a-77b5b48457-72vlp 109
target-a-77b5b48457-qsrwg 110
target-a-77b5b48457-n46p4 91

Scenario 2 - Stateless-lb-frontend pods are not ready

Problem Statement​

Known Inputs​

Cluster resource​

Meridio configuration​

Gateway configuration​

Target configuration​

Error​

Solution​

Helpful Resources​