I ran into a bit of a problem with an NSX deployment I was recently working on… sadly human error caused the issue but as the old saying goes “better to find problems during testing than in production” (I just made that up), or “DCYCJ” (Double Check Your Config James).
User got in touch after deploying a simple web server saying they could only intermittently connect to their site from their desktop.
To give you guys some background on this setup there are two ESGs setup in an ECMP pair (I suspect some people are already screening out what the problem is). When I was setting up the ESGs I diligently disabled the local ESG firewall… on one of the two ESGs.
The local ESG firewall is stateful, so when traffic is sent north through the ESGs ~50% of the traffic will return southbound via the ESG that had its local firewall enabled. If the inspected packet matches an existing firewall rule that permits it, the packet is passed and an entry is added to the state table, in my case this was happening as I had an any any allow rule in place on the ESG Firewall. From that point forward, because the packets in that particular communication session match an existing state table entry, they are allowed access without call for further application layer inspection. As the traffic was traversing one firewall and returning the via another that did not have entries in its state table the packets were dropped.