Here is the story of all the work I did this week, and it is so odd I feel it needs to be shared... but lets talk about the world the problem can be found in.
It is a µservice (that is the ohh, look how smart I am to use a symbol for micro) written with DropWizard and deployed in a Docker container, with Traefik in front of that. To hit it you go through a ingress controller and a load balancer.
+---------------+ | | | Internet | | | +---------------+ | | | +---------------+ | | | Load Balancer | | | +---------------+ | | | +---------------+ | | | Ingress | | | +---------------+ | | | +---------------+ | | | Traefik | | | +---------------+ | | | +---------------+ | | | Microservice | | | +---------------+
The microservice acts a BFF (backend for front end), so it does some auth fun and makes calls to an internal API and manipulates them (e.g. changes the data structure). We have a number of different REST style calls across GET, POST, PUT and DELETE.
In terms of environments, we obviously have a production environment and we have an integration environment which are setup the same way. We have a stubs environment where we fake out the internal API. Lastly we can run the microservice on our laptops, but there is just the microservice... no traefik etc...
When we run our load test from outside everything works except 1 call which fails 98% of the time with a HTTP 500. Other calls (even the same method) all work. Load tests run against the stubs environment and the same call works perfectly on our laptops, in integration and production.
We can even run the fake internal API on the laptops, with the load tests and it works fine there. Basically, one call fails most of the time in one environment... 🤔
When we pull the logs for the microservice in stubs things get weirder... it is returning an HTTP 200 😐 This is the same experience we get everywhere else, it works... except in stubs to the load tests it gets a 500
+---------------+ 500 Here | | | Internet | | | +-------+-------+ | | / | | +-------+-------+ 200 Here | | | Microservice | | | +---------------+
Further pulling of logs show 500s in the load balancer and the ingress so somewhere between the microservice and traefik the HTTP 200 becomes a HTTP 500... but we don't have logs on traefik we can pull which makes this a bit harder to determine...
Logging onto Traefik
Next we logged onto Traefik and decided to curl the microservice directly and lo and behold... we get a 500 🤯 and to make it more interesting that the microservice logs still show it is returning a 200 - like what the actual?! Could there be an network issue or magic?
Interesting the 500 came with an error saying "insufficient content written"
Insufficient content written
This led me to looking at the content and looking at what we are sending and I see we are sending
Content-Length and the body and guess what... the length of the body does not equal the
Content-Length... oh 💩
This is a client side HTTP error... where the server sends the incorrect amount of body, so the client goes, well the server is wrong and raises a 500. I've always thought 500 errors were server error and thus could only be raised on the server.
The fix is simple, in our server we were using the
Response.fromResponse to map internal API to the public API and so it was copying the
Content-Length from the internal API and we were sending that along.
This meant the fix was to delete the
Content-Length header before we call
fromResponse to ensure it would rebuild the header and be correct.
The reason why it didn't fail else where, the version of the mock API we use added
Content-Length but newer versions and the real APIs used chunked encoding which never set the header so there is no issue there.
This was a long road to understand the issue, and one line to fix it, and totally a new experience in learning that server errors can occur client side too.