tales of distributed systems

Here’s a story I posted elsewhere about a year ago. I dont work there anymore.

tfw programming against a distributed framework scares the daylights out of u

At work our marathon cluster that manages our webapp containers has been going bananas. thankfully in integration and not anything that real users touch, every time we deploy multiple web apps at once, the leader nodes’ cpu quickly spikes to 100%. (it fails over to the next master***** within minutes and then the same thing happens there).

this happens on EVERY deployment operation (in mesos, more than half the operations count as ‘deployments’, including scaling apps, deleting apps, relaunching apps to pick up new configs). deployments* in marathon are all API calls and api calls are KNOWN to be cpu intensive[1]. so we’re thinking about the problem: what about some conditions of deployments causing this cpu-crushing flood of API calls? and how are we creating those conditions?

my super experienced coworker thinks it’s because the flood of traffic hits service discovery containers dynamically configuring and reloading nginx [for routing to other services] and these loadbalancers and other apps, under load, start to fail the built in automatic healthchecks which just ping a URL in the webapp. we’re running load tests to test high traffic (pretty frequently these days to gear up for open enrollment- we’re a healthcare company and the beginning of the year is our highest traffic week) and these load test generate a lot of internal loadbalancer requests AND external incoming requests. once the container fails the healthcheck (in this case because it’s slow) marathon’s behavior is to add more instances [of the containers] and scale down the old, apparently “failing” ones. so then those scaling operations (up and down) are all deployment ops as well which could be a cause of a ton of api requests.

meanwhile, the service discovery / internal loadbalancer services are going nuts too. i think they are generating a lot of api calls against the master node because we just have too many apps. a diff coworker has a good guess that the bottleneck here is the playjson library from the scala play framework that marathon uses to parse this a gigantic dictionary of app config and state that it keeps in apache zookeeper. apparently that particular library just isn’t great with huge json files. (in our case with 1500ish apps on 26x 16-cpu marathon “worker” nodes, the dictionary size is 1.5 MB gzipped compressed and whittled down to just the apps. 7 MB unzipped. the full state dict, which the master nodes likely have to search through the entirety of in the worst case, is likely over 20 MB).

to figure out what’s going on, last night i got some coworkers to help me kick off a load test and run a ton of deployments and capture stack traces, syscall traces, running cpu loadavgs, thread dumps, thread counts, connection pool counts, and incoming http requests from the leader master nodes. as well as loadavgs from the worker nodes as i kicked off deployments. so i can look at what’s causing cpu spikes in both.

…however. I wrote this plugin for it to grab secrets from an encrypted source, like if an app needs to grab a database password or something, and whenever a container starts, the deployment process grabs these secrets and puts them as environment variables in the container before doing anything else-- so a webapp can have these passwords delivered from an encrypted place right away, before it starts booting up, all orchestrated by the framework as a scala plugin.

so i wrote this plugin and i wrote comprehensive unit tests for it but when 2 dev teams started using it 1 of them had troubles getting their app to start. it’s super unlikely that my plugin does anything api related. but i do know that all instances of every plugin run on the master, and these were just put into the environment 3 weeks ago. and it could be slowing down the master node when deployments happen because it runs on master everytime a container deployment happens. MAYBE there’s some bug in marathon plugins that causes the cpu to spike. so maybe my plugin’s the reason this complicated problem is happening and I’m TERRIFIED to look at the thread dump and find out that some unexpected side effect of this plugin causes the marathon masters nodes to slow down. writing plugins for the first time for a distributed system that someone else wrote for the first time is SO yikes

[1] Google Groups “some of the API calls that the UI uses are very expensive and we are working on improving performance there… (especially in the frontend code).”

**** SIDE BAR anyone else HATE some of the terminology we have to use for distributed systems???

I wrote that late 2017, shortly before leaving the company. so… it turned out that my plugin was indeed compounding an already bad problem. Everytime any app state was changed- starts, stops, deploys, config changes- marathon would parse through the 7 MB json state file, with the super slow play.json, and then of course, the secrets plugin was invoked, it would parse AGAIN through the 7 megs of json. Enough concurrent deploys and the primary node would spend more of its time parsing JSON than it would answering API requests. Didnt matter that app healthchecks were returning just fine, because the healthchecks that the other nodes would request were… still going through the Marathon API.

So a primary would bog itself down in api request handling after about 20-some deploys when something to the tune of 1k apps were already running (on an odd and >2 number of xlarge-scale EC2 instances), all because of pesky JSON state handling in what should have been a predictably scaleable dimension. These healthcheck api requests originating from the other primary nodes time out, so they go through leader election. As soon as the new primary takes over, it starts handling all of the piled up deployment requests, parsing all that delicious JSON goodness, grinds to a halt as well, very quickly you realize that you have a “rotating single master” problem. CPU graphs were just triangles cycling colors.

My plugin wasnt the root cause, the root cause could be traced down to a bad JSON implementation, or perhaps a design flaw, or perhaps an unwillingness/nonviability for the company who claimed custody over the framework to work on the open-source parts of their software in favor of the wildly expensive enterprise variant. (not saying kubernetes’ stewardship is in any better hands, its just in hands less tied by VC money). Dont really know. I pulled the plug on the plugin project and cleaned everything up before I left. I talked to my manager. wrote a really long informative email with bold keywords and concise executive summaries and technical details and still got a one line “dont we need this to ship X?” (X being the name of their one-click-spin-up-stack automation project) email from the head of engineering to all 200 engineers and managers further cementing my opinion that if you want to be a successful CTO you should probably forget how to read


what is marathon?

edit: Marathon: A container orchestration platform for Mesos and DC/OS

but that doesn’t help haha

its a scala thing that runs on top of mesos, which lets u take any computers and turn them into a compute cluster. marathon then goes and adds a scheduler and some other stuff. this lets u have like 40 computers in a garage or in aws or whatever and I could request 15 instances of my app and it’ll spread them out over the computers and (with some of the finagling described above) loadbalance between them. I could then launch 40 instances of maybe another service and it’ll find the room for those, sometimes being roommates with those previous instances I listed above or even each other. basically mesos can abstract away your resources (disk, memory, compute from ur X computers) into a big pool, and marathon will tell ur apps what lane to swim in.

it’ll also let you tag versions, which means it’ll let you canary (test a new version on one instance of a service) before upgrading all of them; this also means u can do blue-green deployments (a pattern for upgrading all your versions with zero downtime)

basically kubernetes before kubernetes came in and started doing a bunch of this and more out of the box.

idk distributed systems are cool bc they can make ur system of multiple apps super scaleable but also come with a, uh, host of interesting problems and patterns. high availability! fault tolerance! consensus problems! with distributed web apps youre forced to think about networking and load balancing and configs and statelessness. throw in containerization and most of my fave parts of operational computing r there, just add code

1 Like