wmnnd

wmnnd

Failing Big with Elixir and LiveView - A Post-Mortem

Here’s the story how one of the world’s first production deployments of LiveView came to be - and how trying to improve it almost caused a political party in Germany to cancel their convention.

I wrote this post just a few days after the event took place. As annoying as it was, it was a good teachable moment. And soon I’ll write an update with a tutorial on how to scale to 5,000 concurrent LiveView users on a single VPS :slight_smile:

Most Liked

OvermindDL1

OvermindDL1

Oooo, this looks like an interesting read!

  1. Participants poll the GenServer for updates every second.

/me twitches

That seems… inefficient compared to just pushing updates as they happen instead of polling, perhaps with a debouncer? Phoenix makes it easy to push updates to a channel from any process, bypassing the majority of the message passing costs. This is foreboding, lol.

Everything was great - except for one problem: The party kept growing, and thus the number of participants in these events kept growing, too.

And yep, this seems to confirm…

The frequent polling intervals of the first iteration ended up maxing out all eight CPU cores of a t3a.2xlarge AWS EC2 instance.

And yep, that seems even heavier than expected for just polling on the BEAM, I wonder what other costs were involved…

So I decided to switch from constant polling to a Pub/Sub model. This is also quite easy to do with Elixir and Phoenix: Phoenix comes with its own easy-to-use PubSub module.

Yay! Hopefully straight to the socket processes and not re-rendering with LiveView (which does it so incredibly inefficiently compared to some other thing libraries).

A three-day convention packed with votes and almost 3,000 eligible members in Germany.

Didn’t stress test it first?!? Still though, 3k doesn’t sound like much, I’ve stressed drab at work to over 40k on a single core without issues.

It was like watching a trainwreck: As soon as the server was up again, RAM usage immediately started climbing, and climbing … until the inevitable out-of-memory crash.

Oooo I can see so many possible causes…

The LiveView controller process would then receive these messages, set the @participants assign and render an updated view:

…oh wow, right, LiveView stores the changes inside each liveview process instead of shared data or just pushing it to the client to handle like you can in Drab (I still say Drab is overall better designed than LiveView, trivial not to cause this kind of issue in it, where LiveView encourages these issues…)…

With dozens of these updates happening per second as participants were joining the convention, messages were piling up in the inbox of the LiveView admin controller processes faster than they could be handled.

Eh, I wouldn’t think so, when a process on the beam sends a message to another process on the beam on the same system it has backpressure, so if the mailbox grows then the sender process gets scheduled less and less often until it practically is paused… Though if PubSub were used to talk to intermediary processes I could see issues…

My laptop crashed, the theory had been confirmed!

  1. Why on earth would the laptop crash from a single process consuming excess memory?!? What on earth was the OS being used?!
  2. No, I still think it was something else than the mailbox… Like using liveview re-rendering huge swaths of things instead of a better Drab-like model of pushing updates to the client to handle. Still should have debounced the changed data, which Drab would have automatically done by just broadcasting straight to the clients from the change process instead of an intermediary process with its own memory and mailbox and stack and all.

I then wanted the LiveView process to occasionally check if this other assign had been modified and, if so, also update @participants .

More polling? Why not a timeout message when a change comes in? Or better yet broadcast straight to the clients instead of going through intermediary processes per client (that sounds so heavy for shared data…).

With thousands of updates coming in at the same time, neither Firefox nor Chromium stood a chance.

Debouncing and batching!

I implemented a mechanism to do so at most once every second.

Close enough to debouncing, though more costly when no updates are happening, lol.

  1. Avoid large payloads in Phoenix.PubSub if possible

Yep, best to send only changes, and let the pubsub go straight to the client socket process to be handled on the client instead of intermediary re-rendering processes.

  1. Throttle PubSub events at the sender level to avoid clogged process inboxes

Yeah, pubsub doesn’t backpressure as much as one would hope, this is why sending directly to the socket processes would be far better (which use pubsub internally anyway, still debounce your data!).

  1. Using assign/3 in LiveView always causes an update via Websocket, even if no changes were made

And LiveView has no ability to push updates to the client without sending updated DOM either unless you want to manually craft javascript and all, it really needs to take a few of Drab’s features (especially since Drab predated LiveView by about 2 years! I still don’t know why LiveView was made instead of just working on Drab…).

AstonJ

AstonJ

Nice one Philipp - I enjoyed reading your story and I look forward to the follow-up! :+1:

wmnnd

wmnnd

Thanks for your super detailed feedback, that was very interesting to read!

A timeout message at which level? At the LiveView level? And how could changes be broadcast directly to the clients without going through the LV process?

How would you implement debouncing then? In my current solution, I update the state to keep track of it needing to be updated, so this call that happens once every second is not really costly at all :smiley:

True, using some kind of diffing would obviously be ideal here. But again, I’m using LiveView, so it kinda has to go through that. Can you recommend a way to do diffing in Elixir?

I might have overstated what happened by using the term “crash” :smiley: It froze for a few seconds until the OOM killer came in.

Where Next?

Popular General Dev topics Top

emoragaf
Hi again, this time I blogged about creating a development environment for elixir using Docker (post in Spanish)
New
Exadra37
I came across a video where the Vice Chairman of Morgan Stanley, Carla Harris was interviewed…. She said something that struck my nerves...
New
ErlangSolutions
If you were unable to join us recently for Code Mesh V conference you can catch up with our full suite of talk videos, QandA sessions and...
New
timClicks
I published this post yesterday and thought that this community might appreciate it: To save you a click, here are the bulk of the adv...
New
elbrujohalcon
This is what we’ve been doing in our last HackWeek at NextRoll with @maco and @pablocostass Now you can add typespecs to your modules us...
New
paulanthonywilson
I put together a quick run through of the talks that I attended at Elixir Conf EU 2023, in Lisbon.
New
fredwu
Hi folks, I wrote a blog post the other day on how I built my MVP in 3 months whilst having a day job, using Elixir/Phoenix/LiveView. Th...
New
chiroptical
I am a huge fan of functional programming and recently discovered the maybe expression in Erlang. In the blog post I show an example of c...
New
lawik
One of the Erlang ecosystem’s spiciest nerd snipes are hot code updates. Because it can do it. In ways that almost no other runtime can.
New
kjwvanijk
This is Part 1 of my mini series integrating Phoenix Framework with Cardano. The first part is about setting up a phoenix application to...
New

Other popular topics Top

wolf4earth
@AstonJ prompted me to open this topic after I mentioned in the lockdown thread how I started to do a lot more for my fitness. https://f...
New
Rainer
My first contact with Erlang was about 2 years ago when I used RabbitMQ, which is written in Erlang, for my job. This made me curious and...
New
Margaret
Hello content creators! Happy new year. What tech topics do you think will be the focus of 2021? My vote for one topic is ethics in tech...
New
rustkas
Intensively researching Erlang books and additional resources on it, I have found that the topic of using Regular Expressions is either c...
New
PragmaticBookshelf
Author Spotlight Rebecca Skinner @RebeccaSkinner Welcome to our latest author spotlight, where we sit down with Rebecca Skinner, auth...
New
AstonJ
If you want a quick and easy way to block any website on your Mac using Little Snitch simply… File > New Rule: And select Deny, O...
New
New
PragmaticBookshelf
Author Spotlight: Peter Ullrich @PJUllrich Data is at the core of every business, but it is useless if nobody can access and analyze ...
New
PragmaticBookshelf
A Ruby-Centric Chat with Noel Rappin @noelrappin Once you start noodling around with Ruby you quickly figure out, as Noel Rappi...
New
CommunityNews
A Brief Review of the Minisforum V3 AMD Tablet. Update: I have created an awesome-minisforum-v3 GitHub repository to list information fo...
New