There’s something rather unique in Erlang in how it approaches failure compared to most other programming languages. There’s this common way of thinking where the language, programming environment, and methodology do everything possible to prevent errors. If you can deal with some errors rather than preventing them at all cost, then most undefined behaviours of a program can go in that "deal with it" approach.
Stuff Goes Bad: Erlang in Anger by Fred Hébert. 93 pages. 2017.
By far, the most common cause of failure I’ve encountered in real-world scenarios is due to the node running out of memory. Furthermore, it is usually related to message queues going out of bounds.1 There are plenty of ways to deal with this, but knowing which one to use will require a decent understanding of the system you’re working on.
The safe margin of error you established when designing the system slowly erodes as more people use it. It’s important to consider the tradeoffs your business can tolerate from that point of view, because users will tend not to appreciate seeing their allowed usage go down all the time, possibly even more so than seeing the system go down entirely from time to time.
If you need to react to old events before they are too old, then things become more complex, as you can’t know about it without looking deep in the stack each time, and dropping from the bottom of the stack in a constant manner gets to be inefficient.
A practical approach to growing a system and keeping it healthy in production is to make sure all angles are observable: in the large, and in the small. There’s no generic recipe to tell in advance what is going to be normal or not. You want to keep a lot of data and to look at it from time to time to form an idea about what your system looks like under normal circumstances. The day something goes awry, you will have all these angles you’ve grown to know, and it will be simpler to find what is off and needs fixing.
Maintaining and debugging software never ends. New bugs and confusing behaviours will keep popping up around the place all the time. There would probably be enough stuff out there to fill out dozens of manuals like this one, even when dealing with the cleanest of all systems.
.
Lots of good advice specific to erlang and lots more for anyone writing high-volume, non-stop applications.
WyCash Plus handled running out of memory well by having graduated fall back strategies for any failure. All dynamic allocations were attached to windows. When reports stopped showing numbers users figured out for themselves that they were asking too much of the system and started closing windows.