I'm in the process of learning Elixir and the observation I'm about to make may be completely off-base. With that said...
Seems like the section on "When to use processes" is selling processes short a bit. Certain kinds of state management would seem to call for processes to either manage or even hold state... but processes (as I understand them) are also key in designing for reliability. So I would think I may well want to organize my code relative to processes as I'm also figuring out the supervision trees and various failure scenarios. And yes, concurrency issues as well. If I'm wrong on this, I'd be happy to be set straight.
Anyway, yes, the section I speak of does get to some of the other parts of what I mention, but the emphasis on state management seems to distort and unbalance the view of what you might want as a process.
That's something I've been a bit worried about with people coming to Elixir thinking it's basically a faster/robust Ruby: are they absorbing the full Erlang Weltanschauung?
a particular philosophy or view of life; the worldview of an individual or group.
--
Personally I came for the faster/robust Ruby first and then over time discovered (and am still discovering) the other powerful pieces. I think that's ok.
You won't have Erlang-style reliability (predictable recovery after localized
errors) without links, monitors, and supervisors, which themselves are built
with links and monitors.
Plus immutability, plus message passing paradigm (and the associated distribution).
The immutability prevents entire classes of bugs, and helps enforce predictability on the part of your code. It also is assumed at least in part when it comes to message passing (else, what happens if you change a referenced binary after passing it in a message? Does the message change too, or remain static? Either way, complexity)
The message passing paradigm for interprocess communication ensures that the developer is forced to ask "What happens if this isn't received", i.e., if the process is down or missing or etc. Any sort of 'return', via a message sent back, could not happen; this is akin to a called function throwing, which again, requires the developer to ask 'what happens if we never get this response back'? The answer is never to wait indefinitely, it's to...what? Other languages make it very easy to ignore possible exceptions; in Erlang, not only are they likely handled (if you have your supervisor structure in place), but you're forced to decide how such a thing in another process affects the current process (if at all).
Because of the message passing paradigm, the distribution story is simplified; it's largely transparent as to whether the process you're sending a message to is local, or remote. This allows you to build in redundancy across nodes without many of the complexities (and thus, room for errors) that other languages give you, nor the lying abstractions many languages give you (such as RMI).
And having the distribution Erlang does helps with the reliability side of things, as it makes it comparatively straightforward (still complex, but far less so than most other languages) to get solid handling in the event of machine failure.
My understanding if that Agent[1] was designed specifically to keep state. But then again, an Agent is a GenServer is a process, so I guess the post just served as a introduction to the concept of a Elixir process.
Yes. That is correct. I am going to cover Agents, Tasks and OTP in the following articles.
I thought the article would be too big if I cover all three topics at once.
Maybe I'm thinking about this incorrectly but when it comes to web application development and concurrency, the things I would typically want to run in a separate process are very important tasks.
For example, let's say you're sending emails out.
In Rails, Flask, etc. you would typically offload this to a background worker like Sidekiq or Celery. These are dedicated libraries that run in their own process and handle job processing.
Both tools allow you to see if that job was successful or failed and deals with retries. They also use Redis as a back-end so your state persists outside of the application code.
If you just willynilly spawn a process and start sending stuff through it, how do you ensure everything works as expected and what happens to all of your state if the Erlang VM goes down?
I love the idea of doing this, but in real world practice, it sounds like you would still need the Elixir equiv. of Sidekiq / Celery to handle ensuring these spawned tasks are trackable, right?
Of course you could make such a system in Elixir, and I'd always recommend doing things within the erlang VM (beam) instead of figuring out how to deploy another service (like install sidekiq with all its dependencies).
However, processes in Erlang/Elixir are used ubiquitously. Most of the time, they run forever like some sort of service and are restarted when they crash. Sometimes, you won't care about the result.
That's the reason why features such as tracking success/failure in a persistent data store is not a core feature. You could easily add those features to your processes though, the necessary API is already there.
The alternative is to use message queues that are durable. Only ack the message when you successfully handled it. If your application crashes before the task was completed, the task will be redelivered the next time the application starts.
I definitely wouldn't want to be responsible for writing this code myself. Something like Sidekiq has been worked on for 5+ years by hundreds of people.
Does anything exist in the Elixir world that's as battle hardened and has comparable features to Sidekiq / Celery?
When you're coming from Rails/Sidekiq world a process can seem very fragile because you're usually thinking of it like a thread or a Sidekiq job.
With a Sidekiq job, the jobs are kept in Redis and what's actively being worked on is tracked by Sidekiq in Redis. Even if you switch to a different store like Mongo for job details, you still have to have Redis available just for the progress tracking.
This background job will usually run through and execute a lot of sequential steps. When writing a Sidekiq job you are supposed to build it so that if it fails at some point along the way, the job can run again. Failures will trigger retries or exhaustion failure cases and log the error out to Redis for your Sidekiq dashboard so that you can do something different with it at that point.
The job is running in it's own thread which doesn't have a clean tracking mechanism via Sidekiq, so the only option to retry happens by re-queuing the job. Your database connection pool has to be increase to match your concurrency level in most cases as well and the concurrency has to be closely watched to track resource usage by job type in multiple job queues to ensure that too many heavy jobs don't take over and prevent other things like emails or push notifications from getting out on time.
That's with everything that's built in plus a few extensions.
Now, compare that to how processes work with Elixir/Erlang.
Processes are so small (0.5kb compared with at least 1mb / thread) that each process is created with a supervisor as a standard mode of operation. Each job has it's own personal supervisor that will see if an error happened and immediately restart the job (with backoff). No external dependency, re-queuing or special tracking needed (although there are plenty of options).
Because the processes are so small, in most cases you won't have a single large background job, you'll have many smaller ones breaking the entire thing into smaller pieces with other supervisors so at the point of an error, only that piece will need to restart.
Worrying about different job sizes or resource usage largely goes out the window as well because the BEAM's scheduler limits the execution time of every process ensuring that a single large job can't takeover a machine. All of the smaller ones will keep executing just as if the big one wasn't running.
Because the entire thing is built for concurrency, you also don't have to worry about the database-connection-per-job as the connection will just be checked out to make a query and then checked right back in. Because the scheduler could change context at any point, allowing a single process to hold a connection for longer than the database transaction in progress doesn't make sense.
If you really want to use the Sidekiq approach, there is a library for it available (https://github.com/akira/exq). But what you get out of the box with Elixir/BEAM is already pretty far ahead of what Sidekiq gives you.
I'm definitely open for new ways because the whole Sidekiq approach does work but it's a very brittle system. Not because it's a bad tool, but there's just so many moving parts and a million things that need to happen at many layers for it to all work.
Do you happen to have a blog post or code sample that ties together creating a production-ready "Elixir job"? Out in the wild it's reasonable to expect to have code fail for reasons you can't control and being able to control what happens when that does or how it re-tries is important.
There should be plenty out there. Handling failure is one of the main selling points of the Erlang/OTP ecosystem.
A big part of the difference it that in a typical web application in most other languages, you'll think of your system in two parts: web and background.
With an Elixir/Erlang you tend to think of the web as just one interface to the application itself. The idea of background jobs in those contexts somewhat goes out the window when you have an environment where spinning up millions of processes at the same time is the expectation and not a terrifying scaling scenario.
Think of responding to an incoming web request as just another job, essentially.
Because everything is just message passing and there is no global state, the restart just looks like a function call. The supervisor will start a new process calling the function with the arguments that were passed to it when it was created.
Here's an example that will try to explain things.
As I said, using a message queue like Rabbit MQ is basically the same thing. Granted you don't get a UI that is as fancy, but the promise that a task that is enqueued will be executed eventually can be given.
I don't think so. First, just to make sure there aren't any points of confusion... Erlang/Elixir processes are not OS processes (https://elixir-lang.github.io/getting-started/processes.html). As such, we really can't speak of processes in Erlang/Elixir and other kinds of systems on any sort of equal footing. In Erlang, processes are much lighter weight than OS processes. And you do expect them to go wrong... so Erlang has the idea of "Supervision Tree" where there are Supervisor processes whose job it is to monitor other processes and to manage their failure when it occurs. There can be multiple Supervisors which influence one another, or not, as you design them too (thus the "tree" bit). Naturally you plan for these sorts of failures in designing what is/is not a process, what dependencies there may be, what any Supervisor watches and how the Supervisors relate to one another.
Erlang/Elixir also seems to have one of the strongest availability and concurrency stories out there on its own. Part of this comes from the aforementioned built in assumptions around isolating and managing failed processes, but the other is in terms of the relative ease of starting processes as needed, and accessing the processes of other Erlang VMs (same server or not). It's clear you have to architect and build things correctly to take full advantage of these capabilities, but I think you can get to the place you describe without necessarily taking on a bundle of different applications for different purposes.
For example, for many of the use cases I think of Redis, Erlang offers a number of different ways to deal with such task out of the box: https://blog.codeship.com/elixir-ets-vs-redis/ Again, you have to think about how you architect things, but you can use the same toolbox for much of the work.
I'm still new to Elixir/Erlang, but I've been studying capabilities and architecture for several months now. I think there are sufficient differences in Erlang's approach as compared to other more common stacks that it really does benefit one to come at it fresh and not make very many comparative assumptions. In my own learning I've found that to just get the basic feel for what this Erlang/Elixir animal really is, that the Elixir side is more approachable. The documentation tends to be a little more direct and tends to be more beginner friendly. After I got the very basics down, I was able to look at the Erlang side and understand really what they were trying to say.
Very cool. I'm adding this to my list of easy concurrency tools. The main thing I like is the statelessness. That's where most people screw up parallel programs.
Seems like there are many languages/libraries trying to make concurrency easier to implement in practice. Most notably for C++ (my fav since I have to use it for most of my work projects): Intel TBB (definitely the go to for most things), RaftLib (saw at C++Now last year) is probably the easiest to understand (same theme as this post, super easy concurrency for c++). Even Java seems to make concurrency rather easy with it's thread pools and relatively strait-forward synchronized sections.
Okay, taking a step back - this isn't just another concurrency tool. This is a language built for concurrency.
Taking another step back - this isn't just a language built for concurrency. This is a language built for -resiliency-.
Taking another step back - this isn't just a language built for resiliency, this is a language built atop a VM built for resiliency -over 30 years ago- (the Erlang VM, aka, the BEAM, which Elixir runs on).
Okay. Why does this matter? Well, the thing is, resiliency encompasses concurrency and distribution, both. It prioritizes error minimization and, more importantly, the ability to recover from errors. This isn't just a try/catch; this is a "something you completely failed to even expect caused things to fail in a way you can't even imagine, and the system still handled it".
It achieves that via immutability, concurrency, and distribution. Ensure your data is immutable, so that state has to be very explicit (it's not stateless...a process has state. But it's very explicit state; as a developer you can't help but handle it and be very aware of it). Ensure bad states are dropped, and the system can recover the execution unit from a good state. If it can't, allow a user defined subsystem to fail, as the intricacies between the entire subsystem are implicitly stateful, and restart the whole thing from a known good state. If even that fails, keep climbing the supervisor tree, restarting larger and larger subsystems, until you restart the entire -application-, assuming that the intricacies across subsystems have gotten into a bad state, and again, restart from a good state.
These principles have been around a long time, but Erlang is one of the first languages to put them into practice, again, over 30 years ago. There's been a lot of time since to see they actually work, and to further refine them. The difficulty with concurrency is not actually being concurrent (per your post, there are a LOT of ways to implement concurrency); the difficulty is doing it in a way that it behaves how you want it, even in the face of user's doing things you don't anticipate, external resources doing things you don't anticipate, your own code doing things you don't anticipate, etc. The design decisions that went into Erlang focused on minimizing errors...in so doing, it provides a way that most kinds of errors are handled transparently (from logic bugs to actual machine failure), while making it much harder to do things that it can't recover from (memory leaks are comparatively difficult to cause, as are deadlocks, for instance).
To give you an idea, the first commercial product built with Erlang boasted (famously) 9 9s of uptime. Meaning something on the order of ~30ms of downtime a year. That includes planned downtime, visible errors, etc.
I've seen Erlang systems in production...even with a rather critical bug in one, the system just -worked- for -years-, before someone noted an oddity in the logs, dug into it, and went "Oh my God" over how severe the issue was. But, again, Erlang's supervisor process just restarted it, and it was never noticed.
I helped write CNN's current video ingest system in Erlang. It's been working without issue for years, despite no maintenance or attention (to where even most of the developers have left, but it still just...works). Even much less complex Ruby, Java, and Javascript systems are plagued with constant bugs. I would not say the devs on this project were just that much better (though the process was a little different, with little product owner involvement), but that the language we picked was so much better geared toward fault tolerance.
I recommend taking a look at .Net/C#'s concurrency machinery. It makes it terribly easy to use its thread-pool, and its async/await stuff is very cool. Think TBB tasks, but with very nice sugar.
My understanding is that Java's current concurrency offerings are much like C#'s, but... worse.
Seems like the section on "When to use processes" is selling processes short a bit. Certain kinds of state management would seem to call for processes to either manage or even hold state... but processes (as I understand them) are also key in designing for reliability. So I would think I may well want to organize my code relative to processes as I'm also figuring out the supervision trees and various failure scenarios. And yes, concurrency issues as well. If I'm wrong on this, I'd be happy to be set straight.
Anyway, yes, the section I speak of does get to some of the other parts of what I mention, but the emphasis on state management seems to distort and unbalance the view of what you might want as a process.
[edited a touch for clarity]