parallel processing with php (segnitz.net)

The problem

PHP is single threaded. No way around it (if you can't bring Facebook's Hack on the table). This also means that almost every command or function in PHP is a blocking operation. Meaning that the main program (script) has to always wait for results of e.g. database operations, webservices, command line jobs and many more.
Although this is usually not a problem and actually leaves some complexity out - arguably, I know, one can sometimes feel the impact by a cascade of degrading performance as soon as an external source is a bit busy.

There are multiple ways to optimize. For example, if there is a slow mysql query, one can use asynchronous queries offered by the mysqli module and mysql_nd. This assumes of course that there is something else which can run during that time.
If you are calling an external webservice which can be a bit slow, you can use guzzlehttp/guzzle which supports asynchronous requests via promises, which can be seen as "block as late as possible".
For other tasks, one solution can be to queue jobs and have the user collect the result later. This can keep the main process quite short and has the added benefit of doing the heavy loading on maybe even a different server and if the concept of the software allows it, even during idle times - e.g. at night.
But sometimes you need the result within a few seconds or faster - and even within the main process - but need that process to not take too long.

Solutions

In this blog post, we're looking solely into how to handle background jobs which we call directly from the parent job and that run in parallel to the parent job. No queues involved - not that queues are bad - they are just another topic and we want to focus. PHP has multiple ways which allow us to spawn a child process and use it's output. The manual describes them but lacks a bit of hinting when and how to use which of them. And there are even more options which I will describe briefly first.

AJAX

Obviously, loading results asynchronously with one or more subsequent requests which contain only the data, can and most likely will speed up the initial request because it will not need to do the processing itself. If your application is focused on the front end, this is a good way to go, there are many frameworks supporting you, you can even build SPAs (Single Page Applications) which only load one page and from then on just do requests for the data. However, this uses the frontend to trigger parallel running processes on one or more servers, so this is out of scope of this article.

file_get_contents, fopen or curl of "http://localhost/..."

One could think that what we can do in the frontend, we can also do in the backend, right? So, building together a website by calling a few data endpoints and then combining them doesn't seem too bad, right?
Well, don't do it. Just don't. First of all, you don't win anything here - the operations are blocking, that means, if you do this e.g. three times in a request, the user will wait as if they have done 4 HTTP requests, one after the other.
But more importantly, you will exhaust the webserver's resources. The webserver and/or the php-fpm process are configured with a maximum number of children to run in parallel or requests to process in parallel. You will not only reach this limit twice as fast with that approach, you will most likely cause a situation close to a deadlock.
Imagine, we have said limit set to 10. Now, 9 processes are already busy and your new request gets the 10th slot. It now tries to open another request. The webserver will queue that request until one of the workers is free to accept the next request. More precisely one of the other nine, as your request will wait (almost) indefinitely for that moment and can't be finished before itself of course.
The more such child-requests your application is sending and the higher the load of the webserver is, the more processes will hang around waiting for a slot to execute until almost all are busy with doing nothing.
Also spawning such requests as curl background processes with the methods described below, does not help here since you connect to localhost. I've seen this taking whole websites offline more than once over the years. Please don't go there. Not even just a bit.

exec / shell_exec / backtick-operator / passthru / system

All of them take a string as a parameter which is basically the command one would enter into the command line. The differences between the PHP functions are mostly about how they handle output and the return status of the command. What they have in common is that they are blocking operations by default. Let me explain with a bit of code:

// getting a large file listing
$output = [];
exec("ls -la /var/tmp", $output);

// Program code
// [...]

// template code
$template->assign("filelist", $output);

This will sadly not have the "ls" run in the background while the rest of the application is doing it's business and then magically provide the result when we use it first and assign it to a template variable.
The script will actually wait until the full listing is computed and handed over to the php process. It will be a list of output lines in the $output variable.

exec in combination with &

I didn't know this "fire-and-forget" solution until a few years back. If you use exec(), have the & sign at the end of the command, redirect the output to a file or /dev/null and don't pass an array to receive the output of the command, it will be run in the background and the php script will continue.

exec("pngcrush --brute background.png > /tmp/out.txt &");

This comes with two drawbacks:
You can't control anything and you can't use the output. Of course you can script a bit around it and look for the process ID and look into the content of the file the ouput it redirected into, but that's a bit cumbersome and prone to faults.
But even worse is, that if the main process is terminated (through an error or just by finishing regularily), the child process is terminated immediately as well, so you need to either sleep() and periodically check for the child process to be around - which artificially prolongs script execution - or to choose offloading only so many and small scripts that the main process is guaranteed to be finished.
You can cheat a bit here by returning the response to the user early and then secretly wait during script shutdown but it is still a bit risky. For these cases, a queue and decoupled worker processes might be a better solution.

proc_open

This set of functions allows for a bit more complex management of child processes. You can actually have the processes run in the background and at times of your choice communicate with them by writing data to them or read from them. This is achieved in a way similar to files. The manual for proc_open gives a good example on how to use it. Generally it is a good approach to wrap all this into a class so that the actual application around it does not need to know about the internals.
When using this to retrieve data from a slow source, be aware that any output the child process generates will be written into a buffer of usually only a few kilobytes in size. As soon as this buffer is full, the child process will halt. That means, if your child process creates output, you have to periodically read from it to keep it running. If you have multiple such processes, it is also advisable to create some logic that looks after these processes and lets the main process run until they are all finished.
At my current employer, we had these use cases multiple times, so I created a tiny library called "parallel-process-dispatcher" to achieve them in a clean and reusable way.
I was thankfully allowed to open source it recently, so you can actually find it on github.com and packagist.org and use it yourself.

fastbill/parallel-process-dispatcher

With this library you can easily cover different use-cases, a few of them I want to describe here:

per user cronjobs

You can split the execution of cronjobs by e.g. user ID and run them in parallel and not go over the payload one by one. With this you can put all the CPUs on the server to work and finish the job approx. $numbersOfCpus times faster. This has the added benefit of reducing memory leaks because after every processed user, the child process exits and will free the memory.

single-threaded linear approach

foreach (getAllUserIds() as $userId) {
    doCronjobWorkload($userId);
}

Parallel approach

if ($argc === 1) {
    $dispatcher = new Dispatcher(4);
    foreach (getAllUserIds() as $userId) {
        $dispatcher->addProcess(new Process('php cronjob.php ' . $userId));
    }
    $dispatcher->dispatch();
    die();
}

doCronjobWorkload($argv[1]);

background workers listening on a queue

When using a queue - for example Redis - you can have one master process spawn an exact number of workers which will either listen on the queue or do work supplied by the queue. Depending on the work it would also usually be advisable to set the number of max. running workers to the number of CPUs. Start the following script via cron every minute and you will have 4 workers active - or listening on an empty queue with a short pause of up to a minute every 1000 spawned workers.

if ($argc === 1) {  // no cli parameter = master process invocation
    if (file_exists('master.lock')) {
        die();
    }
    touch('master.lock');

    $dispatcher = new Dispatcher(4);
    for ($i = 0; $i < 1000; $i++) {
        $dispatcher->addProcess(new Process('php worker.php ' . $i));
    }
    $dispatcher->dispatch();

    unlink('master.lock');
    die();
}

$work = $redis->brpoplpush('queue', 'stack-' . $argv[1], 60);
if (false === $work) { // empty queue - after 1 minute of waiting
    die();
}

doWork($work);

heavy load background tasks of web applications or web services

The list of possible sub tasks for this use case is probably quite long. At FastBill for example we mostly put the generation of PDF files and the execution of webhooks to the background so that the rest of the script can continue. Although the results of the jobs do matter, there is more to do for the main script, so we wrap them into processes and only block the main process the moment we need the result.

// early in the program we start the job:
$pdfProcess = new Process('php generatePdf.php ' . $id);
$pdfProcess->start();

// business logic skipped

// later we wait for and collect the result
while (!$pdfProcess->isFinished()) {
    usleep(100000);
}

$result = $pdfProcess->getOutput();

In case of multiple such processes it might be advisable to use a Dispatcher instance to make sure that the server resources are not exhausted.

OpenSource

So this was the first project I published as open source and it felt good. Should have done it earlier. My thanks go out to my employer FastBill, especially to @roritharr for supporting OpenSource software and granting my colleagues and me to publish our work products and also to my colleague @arnebahlo for walking me through and showing a few shortcuts.
And I can say already that there will be another project out soon, it uses the fastbill/parallel-process-dispatcher so we had to opensource that one first. It will be a reverse code coverage tool which can help greatly when refactoring a huge legacy application. Stay tuned...

Holger Segnitz, 2018-05-22