array_unique - as bad as possible

Introduction

A phenomenon I encounter quite often is programmers spending hours on reinventing a feature, function or construct of the programming language or framework they use. In PHP - from what I've seen so far - the most common one is string concatenation from array elements:

$string = '';
for ($i = 0; $i < count($array); $i++) {
    $string .= $array[$i];
    if ($i < count($array) - 1) {
        $string .= ';';
    }
}

// instead of

$string = implode(';', $array);

Even though this code is not really wrong, such a behaviour can cause a lot of problems further down the road:

Code becomes unnecessarily large and thus less readable. Since code is many times more often read than written[citation needed], that is the real waste of time compared to the time writing it. I mean, instead of "turning an array into a string by using ';' as the separator" which is indicated by "implode", you have to read the whole code block, go through the steps and by that find out what it's doing with the array.

Oftentimes the code inside the loop is enriched with more code that does jobs for later parts of the program, which is then only used in special occasions, but hey, we saved us a loop somewhere else!

$string = '';
$sum = 0;
for ($i = 0; $i < count($array); $i++) {
    $string .= $array[$i];
    $sum += $array[$i];
    if ($i < count($array) - 1) {
        $string .= ';';
    }
}

// instead of

$string = implode(';', $array);
$sum = array_sum($array);

Of this category I have found a very nice but also quite large specimen. The following "alternative" to the PHP function "array_unique" or the SQL keyword "DISTINCT" suggests that the developer tried by all means to make execution time and memory consumption as high as possible for the task. To make it slower, you would have to add a sleep() somewhere, but that would be cheating! ;)

The following code only illustrates the approach. In the original, the pieces were a lot more distributed and also variable names contained a bit more information, but not too much. Would be boring if they did...

Here be Dragons!

The ValueObject

First of all, we have a value object with only public members (pre-initialized with values - that's going to be another article) and getters and setters, nothing more.

class ValueObject
{
    var $value1 = '-';
    var $value2 = '00.00.0000';

    /** about 30 more of those */

    function getValue1()
    {
        return $this->value1;
    }

    function setValue1($value)
    {
        $this->value1;
    }

    /** 31 times more */
}

The Data Class

The "Data" class executes a very long and complicated query, 6 JOINs and then a UNION SELECT with another 7 JOINs. Although this query hides too much business logic in conditions and formatting of strings, and when I took over, it lacked the proper indexes and whatnot, but for this article it's not really the cause of problems. For a result of 20,000 lines, it only took about 1 second so we'll ignore it here. The result is being put into a list of instances of the ValueObject class above - in most cases not even calling all setters.

class DataClass
{
    public function getResult()
    {
        $sql = 'SELECT ............................................ ';

        $result = mysql_query($sql);

        $data = array();
        while ($row = mysql_fetch_num($result)) {
            $row = new ValueObject();
            $row->setValue1($row[0]);
            $row->setValue2($row[1]);
            /** 30 times more */

            $data[] = $row;
        }
        return $data;
    }
}

At this point I should mention what this code is for. The data collected will be displayed and used much similar to a spreadsheet application like LibreOffice Calc or Microsoft Excel. The application features a pagination so there's only let's say 100 rows per page (it's adjustable) but it still reads and processes the whole result. That's happening for a reason. Similar to the spreadsheet applications, our web application also features column filters in form of a select box that lists all distinct values of the column, alphabetically ordered.

The Loop in a loop in a loop

To build the lists of values for those column filters, the following function was created. It checks if an array - which is passed as the second parameter, has a certain value - passed as the first parameter - in it. If not, it appends that value to the array and in both cases returns the new array.

function arrayHasValue ($value, $array) {
    for ($i = 0; $i < count($array); $i++) {
        if ($array[$i] == $value) {
            $found = true;
        }
    }

    if (! $found) {
        $array[] = $value;
    }

    return $array;
}

The Loop outside of the loop in a loop in a loop

And here we see the whole set of aforementioned code in use:

$databaseResult = $dataClass->getResult();

$column1 = array();
$column2 = array();
/** and 30 more */

for ($i = 0; $i < count($databaseResult); $i++) {
    $column1 = arrayHasValue($databaseResult[$i]->getValue1(), $column1);
    $column2 = arrayHasValue($databaseResult[$i]->getValue2(), $column2);
    /** and 30 more */
}

If your database result is just a few rows, the above solution would be just another case of "you can do it that way, but you don't have to.", but if you get a result - let's say in the mid-hundreds, it will most certainly start to affect the performance of your web application. Since the application I found this in had results of up to 30,000 rows, we can do a bit of math to see what actually happens if the application has to handle an average result of 20,000 rows with 30 columns.

Analysis

  1. 20,000 instances of the ValueObject class are being created
  2. 600,000 setters will be called (inside DataClass::getResult())
  3. 600,000 times, a value object will be picked out of the big result array by using it's numerical index
  4. 600,000 getters will be called
  5. 1,200,000 times, an ever-growing array will be passed by value (= copied) into a function or back as its return value
  6. approx. 5,000,000,000 times array access via numerical index [1]
  7. approx. 5,000,000,000 string comparisons [1]

Sounds like a lot? Well, it is. For such a result set, it took from 30 seconds to two minutes to render the page. Before I created indexes, it took the application to render certain views more than 10 minutes. People actually scheduled their work days around this! But how can we improve this? Let's look at it step by step.

Solution(s)

Caching

Caching wouldn't help for the first visit and since every user was only allowed to see a different subset of the data which changed quite frequently, we would also need to either frequently clear the cache or live with slightly wrong data. It would help reducing the average load time of the pages but it wouldn't be the same application in the end.

Temporary Variables

Problem c) could be lowered from 600,000 array index accesses to just 20,000 by only adding one line of code which uses temporary variable:

for ($i = 0; $i < count($databaseResult); $i++) {
    $row = $databaseResult[$i];
    $column1 = arrayHasValue($row->getValue1(), $column1);
    $column2 = arrayHasValue($row->getValue2(), $column2);
    /** and 30 more */
}

foreach

Problems c) and f) could be eliminated by using a foreach loop. Foreach is not only easier to write (and read), it is in this case also a lot faster. The actual problem is the array access via numerical index. Everytime you write something like $result[1234], php has to internally look for that specific entry in the whole array. It is not walking over each of the elements until it finds it, but still, this internal lookup is way more expensive than just going over the array one element at a time like the foreach loop does. If you are interested in more details, I suggest you read about Hash Tables.

Imagine a mailman who can only carry one letter at a time from his carriage to the mailboxes of a multi-tenancy building instead of taking all the letters for the building in one turn.

foreach ($databaseResult as $row) {
    $column1 = arrayHasValue($row->getValue1(), $column1);
    $column2 = arrayHasValue($row->getValue2(), $column2);
    /** and 30 more */
}

early exit

Problems f) and g) can be reduced a bit - how much depends on the actual data - by returning early inside the loop when a match is found - instead of walking over the whole array every time. If we find a match, we won't find another one - and even if, it wouldn't matter at all.

function arrayHasValue ($value, $array) {
    foreach ($array as $row) {
        if ($row == $value) {
            return $array;
        }
    }

    $array[] = $value;

    return $array;
}

pass by reference

Problem e) can be reduced by either not encapsulating the code in a function or by passing the array as a reference.

In PHP5 or newer, if you pass a non-object to a function, it is by default being passed by value (in PHP4 this was even the case for objects). That means as a copy. If you change the content of the variable inside the function, it does not affect the original variable's value after the function call.

"Copy" usually means "twice as much RAM" and thus CPU operations for allocating the necessary memory and copying the variable content. Since that would make function calls very expensive, PHP comes with a "copy on modify" implementation. It essentially means that the variable is treated as a reference as long as data is only being read and is copied as soon as something is changed.

In our case this only helps for the very few occasions, where an actual duplicate entry is found. In all other cases the array is being modified - one entry is added - so with almost every function call the full array is being copied in RAM, only to be discarded a tiny fraction of a second later. This can be run more efficient, if we pass the array as a reference so that the function can work on the original.

function arrayHasValue ($value, &$array) {
    foreach ($array as $row) {
        if ($row == $value) {
            return;
        }
    }

    $array[] = $value;
}

// [...]

foreach ($databaseResult as $row) {
    arrayHasValue($row->getValue1(), $column1);
    arrayHasValue($row->getValue2(), $column2);
    /** and 30 more */
}

2D array instead of ValueObjects

Problems a), b) and d) can be reduced by keeping the database result as an array. I usually advocate using typed objects over handling all kinds of arrays, especially inside business logic, but there it's usually just about a few single entities or small lists/rowsets. This however should be just some internal code which in the end spits out the column filter lists.

class DataClass
{
    public function getResult()
    {
        $sql = 'SELECT ............................................ ';

        $result = mysql_query($sql);

        $data = array();
        while ($row = mysql_fetch_assoc($result)) {
            $data[] = $row;
        }
        return $data;
    }
}

Hash table lookups instead of string comparisons in loops

Earlier I explained that hash table lookups are slow and that one should use foreach loops because of that. Well, in the following case we will use them as a speedy alternative for string comparisons in loops and also get rid of the millions of loop iterations and hash table lookups of e), f) and g) in one step. For this we completely ditch the function arrayHasValue and complete the task within just the outer loop.

We use the fact that in an associative array (PHP's name for a sorted hash) every key can be used only once. So we use every value per column as a key and set that to some value - in this case "true". If we encounter a duplicate entry, it will be set a second time to true (read: nothing will happen). So after the loop we have a list of unique entries and we just have to make them the values of the array again which can be achieved by a call to array_keys(). This is a lot faster because we don't have to compare every string with all the others.

$databaseResult = $dataClass->getResult();

$column1 = array();
$column2 = array();
/** and 30 more */

foreach ($databaseResult as $row) {
    $column1[$row['col1']] = true;
    $column2[$row['col2']] = true;
    /** and 30 more */
}

$column1 = array_keys($column1);
$column2 = array_keys($column2);
/** and 30 more */

Architectural Changes

Depending on the application architecture, size of the data, server settings, available PHP modules or 3rd party software, there are a few more options. Since they are a bit more complex - but might bring the page load time down to 1-2 seconds I'll only describe them. If you want me to produce examples, tell me in the comments, I might write about it in one of the next articles.

array_unique

If you can somehow get each column's values as an array with duplicates (e.g. by using PDO and PDO::FETCH_COLUMN mode - which would require to query the database multiple times with the same query - where the query cache can help - but also only if the settings are right) there would be a PHP function for that: array_unique. It basically does exactly what we want internally which is even faster - the queries might slow it down again, so this has to be load tested.

SELECT DISTINCT

If the data and the initial query allows it, one could just fire a very quick query to get each columns unique values from the database and let it do the filtering - databases are optimized for such operations ("SELECT DISTINCT" or "GROUP BY" are the keywords to look up), if the database server is on a different machine, the duplicate entries are not transmitted over the network so that might save some time too. If you use mysqli with mysqlnd (since PHP 5.4), you can even send multiple queries asynchronously to either the same server or multiple read slaves at once to speed things up.

Triggers and Temporary Tables or Redis, memcached, APCu

If your business logic is not built into IFs in database queries or otherwise resembles a big bowl of spaghetti, you might quite easily keep redundant copies of these lists lying around in either temporary tables or local RAM (APCu, memcached) or a Redis server. Whenever something is added to the original table, you add it to the temporary lists as well - and do the duplicate check for just that one value - and if something is removed, you check if it was the last entry of the kind against the original table and remove it as well. In case of changes this would require both operations. Within a modern database server you can use triggers to make this work automatically. If you only have one webapplication server, you can use local memory with APCu. And if your database is already quite loaded, you might want to give Redis a try - it's a NoSQL in-memory database which has amongst others a hash type which is a perfect fit for the use case.

Footnotes

[1]: This is based on the assumption that 10% of the data is duplicated. Since we can't predict if duplication happens at the start or the end of the array it stays an estimation:

20,000 rows, reduced by 10% => 18,000,
SUM (1...18,000) = (18,000^2 + 18,000) / 2 = 162,009,000
times 30 columns: 4,860,270,000

The actual count of f) and g) might be even higher because there is no exiting the loop even if a match is found.

Holger Segnitz, 2017-08-17