strpos and the joys of dynamic typing

Dynamic Typing

PHP is one of the languages which interpret the type of the content of a variable the moment it is used. Something unfitting will be converted by a set of rules so that the operation can be executed. So, through this, we can add a string to an integer, if the string is numeric, it will work and result in either an integer or a float. Of course this has limits and pitfalls but in general, if one knows what they are doing, it makes programming a bit easier - arguably of course - because you don't have to convert between types all the time.
However, sometimes it causes really odd and unexpected behaviour and if you're unlucky it's even extra hard to debug - in the following case to be perfectly honest partially because of my own bias, didn't check the right place because I really didn't expect the problem there.

Introduction

strpos

This function returns the position of one string within another. If it doesn't find the supplied string it returns false. Here comes the first pitfall. Loosely compared, false equals 0 - so if you do not use the result in a typesafe manner, finding the supplied string at the very start of the target might be confused with not finding it at all. That's why one should use the === or !== operators which also check the type of two variables or expressions. This is quite obvious and also written quite prominently in the manual.
However, knowing this didn't help for what's about to come...

The Feature

At FastBill we introduced a new functionality for our banking module last year. If our customers link their bank account to our service, we compare attributes of their bank transactions with attributes of their invoices and suggest the best matching one as a possible link and the next best ones as further suggestions. This is then fed into machine learning algorithms to train the machine.
This involves a scoring mechanism where we try find certain pieces of information in an increasingly fuzzy way where a match scores less the fuzzier it is.
For example, if we precisely match the full name of a client, the transaction is way more probable to be the right one compared to only matching the initials of that client. The same goes for dates. A client might write the invoice date into the reference line of the bank transfer, so we search for the invoice date in that field. Since this field can have a lot of information in it, we check for the date in various formats giving less matching points for less common formats which might not even be a date.
Easiest way to do this is to use the date, format it in all possible ways and loop over that list where certain formats give different scores.

class Scoring
{
    /**
     * @param  DateTime $invoiceDate
     * @param  string   $referenceLine
     * @return int                     [0..10]
     */
    public function scoreInvoicedateInReferenceLine(DateTime $invoiceDate, $referenceLine)
    {
        // list is shortened
        $variants = [
            $invoiceDate->format('Y-m-d') => 10,  // 2017-05-08
            $invoiceDate->format('Ymd')   =>  8,  // 20170508
            $invoiceDate->format('m/d/y') =>  6,  // 05/08/17
        ];

        // we loop over the formats and the first hit determines the score, the rest of the targets are ignored.
        foreach ($variants as $format => $points) {
            if (false !== strpos($referenceLine, $format)) {
                return $points;
            }
        }
        return 0;
    }
}

The Bug

Looks simple enough, right? So, writing a unit test to actually check all the possible outcomes behave as expected, should be child's play, right?

class ScoringTest extends PHPUnit_Framework_TestCase
{
    public function targets()
    {
        return [
            'iso'      => ['asdf2017-05-08jkloe', 10],
            'isoShort' => ['asdf20170508jkloe',    8],
            'american' => ['asdf05/08/17jkloe',    6],
        ];
    }

    /**
     * @param string $subject
     * @param int    $expectedScore
     * @dataProvider targets
     */
    public function testScoring($subject, $expectedScore)
    {
        $scoring = new Scoring();
        $this->assertEquals(
            $expectedScore,
            $scoring->scoreInvoicedateInReferenceLine(new DateTime('2017-05-08'), $subject)
        );
    }
}

...and running them should be all good and we can surely continue with the next task....

Failed asserting that 0 matches expected 8.
Expected: 8
Actual  : 0

wait... what!?

The reason

Okay... one of the three tested formats does not behave as expected... happens... so, you start looking for the obvious reasons. A typo in the test code, a typo in the format string etc., but everything is fine there. Manual output of the formats resulted in exactly the expected formats. I quickly tried a few different PHP versions, all behaved the same. I read the manual again. And only then I became aware that strpos has a lesser known behaviour: While most PHP functions would convert an integer to it's base-10 representation in string form when said integer is passed instead of a string, strpos interprets it as the ordinal value of a character and then looks for that character instead. So, for example, if you pass int(97), strpos will look for "a".
But DateTime::format() returns a string and all the formats are strings there, right? Well, yes AND no. Here comes another common pitfall of PHPs dynamic typing. Arrays can be numeric or associative. That means they can be simple lists, usually starting from 0 to 1 and 2 and so forth or they are (ordered) hash tables or maps, like ['a' => 'orange', 'b' => 'apple'].
The array in the scoring function should be clearly the latter, since we use strings as the index, but in our case it is actually a mix of both.

var_dump($variants);

array(3) {
  ["2017-05-08"]=>int(10)
  [20170508]=>int(8)
  ["05/08/17"]=>int(6)
}

When writing to an array, PHP converts a numeric string into an integer when it is used as the key. Well, at least not all numeric strings, only the ones which look like an integer. This usually is not a problem, but I managed to find one of the edge cases here. After coming to this enlightment, I fixed it with casting to string, left a comment in the code why and tried not to cry openly until I was home, knowing I spent some time well wasted... ;)

Holger Segnitz, 2018-07-12