Dynamic Typing
if one knows what they are doing, it makes programming a bit
easier - arguably of course - because you don't have to convert between types all the time.
However, sometimes it causes really odd and unexpected behaviour and if you're unlucky it's even extra hard to debug - in the following case to be perfectly
honest partially because of my own bias, didn't check the right place because I really didn't expect the problem there.
Introduction
strpos
This function returns the position of one string within another. If it doesn't find the supplied string it returns false. Here comes the first
pitfall. Loosely compared, false equals 0 - so if you do not use the result in a typesafe manner, finding the supplied string at the very start of the
target might be confused with not finding it at all. That's why one should use the === or !== operators which also check the type of two variables or
expressions. This is quite obvious and also written quite prominently in the
manual.
However, knowing this didn't help for what's about to come...
The Feature
At FastBill we introduced a new functionality for our banking module last year. If our customers
link their bank account to our service, we compare attributes of their bank transactions with attributes of their invoices and suggest the best matching one
as a possible link and the next best ones as further suggestions. This is then fed into machine learning algorithms to train the machine.
This involves a scoring mechanism where we try find certain pieces of information in an increasingly fuzzy way where a match scores less the fuzzier it
is.
For example, if we precisely match the full name of a client, the transaction is way more probable to be the right one compared to only matching the
initials of that client. The same goes for dates. A client might write the invoice date into the reference line of the bank transfer, so we search for the
invoice date in that field. Since this field can have a lot of information in it, we check for the date in various formats giving less matching points
for less common formats which might not even be a date.
Easiest way to do this is to use the date, format it in all possible ways and loop over that list where certain formats give different scores.
class Scoring { /** * @param DateTime $invoiceDate * @param string $referenceLine * @return int [0..10] */ public function scoreInvoicedateInReferenceLine(DateTime $invoiceDate, $referenceLine) { // list is shortened $variants = [ $invoiceDate->format('Y-m-d') => 10, // 2017-05-08 $invoiceDate->format('Ymd') => 8, // 20170508 $invoiceDate->format('m/d/y') => 6, // 05/08/17 ]; // we loop over the formats and the first hit determines the score, the rest of the targets are ignored. foreach ($variants as $format => $points) { if (false !== strpos($referenceLine, $format)) { return $points; } } return 0; } }
The Bug
Looks simple enough, right? So, writing a unit test to actually check all the possible outcomes behave as expected, should be child's play, right?
class ScoringTest extends PHPUnit_Framework_TestCase { public function targets() { return [ 'iso' => ['asdf2017-05-08jkloe', 10], 'isoShort' => ['asdf20170508jkloe', 8], 'american' => ['asdf05/08/17jkloe', 6], ]; } /** * @param string $subject * @param int $expectedScore * @dataProvider targets */ public function testScoring($subject, $expectedScore) { $scoring = new Scoring(); $this->assertEquals( $expectedScore, $scoring->scoreInvoicedateInReferenceLine(new DateTime('2017-05-08'), $subject) ); } }
...and running them should be all good and we can surely continue with the next task....
Failed asserting that 0 matches expected 8. Expected: 8 Actual : 0
wait... what!?
The reason
Okay... one of the three tested formats does not behave as expected... happens... so, you start looking for the obvious reasons. A typo in the test code, a
typo in the format string etc., but everything is fine there. Manual output of the formats resulted in exactly the expected formats. I quickly tried a few
different PHP versions, all behaved the same. I read the manual again. And only then I became aware that strpos has a lesser known behaviour: While most PHP
functions would convert an integer to it's base-10 representation in string form when said integer is passed instead of a string, strpos interprets it as
the ordinal value of a character and then looks for that character instead. So, for example, if you pass int(97), strpos will look for "a".
But DateTime::format() returns a string and all the formats are strings there, right? Well, yes AND no. Here comes another common pitfall of PHPs dynamic
typing. Arrays can be numeric or associative. That means they can be
simple lists, usually starting from 0 to 1 and 2 and so forth or they are (ordered) hash tables or maps, like ['a' => 'orange', 'b' => 'apple'].
The array in the scoring function should be clearly the latter, since we use strings as the index, but in our case it is actually a mix of both.
var_dump($variants); array(3) { ["2017-05-08"]=>int(10) [20170508]=>int(8) ["05/08/17"]=>int(6) }
When writing to an array, PHP converts a numeric string into an integer when it is used as the key. Well, at least not all numeric strings, only the ones which look like an integer. This usually is not a problem, but I managed to find one of the edge cases here. After coming to this enlightment, I fixed it with casting to string, left a comment in the code why and tried not to cry openly until I was home, knowing I spent some time well wasted... ;)