The magic empty string that is not empty
I just helped one of our developers with a weird problem. The piece of code he was working on contained roughly this:
if ($str!='') { echo "Hello $str"; }
He was importing a CSV file that didn't contain a value for $str, so $str was an empty string. So it should skip the echo statement according to the above code, right?
Wrong. The script echoed "Hello ". It completely did the $str!='' wrong.
In a debugger, we watched the value for $str and watched it step through the code. We clearly saw that while $str was empty (""), it executed the next line. Almost seems like a bug in the != operator but that obviously can't be the case.
This kind of baffled us.
To investigate, we var_dumped the value of $str, and this gave a very weird output:
string(3) ""
An empty string, with a length of 3?
Eventually we noticed there were some weird control characters in there that translate to an empty string in output, but that do have a length. (Somehow this reminds me of black holes and dark matter).
We removed the char from the file (apparently, in a text editor, you could do 'delete' on the char, and although this didn't have a visible effect as it was an empty string in a text editor as well, it did work and we were able to process the file).
Lesson learned: not every empty string is an empty string. Control characters can have very weird effects. It can make a string look empty while it's not.
By the way: the people who created this file did so on a mac and uploaded it to a linux server. The weird chars were only at the beginning of the file, only on the first line. Does anybody know if there is some mac/linux/windows conversion that could cause these chars to appear?
Tags: charsets, control characters, PHP, strings



May 14th, 2008 at 1:51 pm
I remember I had this before, characters that are there but not shown. An example would be when $str = “”;
This would give the exact result you got with your example.
May 14th, 2008 at 1:53 pm
And of course input validation b0rked my example. Let’s say $str = “backslash zero backslash zero backslash zero”
May 14th, 2008 at 2:23 pm
Or trim($str) != ”
May 14th, 2008 at 2:24 pm
Sounds like you picked up a UTF-8 file with a BOM (the non exploding kind ;)) - http://unicode.org/faq/utf_bom.html#BOM
May 14th, 2008 at 2:28 pm
Sounds like UTF-8 BOM to me too, although UTF-8 BOM usually shows up whenever you’re displaying it in a non-UTF8 encoding.
From wikipedia:
> The UTF-8 representation of the BOM is the byte sequence EF BB BF, which appears as the ISO-8859-1
> characters  in most text editors and web browsers not prepared to handle UTF-8.
May 14th, 2008 at 2:58 pm
Oh, have I fought with the BOM once in a design where I had a weird margin type of space underneath a div.
 is indeed what gave me the headache.
Arno
May 14th, 2008 at 3:04 pm
This is the non-breaking space character in UTF-8 encoding (latin1/Unicode char 160).
May 14th, 2008 at 3:29 pm
Which is why a hex capable editor should be part of all programmers toolkits.
Its even more fun when its been through multi-byte charater systems like ebcidic
May 14th, 2008 at 3:40 pm
I’ve had a similar problem to this. I noticed it when I copied the output to notepad and used the arrow keys to move around. On one like two right arrows didn’t move because it was invisible characters.
if that makes sense?
May 14th, 2008 at 3:51 pm
I’ve recently ran into similar problem. I was importing CSV files, when noticed that some of the strings are “strange” in a similar manner like yours. I’ve outputed some of them and found they’re set of one or two strange characters, but while viewing the CSV in OpenOffice Calc or MS Excel the’re represented as spaces.
I was not able to solve this problem in my code, so I had to catch these baddies
using OOCalc. Thankfully, all of these characters were at the beginning of a text in a cell, so when opened with a CSV viewer it can be clearly seen that cells with content like this: ” Vilnius” or ” Kaunas” needs to be checked twice.
May 14th, 2008 at 4:34 pm
As others have already suggested, it’s most certainly a BOM.
May 14th, 2008 at 4:35 pm
hi,
if ($str!=”)
{
echo “Hello $str”;
}
returns
Hello $str
not Hello (what in the var is)
if ($str!=”)
{
echo “Hello “.$str;
}
is what u mean^^
me sorry if anyone have sayd this allready
May 14th, 2008 at 6:43 pm
If you wanted to find out conclusively, just urlencode() the string and check out what its binary representation is.
What you really should be doing is checking the character encoding of incoming strings with the u PCRE flag or iconv.
May 14th, 2008 at 8:57 pm
I had a similar problem once. trim() would remove it either. As Derick Rethans says it was a non breaking space. I found out that a client had copied and pasted a table from a website into excel and then saved it as CSV file. The table contained which excel translated to non breaking space.
May 14th, 2008 at 8:58 pm
dam it that was supposed to say trim() wouldn’t remove it.
May 14th, 2008 at 11:18 pm
Thanks for all the input!
@picco: you’re wrong; it does output “Hello “; variable interpolation in PHP when you use double quotes. But that wasn’t the point of the message.
May 15th, 2008 at 6:30 am
You might want to use check for strlen(…) == 0 instead, this should give you the expected result even when the string contains non-printable characters (and Unicode even has whitespace that has no length, because in some Asian languages they do not leave space between the individual words).
Stefan
May 15th, 2008 at 10:00 am
yeah, i’ve had this a lot as i do a lot of importing from external csv’s. It invariably is a BOM. If I’m in BBE Edit, you can switch to UTF8-NO_BOM mode, and it also has a handy ‘Zap Gremlins’ function, which gets rid of this kind of thing.
It’s on my list to work out a way of better dealing with this in the upload.
May 15th, 2008 at 1:31 pm
Hi Ivo,
As others have indicated, this is the UTF-8 BOM (byte order mark). Here’s how you can reproduce the above:
<?php
header(’Content-Type: text/html; charset=utf-8′);
$string = chr(0xEF) . chr(0xBB) . chr(0xBF);
var_dump($string);