The magic empty string that is not empty

May 14th, 2008 by Ivo

I just helped one of our developers with a weird problem. The piece of code he was working on contained roughly this:

 
if ($str!='')
{
   echo "Hello $str";
}
 

He was importing a CSV file that didn't contain a value for $str, so $str was an empty string. So it should skip the echo statement according to the above code, right?

Wrong. The script echoed "Hello ". It completely did the $str!='' wrong.

In a debugger, we watched the value for $str and watched it step through the code. We clearly saw that while $str was empty (""), it executed the next line. Almost seems like a bug in the != operator but that obviously can't be the case.

This kind of baffled us.

To investigate, we var_dumped the value of $str, and this gave a very weird output:

 
   string(3) ""
 

An empty string, with a length of 3?

Eventually we noticed there were some weird control characters in there that translate to an empty string in output, but that do have a length. (Somehow this reminds me of black holes and dark matter).

We removed the char from the file (apparently, in a text editor, you could do 'delete' on the char, and although this didn't have a visible effect as it was an empty string in a text editor as well, it did work and we were able to process the file).

Lesson learned: not every empty string is an empty string. Control characters can have very weird effects. It can make a string look empty while it's not.

By the way: the people who created this file did so on a mac and uploaded it to a linux server. The weird chars were only at the beginning of the file, only on the first line. Does anybody know if there is some mac/linux/windows conversion that could cause these chars to appear?

19 Responses to “The magic empty string that is not empty”

  1. May 14, 2008 at 1:51 pm, Felix said:

    I remember I had this before, characters that are there but not shown. An example would be when $str = “”;
    This would give the exact result you got with your example.

  2. May 14, 2008 at 1:53 pm, Felix said:

    And of course input validation b0rked my example. Let’s say $str = “backslash zero backslash zero backslash zero” ;-)

  3. May 14, 2008 at 2:23 pm, Mike said:

    Or trim($str) != ”

  4. May 14, 2008 at 2:24 pm, Daan Broekhof said:

    Sounds like you picked up a UTF-8 file with a BOM (the non exploding kind ;) ) – http://unicode.org/faq/utf_bom.html#BOM

  5. May 14, 2008 at 2:28 pm, Mathieu Kooiman said:

    Sounds like UTF-8 BOM to me too, although UTF-8 BOM usually shows up whenever you’re displaying it in a non-UTF8 encoding.

    From wikipedia:
    > The UTF-8 representation of the BOM is the byte sequence EF BB BF, which appears as the ISO-8859-1
    > characters  in most text editors and web browsers not prepared to handle UTF-8.

  6. May 14, 2008 at 2:58 pm, Arno Zijlstra said:

    Oh, have I fought with the BOM once in a design where I had a weird margin type of space underneath a div.

     is indeed what gave me the headache.

    Arno

  7. May 14, 2008 at 3:04 pm, Derick said:

    This is the non-breaking space character in UTF-8 encoding (latin1/Unicode char 160).

  8. May 14, 2008 at 3:29 pm, Dave said:

    Which is why a hex capable editor should be part of all programmers toolkits.

    Its even more fun when its been through multi-byte charater systems like ebcidic

  9. May 14, 2008 at 3:40 pm, Dougal said:

    I’ve had a similar problem to this. I noticed it when I copied the output to notepad and used the arrow keys to move around. On one like two right arrows didn’t move because it was invisible characters.

    if that makes sense?

  10. May 14, 2008 at 3:51 pm, Sepa said:

    I’ve recently ran into similar problem. I was importing CSV files, when noticed that some of the strings are “strange” in a similar manner like yours. I’ve outputed some of them and found they’re set of one or two strange characters, but while viewing the CSV in OpenOffice Calc or MS Excel the’re represented as spaces.

    I was not able to solve this problem in my code, so I had to catch these baddies :) using OOCalc. Thankfully, all of these characters were at the beginning of a text in a cell, so when opened with a CSV viewer it can be clearly seen that cells with content like this: ” Vilnius” or ” Kaunas” needs to be checked twice.

  11. May 14, 2008 at 4:34 pm, Lorenzo Alberton said:

    As others have already suggested, it’s most certainly a BOM.

  12. May 14, 2008 at 4:35 pm, Picco said:

    hi,
    if ($str!=”)
    {
    echo “Hello $str”;
    }
    returns
    Hello $str
    not Hello (what in the var is)

    if ($str!=”)
    {
    echo “Hello “.$str;
    }
    is what u mean^^
    me sorry if anyone have sayd this allready

  13. May 14, 2008 at 6:43 pm, Edward Z. Yang said:

    If you wanted to find out conclusively, just urlencode() the string and check out what its binary representation is.

    What you really should be doing is checking the character encoding of incoming strings with the u PCRE flag or iconv.

  14. May 14, 2008 at 8:57 pm, James Dempster said:

    I had a similar problem once. trim() would remove it either. As Derick Rethans says it was a non breaking space. I found out that a client had copied and pasted a table from a website into excel and then saved it as CSV file. The table contained   which excel translated to non breaking space.

  15. May 14, 2008 at 8:58 pm, James Dempster said:

    dam it that was supposed to say trim() wouldn’t remove it.

  16. May 14, 2008 at 11:18 pm, Ivo said:

    Thanks for all the input!

    @picco: you’re wrong; it does output “Hello “; variable interpolation in PHP when you use double quotes. But that wasn’t the point of the message. :)

  17. May 15, 2008 at 6:30 am, Stefan Priebsch said:

    You might want to use check for strlen(…) == 0 instead, this should give you the expected result even when the string contains non-printable characters (and Unicode even has whitespace that has no length, because in some Asian languages they do not leave space between the individual words).

    Stefan

  18. May 15, 2008 at 10:00 am, fantata said:

    yeah, i’ve had this a lot as i do a lot of importing from external csv’s. It invariably is a BOM. If I’m in BBE Edit, you can switch to UTF8-NO_BOM mode, and it also has a handy ‘Zap Gremlins’ function, which gets rid of this kind of thing.

    It’s on my list to work out a way of better dealing with this in the upload.

  19. May 15, 2008 at 1:31 pm, Andries Seutens said:

    Hi Ivo,

    As others have indicated, this is the UTF-8 BOM (byte order mark). Here’s how you can reproduce the above:

    <?php

    header(‘Content-Type: text/html; charset=utf-8′);

    $string = chr(0xEF) . chr(0xBB) . chr(0xBF);
    var_dump($string);