Sharovatov’s Weblog

PHP loadHTMLFile and a html file without DOCTYPE

Posted in php, web-development by sharovatov on 1 November 2009

Just noticed that when you parse an html file with DOMDocument’s method loadHTMLFile and there’s no DOCTYPE defined in your html, PHP will silently load an empty DOM document.

Just try saving the following in a test.html file:

<html><body><div id="toc">wtf</div></body></html>

And then run the following php code:

$doc = new DOMDocument();
if ($doc->loadHTMLFile('test.html')) {
  echo 'loadHTMLFile was successfully executed<br>';
  $toc = $doc->getElementById('toc');
  echo 'now trying to var_dump the $toc:<br>';
  var_dump($toc);
}

You’ll get NULL as a result of the var_dump call. As if getElementById couldn’t find the node.

Interesting?

Citing php.net,

The function parses the HTML document in the file named filename. Unlike loading XML, HTML does not have to be well-formed to load.

Does this imply that DOCTYPE may be omitted? I think so. But then the abovementioned code wouldn’t show NULL as a dump of $toc. Unfortunately, experiment shows that DOCTYPE is required, even a HTML5-ish
<!DOCTYPE html> will do the job. 

But why on earth doesn’t loadHTMLFile throw a warning or at least return false as it should according to the documentation? Nobody knows.

So if you notice that your DOM-based php script acts in a weird way, check if you have a DOCTYPE defined on the HTML document you’re trying to parse.

Hope this saves someone some time.

P.S. More bugs to come — if you have a HTML file saved in utf-8 codepage with BOM, loadHTMLFile will throw the following E_WARNING:

Warning: DOMDocument::loadHTMLFile() [function.DOMDocument-loadHTMLFile]: Misplaced DOCTYPE declaration in test-BOM.html, line: 1 in /home/test/www/test-DOMDocument.php on line 3

Remove the BOM and everything works fine. Apparently, loadHTMLFile doesn’t know that BOM usually indicates that the document is saved in UTF8/16/32. Weird.

P.P.S. Another issue. Try pointing loadHTMLFile to an HTML-document saved in UTF-8 with some international characters (Russian words, in my case). Then get a node with international characters and do echo $node->nodeValue. Are you getting corrupted symbols? I was. The whole project is in UTF-8, every single file is saved in UTF-8.

I added <meta http-equiv="Content-type" content="text/html;charset=utf-8" /> to the head section — characters started showing in a correct encoding, but the following WARNING appeared:

Warning: DOMDocument::loadHTMLFile() [function.DOMDocument-loadHTMLFile]: Input is not proper UTF-8, indicate encoding ! in /home/test/www/test-russian.html, line: 65 in /home/test/www/test-DOMDocument.php on line 29

And the only way to properly get rid of this warning is to add

<?xml version="1.0" encoding="UTF-8"?>

to the first line of your html document and it finally worked without any warnings or issues. Awesome. XML header must be used for loadHTMLFile to run properly. Way too buggy to use.


Share: 

About these ads
Tagged with:

5 Responses

Subscribe to comments with RSS.

  1. patgod said, on 2 November 2009 at 6:50 am

    Thanks

  2. selau said, on 20 December 2009 at 3:46 am

    Agreed, I have been using DOMDocument for 1 year.
    Its way to buggy to use.
    Unpredictable behaviour.

    Considering moving on to Rails

  3. Markus said, on 20 March 2011 at 9:00 am

    Old article – still fantastic! Thanks for pointing out the BOM issue. Found below function on the web which helped me with using DOMDocument::loadHTML() without Misplaced DOCTYPE declaration warning.

    function removeBOM($str=””) {
    if ( substr($str, 0, 3) == pack(“CCC”, 0xef, 0xbb, 0xbf) ) {
    $str = substr($str, 3);
    }
    return $str;
    }

    Strange the BOM issue wasn’t mentioned on php.net!

    Referens

    http://www.codingforums.com/showthread.php?t=129314

  4. lazos said, on 2 December 2011 at 4:37 am

    $doc = new DOMDocument();
    if ($doc->loadHTMLFile(‘test.html’)) echo $doc->saveHTML();

  5. outis said, on 7 January 2012 at 3:55 am

    In the first example, the document isn’t empty. DOMDocument->getElementById simply can’t return the element you expect. This is documented in the manual page for getElementById:

    For this function to work, you will need either to set some ID attributes with DOMElement::setIdAttribute or a DTD which defines an attribute to be of type ID. In the later case, you will need to validate your document with DOMDocument::validate or DOMDocument->validateOnParse before using this function.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: