XPath queries on imported HTML doesn't work?

Post by **panter** » Sun Jan 23, 2005 4:53 pm

Hi there,

On a newly installed oxygen XML 5.1 on a PowerBook I try the following:

1) Import some HTML: "File -> Import HTML...", enter "http://www.google.com" and select "XHTML 1.0 Transitional". This will load the Google homepage and transform it to XHTML. The resulting document is a well-formed XHTML document. (But it does not quite validate - it complains about some attributes, but never mind...)

2) Using the XPath 1.0 text box on the top right of the editor window, try to find some portion of the document.

My problem is, no matter what expression I search with, the result is always a popup reporting that "The XPath query returned no results". Even if I search for elements that obviously are in the document, such as "//html", "//body", "//table", etc.

My question is, why don't the XPath queries find anything?

Regards,

panter

Post by **george** » Sun Jan 23, 2005 5:13 pm

Hi,

This is because XPath expressions do not have a notion of default namespace. This means that when you write //table that expressions looks for the table element from no namespace while in your document you have table elements but they are all in the http://www.w3.org/1999/xhtml namespace. How is that so? Because the XHTML DTD contains a fixed attribute xmlns with that value. I admit that this is difficult to see, here it is the fragment from the DTD:
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

Code: Select all


<!ELEMENT html (head, body)>

<!ATTLIST html

  %i18n;

  id          ID             #IMPLIED

  xmlns       %URI;          #FIXED 'http://www.w3.org/1999/xhtml'

  >

oXygen determines the default namespace and maps it to the first available prefix from {default, default1, default2, etc.}.

In your case if you will use //default:table you should get all the table elements from the document.

Also if you remove the DTD then you will get all the elements in no namespace and you should be able to use //table.

Best Regards,
George

Post by **panter** » Sun Jan 23, 2005 6:55 pm

Hi George,

Thanks for the swift reply. I hadn't considered the namespace issues...

You're right, using "//default:XXX" allows me to find elements with name XXX in the document.

However, if I remove the DTD, as you suggested (the "<!DOCTYPE ...>" declaration) and then try to search for "//body" I get an error. It complains that my XPath expression is invalid (???) because an entity ("nbsp" in my case) is referenced in the document but not declared. What's the reason for this odd message? Why does my XPath expression become invalid when there's an undeclared entity in the document?

Post by **george** » Mon Jan 24, 2005 11:56 am

Hi,

If you have a document that is not wellformed than that document is not XML. If you have undeclared entities then the document is not well formed. XPath can be applied only on XML documents, that is documents that pass the wellformedness check.

I agree that the error message starts with a wrong wording, it is not the XPath expression that is invalid but the document. The rest of the message clarifies that. Anyway I filed a bugzilla entry with this so we can give a better message in such cases when the problem is in the document.

Best Regards,
George