Native pdf -> XML -> Excel . How to map an xml file

Pippo3 · Post by **Pippo3** » Mon Feb 18, 2019 11:31 am

Hello,

I’m trying to use xml to convert a pdf data file into an excel file. I don’t know if this is the right path, please feel free to advise me it’s not

The problem
I have a data-rich pdf, full of tables and numbers. The xml file I can generate automatically out of it is extremely accurate. In fact, I wouldn’t be surprised if the pdf itself had been generated starting from an xml database. I have no control on the database, and I can only start working on the pdf though.

So let’s say, I have page one with this set of data:

Lots of text
Category
Date
A table with 8 useless columns, and just two relevant ones.

I’d like to isolate only 4 data: category, date and the two relevant cells in the table, mapping them into an excel file.

The problem is, the following page might have the category in the same place, the date also, but the table would have 20 columns, and I still need only two values out of it.

The two values I need are clearly “mappable” given the combination of row and column headers, but they are not identifiable given the combination of row and column number (so, it’s always the “total” row in the “Pieces” column, but sometimes this can be cell C3, sometimes G5, etc).

If this was html, using a scraper I could very easily map the camps and get the work done. The result would be a csv that I can easily transport in excel, with all the data I need.

Is it a good idea to try get the work done with XML? Is the work clear at all?

Please let me know if you have some hint, or some keyword I can use for further research on the topic.

Thanks

Nick

Post by **adrian** » Tue Feb 19, 2019 2:58 pm

Hi,

With what tool do you make the conversion from PDF to XML (or other format)?

The two values I need are clearly “mappable” given the combination of row and column headers, but they are not identifiable given the combination of row and column number (so, it’s always the “total” row in the “Pieces” column, but sometimes this can be cell C3, sometimes G5, etc).

If you already have the means to convert everything into XML, just do that, grab the whole table into XML and use the same rule within the XML to identify what data you actually need.

Regards,
Adrian

Pippo3 · Post by **Pippo3** » Fri Feb 22, 2019 11:13 pm

Dear Adrian,

I'm using Adobe Acrobat X to do the conversion. As I previously wrote, the conversion is perfect and the resulting XML is flawless.

Could you please elaborate on this?

use the same rule within the XML to identify what data you actually need

Is there a way to do that programmatically? I mean, extract them and get them in to a table.

With which tool?

Thanks

Nick

Post by **adrian** » Wed Feb 27, 2019 12:51 pm

Hi,

After obtaining the document in XML format you can use various XML tools to inspect the document. Note that you're on the Oygen XML Editor forum, so you can take a wild guess what tool you could use.

I quoted the "rule" that you used to determine what cell you need.

it’s always the “total” row in the “Pieces” column, but sometimes this can be cell C3, sometimes G5, etc

Inspect the XML and see if an XML rule (e.g. XPath selector) can be composed to identify this cell.

Not sure what the table from your converted XML would look like, so it's difficult to provide an XPath example. Solutions vary a lot depending on how the table is defined in the XML document. But you should be able to write an XPath expression that pinpoints to your cell.

Regards,
Adrian

Pippo3 · Post by **Pippo3** » Thu Feb 28, 2019 12:16 pm

Thanks Adrian!
That was exactly the advice I needed.
Let's see what I can do

Nick

Native pdf -> XML -> Excel . How to map an xml file

Native pdf -> XML -> Excel . How to map an xml file

Re: Native pdf -> XML -> Excel . How to map an xml file

Re: Native pdf -> XML -> Excel . How to map an xml file

Re: Native pdf -> XML -> Excel . How to map an xml file

Re: Native pdf -> XML -> Excel . How to map an xml file