Native pdf -> XML -> Excel . How to map an xml file
Posted: Mon Feb 18, 2019 11:31 am
Hello,
I’m trying to use xml to convert a pdf data file into an excel file. I don’t know if this is the right path, please feel free to advise me it’s not
The problem
I have a data-rich pdf, full of tables and numbers. The xml file I can generate automatically out of it is extremely accurate. In fact, I wouldn’t be surprised if the pdf itself had been generated starting from an xml database. I have no control on the database, and I can only start working on the pdf though.
So let’s say, I have page one with this set of data:
Lots of text
Category
Date
A table with 8 useless columns, and just two relevant ones.
I’d like to isolate only 4 data: category, date and the two relevant cells in the table, mapping them into an excel file.
The problem is, the following page might have the category in the same place, the date also, but the table would have 20 columns, and I still need only two values out of it.
The two values I need are clearly “mappable” given the combination of row and column headers, but they are not identifiable given the combination of row and column number (so, it’s always the “total” row in the “Pieces” column, but sometimes this can be cell C3, sometimes G5, etc).
If this was html, using a scraper I could very easily map the camps and get the work done. The result would be a csv that I can easily transport in excel, with all the data I need.
Is it a good idea to try get the work done with XML? Is the work clear at all?
Please let me know if you have some hint, or some keyword I can use for further research on the topic.
Thanks
Nick
I’m trying to use xml to convert a pdf data file into an excel file. I don’t know if this is the right path, please feel free to advise me it’s not
The problem
I have a data-rich pdf, full of tables and numbers. The xml file I can generate automatically out of it is extremely accurate. In fact, I wouldn’t be surprised if the pdf itself had been generated starting from an xml database. I have no control on the database, and I can only start working on the pdf though.
So let’s say, I have page one with this set of data:
Lots of text
Category
Date
A table with 8 useless columns, and just two relevant ones.
I’d like to isolate only 4 data: category, date and the two relevant cells in the table, mapping them into an excel file.
The problem is, the following page might have the category in the same place, the date also, but the table would have 20 columns, and I still need only two values out of it.
The two values I need are clearly “mappable” given the combination of row and column headers, but they are not identifiable given the combination of row and column number (so, it’s always the “total” row in the “Pieces” column, but sometimes this can be cell C3, sometimes G5, etc).
If this was html, using a scraper I could very easily map the camps and get the work done. The result would be a csv that I can easily transport in excel, with all the data I need.
Is it a good idea to try get the work done with XML? Is the work clear at all?
Please let me know if you have some hint, or some keyword I can use for further research on the topic.
Thanks
Nick