Native pdf -> XML -> Excel . How to map an xml file
Questions about XML that are not covered by the other forums should go here.
			- 
				Pippo3
- Posts: 3
- Joined: Mon Feb 18, 2019 11:09 am
Native pdf -> XML -> Excel . How to map an xml file
Hello,
I’m trying to use xml to convert a pdf data file into an excel file. I don’t know if this is the right path, please feel free to advise me it’s not
The problem
I have a data-rich pdf, full of tables and numbers. The xml file I can generate automatically out of it is extremely accurate. In fact, I wouldn’t be surprised if the pdf itself had been generated starting from an xml database. I have no control on the database, and I can only start working on the pdf though.
So let’s say, I have page one with this set of data:
Lots of text
Category
Date
A table with 8 useless columns, and just two relevant ones.
I’d like to isolate only 4 data: category, date and the two relevant cells in the table, mapping them into an excel file.
The problem is, the following page might have the category in the same place, the date also, but the table would have 20 columns, and I still need only two values out of it.
The two values I need are clearly “mappable” given the combination of row and column headers, but they are not identifiable given the combination of row and column number (so, it’s always the “total” row in the “Pieces” column, but sometimes this can be cell C3, sometimes G5, etc).
If this was html, using a scraper I could very easily map the camps and get the work done. The result would be a csv that I can easily transport in excel, with all the data I need.
Is it a good idea to try get the work done with XML? Is the work clear at all?
Please let me know if you have some hint, or some keyword I can use for further research on the topic.
Thanks
Nick
			
			
									
									
						I’m trying to use xml to convert a pdf data file into an excel file. I don’t know if this is the right path, please feel free to advise me it’s not

The problem
I have a data-rich pdf, full of tables and numbers. The xml file I can generate automatically out of it is extremely accurate. In fact, I wouldn’t be surprised if the pdf itself had been generated starting from an xml database. I have no control on the database, and I can only start working on the pdf though.
So let’s say, I have page one with this set of data:
Lots of text
Category
Date
A table with 8 useless columns, and just two relevant ones.
I’d like to isolate only 4 data: category, date and the two relevant cells in the table, mapping them into an excel file.
The problem is, the following page might have the category in the same place, the date also, but the table would have 20 columns, and I still need only two values out of it.
The two values I need are clearly “mappable” given the combination of row and column headers, but they are not identifiable given the combination of row and column number (so, it’s always the “total” row in the “Pieces” column, but sometimes this can be cell C3, sometimes G5, etc).
If this was html, using a scraper I could very easily map the camps and get the work done. The result would be a csv that I can easily transport in excel, with all the data I need.
Is it a good idea to try get the work done with XML? Is the work clear at all?

Please let me know if you have some hint, or some keyword I can use for further research on the topic.
Thanks
Nick
- 
				adrian
- Posts: 2893
- Joined: Tue May 17, 2005 4:01 pm
Re: Native pdf -> XML -> Excel . How to map an xml file
Hi,
With what tool do you make the conversion from PDF to XML (or other format)?
Regards,
Adrian
			
			
									
									With what tool do you make the conversion from PDF to XML (or other format)?
If you already have the means to convert everything into XML, just do that, grab the whole table into XML and use the same rule within the XML to identify what data you actually need.The two values I need are clearly “mappable” given the combination of row and column headers, but they are not identifiable given the combination of row and column number (so, it’s always the “total” row in the “Pieces” column, but sometimes this can be cell C3, sometimes G5, etc).
Regards,
Adrian
Adrian Buza
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
						<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
- 
				Pippo3
- Posts: 3
- Joined: Mon Feb 18, 2019 11:09 am
Re: Native pdf -> XML -> Excel . How to map an xml file
Dear Adrian,
I'm using Adobe Acrobat X to do the conversion. As I previously wrote, the conversion is perfect and the resulting XML is flawless.
Could you please elaborate on this?
With which tool?
Thanks
Nick
			
			
									
									
						I'm using Adobe Acrobat X to do the conversion. As I previously wrote, the conversion is perfect and the resulting XML is flawless.
Could you please elaborate on this?
Is there a way to do that programmatically? I mean, extract them and get them in to a table.use the same rule within the XML to identify what data you actually need
With which tool?
Thanks
Nick
- 
				adrian
- Posts: 2893
- Joined: Tue May 17, 2005 4:01 pm
Re: Native pdf -> XML -> Excel . How to map an xml file
Hi,
After obtaining the document in XML format you can use various XML tools to inspect the document. Note that you're on the Oygen XML Editor forum, so you can take a wild guess what tool you could use.
I quoted the "rule" that you used to determine what cell you need.
Not sure what the table from your converted XML would look like, so it's difficult to provide an XPath example. Solutions vary a lot depending on how the table is defined in the XML document. But you should be able to write an XPath expression that pinpoints to your cell.
Regards,
Adrian
			
			
									
									After obtaining the document in XML format you can use various XML tools to inspect the document. Note that you're on the Oygen XML Editor forum, so you can take a wild guess what tool you could use.
I quoted the "rule" that you used to determine what cell you need.
Inspect the XML and see if an XML rule (e.g. XPath selector) can be composed to identify this cell.it’s always the “total” row in the “Pieces” column, but sometimes this can be cell C3, sometimes G5, etc
Not sure what the table from your converted XML would look like, so it's difficult to provide an XPath example. Solutions vary a lot depending on how the table is defined in the XML document. But you should be able to write an XPath expression that pinpoints to your cell.
Regards,
Adrian
Adrian Buza
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
						<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
Return to “General XML Questions”
			
				Jump to
				
			
		
			
			
	
	- Oxygen XML Editor/Author/Developer
- ↳ Feature Request
- ↳ Common Problems
- ↳ DITA (Editing and Publishing DITA Content)
- ↳ Artificial Intelligence (AI Positron Assistant add-on)
- ↳ SDK-API, Frameworks - Document Types
- ↳ DocBook
- ↳ TEI
- ↳ XHTML
- ↳ Other Issues
- Oxygen XML Web Author
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Content Fusion
- ↳ Feature Request
- ↳ Common Problems
- Oxygen JSON Editor
- ↳ Feature Request
- ↳ Common Problems
- Oxygen PDF Chemistry
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Feedback
- ↳ Feature Request
- ↳ Common Problems
- Oxygen XML WebHelp
- ↳ Feature Request
- ↳ Common Problems
- XML
- ↳ General XML Questions
- ↳ XSLT and FOP
- ↳ XML Schemas
- ↳ XQuery
- NVDL
- ↳ General NVDL Issues
- ↳ oNVDL Related Issues
- XML Services Market
- ↳ Offer a Service