Page 1 of 1

Docx to xml

Posted: Mon May 22, 2017 5:57 pm
by apzkhan
Hi

I'm still new to this, so apologies in advance if what I'm asking about is painfully obvious.

I have to publish content using a content management system, this system validates the content against an xml schema. However the content is originally written on word documents, so at the moment we manually add the appropriate tags that are necessary for the document to be validated and published.

I wanted to ask if there is an easy way to do this, and if it is possible to convert a .docx file into an xml file using the tags defined by the xml schema?

Many Thanks

Re: Docx to xml

Posted: Tue May 23, 2017 8:23 am
by Radu
Hi,

One possibility is to save the Word document to HTML from MS Office, then convert the HTML to XHTML (Oxygen has a File->Import feature which can do that) and then use XSLT processing to convert the HTML to your target vocabulary.
Or you can open the DOCX file in the Oxygen Archive Browser view, open from it the "document.xml" which contains the main document's contents and then apply a custom XSLT stylesheet to try and convert it to some other XML format.
For certain target XML vocabularies like DITA and Docbook Oxygen has a special feature called "Smart Paste" (based on a set of predefined internal XSLT stylesheets) which can help with the conversion:

http://blog.oxygenxml.com/2016/05/how-to-migrate-from-word-to-dita.html

Regards,
Radu

Re: Docx to xml

Posted: Wed May 24, 2017 12:08 pm
by apzkhan
Thanks, I'll give them a go and see how far I can get.