Batch Documents Converter Add-on

Oxygen XML offers an add-on that contributes actions in the following submenus:

Batch Documents Converter submenu located in the Tools menu and the contextual menu of resources in the Project view.
Additional conversions submenu located in File > Import/Convert.
Import submenu located in the Append child, Insert Before, and Insert After submenus from the contextual menu of the DITA Maps Manager view when the opened DITA map is a local file.

The first time you invoke any of these actions, Oxygen XML will ask you if you want to install it and offer a wizard to help with the installation process.

Once installed, you need to restart Oxygen XML and those same actions will then contain the list of available conversions. Selecting an action from the submenu will open a dialog box where you can configure the options for the corresponding conversion. You can batch convert between the following formats:

HTML to XHTML
HTML to DITA
HTML to DocBook4
HTML to DocBook5
Markdown to XHTML
Markdown to DITA
Markdown to DocBook4
Markdown to DocBook5
Word (.doc or .docx) to XHTML
Word (.doc or .docx) to DITA
Word (.doc or .docx) to DocBook4
Word (.doc or .docx) to DocBook5
Excel to DITA
Confluence to DITA
DocBook to DITA
OpenAPI to DITA
JSON to XML
XML to JSON
JSON to YAML
YAML to JSON
YAML to XML
XML to YAML
XSD to JSON Schema (version 2020-12)

When actions are invoked from the contextual menu of the DITA Maps Manager view, the resulting documents from the conversion are automatically inserted in the map as follows:

Actions from Append child inserts map nodes as children of the currently selected node.
Actions from Insert Before inserts map nodes as siblings of the currently selected node, above the current node in the map.
Actions from Insert After inserts map nodes as siblings of the currently selected node, below the current node in the map.

Quick Installation

You can drag the following Install button and drop it into the main editor in Oxygen to quickly initiate the installation process:

Install

Manual Installation

To manually install the Batch Documents Converter add-on:

Go to Help > Install new add-ons to open an add-on selection dialog box.
Enter or paste https://www.oxygenxml.com/InstData/Addons/default/updateSite.xml in the Show add-ons from field or select it from the drop-down menu.
Note:
If you have issues connecting to the default update site, you can download the add-on package, unzip it, then use the Browse for local files action in the Install new add-ons dialog box to locate the downloaded addon.xml file.
Select the Batch Documents Converter add-on and click Next.
Read the end-user license agreement. Then select the I accept all terms of the end-user license agreement option and click Finish.
Restart the application.

Result: A Batch Documents Converter submenu will now be available in the Tools menu and in the contextual menu. This submenu will contain a list of the various types of available conversions. Selecting one of the types of conversions will open a dialog box where you can configure options for the conversion.

Configuration

Options for configuring the conversions can be found in the preferences page of the add-on (Options > Preferences > Plugins > Batch Documents Converter) or in the conversion dialog box.

Conversions from Word (Word Styles Mapping)

The conversions from MS Word work best if you only use the MS Word styles to semantically mark up your document. It is important that sections from the Word document are well defined using the heading styles.

Use the Word styles mapping option from the Batch Documents Converter preferences page to configure any of the types of Word conversions (Word to HTML, Word to DITA, Word to DocBook4, and Word to DocBook5) by setting a mapping between Word elements and styles to the corresponding HTML element.

If the Word document contains paragraphs formatted with custom styles that are not based on default styles, they have to be set in the Word styles mapping configuration. Those that are not set will be converted into simple paragraphs.

The styles mapping configuration is inherited between styles. If you use a custom style that is based on a default style, the default style mapping configuration will be inherited and also used for the custom style. The mapping from the base style is not inherited if the custom style has a mapping defined in the Word styles mapping configuration.

To define a mapping in the Word Styles Mapping table, you can use the already defined default configuration. For example, if you use a custom Word style named Document Title that is not based on a default style, you can map this to the HTML "h1" element:

Word element	Word style	HTML elements
p	Document Title	h1:fresh

The resulting 'h1' element will be transformed into the corresponding element when converting to DITA, DocBook 4, and DocBook 5.

The Word styles mapping table contains the following columns:

Word element

This column allows one of the following Word elements:

p - Word paragraph
r - Word run
b - bold text
i - italicized text
u - underlined text
strike - strikethrough text
table - table
p:unorderd-list(x) - unordered list (where 'x' is the nesting level of the list)
p:orderd-list(x) - ordered list (where 'x' is the nesting level of the list)

Word style

This column can be used to map a paragraph, run, or table with a specific style (referenced by name).

Styles can also be referenced by style ID. This is the ID used internally in the .docx file. To map a paragraph or run with a specific style ID, append a dot followed by the style ID in the Word element column (for example: p.Heading1).

HTML elements

This column can be used to map the resulting HTML elements. It allows a single element or multiple nested elements.

The nested elements can be declared by using the '>' character (for example: ul > li).

The class attribute can be specified on the resulting HTML elements by appending a dot followed by the class value, after the element (for example: p.myClass).

When converting Word to DITA, these class attributes are automatically converted to outputclass attributes. This may be useful if you want to apply extra processing on the resulting DITA document using a custom XLS stylesheet.

The :fresh syntax can be used to create new elements. If it is not used, the converter will try to reuse the element and close it only when it is necessary.

For example, if the following configuration is set:

Heading 1

When the converter finds consecutive Word paragraphs with the style named Heading 1, these will be converted into a single h1 element that contains the text appended from all of the Word paragraphs.

If h1:fresh is set in the last column, the converter will create separate h1 elements.

The :separator('separator_string') syntax can be used to specify a separator between paragraphs that are merged when :fresh syntax is not specified.

For example, if the following configuration is set:

Code Block

pre:separator('\n')

when the converter finds multiple consecutive paragraphs styled with the Code Block style, it will merge them into a single <pre> HTML element and the "\n" (new line) separator will be used between merged text.

To ignore elements, the '!' character can be added in the HTML elements column.

The Export button can be used to export the word styles configuration to an XML file. This exported file can be used to configure the MS Word Dynamic Conversion from Oxygen XML by copying the file in the DITA-OT plugin directory: [OXYGEN_INSTALL_DIR]/frameworks/dita/DITA-OT/plugins/com.oxygenxml.dynamic.resources.converter.

The Import button allows you to import the word styles configuration from an exported XML file.

Note:

The Word styles mapping configuration is applied only for the newer version of MS Word files formatted in the Microsoft Office Open XML (DOCX) format.

Maximum Heading Level for Creating Topics

The Maximum heading level for creating topics option from the Batch Documents Converter preferences page allows you to set a maximum heading level that the converter will process as DITA topics. The headings with a higher level will be converted to section elements.

When the output is a DITA topic, this option sets the maximum heading level that will be converted as a nested topic in the document.

When the output is a DITA map, this option sets the maximum heading level that will be extracted as a DITA topic file and referenced in the DITA map hierarchy.

Note:

This option only applies to the HTML to DITA and Word to DITA conversions.

Word to DITA

The Create DITA maps from Word documents containing multiple headings option from the conversion dialog box allows you to decide whether the output will be a DITA map or a DITA topic. When this option is selected, the sections from your Word document marked by titles or headings will be separated into individual DITA topics and referenced in a DITA map. If the word document does not contain multiple sections, the output will be a single topic. When this option is not selected, the output will be a topic with nested topics and sections according to the number of titles and headings from the Word document.

Note:

Mathematical equations in Word documents should be automatically converted to MathML equations if they are in Office Math Markup Language (OMML) format. If the mathematical equations are in Microsoft Equation Editor format, they first need to be converted to the newer OMML format. See: https://support.microsoft.com/en-us/office/editing-equations-created-using-microsoft-equation-editor-08a44b8c-ae15-41a7-bc15-7239890c0cec.

Markdown to DITA

The Create DITA maps from Markdown documents containing multiple headings option from the conversion dialog box allows you to decide whether the output will be a DITA map or a DITA topic. When this option is selected, all headings from your Markdown document will be separated into individual DITA topics and referenced in a DITA map. If the Markdown document does not contain multiple headings, the output will be a single topic. When this option is not selected, the output will be a topic with nested topics or sections according to the number of headings from the document.

The Create short description elements option from the conversion dialog box allows you to decide whether or not the shortdesc elements are created in the output DITA document. When this option is selected, the first paragraph before the headings from the Markdown document will be converted into DITA short description elements. When this option is not selected, the output will not contain the short description element.

HTML to DITA

The Create DITA maps from HTML documents containing multiple headings option from the conversion dialog box allows you to decide whether the output will be a DITA map or a DITA topic. When this option is selected, the headings from your HTML document will be separated into individual DITA topics and referenced in a DITA map. If the HTML document does not contain multiple sections, the output will be a single topic. When this option is not selected, the output will be a topic with nested topics or sections according to the number of headings from the document.

The Ignore HTML 'div' elements option from the conversion dialog box allows you to decide whether or not the <div> elements will be ignored. When this option is selected, all <div> elements will be ignored. When this option is not selected, only <div> elements that include the @class or @id attribute will be handled by the converter.

Confluence to DITA

The Confluence to DITA conversion processes the HTML content generated by the Atlassian® Confluence (see https://www.atlassian.com/software/confluence) export process. To export Confluence content to HTML, log in to your Atlassian® Confluence account and navigate to the specific space that you want to export. Then go to Space Settings > Export space and choose to export it as HTML. The resulting index.html file must be provided in the Input files list from the conversion dialog box.

DocBook to DITA

The Create DITA maps from DocBook documents containing multiple sections option from the conversion dialog box allows you to decide whether the output will be a DITA map or a DITA topic. When this option is selected, the sections from your DocBook document will be converted into individual DITA topics and referenced in a DITA map. When this option is not selected, the output will be a single topic with nested topics.

OpenAPI to DITA

The OpenAPI to DITA conversion can be used to convert JSON or YAML files that use and conform to the OpenAPI specification (versions 2.0, 3.0, or 3.1) into DITA documents. The Create DITA maps from OpenAPI documents option from the conversion dialog box allows you to decide whether the output will be a DITA map or a DITA topic. When this option is selected, the converter will create separate DITA topics for the introduction (including OAS 'Info', 'Server', 'Security Requirement' and 'External Documentation' objects), 'Tag', 'Operation', 'Callback', and 'Components' objects. These topics will be referenced in a DITA map. When this option is not selected, the output will be a single topic with nested topics.

Word to DITA Conversion Notes

The following are some notes about the Word to DITA conversion:

Paragraphs styled with default Word heading styles (or with custom styles based on default Word heading styles) are handled as topics or sections in the converted DITA output.
You can choose whether the converted output is a DITA map with referenced topics or a single DITA topic. See the Create DITA maps from Word documents containing multiple headings option.
You can choose the level of headings that are converted as topics or sections. See the Maximum Heading Level for Creating Topics option.
You can customize the conversion by adding mappings from your own Word styles to HTML elements. The configured HTML element is converted to the proper DITA element. The @class attribute is transformed to the DITA @outputclass attribute. See the Conversions from Word (Word Styles Mapping) section.
Ordered and unordered lists are converted to DITA and the list level is preserved.
Bold, italic, underline, strikethrough, superscript, and subscript styles are converted to the corresponding DITA elements.
The formatting of table properties (such as borders) is currently ignored, but the formatting of the text inside the table is treated the same as in the rest of the document. Only the header row formatting is taken into account when converting tables to DITA.
Footnotes and endnotes are converted.
Images embedded in Word documents are saved to separate files and referenced in the generated DITA topics.
Links (cross-references and external links) are converted.
Line breaks and taken into account.
Mathematical equations in Word documents are automatically converted to MathML equations if they are in Office Math Markup Language (OMML) format. If the mathematical equations are in Microsoft Equation Editor format, they first need to be converted to the newer OMML format.
Symbols are converted.
Index entries are converted.
The Table of Contents is ignored in the DITA result.

Resources

For more information about the Batch Converter add-on, as well as details regarding other popular add-ons that extend the functionality of Oxygen XML, see the following resources: