Synchronizing XML formatting between Oxygen and external tools (+ Git)

chrispitude
Posts: 254
Joined: Thu May 02, 2019 2:32 pm

Synchronizing XML formatting between Oxygen and external tools (+ Git)

Post by chrispitude » Mon Sep 07, 2020 9:57 pm

Hi all,

We have some external tools that we run from Oxygen to update our DITA topics and maps. For example,
  • Update topics with target text for cross-book xrefs
  • Update maps with corrected subtopic structure for nested topic structures in a single file
These external utilities are currently written in perl, although I'd like to rewrite them in Python in the near future.

Our DITA content is stored in a Git repo, which means that Git notices changes in XML formatting as file differences. When I drag-and-drop a file into the DITA Maps Manager in Oxygen, then make it conditional, long attributes are placed on separate lines as follows:

Code: Select all

  <chapter>
    <topicref
      href="ptug/using_primetime_with_spice/simlink/correlating_arc_based_coupled_primetime_si_spice_analysis.dita"
      keys="correlating_arc_based_coupled_primetime_si_spice_analysis"
      product="library(LC)"/>
  </chapter>
Our external tool uses the perl XML::Twig package to read/write XML. When the content round-trips through the utility, they are placed on the same line:

Code: Select all

  <chapter>
    <topicref href="ptug/using_primetime_with_spice/simlink/correlating_arc_based_coupled_primetime_si_spice_analysis.dita" keys="correlating_arc_based_coupled_primetime_si_spice_analysis" product="library(LC)"/>    
  </chapter>
Oxygen does quite a commendable job of leaving existing XML structures intact. However, changes in structure can cause the content to get reformatted to multiple lines again, such as if the topic is reordered or surrounding structure is changed.

These XML formatting differences are causing Git conflicts. As a first step, I'd like to align the XML formatting between Oxygen and my external tools as much as possible.

In Oxygen's preferences, I see the following settings:
image.png
image.png (6.57 KiB) Viewed 394 times
but the setting that sounds like it would keep the attributes on the same line is already unchecked. Does adding topics in the DITA Maps Manager bypass this setting?

Radu
Posts: 7154
Joined: Fri Jul 09, 2004 5:18 pm

Re: Synchronizing XML formatting between Oxygen and external tools (+ Git)

Post by Radu » Tue Sep 08, 2020 7:19 am

Hi Chris,

In the Oxygen Preferences->"Editor / Format" there is a "Line width" setting. Oxygen tries to add line breaks so that the maximum number of characters on each line does not overflow that limit. You can experiment setting a very large line width there.
So even with that "Break line before attribute" checkbox unchecked, Oxygen will still break lines between attributes if it considers that otherwise the line of text will be longer than the maximum line width set in the preferences. I will try to explain this behavior more clearly in the user's manual for the next releae.
That ""Break line before attribute"" takes effect more when the element's start tag along with the attributes does not go over the maximum line width. So if you have a small element like:

Code: Select all

<elem a="b" c="d">
setting and unsetting the "Break line before attribute" checbox will take effect on it because the element's start tag does not overflow the maximum line width.
If you set a very long "Line width" value, Oxygen will still add line breaks in element-only content, but this will influence also the DITA topics.

Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com

chrispitude
Posts: 254
Joined: Thu May 02, 2019 2:32 pm

Re: Synchronizing XML formatting between Oxygen and external tools (+ Git)

Post by chrispitude » Tue Sep 08, 2020 2:14 pm

Thanks Radu, this is exactly what I needed!

Does the "Format and Indent" operation behave identically to new content/element creation in the topic editor and the DITA Maps Manager? If so, then what I can do is
  1. Run Format and Indent on every file in the repo
  2. Write a simple perl script to read/write the XML
  3. Compare #1 and #2 for differences
Another tricky aspect of this is that currently, our current DITA content comes from our FrameMaker-to-DITA conversion script, which means that it is formatted to the XML::Twig perl package's default writing behavior. As authors add add new content or modify content, it is formatted or reformatted to Oxygen's default behavior. I've resisted doing a global Format and Indent to Oxygen's style because I don't want to obscure the "Git Blame" history of our content.

I'll follow up here with my progress!

chrispitude
Posts: 254
Joined: Thu May 02, 2019 2:32 pm

Re: Synchronizing XML formatting between Oxygen and external tools (+ Git)

Post by chrispitude » Wed Sep 09, 2020 1:10 am

Hi Radu,

I ran into my first obstacle. Let's say I start with this DITA content:

Code: Select all

    <p><xref href="#creating_the_virtual_top_level_netlist/fig_ipq_mw3_qlb" type="fig"/>
      shows a 2D and 3D view of a top-level design with an SoC and memory design.</p>
    <fig id="fig_ipq_mw3_qlb">
      <title>2D and 3D views of a Top-Level Design</title>
    </fig>
In the Author mode, if I define @format="dita" in the Attributes view, the underlying XML turns into this:

Code: Select all

    <p><xref format="dita" href="#creating_the_virtual_top_level_netlist/fig_ipq_mw3_qlb" type="fig"
      /> shows a 2D and 3D view of a top-level design with an SoC and memory design.</p>
    <fig id="fig_ipq_mw3_qlb">
      <title>2D and 3D views of a Top-Level Design</title>
    </fig>
Note that the ending "/>" tag fragment of the <xref> element is wrapped to the next line. The perl XML package I'm using does not support separately breaking and wrapping just this "/>" fragment to the next line. Is there a way to disable this in Oxygen? I don't see anything in Editor / Format / XML that pertains specifically to these ending-tag fragments.

Radu
Posts: 7154
Joined: Fri Jul 09, 2004 5:18 pm

Re: Synchronizing XML formatting between Oxygen and external tools (+ Git)

Post by Radu » Wed Sep 09, 2020 9:35 am

Hi Chris,

About this question:
Does the "Format and Indent" operation behave identically to new content/element creation in the topic editor and the DITA Maps Manager?
The Author visual editing mode and the DITA Maps Manager have the same internal structure which has the same serialization behavior.
About Oxygen sometimes adding line breaks, for example before the "/>" in order to obey the maximum line width specified in its settings, we do not have a setting to control this, Oxygen considers that the resulting XML is data-wise equivalent to the original one and it is. I'm sorry but I'm not sure what we can do about this, we cannot guarantee that our serialization has all the settings to make it behave exactly like the serialization of another tool that you are using.

Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com

chrispitude
Posts: 254
Joined: Thu May 02, 2019 2:32 pm

Re: Synchronizing XML formatting between Oxygen and external tools (+ Git)

Post by chrispitude » Wed Sep 09, 2020 3:11 pm

Hi Radu,

Is it even feasible to ask for an "Allow line breaks before />" option in your serializer? I realize I'm the only person asking for this. I don't know if you're using your own serializer or a standardized one.

Most serializers accessible from perl/Pythron/etc. provide control over indent, line length, and keep spaces. But this "/>" behavior is something I cannot emulate in them.

Right now I read the XML file twice - once in XML tree form, and once as a single large string. I use the tree form to structurally explore the content and figure out what elements to modify, then I attempt to find the same elements using regex and element IDs to update them in string form. It is just as awful as it sounds. :) But, it ensures I modify only the areas I want, while leaving everything else precisely identical.

Radu
Posts: 7154
Joined: Fri Jul 09, 2004 5:18 pm

Re: Synchronizing XML formatting between Oxygen and external tools (+ Git)

Post by Radu » Thu Sep 10, 2020 7:40 am

Hi Chris,

As far as I know there is no standard when it comes to serializing XML, so we use our own code along with the settings to format it. I added an internal issue for your request EXM-46298 - Format and indent setting to avoid line break between attribute and end of tag but I cannot guarantee a timeline for it.
About your external update tools, in Oxygen 22.1 we added API to be able to load XML content in memory in an Author-mode node structure (but without any visual aspects of it), to modify that structure and then save it back on disk. So at some point you could consider trying to migrate your Perl scripts to an Oxygen Java based plugin which adds for example a contextual menu action in the Project view and processes all content as if it would be loaded and modified in the Author visual editing mode.

Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com

chrispitude
Posts: 254
Joined: Thu May 02, 2019 2:32 pm

Re: Synchronizing XML formatting between Oxygen and external tools (+ Git)

Post by chrispitude » Fri Sep 11, 2020 5:04 pm

Hi Radu,

Thanks for filing the low-priority enhancement!

My perl utility reads in a ditamap, makes some changes, and writes it back out. As a workaround, I implemented this:
  1. Hash all actual element strings (with linefeeds, etc.) in input file by a normalized-whitespace version of the element.
  2. Search and replace all actual element strings in the output file, replacing it with the actual string from #1 if a normalized-whitespace string match exists.
So basically, I stuff the existing element strings back into my XML wherever possible. The code isn't smart, but it handles small contained changes well. For bigger changes, like when the utility moves nested XML structures deeper, then it copies the wrong indent level over. Still, at least only the parts of the file modified by the utility get noticed by Git/diff now.

Here's the perl code for reference if it helps anyone:

Code: Select all

sub normalize_whitespace { return (shift =~ s![\s\n\r]+! !gsr); }

sub write_differences {
 my ($filename, $contents) = @_;
 my %orig_elements = map {normalize_whitespace($_) => $_} (read_entire_file($filename) =~ m#(<[\w\-]+\s[^>]*>)#gs); 
 my $get_element = sub {
  my $e = normalize_whitespace(shift);
  return defined($orig_elements{$e}) ? $orig_elements{$e} : $e;  # return original element if possible
 };
 $contents =~ s#(<[\w\-]+\s[^>]*>)#$get_element->($1)#gse;
 write_entire_file($filename, $contents);
 return 1;
}
along with some helper functions:

Code: Select all

sub read_entire_file {
 my $filename = shift;
 open(FILE, "<$filename") or die "can't open $filename for read: $!";
 local $/ = undef;
 binmode(FILE, ":encoding(utf-8)");  # the UTF-8 package checks and enforces this
 my $contents = <FILE>;
 close FILE;
 return $contents;
}

sub write_entire_file {
 my ($filename, $contents) = @_;
 $contents =~ s!\n?$!\n!s;  # add LF if needed
 open(FILE, ">$filename") or die "can't open $filename for write: $!";
 binmode(FILE);  # don't convert LFs to CR/LF on Windows
 binmode(FILE, ":encoding(utf-8)");  # the UTF-8 package checks and enforces this
 print FILE $contents;
 close FILE;
}
Then you can do something like this:

Code: Select all

my $file_contents = read_entire_file($filename);
# reformat/modify XML inside $file_contents
write_differences($filename, $file_contents);

Radu
Posts: 7154
Joined: Fri Jul 09, 2004 5:18 pm

Re: Synchronizing XML formatting between Oxygen and external tools (+ Git)

Post by Radu » Mon Sep 14, 2020 7:47 am

Hi Chris,

Thanks for posting details about your current approach.

Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com

chrispitude
Posts: 254
Joined: Thu May 02, 2019 2:32 pm

Re: Synchronizing XML formatting between Oxygen and external tools (+ Git)

Post by chrispitude » Mon Sep 14, 2020 2:32 pm

I agree that integrated Java-based solutions would be best. I wish I had the Java knowledge and free time to pursue this! The DITA migration is not even my full-time job. I'm just a technical writer with a spare-time investigation into DITA that somehow turned into a full multi-group migration effort.

Maybe when I retire some day (some day??!), I can learn how to create useful Oxygen add-ons. :)

Post Reply