[oXygen-user] Why can't Oxygen format and indent a 200MB file in 30 minutes whereas SAXON can do it in 15 seconds?

Oxygen XML Editor Support (Adrian Buza) support at oxygenxml.com
Thu Aug 19 09:11:45 CDT 2021


Hello,

Like the saying goes, it depends...

> I opened the file in Oxygen and clicked on the format and indent 
> button. After 30 minutes of processing Oxygen gave up with an error 
> message. 
You haven't mentioned what the error message was, but I'm pretty sure 
Oxygen ran out of memory. Basically it ran out of memory in the first 
few seconds of formatting then the Java VM struggled to admit this fact 
for the rest of the 30 minutes.
So, first it depends on how much memory Oxygen had available (Help > 
About, JVM Memory ... Total).

To keep a long story short, if you want Oxygen to format and indent a 
large file as fast as Saxon and not risk running out of memory in the 
process, it should to do this either without opening the document or at 
document opening time.

A. Without opening the document
Use Tools > "Format and Indent Files" or right click on the document in 
the Project view and "Format and Indent Files".

B. At document opening time
1. Set Options > Preferences >  Editor / Format, [x] "Format and indent 
the document on open".
2. Close the document.
3. Reopen the document (File > Reopen last closed editor / Ctrl+Alt+T)
4. Eventually clear the box for [ ] "Format and indent the document on 
open" because it will apply to all opened documents.


Read on for the juicy details...

It is actually a huge difference between how Oxygen (an IDE) and Saxon 
(a CLI tool) achieve this and what their requirements are for this, even 
though the result may be the same.

I can't really speak for Saxon's inner workings, but it might not even 
build an XML model into memory depending on Saxon optimizations and if 
Saxon streaming is used.
In theory, if you use an input stream that reads and parses the XML one 
chunk at a time, and an output stream that writes the XML model as the 
first one reads, you don't actually have to load the entire thing into 
memory for the purpose of formatting it.
Using Saxon streaming would probably be faster than your result and 
could work for a file of any size, but I digress.


By Oxygen's standards 200MB is a large file (> 30MB). That means some 
optimizations are enforced to accommodate a file of this size. [1]
For 300MB or more, Oxygen has a "huge files" mode that no longer loads 
the entire document in memory and has some more severe limitations. [2]
So this is closer to Oxygen's "huge" limit rather than the "large" limit.

Because Oxygen is an IDE, it loads the document in memory as text (with 
the exceptions/optimizations mentioned above) and then builds various 
specialized models from the document so that you have all those editing 
helpers (Outline, Attributes, Model) or a much more complex model if you 
switch to Author mode.
When you format the document while already opened in Text mode, Oxygen 
parses the XML and serializes it with the configured formatting options. 
Due to the way the model of a text editor is updated, it is not feasible 
to make this into a stream and repeatedly update parts of the file (e.g. 
line by line), so the entire document contents is replaced when the 
formatting ends. This causes a duplication of the entire document in 
memory. Oxygen also provides Undo for that formatting in case you don't 
like it or have triggered it accidentally, so it also has to keep the 
old document. All of this comes at a high price with regard to memory. 
Which is what Oxygen usually stumbles upon (running out of memory) when 
working with large files.

So, as much as we would want to make it work with large files, it's just 
that the amount of memory required to achieve this within an IDE is a 
number of times larger than Saxon's (assuming Saxon would actually build 
the entire XML model of that document). The solution is to try and 
serialize to disk some of the pieces of the puzzle in order to free 
memory. This is actually what some of the large/huge mode optimizations 
do, but with limited success.

Regards,
Adrian

[1] 
https://www.oxygenxml.com/doc/versions/23.1/ug-editor/topics/large-file-editor.html
[2] 
https://www.oxygenxml.com/doc/versions/23.1/ug-editor/topics/huge-file-editor.html

Adrian Buza
oXygen XML Editor and Author Support

On 18.08.2021 17:28, Roger L Costello wrote:
> Hi Folks,
>
> I have (by today's standards) a medium sized XML file that is 200MB in size. It is unformatted (no indentation). I opened the file in Oxygen and clicked on the format and indent button. After 30 minutes of processing Oxygen gave up with an error message. So I wrote a simple 1-line XSLT program (below) to do the indentation, it took about 15 seconds and was done. Why is it that Oxygen can't indent the file in 30 minutes whereas an XSLT processor (Saxon) can do it in 15 seconds?  /Roger
>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>      xmlns:xs="http://www.w3.org/2001/XMLSchema"
>      exclude-result-prefixes="xs"
>      version="2.0">
>      <xsl:output method="xml" indent="yes" />
>      
>      <xsl:template match="/">
>          <xsl:copy-of select="/" />
>      </xsl:template>
>      
> </xsl:stylesheet>
> _______________________________________________
> oXygen-user mailing list
> oXygen-user at oxygenxml.com
> https://www.oxygenxml.com/mailman/listinfo/oxygen-user



More information about the oXygen-user mailing list