[oXygen-user] Why can't Oxygen format and indent a 200MB file in 30 minutes whereas SAXON can do it in 15 seconds?
Oxygen XML Editor Support (Adrian Buza)
support at oxygenxml.com
Thu Aug 19 09:11:45 CDT 2021
Hello,
Like the saying goes, it depends...
> I opened the file in Oxygen and clicked on the format and indent
> button. After 30 minutes of processing Oxygen gave up with an error
> message.
You haven't mentioned what the error message was, but I'm pretty sure
Oxygen ran out of memory. Basically it ran out of memory in the first
few seconds of formatting then the Java VM struggled to admit this fact
for the rest of the 30 minutes.
So, first it depends on how much memory Oxygen had available (Help >
About, JVM Memory ... Total).
To keep a long story short, if you want Oxygen to format and indent a
large file as fast as Saxon and not risk running out of memory in the
process, it should to do this either without opening the document or at
document opening time.
A. Without opening the document
Use Tools > "Format and Indent Files" or right click on the document in
the Project view and "Format and Indent Files".
B. At document opening time
1. Set Options > Preferences > Editor / Format, [x] "Format and indent
the document on open".
2. Close the document.
3. Reopen the document (File > Reopen last closed editor / Ctrl+Alt+T)
4. Eventually clear the box for [ ] "Format and indent the document on
open" because it will apply to all opened documents.
Read on for the juicy details...
It is actually a huge difference between how Oxygen (an IDE) and Saxon
(a CLI tool) achieve this and what their requirements are for this, even
though the result may be the same.
I can't really speak for Saxon's inner workings, but it might not even
build an XML model into memory depending on Saxon optimizations and if
Saxon streaming is used.
In theory, if you use an input stream that reads and parses the XML one
chunk at a time, and an output stream that writes the XML model as the
first one reads, you don't actually have to load the entire thing into
memory for the purpose of formatting it.
Using Saxon streaming would probably be faster than your result and
could work for a file of any size, but I digress.
By Oxygen's standards 200MB is a large file (> 30MB). That means some
optimizations are enforced to accommodate a file of this size. [1]
For 300MB or more, Oxygen has a "huge files" mode that no longer loads
the entire document in memory and has some more severe limitations. [2]
So this is closer to Oxygen's "huge" limit rather than the "large" limit.
Because Oxygen is an IDE, it loads the document in memory as text (with
the exceptions/optimizations mentioned above) and then builds various
specialized models from the document so that you have all those editing
helpers (Outline, Attributes, Model) or a much more complex model if you
switch to Author mode.
When you format the document while already opened in Text mode, Oxygen
parses the XML and serializes it with the configured formatting options.
Due to the way the model of a text editor is updated, it is not feasible
to make this into a stream and repeatedly update parts of the file (e.g.
line by line), so the entire document contents is replaced when the
formatting ends. This causes a duplication of the entire document in
memory. Oxygen also provides Undo for that formatting in case you don't
like it or have triggered it accidentally, so it also has to keep the
old document. All of this comes at a high price with regard to memory.
Which is what Oxygen usually stumbles upon (running out of memory) when
working with large files.
So, as much as we would want to make it work with large files, it's just
that the amount of memory required to achieve this within an IDE is a
number of times larger than Saxon's (assuming Saxon would actually build
the entire XML model of that document). The solution is to try and
serialize to disk some of the pieces of the puzzle in order to free
memory. This is actually what some of the large/huge mode optimizations
do, but with limited success.
Regards,
Adrian
[1]
https://www.oxygenxml.com/doc/versions/23.1/ug-editor/topics/large-file-editor.html
[2]
https://www.oxygenxml.com/doc/versions/23.1/ug-editor/topics/huge-file-editor.html
Adrian Buza
oXygen XML Editor and Author Support
On 18.08.2021 17:28, Roger L Costello wrote:
> Hi Folks,
>
> I have (by today's standards) a medium sized XML file that is 200MB in size. It is unformatted (no indentation). I opened the file in Oxygen and clicked on the format and indent button. After 30 minutes of processing Oxygen gave up with an error message. So I wrote a simple 1-line XSLT program (below) to do the indentation, it took about 15 seconds and was done. Why is it that Oxygen can't indent the file in 30 minutes whereas an XSLT processor (Saxon) can do it in 15 seconds? /Roger
>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
> xmlns:xs="http://www.w3.org/2001/XMLSchema"
> exclude-result-prefixes="xs"
> version="2.0">
> <xsl:output method="xml" indent="yes" />
>
> <xsl:template match="/">
> <xsl:copy-of select="/" />
> </xsl:template>
>
> </xsl:stylesheet>
> _______________________________________________
> oXygen-user mailing list
> oXygen-user at oxygenxml.com
> https://www.oxygenxml.com/mailman/listinfo/oxygen-user
More information about the oXygen-user
mailing list