[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

Re: [xsl] "Heap" of trouble handling input file of 500 MByte


Subject: Re: [xsl] "Heap" of trouble handling input file of 500 MByte
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Tue, 22 Feb 2011 09:02:30 +0000

IIRC some time back the recommendation used to be 10x Mike?
If that's correct, what's changed please? Just Saxon getting smarter?

I think I used to say 10x before the TinyTree came along, but that's a very long time ago. Since the introduction of the TinyTree any improvements have been relatively minor (e.g. whitespace compression). 4x is probably the best you'll achieve, but I've seen a number of people report that. A more detailed sizing (assuming no attribute nodes, no type information, no backwards navigation, and no keys) is:


19 bytes per element node
19 bytes for a whitespace text node
19 + 2x bytes for a non-whitespace text node, where x is the number of characters


It's not unusual to see documents where most of the lines are say 40 characters long, and account for one element, one whitespace text node, and one 20-byte text node, which means 40 bytes of source translates to 97 bytes of TinyTree space, giving an expansion factor of 2.5.

In my IEEE Data Engineering paper a couple of years ago at http://sites.computer.org/debull/A08dec/saxonica.pdf , I measured the memory occupied by the 100Mbyte XMark test document at 327Mbytes, and this agreed well with the theoretical sizing.

Michael Kay
Saxonica


Current Thread