Extracting all values, all text and all nodes with only wildcards

Questions about XML that are not covered by the other forums should go here.
Brian_donovan
Posts: 3
Joined: Tue Sep 17, 2019 10:27 pm

Extracting all values, all text and all nodes with only wildcards

Post by Brian_donovan »

Hi, questions: I have a 400MB xml and I need to extract all values, all text and all nodes from it in blocks and as efficiently possible. Here is a example of my xml tags:

Code: Select all

<Big report>
	<block something1="A" something2="B" something3="C" something4="D" something5="E"/>
		<inner block>F</inner block>
		<inner block>G</inner block>
		<inner block>"H"</inner block>
		<inner block>
			<inner inner block something1="I" something2="J" something3="K"/>
		</inner block>
	</block>
	<block something1="L" something2="M" something3="N" something4="O" something5="P"/>
		<inner block>"Q"</inner block>
		<inner block>
			<inner inner block something1="R" something2="S" something3="T"/>
		</inner block>
	</block>
	<something else>
	</something else>
<Big report>
What I need is nodes, values and text from every block, IMPORTANT: I do not know how many blocks, how many somethings and how many inner blocks are in the code and I do not know the names either, everything need to be extracted with only wildcards. here is the code I have up to this point ( obviously is not perfect) pleas help :(

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
    <xsl:for-each select="Big report/block">
		<xsl:value-of select="*/@*[3]"/>
		<xsl:value-of select="child::*[.]"/>
    </xsl:for-each>
</xsl:template>

</xsl:stylesheet>
the answer I am looking for is the next one:

Code: Select all

ABCDEFGHIJK
LMNOPQRST
the "child::*[.]" code is working nicely but the "*/@*[3]" is not I do not know how to use a wildcard instead of a 3 and I can not repeat the code from 1 to 100 there must be a better way. I have also tried the "//*" but I just cant make it work right... Any help will be appreciated, thank you all.
Radu
Posts: 9018
Joined: Fri Jul 09, 2004 5:18 pm

Re: Extracting all values, all text and all nodes with only wildcards

Post by Radu »

Hi Brian,

Ideally when you post sample XML fragments, they should be wellformed, it makes it easier for somebody to construct an example based on them.
An XSLT which lists all attribute values and all text nodes could look something like this:

Code: Select all

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">
    <xsl:template match="/">
        <xsl:apply-templates/>
    </xsl:template>
    <xsl:template match="*">
        <xsl:apply-templates select="@* | node()"/>
    </xsl:template>
    <xsl:template match="@*">
        <xsl:value-of select="."/>
    </xsl:template>
</xsl:stylesheet>
For generic XSLT questions there is also an XSLT users list which may also be a good place where to ask XSLT related questions:

https://www.mulberrytech.com/xsl/xsl-list/

Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com
Martin Honnen
Posts: 96
Joined: Tue Aug 19, 2014 12:04 pm

Re: Extracting all values, all text and all nodes with only wildcards

Post by Martin Honnen »

The built-in default processing (see https://www.w3.org/TR/xslt-30/#built-in ... -only-copy for XSLT 3 but the declarative way it is specified with "xsl:mode" in XSLT 3 is backwards compatible with XSLT 1 and 2) copies all attribute values and text node values so there is not much you need to do beyond simply relying on it and ensuring the attributes are processed, perhaps, if the result of each "block" should form one line (as long as the data doesn't contain line breaks) then

Code: Select all


<xsl:output method="text"/>

<xsl:strip-space elements="*"/>

<xsl:template match="block | block//*">
  <xsl:apply-templates select="@* | node()"/>
  <xsl:text>&#10;</xsl:text>
</xsl:template>
suffices.

Doing it efficiently (in terms of low memory consumption for a huge document), if you use oXygen where you can set up the latest Saxon 9.9 or 9.8 EE for streaming, then use

Code: Select all


<xsl:mode streamable="yes"/>

<xsl:output method="text"/>

<xsl:strip-space elements="*"/>

<xsl:template match="block | block//*">
  <xsl:apply-templates select="@*"/>
  <xsl:apply-templates/>
  <xsl:text>&#10;</xsl:text>
</xsl:template>
Post Reply