[XSL-LIST Mailing List Archive Home]
[By Thread]
[By Date]
On 17/11/2012 12:05, Ihe Onwuka wrote:
That's a wildly expensive regex. If I change it to
I get identical output for your input and adding the extra line doesn't make it take appreciably longer
David
Re: [xsl] Hanging regex
Subject: Re: [xsl] Hanging regex From: David Carlisle <davidc@xxxxxxxxx> Date: Sat, 17 Nov 2012 12:40:35 +0000 |
On 17/11/2012 12:05, Ihe Onwuka wrote:
First let me dissect the regex
<xsl:analyze-string select="." flags="x" regex="(.+?) ((-?\d*\s*)+$)"
is targeted at lines of balance sheet text such as below where we do not know how many amounts will occur
1. Total Quick Assets 1,511 2,829 1,694 4,429
(.+?) lazily matches the non-financial half of the line - in this case it will gobble up 1. Total Quick Assets
((-?\d*\s*)+$) captures the financial half - allowing for a leading minus sign - the inner brackets are for grouping not capture.
Here is some test data - a file containing the following
I. Current Assets 1,871 2,829 1,694 4,429 1. Total Quick Assets 1,511 2,829 1,694 4,429 Short-term financial instrument 31 16 45 - 2. Total Inventories 359 - - - II. Leased Housing Assets - - - - III. Deferred Liabilities - - - - III.Capital Adjustments - - -28 -30 V. Retained Earnings -2,840 -4,664 -4,363 -4,383
**********************************************************************************************************
FINANCIAL INFORMATION 1. Financial Statements
Income Statement ------------------ (Unit : KRW million) **********************************************************************************************************
Here is the stylesheet
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" exclude-result-prefixes="xs" version="2.0"> <xsl:output indent="yes"/> <xsl:param name="input" as="xs:string" required="yes"/>
<xsl:template match="/"> <!-- read in text whilst removing comma punctuation from monetary fields --> <xsl:for-each select="tokenize(replace(unparsed-text($input, 'iso-8859-1'),',(\d{3}\D)','$1'), '\r?\n')">
<!-- Delete lines that don't contain alphanumeric text --> <xsl:if test="matches(.,'\w') and matches(.,'(-|\d)+')"> <line> <xsl:analyze-string select="." flags="x" regex="(.+?) ((-?\d*\s*)+$)">
That's a wildly expensive regex. If I change it to
<xsl:analyze-string select="." flags="x" regex="(.+?) ((-|\d|\s)+$)">
I get identical output for your input and adding the extra line doesn't make it take appreciably longer
David
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
[xsl] Re: Hanging regex, Ihe Onwuka | Thread | Re: [xsl] Hanging regex, Liam R E Quin |
[xsl] Re: Hanging regex, Ihe Onwuka | Date | [xsl] Why doesn't <xsl:copy> copy x, Costello, Roger L. |
Month |