[XSL-LIST Mailing List Archive Home] [By Thread] [By Date]

RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl


Subject: RE: [xsl] Converting CSV to XML without hardcoding schema details in xsl
From: "Pantvaidya, Vishwajit" <vpantvai@xxxxxxxxxxxxx>
Date: Tue, 27 Jun 2006 14:48:17 -0700

>From: Michael Kay [mailto:mike@xxxxxxxxxxxx]
>Sent: Saturday, June 24, 2006 12:41 AM
>> >
>> >There's a lot of potential backtracking here: it might be better to
>> >replace each "(.*)," with "[^,]*" or with "(.*?),".
>>
>> [Pantvaidya, Vishwajit] Does "[^,]*" work the same as "(.*),"
>> - I understand that ^ is start of line metachar. How does the
>> former match the alphabet chars?
>
>No, within square brackets, ^ means "not". So [^,]* matches a sequence of
>any characters except comma.
>
>The problem with your expression is that (.*) matches as many characters as
>it can. Then it sees ",", so it backtracks to find the last comma. Then it
>sees the next (.*), and has to backtrack again; and so on.
>>
>> >
>> >My own instinct would be to use something like:
>> >
>> >([^"]*,|"[^"]*",)*
>> >
>>
>> [Pantvaidya, Vishwajit] Oxygen would not accept this regex as
>> "it matches a zero-length string".
>
>Perhaps then you want to change the final "*" to a "+".
>
[Pantvaidya, Vishwajit] That's is the first thing I tried when the * did not
work - but even then it does not seem to be working.

>> Anyway, how does this regex work - it does not seem to have
>> anything that matches the alphabet chars.
>
>See above: [^"] matches everything except quotes.
>
>> And does the ,|" match comma or double quotes - because
>> actually some field will have both.
>
>The first alternative, [^"]*, matches any field that ends with a comma, and
>doesn't contain a quotation mark. The second alternative, "[^"]*,", matches
>any field that begins and ends with quotes (followed by a comma), and might
>contain a comma between the quotes.
>
>It's very hard to find out what the exact rules for CSV files used by a
>particular product are: for example, how it represents a field that
>contains
>quotation marks as well as commas. (That's one of the great advantages of
>XML< you can find a specification!) If you know the exact rules for your
>particular flavour of CSV, you can adapt the regex to match (well, you can
>if you study a bit more about regular expressions).
>>
>>
>> Maybe this conversion is easier done with some Java code.
>>
>I'm sure it can be done using regular expressions but it looks as if you
>need to do some learning in this area.
>
[Pantvaidya, Vishwajit] Thanks a lot for all the clarifications and help.
Actually I did look at the regex documentation in the XSLT2 spec, but not
very exhaustively - the info on back-references I found there made me feel
that could be potentially useful here e.g. to tell the regex that if a
starting quote is found, look for an ending one. But the more I look into
it, the more it seems like I maynot be able to use it.

Thanks and regards,

Vish.


Current Thread
Keywords