Page 1 of 1

Validating regular expressions in schema 1.0 with oXygen

Posted: Thu Nov 24, 2016 1:22 pm
by mhGLEIF
Hello all,

I'm creating a schema with a number of regular expressions restricting some of the datatypes.

It appears that the regex do work as I've tested them separately outside oXygen and they do what I'd expect.

oXygen generates test XML instances from the schema including some characters / combinations that shouldn't validate against those regex though.

Is this oXygen, the regex or the schema that's not stopping the bad data?

The basic token type with max 500 characters, minimum of one character and may not contain any of: the carriage return (#xD), line feed (#xA) nor tab (#x9) characters, shall not begin or end with a space (#x20) character, or a sequence of two or more adjacent space characters:

Code: Select all

<xs:simpleType name="Tokenized500Type">
<xs:restriction base="xs:string">
<xs:maxLength value="500"/>
<xs:minLength value="1"/>
<xs:pattern value="\S+( \S+)*"/>
</xs:restriction>
</xs:simpleType>
>>> I want all the token type (above) restrictions to apply at once i.e. AND - can anyone confirm that's how the combination of maxLength, minLength and pattern will apply?

>>> It should not be necessary to split them out into multiple xs:restriction tags?

The transliteration type below should only allow ASCII characters.

Code: Select all

<xs:simpleType name="TransliteratedStringType">
<xs:annotation>
<xs:documentation> can only contain non-control characters drawn from the “invariant subset”
of ISO 646 (i.e. ASCII). </xs:documentation>
</xs:annotation>
<xs:restriction base="example:Tokenized500Type">
<xs:pattern
value="(!|"|%|&|'|\(|\)|\*|\+|,|-|.|\/|0|1|2|3|4|5|6|7|8|9|:|;|<|=|>|\?|A|B|C|D|E|F|G|H|I|J|K|L| |M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z|_|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)+"
/>
</xs:restriction>
</xs:simpleType>
>>> The above transliteration restriction should apply to the token type even if the token type is not functioning correctly, right?

Following element uses the transliteration type above:

The element's type is built as follows:

Code: Select all


<xs:complexType name="TransliteratedNameType">
<xs:simpleContent>
<xs:extension base="example:TransliteratedStringType">
<xs:attribute ref="xml:lang" use="optional">
<xs:annotation>
<xs:documentation>The language of this element's text content. An IETF Language Code
conforming to the latest RFC from IETF BCP 47. Note that the first characters of an
IETF Language Code, up to the hyphen (if any), are all lowercase, and those following
the hyphen (if any) are all uppercase.<br/>
</xs:documentation>
</xs:annotation>
</xs:attribute>
</xs:extension>
</xs:simpleContent>
</xs:complexType>

...

<xs:complexType name="TransliteratedOtherEntityNameType">
<xs:complexContent>
<xs:extension base="lei:TransliteratedNameType">
<xs:attribute name="type" type="lei:TransliteratedEntityNameTypeEnum" use="required">
<xs:annotation>
<xs:documentation>Type of alternative name for the legal entity.</xs:documentation>

</xs:annotation>
</xs:attribute>
</xs:extension>
</xs:complexContent>
</xs:complexType>
Here's the data I'm testing in oXygen:

Code: Select all


<TransliteratedOtherEntityName type="AUTO_ASCII_TRANSLITERATED_LEGAL_NAME" xml:lang="de-DE">Dr. Bäcker</TransliteratedOtherEntityName>
>>> Why does the data in the element validate? The "ä" should cause it to fail...

Re: Validating regular expressions in schema 1.0 with oXygen

Posted: Thu Nov 24, 2016 6:52 pm
by adrian
Hello,
>>> I want all the token type (above) restrictions to apply at once i.e. AND - can anyone confirm that's how the combination of maxLength, minLength and pattern will apply?

>>> It should not be necessary to split them out into multiple xs:restriction tags?
That is the correct way to combine a restriction of minimum/maximum length + pattern. No, you don't have to split it.

Code: Select all

    <xs:restriction base="example:Tokenized500Type">
<xs:pattern
value="(!|"|%|&|'|\(|\)|\*|\+|,|-|.|\/|0|1|2|3|4|5|6|7|8|9|:|;|<|=|>|\?|A|B|C|D|E|F|G|H|I|J|K|L| |M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z|_|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)+"
/>
</xs:restriction>
>>> The above transliteration restriction should apply to the token type even if the token type is not functioning correctly, right?
Yes, but that is not the case, the base type seems just fine.
BTW, "/" does not need escaping ("\/" in your pattern), Saxon-EE actually treats this unnecessary escaping as an error. I would remove it.
>>> Why does the data in the element validate? The "ä" should cause it to fail...
You forgot to escape the "." in the pattern, so it matches anything, thus making the rest of the expression redundant.

Regards,
Adrian

Re: Validating regular expressions in schema 1.0 with oXygen

Posted: Fri Nov 25, 2016 1:19 pm
by mhGLEIF
Thank you, this gave the correct behaviour!