Validating regular expressions in schema 1.0 with oXygen

Post by **mhGLEIF** » Thu Nov 24, 2016 1:22 pm

Hello all,

I'm creating a schema with a number of regular expressions restricting some of the datatypes.

It appears that the regex do work as I've tested them separately outside oXygen and they do what I'd expect.

oXygen generates test XML instances from the schema including some characters / combinations that shouldn't validate against those regex though.

Is this oXygen, the regex or the schema that's not stopping the bad data?

The basic token type with max 500 characters, minimum of one character and may not contain any of: the carriage return (#xD), line feed (#xA) nor tab (#x9) characters, shall not begin or end with a space (#x20) character, or a sequence of two or more adjacent space characters:

Code: Select all

<xs:simpleType name="Tokenized500Type">

    <xs:restriction base="xs:string">

      <xs:maxLength value="500"/>

      <xs:minLength value="1"/>

      <xs:pattern value="\S+( \S+)*"/>

    </xs:restriction>

  </xs:simpleType>

>>> I want all the token type (above) restrictions to apply at once i.e. AND - can anyone confirm that's how the combination of maxLength, minLength and pattern will apply?

>>> It should not be necessary to split them out into multiple xs:restriction tags?

The transliteration type below should only allow ASCII characters.

Code: Select all

<xs:simpleType name="TransliteratedStringType">

    <xs:annotation>

        <xs:documentation> can only contain non-control characters drawn from the “invariant subset”

            of ISO 646 (i.e. ASCII). </xs:documentation>

    </xs:annotation>

    <xs:restriction base="example:Tokenized500Type">

        <xs:pattern

            value="(!|"|%|&|'|\(|\)|\*|\+|,|-|.|\/|0|1|2|3|4|5|6|7|8|9|:|;|<|=|>|\?|A|B|C|D|E|F|G|H|I|J|K|L| |M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z|_|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)+"

        />

    </xs:restriction>

</xs:simpleType>

>>> The above transliteration restriction should apply to the token type even if the token type is not functioning correctly, right?

Following element uses the transliteration type above:

The element's type is built as follows:

Code: Select all


<xs:complexType name="TransliteratedNameType">

    <xs:simpleContent>

      <xs:extension base="example:TransliteratedStringType">

        <xs:attribute ref="xml:lang" use="optional">

          <xs:annotation>

            <xs:documentation>The language of this element's text content. An IETF Language Code

              conforming to the latest RFC from IETF BCP 47. Note that the first characters of an

              IETF Language Code, up to the hyphen (if any), are all lowercase, and those following

              the hyphen (if any) are all uppercase.<br/>

            </xs:documentation>

          </xs:annotation>

        </xs:attribute>

      </xs:extension>

    </xs:simpleContent>

  </xs:complexType>



...



<xs:complexType name="TransliteratedOtherEntityNameType">

    <xs:complexContent>

      <xs:extension base="lei:TransliteratedNameType">

        <xs:attribute name="type" type="lei:TransliteratedEntityNameTypeEnum" use="required">

          <xs:annotation>

            <xs:documentation>Type of alternative name for the legal entity.</xs:documentation>



          </xs:annotation>

        </xs:attribute>

      </xs:extension>

    </xs:complexContent>

  </xs:complexType>

Here's the data I'm testing in oXygen:

Code: Select all


<TransliteratedOtherEntityName type="AUTO_ASCII_TRANSLITERATED_LEGAL_NAME" xml:lang="de-DE">Dr. Bäcker</TransliteratedOtherEntityName>

>>> Why does the data in the element validate? The "ä" should cause it to fail...

Post by **adrian** » Thu Nov 24, 2016 6:52 pm

Hello,

>>> I want all the token type (above) restrictions to apply at once i.e. AND - can anyone confirm that's how the combination of maxLength, minLength and pattern will apply?

>>> It should not be necessary to split them out into multiple xs:restriction tags?

That is the correct way to combine a restriction of minimum/maximum length + pattern. No, you don't have to split it.

Code: Select all

    <xs:restriction base="example:Tokenized500Type">

        <xs:pattern

            value="(!|"|%|&|'|\(|\)|\*|\+|,|-|.|\/|0|1|2|3|4|5|6|7|8|9|:|;|<|=|>|\?|A|B|C|D|E|F|G|H|I|J|K|L| |M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z|_|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)+"

        />

    </xs:restriction>

>>> The above transliteration restriction should apply to the token type even if the token type is not functioning correctly, right?

Yes, but that is not the case, the base type seems just fine.
BTW, "/" does not need escaping ("\/" in your pattern), Saxon-EE actually treats this unnecessary escaping as an error. I would remove it.

>>> Why does the data in the element validate? The "ä" should cause it to fail...

You forgot to escape the "." in the pattern, so it matches anything, thus making the rest of the expression redundant.

Regards,
Adrian

Post by **mhGLEIF** » Fri Nov 25, 2016 1:19 pm

Thank you, this gave the correct behaviour!

Validating regular expressions in schema 1.0 with oXygen

Validating regular expressions in schema 1.0 with oXygen

Re: Validating regular expressions in schema 1.0 with oXygen

Re: Validating regular expressions in schema 1.0 with oXygen