Re: [xml-dev] A dandy little technique for constraining your stringsto A

If the schema is a published interface for a large system, rather than some changeable part inside a black box, then you should consider having a two layer schema.

The first layer is almost a data dictionary or vocabulary: only absolutely necessary constraints such as "this element is a xs:string" and the most minimal containment and optionality rules.

The second layer provides the pragmatic layer. Lengths, restrictions, order, grouping, cardinality. System limitations. Business rules. It could be a derived by restriction from the first using xsd, or overlaid using schematron.

Systems then create data using the second level schema but accept data using the first level if possible. (If the receiving/generating system must pick one schema for databinding, then a transform to make documents valid against the first also valid against the second (as far as order goes) can be provided on the input or output.)

If you dont have the two layers you dont have any objective way to be conservative in what you send and generous in what you receive.

If there is the need to internationalize, then a new second level schema is constucted. But systems that were built to cope with the first level schema will not need to be changed.

On 22/10/2015 4:07 AM, "Costello, Roger L." <costello@mitre.org> wrote:

Hi Folks,

So, you’ve created an XML schema. And it contains a lot of elements and attributes of type string.

You want each string constrained to just ASCII characters. Use the pattern facet for that.

Here’s a dandy little technique you can use:

At the top of your schema, place this named entity declaration:

<!DOCTYPE xs:schema [
<!ENTITY ASCII "[\p{IsBasicLatin}]*">
]>

The entity ( ASCII ) can then be referenced in each pattern facet:

<xs:simpleType name="NameType">
    <xs:restriction base="xs:string">
        <xs:maxLength value="10" />
        <xs:pattern value="&ASCII;" />
    </xs:restriction>
</xs:simpleType>

<xs:simpleType name="DescriptionType">
    <xs:restriction base="xs:string">
        <xs:maxLength value="20" />
        <xs:pattern value="&ASCII;" />
    </xs:restriction>
</xs:simpleType>

At parse-time the XML parser will substitute each entity reference ( &ASCII; ) with its replacement text ( [\p{IsBasicLatin}]* ).

The entity provides useful documentation; i.e., I assert that this:

<xs:pattern value="&ASCII;" />

is more readable than this:

<xs:pattern value="[\p{IsBasicLatin}]*" />

Here’s a complete schema to illustrate the technique:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xs:schema [
<!ENTITY ASCII "[\p{IsBasicLatin}]*">
]>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="Test">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="Name" type="NameType" />
                <xs:element name="Description" type="DescriptionType" />
            </xs:sequence>
        </xs:complexType>
    </xs:element>

    <xs:simpleType name="NameType">
        <xs:restriction base="xs:string">
            <xs:maxLength value="10" />
            <xs:pattern value="&ASCII;" />
        </xs:restriction>
    </xs:simpleType>

    <xs:simpleType name="DescriptionType">
        <xs:restriction base="xs:string">
            <xs:maxLength value="20" />
            <xs:pattern value="&ASCII;" />
        </xs:restriction>
    </xs:simpleType>

</xs:schema>

/Roger