[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: XML Schema regex
- From: Robin Cover <robin@isogen.com>
- To: "Bullard, Claude L (Len)" <clbullar@ingr.com>
- Date: Thu, 16 Aug 2001 08:54:59 -0500 (CDT)
Of possible interest WRT Len's question on regexes:
"Regular expressions for checking dates"
By Eric Howland and David Niergarth
In Markup Languages: Theory and Practice
ISSN 1099-6621
http://mitpress.mit.edu/MLANG
Volume 2, Issue 2 (Spring 2000 ), pages 126-132
WRT MLTP Contest on writing the shortest correct
regular expressions for dates...
* Date checking regular expression that catches all bad dates: 245 char long
* Same regular expressison without \D characters: 277 characters
* Two regular expressions, the first of which must match to ensure that
the expression is well formed and the second of which must not match to catch
all the bad numbers: 184 characters total
Below we present several regular expressions for checking dates
including leap years. These expressions were inspired by the article
by C.M. Sperberg-McQueen in Markup Languages (**see [Sperberg-McQueen
1999]). Specifically they were inspired by the challenge at the end
of the article (and the date on which that challenge expires) to
shorten the long regular expression generated by lex.
** http://xml.coverpages.org/mltpTOC14.html#MLTP-14Sperberg
The regular expression offered here is, unfortunately, not
deterministic but it is more than an order of magnitude shorter than
the regular expression generated by lex. The expression is the inverse
of the lex expression and actually finds incorrect dates rather than
correct dates.The proposed expression uses the \D convention of Perl
and Python to detect characters that are not numbers and the
.{1,8}notation to indicate a string of one to eight characters. Note
that a somewhat longer (and arguably less readable) version of the
regular expression is also included in case you find those conventions
distasteful.
Also note that about 40% of this expression is dedicated to finding
poorly formed dates (dates not in the nnnn-nn-nn format where n is a
number). This implies that a much shorter total expression is
possible if one allows two passes (one pass to insure that the
potential date is well-formed and the second pass to detect incorrect
dates). The percentage saved by using two passes is even larger when
the Python conventions are not allowed. Using two passes is, however,
a less aesthetically pleasing response to the challenge.
The expression is a series of tests for errors OR'ed
together. Perhaps the easiest way to understand this expression is to
see how it is built up from the various types of possible errors. This
approach turns out to be effective, but it is hard to guarantee that
all possible errors have been found.
Because the challenge specifies a well defined (and enforcable) format
for the input to be tested, it is possible to exaustively test for
errors. A Python program has been created (the second listing below)
that exhaustively tests all dates of the form nnnn-nn-nn (where n is a
number) using both algorithmic and regular expression based tests. A
comparison of the results from these two methods exposes any errors in
the regular expression and guarantees that the regular expression is
as accurate as the algorithm.
[...]
A Python program to check date-checking regular expressions:
A program to check all possible numeric dates in the form
nnnn-nn-nn.Compares two methods of validating such dates,
one based on a regular expression and one based on an algorithm.
Substituting a different regular expression for whole_re would
allow it to be checked for accuracy.
[...]
On Thu, 16 Aug 2001, Bullard, Claude L (Len) wrote:
> Does anyone know of a repostitory of moreorless
> reusuable regexes, eg, international phone numbers,
> file paths etc, for XML Schema?
Best wishes,
Robin Cover
XML Cover Pages
http://xml.coverpages.org/