[
Lists Home |
Date Index |
Thread Index
]
- From: agreene@bitstream.com (Andrew Greene)
- To: xml-dev@ic.ac.uk
- Date: Thu, 28 Aug 1997 12:52:15 -0400
But perl doesn't have to break $_ on newlines. Whenever I do SGML
"parsing" with perl, I start off with
$/ = "<";
which says "the record-break character is '<', instead of newline."
Then, within my while (<>) loop, each $_ contains a single tag and
some content, (roughly) matching the regexp:
($etagP, $gi, $attlist, $content) =
/(\/?)(\w+)\s*([^>]*)\>(.*)/;
[For purists only: Yes, GIs can contain a different set of characters
than \w+, and attributes can contain > if it's enclosed in quotations,
and this doesn't chop off the '<' at the end of all tags except the
last one, and so on and so forth.... For SGML, it assumes that the
first character of ETAGO is the same as STAGO; for XML, it doesn't
handle the /> syntax... but it's simplified to make a point.]
The point is that perl doesn't care whether you have whitespace or
not, and if your perl script is splitting on newlines then you're
probably not going to correctly handle tags that contain newlines,
such as
<book
id=TWENTYKDOWN
authorid=VERNEJ
pubid=PENGUIN
><title>20,000 Leagues Under the Sea</title
></book
>
- Andrew Greene
xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)
|