OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Character Entities: An XML Core WG View

[ Lists Home | Date Index | Thread Index ]

Pardon the possible self-promotion, but I think a brief description of
Ents might offer some hope to those who find internal subsets for
character references to be unpleasant.

Ents is a Java FilterReader that looks through a document and either
replaces entity names with character references or character references
with entity names.  It does this in the text of the document, so this
processing can be either fed into a parser or poured back out as XML for
later processing.  (There is also a SAXFilter for the skippedEntity()
event.)

The rules files, while not the best XML structure I've created, are
pretty simple at their core:
<equal ent="iexcl" ref="#161">inverted exclamation mark, U+00A1
ISOnum</equal>

There's also room for some descriptions, an identification of the source
for these references, etc.  It's not a particularly bright tool at the
moment, as I never got around to teaching it about hex to decimal
conversion, but it does let you round-trip entities to character
references and back.  Humans can enjoy the (relative) convenience of
named entities, while parsers can enjoy the simpler processing of
character references.

Features coming soon include the hexadecimal support mentioned above, as
well as support for putting the characters directly into or out of text,
not just character refs.  I'm integrating Ents with my Gorille work on
Unicode, and should have something to show in the next couple of weeks.  

It seems like pretty much all XML development to date has been at the
parser level or above, but there's a lot of useful work to be done on
the text.  It's unfortunate that the parsing model described in XML 1.0
puts a lot of layers into a single processing context, but maybe we can
start breaking out those layers and take advantage of having all this
accessible text.


-------------
Simon St.Laurent - SSL is my TLA
http://simonstl.com may be my URI
http://monasticxml.org may be my ascetic URI
urn:oid:1.3.6.1.4.1.6320 is another possibility altogether




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS