[
Lists Home |
Date Index |
Thread Index
]
- To: XML Developers List <xml-dev@lists.xml.org>
- Subject: converting character entities to us-ascii /equivalents/
- From: Robert Koberg <rob@koberg.com>
- Date: Wed, 06 Oct 2004 14:55:58 -0700
- User-agent: Mozilla Thunderbird 0.7 (Macintosh/20040616)
Hi,
I need to output several versions of a page (through XSL
transformations), one of which is us-ascii (for email). But, the content
might contain some characters that are not supported by us-ascii (like
em dash - —).
I want the character entities to remain in the content. When
transforming to us-ascii, I want to replace the entities with some ascii
text equivalent: For example, '—' would get converted to '--'.
The XML is pulled into the transformation through the document function
using a custom URIResolver.
Is there an existing solution to this?
Does Apache's FOP and the text renderer handle this type of thing?
I have tried to set a ContentHandler (actually a DefaultHandler) on the
XMLReader and tried to replace a character entity, but I am doing
something wrong and a confused on how to proceed. Using the code below I
get a recoverable error using saxon/aelfred and a failure when using
saxon/xerces.
Here is a snippet from the URIResolver:
InputSource in = new InputSource(file.getAbsolutePath());
SAXSource source = new SAXSource(in);
XMLReader reader = null;
try {
reader =
XMLReaderFactory.createXMLReader("com.icl.saxon.aelfred.SAXDriver");
//reader =
XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
} catch (SAXException e) {
System.err.println(e.getMessage());
}
reader.setContentHandler(new AsciiHandler());
source.setXMLReader(reader);
return source;
And the DefaultHandler has one method:
public void characters(char[] text, int start, int length) {
String str = new String(text, start, length);
if (str.indexOf(174) > -1) {
str.replaceAll("\u00AE", "(Registered Trademark)");
}
text = str.toCharArray();
}
How can I do this? Is there a better way to handle this type of thing?
thanks,
-Rob
|