XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] It's too late to improve XML ... lessons learned?

Off the top of my head, there are two quite good reasons for the markup to use a different script than the content, such as using ASCII markup with CJK content, or even Greek or Cyrillic.

First, because it is so visually distinct. Which is a good characteristic for markup. 

Second, because it may be marginally easier in some editing applications: if you search for a substring in a element name you don't need to worry about false hits in the data. And it may be that when typing, after flipping your input mode to "half width" to enter "<" STAGO it is easier to stick in that mode rather than having to flip to fullwidth, kana, kanji etc. 

So the point about native language markup is not that people MUST use it, but that an international standard (of any kind) MUST not prevent anyone (unless there are other compelling technical factors.) No nannying.*

Sitting here in Australia, in my mask in the sunshine, why is it my business to dictate yea or nay to anyone a country I am foreign to?  When Murata Sensei (whose dedication and contribution is making him a Living International Treasure IMHO: I hope he is recognised more) creates a schema for some Japanese use, the decision/negotiation about which characters to use should be his (and his team's) alone!  

I can remember Murata-san showing me some Japanese address schema, and saying how there was no Western equivalent of the Japanese cho-me, and that spelling it out in alphabetic letters would frustrate the user, who would think it was supposed to be a foreign word spelled out. We should not think that vocabularies from different language families are simply inter-translateable.

...Some caveats: I suspect that auto-generated IDs are still better off being ASCII only (which kinda happens by default because so many IDs are constructed with a UUID suffix). And names that you might want to bind as symbols (LHS names) in a programming language, too. (Oh, and I expect it can be useful to borrow HTML tag names rather than have neologisms in markup.) And with so much outsourcing, the rationale for native language markup has an extra consideration  too. 

I am interested to know what gotchas people have found in real deployments, in the last 20 years, with XML with non-ASCII data and markup. And also, whether modern Unicode is actually good enough now for professional, quality publishing in CJK and other national or major scripts. 

For example, are PUA characters used much in XML?, or is Unihan plus markup good enough, or do people need to embed actual glyph information? How are new ideographs handled when you cannot wait for the Unicode Consortium process? Is the situation different with JSON?

Cheers 
Rick

* But aren't restrictions on characters allowed in names "nannying"? 

For the specific restrictions   there is a trade-off between benefits (detecting control characters that indicate wring encoding info, preventing spoofed tags where fullwidth characters are used, limiting non-descriptive markup, keeping a separation between tags and data, preventing processing/transmission errors on legacy systems, preventing invisible markup   etc) and costs (how fine grained do the checks need to be?, is there some SIMD optimization or folding that we can enable to reduce cost?)  And I see no reason to expect the trade-off wouldn't shift over time.

Personally, I think coarse-grained restrictions on names (at least by Unicode block granularity) is justifiable, because markup languages have a tacit software engineering methodology and value proposition: XML names are names: visible and "readable". 

But I also think that users should be allowed an alternative, if they choose. In this case, that alternative would be to also allow string literals for element and attribute names, and character references. So

<"i am an element"></"I am an element">

Or the horrible
<"1235&#x3000;&#x0000;"/>

(Using this could be like resorting to using Processing Instructions: it is a sign that our attempts to presage and squeeze all possible future requirements into strictvschemas have failed and we are sheepishly having to adopt a Plan B? You are only finite humans, not supermen or Gods, Processing Instructions whisper to our former pride cries of "Behold my tags and tremble!" ) :-)

On Fri, 14 Jan 2022, 4:37 am Michael Kay, <mike@saxonica.com> wrote:
>
> It's hard to believe now, but Unicode was once a very hard technology to sell.
>
>

I was working closely with Japanese colleagues at the time. I think the sentiment was (a) they had spent years in standards committees negotiating compromises that met the requirements of all the Asian countries, and (b) they weren't going to accept something dreamt up by Apple and Xerox in California that ignored all the cultural nuances they had spent years refining.

But as so often happens with standards, the adequate solution today proved better than the perfect solution tomorrow.

Michael Kay
Saxonica



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS