Lists Home |
Date Index |
I think there are some easy wins to be had. Like others have said the nature
of the document will drive a lot of the approaches-- but more importantly
the nature of the grammar will drive the approaches. For instance-- working
with DTDs in an editor makes offering up allowed items at a given position
child's play. Because DTDs have limited notions of context (i.e., one global
context) it makes determining allowed entities, elements, attributes,
attribute values, a snap.
(1) Parse the grammar. In my editor I cache a variety of (configurable)
common DTDs. Store the results in arrays/lists for quick access. This
parsing is done with a modified SAX parser and validator-- but only barely
(2) On Invoke of the completion tool I reverse parse to get:
a) current desired context (e.g., attribute, attribute value, start
element, end element)
b) current document context (e.g., previous siblings, previous attributes
in the element, parent element)
(3) Because DTDs offer a single global context for element declarations--
knowing the parent element solves the problem-- determining the next
allowable element or elements is fairly simple and very localized.
Now, with that said-- I am not selling my editor so there were some
shortcuts I took. For instance the backwards parsing algorithm isn't perfect
by any stretch of the imagination. Also, I only parse backward on Invoke--
so attributes used after the current cursor position appear in the list.
Additionally, there are pathalogical cases-- e.g., an XML document with one
root node-- and thousands upon thousands of children nodes that would be
treated as siblings. If I were to handle this, I would probably do a better
job attempting to predetermine the structure possiblities in the grammar and
create an internal list of "break" cases.
An example of this would be an element foo which had the content model
(BarA, BarB*, BarC). If it is possible to determine that BarB is only
referenced in this declaration then the parent context is not needed--
anytime a preceding sibling is BarB, the immediate determination that BarB |
BarC is allowed can be inferred.
Unfortunately, with modern schema systems there are a number of new
problems. Notably, namespaces. But additional problems are created by the
introduction of multiple contexts for element names-- in which case a single
parent context is not satisfactory. For XML Schema, a list of predefined
"break" cases is imperative-- but the range of pathological documents
In terms of trees-- I have one that can be turned off. It is always off for
me. It is not that good : ) In terms of parsing and validation I haven't
spent much time here. I simply have a delay built in-- if the user doesn't
move the cursor for X seconds it does a SAX based parse and Validation if
the features are turned on. All of this is a separate thread so that it can
be immediately interrupted.
A far better solution would be to combine this a detection of changes--
e.g., if they typed comment data that didn't have "--" there is no reason to
re-validate. I imagine, maybe wrongly, that there is some elegant solution
of a merged validation based on diffs. But I suspect most editors simplify
this a little and validate a local context determining the context in a way
that is similar to the above completion proposal strategy.
Of course, similar to everyone else I am eager to here new solutions or
critiques. Also, I am willing to share my reverse parsing algorithm (in
Pascal no less)-- of course with the attached caveat that it would probably
need work before it was ready for prime time.