Re: [xml-dev] Convert an XML Schema validation task into a form thatis suitable for running on Graphics Processing Units (GPUs)?
Here is one simplified validation method for GPUs I looked at. We start by validating markup pairs only (i.e. tags, not elements), thereby allowing the document to be a linear stream. Then we convert our schema into a tabular form with reduced constraints than the full regular content models. Apologies for typos.
Parse document into a list of items (throwing away comments, PIs, attributes, and text nodes containing only whitespace) which are hashes, using some hash functions:
start-tag(X), end-tag(X), cdata()
Process schema into a table of N x N hashes, with value 0 or 1, allowing lookup of whether any pair is allowed. E.g if element title may be the first item under a section, then the value at table[ start-tag(title), start-tag(section)] is 1.
Validation is done by each GPU checking one consecutive item (hash), that it has an allowed preceding hash. I.e. for one warp
for-each chunk = list.length/ number.of.warps
for-each warp w
cursor = chunk×number.of.warps + w
item_result = ( table[ list[cursor], list[cursor-1]])
summarize-warp()
summarize-all()
where summarize-warp() reduces the item_result of each processor in the warp to a single value. (Different GPUs allow different operations for this.)
So this gives a basic parallel validation, and the trick is to augment this to cover more schema contraints efficiently.
For greater efficiency, you make each process do more:
for-each chunk = list.length/ number.of.warps / 4
for-each warp w
cursor = chunk*number.of.warps*4 + 4 * w
item_result = ( table[ list[cursor], list[cursor-1]])
+ ( table[ list[cursor+1], list[cursor]])
+ ( table[ list[cursor+2], list[cursor+1]])
+ ( table[ list[cursor+3], list[cursor+2]])
summarize-warp()
summarize-all()
To augment this to cover attributes (presence not values) ,
1) add hash functions atttibute(A) and start-tag-close(X)
2) in the list, the items for attributes appear in some fixed order (by hash, alphabetically)
So this means <section b="c"><title> becomes
start-tag(section), attribute(b), start-tag-close(section), start-tag(title)
The table is constructed accordingly. The validator algorithm is unchanged.
So now we have a parallel validator that checks:
* known and unknown element and attribute names
* known and attributes for each element
* allowed initial and final element children for an element
* mixed content, element content, empty elements
* whether two sibling elements are not ever allowed in any context (i.e. quite weak, except that mixed content models tend to be much the same)
* required consecutive sibling , and initial or final child elements.
You can continue to expand. E.g., instead of hash function start-tag(name) you can have start-tag(name, parent-name) which then allows validation of allowed siblings in the context of their parent elements. Or (better for concise tables) complex type name. (If you extend if to futher levels, you can start to get closer to my old Hook schema toy idea.)
Note that if you decide not to have unique hash codes, to restrict the size of the table against explosions, you won't get false negatives for validation.
To get better validation, you can make a second validation table that works on allowed pairs of hashes of items that are 1 apart (or 3, etc), rather than consecutive. This may help validate sequences with common and unvoomon elements that otherwise could mask each other, getting closer to the constraints of the XML schema content model. (E.g, if the content model was an ALL model, the 1- table would detect a,a,b,c as invalid, but not a,b,a,c. The 2- table would detect the reverse. Neither would detect a, b, c, a, which would need a 3-table.)
Anyway, that is one way that GPUs could be utilized. I think.
Cheers
Rick