Re: [xml-dev] Convert an XML Schema validation task into a form thatis s

Here is one simplified validation method for GPUs I looked at. We start by validating markup pairs only (i.e. tags, not elements), thereby allowing the document to be a linear stream. Then we convert our schema into a tabular form with reduced constraints than the full regular content models. Apologies for typos.

Parse document into a list of items (throwing away comments, PIs, attributes, and text nodes containing only whitespace) which are hashes, using some hash functions:

start-tag(X), end-tag(X), cdata()

Process schema into a table of N x N hashes, with value 0 or 1, allowing lookup of whether any pair is allowed. E.g if element title may be the first item under a section, then the value at table[ start-tag(title), start-tag(section)] is 1.

Validation is done by each GPU checking one consecutive item (hash), that it has an allowed preceding hash. I.e. for one warp

for-each chunk = list.length/ number.of.warps

for-each warp w

cursor = chunk×number.of.warps + w

item_result = ( table[ list[cursor], list[cursor-1]])

summarize-warp()

summarize-all()

where summarize-warp() reduces the item_result of each processor in the warp to a single value. (Different GPUs allow different operations for this.)

So this gives a basic parallel validation, and the trick is to augment this to cover more schema contraints efficiently.

For greater efficiency, you make each process do more:

for-each chunk = list.length/ number.of.warps / 4

for-each warp w

cursor = chunk*number.of.warps*4 + 4 * w

item_result = ( table[ list[cursor], list[cursor-1]])

+ ( table[ list[cursor+1], list[cursor]])

+ ( table[ list[cursor+2], list[cursor+1]])

+ ( table[ list[cursor+3], list[cursor+2]])

summarize-warp()

summarize-all()

To augment this to cover attributes (presence not values) ,

1) add hash functions atttibute(A) and start-tag-close(X)

2) in the list, the items for attributes appear in some fixed order (by hash, alphabetically)

So this means <section b="c"><title> becomes

start-tag(section), attribute(b), start-tag-close(section), start-tag(title)

The table is constructed accordingly. The validator algorithm is unchanged.

So now we have a parallel validator that checks:

* known and unknown element and attribute names

* known and attributes for each element

* allowed initial and final element children for an element

* mixed content, element content, empty elements

* whether two sibling elements are not ever allowed in any context (i.e. quite weak, except that mixed content models tend to be much the same)

* required consecutive sibling , and initial or final child elements.

You can continue to expand. E.g., instead of hash function start-tag(name) you can have start-tag(name, parent-name) which then allows validation of allowed siblings in the context of their parent elements. Or (better for concise tables) complex type name. (If you extend if to futher levels, you can start to get closer to my old Hook schema toy idea.)

Note that if you decide not to have unique hash codes, to restrict the size of the table against explosions, you won't get false negatives for validation.

To get better validation, you can make a second validation table that works on allowed pairs of hashes of items that are 1 apart (or 3, etc), rather than consecutive. This may help validate sequences with common and unvoomon elements that otherwise could mask each other, getting closer to the constraints of the XML schema content model. (E.g, if the content model was an ALL model, the 1- table would detect a,a,b,c as invalid, but not a,b,a,c. The 2- table would detect the reverse. Neither would detect a, b, c, a, which would need a 3-table.)

Anyway, that is one way that GPUs could be utilized. I think.

Cheers

Rick

On Tue, 1 Sep. 2020, 22:50 Roger L Costello, <costello@mitre.org> wrote:

Hi Folks,

I am reading a book [1] on machine learning and the book says some pretty interesting things:

"In the search for more speed, machine learning researchers started taking advantage of special hardware found in some computers, originally designed to improve graphics performance. You may have heard these called graphics cards. ... Those graphics cards contain a GPU, or graphics processing unit. Unlike a general purpose CPU, a GPU is designed to perform specific tasks, and do them well. One of those tasks is to carry out arithmetic, including matrix multiplication, in a highly parallel way. ... GPUs have many more [than CPUs] arithmetic cores, thousands are fairly common today. This means a huge workload can be split amongst all those cores and the job can be done quickly."

Neat!

That got me to thinking ... I wonder if other kinds of problems (not video game problems) could be converted into a form that could be run with benefit on GPUs? It seems that GPUs are really good at processing large matrices, i.e., doing matrix operations on large matrices. Have you converted a problem, which at first blush seems to have nothing to do with matrices and matrix operations, into a form that involves matrices and matrix operations?

Here's a crazy question: Can the task of validating an XML instance against an XML Schema be turned into a form that could run with benefit on GPUs?

/Roger

[1] Another awesome book by Tariq Rashid: "Make Your First GAN with Pytorch"

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php