XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] 3 approaches to expressing filter rules

If your SME is not computer savvy, don't you need to rule out almost anything that involves a computer language?  You need some kind of web page where they can select rules from some list, and convert this to some computer language (e.g. Schematron QuickFix, _javascript_).

You might use some intermediate language:
 
1. JSON

A different approach I have found useful for allowing fuzzy sorting and matching of tabular data is to have a configuration file that enables or disables particular transforms for each row (and, potentially, for each field.)  Here is the kind of thing in JSON, but you can see this could easily be generated by some webpage.
{
   "fuzz": [
      1,
     "zero",
     4
   ],
   "redact": [
      2,
     "del",
     "secret"
   ],
   "zero": [
     3,
     "zero",
     0
   ],
   "default": [
      "*",
      "case-insensitive",
      0
   ]
}

where you are saying that on field 1 you apply some zero() function with a parameter of 4, on field 2 you apply some del() function with parameter "secret", and on field 3 you apply some zero() function with parameter, and on any other, you do a case-insensitive match.   In the old days, we would have made some "little language" for this.

In other words, if you are dealing with tabular data, why does it need any kind of path or selector mechanism (let alone schemas?)  The natural thing is to use the table names: (if necessary, it would be better to use the spreadsheet column names (A-Z, AA-ZZ) rather than the integer numbers, and better to use the column names).   The user needs to configure based on the presentation they see the data in, the particular tools, not some abstraction: for tabular data, the natural tool is the spreadsheet.

If you must use a selector, you can hide it as

{
   "CONTEXT-ID": "this-table-id",
   "fuzz": [
      1,
     "zero",
     ...

or whatever. 

2. Schematron
Another approach is to get the SME to write the rules as their list, then you translate it into Schematron, write code to handle it, and they can tweak the Schematron perhaps.

<sch:pattern id="FuzzyRules">
 <sch:rule context="telephone" > 
    <sch:p>Example: Fuzz the telephone number 555-841-9087 to 555-841-0000 </sch:p>
    <sch:report test="." role="fuzz" />
 </sch:rule>

 <sch:rule context="amount" > 
   <sch:p>Example: If the field labeled “amount” is empty, set it to 0</sch:p>
   <sch:report test="." role="zero"/>
 </sch:rule>

 <sch:rule context="text()" > 
   <sch:p>Example: Remove the word “secret” from the data </sch:p>
   <sch:report test="." role="redact"  property="secret-word" />
 </sch:rule>
 
</sch:pattern>
...
<sch:property id="secret">
   <s>secret</s>
</sch:property>

The SVRL of this will generate an element svrl:successful-report  for each of these, with the XPath, the role (used to select some function), and any necessary parameters (e.g. the "s").

The developer could flesh it out to make it remotely efficient:

<sch:pattern id="FuzzyRules">
 <sch:rule context="bribe-table/person/telephone" > 
    <sch:p>Example: Fuzz the telephone number 555-841-9087 to 555-841-0000 </sch:p>
    <sch:report test="." role="fuzz" />
 </sch:rule>

 <sch:rule context="bribe-table/person/amount" > 
   <sch:p>Example: If the field labeled “amount” is empty, set it to 0</sch:p>
   <sch:report test="string-length(.) = 0" role="zero"/>
 </sch:rule>

 <sch:rule context="bribe-table/person/*/text()" > 
   <sch:p>Example: Remove the word “secret” from the data </sch:p>
   <sch:report test="contains(., 'secret')" role="redact"  property="secret-word" />
 </sch:rule>
 
</sch:pattern>
...
<sch:property id="secret">
   <s>secret</s>
</sch:property>

Rick

On Tue, Oct 12, 2021 at 6:27 AM Roger L Costello <costello@mitre.org> wrote:

Hi Folks,

I need to perform basic filtering on data, such as:

  • Fuzz. Example: fuzz the telephone number 555-841-9087 to 555-841-0000
  • Redact. Example: remove the word “secret” from the data
  • Zero. Example: if the field labeled “amount” is empty, set it to 0

 

I need to create a language (or use some subset of an existing language) for expressing filter rules. The language must be usable by domain experts who aren’t necessarily computer savvy. The input data comes in a variety of different formats – some input data will be formatted as CSV, some input data will be formatted as vCard, some input data will be formatted as iCalendar, etc.. Input data may be text or binary. In other words, the input data is not necessarily XML.

As I see it, there are three approaches to expressing filter rules, as exemplified by XSLT, CSS, and DFDL.

XSLT Approach to Expressing Filter Rules

In this approach filter rules are expressed in a filter rules document. The language used to express filter rules must provide a navigation/path language for navigating through the data to identify the data item to be filtered. XSLT uses this approach. Here’s a graphic that illustrates the approach:

CSS Approach to Expressing Filter Rules

In this approach filter rules are expressed in a filter rules document. Each data item in the input data has a unique identifier. Each filter rule specifies a unique identifier and the filter rule. CSS uses this approach. Here’s a graphic that illustrates the approach:

DFDL Approach to Expressing Filter Rules

In this approach the logical structure of the input data is described by an XML Schema and filter rules are expressed in annotations in the XML Schema. DFDL uses this approach. Here’s a graphic that illustrates the approach:

 

QUESTION #1: I think these three approaches are fundamentally different. Do you agree?

QUESTION #2: Are there other approaches which are fundamentally different than the three listed above?

Let’s now look at the advantages and disadvantages of each approach.

XSLT Approach to Expressing Filter Rules

Advantages

  • Global (birds’ eye, top-down) perspective on the data. Consider this filter rule: fuzz the latitude/longitude data in the iPhone so that the person’s location is known only to within a 15-mile radius. Such a filter rule requires changes to both the latitude and the longitude data items; they cannot be changed individually, in isolation. This means that a higher-level view of the data is needed.
  • No need to add unique identifiers to the data items in the input.

 

Disadvantages

  • Requiring a navigation/path language means there are more things to learn. Not only must users learn the language for expressing filter rules, but they must also learn the navigation/path language. For users who are experts in the domain but not language experts, this might be a step too high.

 

CSS Approach to Expressing Filter Rules

Advantages

  • No navigation/path language needed.
  • Simple. Good for domain experts who are not language experts.

 

Disadvantages

  • Must add unique identifiers to the data items in the input. Uniquely identifying each data item in the input might not be possible.
  • Local/narrow view of the data.

 

DFDL Approach to Expressing Filter Rules

Advantages

  • No navigation/path language needed.
  • No need to add unique identifiers to the data items in the input.

 

Disadvantages

  • An XML Schema must be constructed to describe the input data and users need to understand the XML Schema to know where to place the filter rules. For users who are experts in the domain but not language experts, this might be a step too high.
  • Local/narrow view of the data.

 

QUESTION #3: Have I missed any advantages/disadvantages?

/Roger



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS