Re: [xml-dev] Copying text from a source, then converting to XML

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Daniel Gresh <dgresh@lle.rochester.edu>
To: xml-dev@lists.xml.org
Date: Fri, 14 Jul 2006 08:39:27 -0400

Mark Novembrino (novembri) wrote:

Hi, Daniel.
There are probably many ways to do this.
One way, perhaps a little crude, would be to use a text/macro editor to
process the files in batch mode first. I've often used Vedit for these
sorts of things. (http://www.vedit.com) The program doesn't do these
kinds of batch things out of the box. You'd have to write a macro, but
the macro language is easy to work with. Of course, you could also do
the same thing in Perl or another scripting language.
Once you've extracted the text you want to each file, the conversion to
XML is another matter. That would depend on *which* XML you mean, i.e.,
what DTD, what sort of text, what are the mapping rules you want to use
and how do you want to tag the resulting XML output. You could continue
to use the text editor for this sort of thing, or if you want a more
"official" method, use XSLT to do the transform.
Hope this helps.
Not sure your level of programming expertise. If you need any more info
(and nobody else on the list comes up with any better answers), I'd be
glad to help with any small scripts/macros. I don't know Perl very well,
but I probably have some Vedit and/or VBScripts floating around
somewhere that could do the job.
- Mark Novembrino
-----Original Message----- From: Daniel Gresh [mailto:dgresh@lle.rochester.edu] Sent: Thursday, July 13, 2006 1:12 PM To: xml-dev@lists.xml.org Subject: [xml-dev] Copying text from a source, then converting to XML

I have a question about this. Some of the question may not pertain to XML, but if anyone knows a method, that'd be great.

So, I basically want to automatically search a large number of documents for certain keywords. When I find that keyword, I want the paragraph the keyword is in, not the page, to be copied and pasted somewhere. After that, I want to convert the pasted text to XML.

Does anyone know a method for doing either of these tasks? Copying certain paragraphs or substrings of text that have certain phrases in them, then converting to XML? Perhaps there is a script of some sort? Or a free program?
Any help would be appreciated.
----------------------------------------------------------------- The xml-dev list is sponsored by XML.org <http://www.xml.org>, an initiative of OASIS <http://www.oasis-open.org>
The list archives are at http://lists.xml.org/archives/xml-dev/
To subscribe or unsubscribe from this list use the subscription
manager: <http://www.oasis-open.org/mlmanage/index.php>

You're going to have to forgive my lack of knowledge regarding the subject, but I am not all that familiar with XSLT. As for extracting the text, I've looked around a bit, and it does look like a script of some sort would be useful; I'll look around for an example before I try to make one from scratch.

As for what type of XML I'm converting to, I guess I should have been a little more specific. I'm not even sure if this is possible, but I'm really crossing my fingers and hoping it is, because it will make this task a whole lot easier: I want to somehow extract the text and use it with an ontology built in RDF/OWL. Is that ... possible? Even if it's not possible to convert it directly to RDF/OWL format, which I would guess is impossible, because in OWL and RDF one needs to predefine the classes and such, I figured converting to a XML format would be the first step in the right direction.

I'm sort of digressing here, and I apologize, but I simply don't know where else to ask this: is there some way to extract large amounts of text from a large number of documents, then access it in some way by applying metadata to it and using RDF/OWL? Extracting the text can be accomplished with scripts, as mentioned earlier, or by using XSLT, although I am not familiar with that method, but putting it into an ontology is a different matter. I was thinking of organizing the text according to the keywords and areas I extract, and then using something to search through it, but that's not really what I need, and I could just use XQuery for that, or something similar. Does anyone have any thoughts? Again, I apologize for the off-topic subject, I just haven't found any other places to ask this.

Thanks for all the help,
Dan

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]