[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[comp.text.sgml] Re: how to locate hit using Opentext

	Best regards, -- Boris.

---- Begin included message ----
Jakob Fix wrote:
> We are using Opentext5 (Pat) to index and query our text collections.
> However, Pat knows only about the byte offset of a hit, not about the
> position of the hit within the SGML tree structure.

I'm assuming you have used sgmlregion5 as well as patbuild5, so you
do have a .P index of the markup as well as an index of the content.

> querying Pat would return the byte offset of "hit", say, 1234.  It would
> not return information about the position within the tree, such as,
>   "'hit' occurs in 2nd <P> in 1st <DIV2> in 1st <DIV1> ... "

This is exactly the problem we identified when we started the CURIA
(now CELT) project: how to return a canonical reference (either in
SGML markup terms, or interpreting that using STEP or STATE to return
"chapter and verse"). It is not easy, but we have a pilot working at

It was _very_ hard to explain to vendors that we did not want to
"search" as they envisaged it, where you already know the chapter
and verse, and you just want to find the 5th LG within the 4th DIV2.
The kind of search we are talking about here is from the inside 
outwards, and it's why I was griping about search engines a few weeks

> or something similar.  We need this information to gradually expand the
> context around a hit, and also to locate a hit more precisely than just by
> byte offset.

That's it exactly.

> Are there any other Opentext users that have/had similar problems and
> found a solution?

Our script writes a PAT control file and then runs it through PAT,
finds the hit we want, logs the byte offset to a file, exits, munges
the byte offset into square brackets and adds more commands, then
runs PAT again, outputting a variety of things, but essentially the 
salient elements which we know we want for referencing. 

Example: search for "hand-bag" (in Wilde's _Importance_): this is how
it would look done manually

>> hand-bag
  1: 18 matches

>> pr
    49523, ../speaker><p>A hand-bag?</p></sp><pb n="466"><sp><speaker
    [etc for another 17 hits: pick the one we want]

>> [49523]
  2: one match

>> region SP incl %
  3: one match 

>> pr.region.SP
    49480, .. <sp><speaker>Lady Bracknell.</speaker><p>A hand-bag?</p>

>> page = region "<pb".."<pb"
  4: page = 535 matches

>> *page incl 3
    46866, ..<pb n="465"><sp><speaker>Jack.</speaker><p>Well, I own a
    [etc, the rest of the page, including the quote sought]

The trick is to output this lot to a file with {savefile "name"}
so it can be torn apart afterwards by the script. In fact the script
we use does a whole bunch of region FOO incl nnn commands to locate
and save to disk SP, STAGE, P, L, LG, DIV5..4..3..2..1, LIST, 
all we could want to identify the text and the quote. 

The BIG problem is, there is no way in PAT to do this recursively
for many hits: it only works for a single hit. To do it for many,
you'd have to write a script to create a control file for each hit,
and then process them all one after another.

A smaller problem is we haven't yet sorted out searching for multiple
words via the Web interface, so if you try it, it's hand-bag, not
hand fby bag. The script ends up by transforming the data into HTML,
so the user has a wait of about 30 secs, but gets a nice clean 
interface. We can't match Margaret Rutherford, though...

Basically, it works, but it's kludgy. OpenText were zero help, they
simply weren't interested, and now they're all hung up on webserver
indexing, PAT is a dead duck for serious research, which is a pity,
just as it was getting usable for it.

I have a draft paper on searching at
http://www.ucc.ie/celt/doc/searching.html which explains the problem
in more detail.

DTDs are not common knowledge because programming students are not
taught markup.  A markup language is not a programming language.
---- End included message ----