[Ferret-talk] Indexing an XML/HTML File

S D sd.codewarrior at gmail.com
Sat Apr 12 00:45:55 EDT 2008


I'm planning on indexing XML/HTML files. I only want to index the text
contained in the files and not any of the elements or tags. I just finished
reading Chapter 6 of "Ferret" (Balmain/O'Reilley) that presented a solution
for this issue. The essence of the solution was to parse the XML/HTML and
extract the text content using a parser such as Hpricot. My concern is that
this approach will not support highlighting of the results [correct me if
I'm wrong here] since the corresponding indexed field will only contain text
without the elements and tags that are necessary to indicate the position of
the text. Question: wouldn't a better approach be to implement a tokenizer
that ignores XML/HTML tags and preserves the positions of the appropriately
indexed items? If this is indeed an ideal approach does such a solution
exist or, alternatively, how can I contribute when I implement it?

Regards,
John
aka sd.codewarrior
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080412/4884f333/attachment.html 


More information about the Ferret-talk mailing list