[Ferret-talk] Need some information about Ferret
Lyes Amazouz
lyesjob at gmail.com
Mon Dec 1 05:36:09 EST 2008
Hello Erik
Thanks for the feedback. If you don't mind elaborating further, what kind
> of documents are you indexing (database rows? file system files? other?),
> how many documents do you have, and how are you indexing it?
>
> Thanks,
>
> Erik
>
Now, we are indexing file system files varying from HTML pages (85%) to
IMAGES (10%) (We index Meta information here), PDF(2%) WORD (2%) and PURE
TEXT (1%), we have 100 000 000 documents to index (10%) is already done. And
for the last question, I didn't exactly understand what do you mean by "How
we are indexing", What I can say is that before we index non full text
documents (like PDF, WORD and HTML), we operate a content extraction
(usingpdftotext, antiword and 'hpricot' ruby library). We axtract also the
metadata related to each document we index.
>
>
>
>
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>
--
===========
| Lyes Amazouz
| USTHB, Algiers
===========
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://rubyforge.org/pipermail/ferret-talk/attachments/20081201/152a22d6/attachment.html>
More information about the Ferret-talk
mailing list