[Ferret-talk] Handling Carriage Returns
S D
sd.codewarrior at gmail.com
Wed Apr 30 01:47:24 EDT 2008
That was it. Stupid mistake on my part.
Thanks!
John
On Mon, Apr 28, 2008 at 6:37 AM, Jens Kraemer <jk at jkraemer.net> wrote:
> Hi,
>
> File.readlines returns an array which I think is the root cause of the
> problem.
> Just using File.read instead should solve your problem.
>
> Cheers,
> Jens
>
> On Mon, Apr 28, 2008 at 03:04:36AM -0400, S D wrote:
> > It's my understanding that the tokens in a token_stream consist of text
> > along with start/stop positions that represent the byte positions of the
> > text within the corresponding document field. The documentation I've
> been
> > reading (i.e., O'Reilly - Ferret - page 67) suggests that these byte
> > positions represent positions within the entire field but based on my
> > testing it appears that the byte positions are with respect to the line
> that
> > contains the corresponding text within the field. I read my fields
> following
> > Brian McCallister:
> >
> > index.add_document :file => path,
> > :content => file.readlines
> >
> >
> > Hence, if I have a file that contains carriage returns, the token
> positions
> > will be reset with each new line. For example, the following file
> contents
> > (File A)
> > this is a sentence
> > will result in a token for the text "sentence" with start position equal
> to
> > 10 (assume "this" starts in position 0) while a file with a carriage
> return
> > this is a
> > sentence
> > will result in a token for the text "sentence" with start position equal
> to
> > 0. I get the same results for my custom tokenizer as well as
> > StandardTokenizer. The above does not seem consistent with the
> documentation
> > but more importantly, it seems that global positions are more useful
> than
> > line-based positions (e.g., for highlighting).
> >
> > Digging a little deeper it seems that the tokenizer's initialize method
> is
> > called each time the token_stream method of the containing analyzer is
> > called:
> >
> > class CustomAnalyzer
> > def token_stream(field, str)
> > ts = StandardTokenizer.new(str)
> > end
> > end
> >
> > Am I missing something here? Are the start/stop byte positions intended
> to
> > be with respect to the line? Is there a way for token_stream to only be
> > called once for an entire string sequence (even if carriage returns are
> > contained)?
> >
> > Thanks,
> > John
>
> > _______________________________________________
> > Ferret-talk mailing list
> > Ferret-talk at rubyforge.org
> > http://rubyforge.org/mailman/listinfo/ferret-talk
>
> --
> Jens Krämer
> Finkenlust 14, 06449 Aschersleben, Germany
> VAT Id DE251962952
> http://www.jkraemer.net/ - Blog
> http://www.omdb.org/ - The new free film database
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://rubyforge.org/pipermail/ferret-talk/attachments/20080430/b906716b/attachment.html>
More information about the Ferret-talk
mailing list