[Ferret-talk] Handling Carriage Returns
Jens Kraemer
jk at jkraemer.net
Mon Apr 28 06:37:21 EDT 2008
Hi,
File.readlines returns an array which I think is the root cause of the
problem.
Just using File.read instead should solve your problem.
Cheers,
Jens
On Mon, Apr 28, 2008 at 03:04:36AM -0400, S D wrote:
> It's my understanding that the tokens in a token_stream consist of text
> along with start/stop positions that represent the byte positions of the
> text within the corresponding document field. The documentation I've been
> reading (i.e., O'Reilly - Ferret - page 67) suggests that these byte
> positions represent positions within the entire field but based on my
> testing it appears that the byte positions are with respect to the line that
> contains the corresponding text within the field. I read my fields following
> Brian McCallister:
>
> index.add_document :file => path,
> :content => file.readlines
>
>
> Hence, if I have a file that contains carriage returns, the token positions
> will be reset with each new line. For example, the following file contents
> (File A)
> this is a sentence
> will result in a token for the text "sentence" with start position equal to
> 10 (assume "this" starts in position 0) while a file with a carriage return
> this is a
> sentence
> will result in a token for the text "sentence" with start position equal to
> 0. I get the same results for my custom tokenizer as well as
> StandardTokenizer. The above does not seem consistent with the documentation
> but more importantly, it seems that global positions are more useful than
> line-based positions (e.g., for highlighting).
>
> Digging a little deeper it seems that the tokenizer's initialize method is
> called each time the token_stream method of the containing analyzer is
> called:
>
> class CustomAnalyzer
> def token_stream(field, str)
> ts = StandardTokenizer.new(str)
> end
> end
>
> Am I missing something here? Are the start/stop byte positions intended to
> be with respect to the line? Is there a way for token_stream to only be
> called once for an entire string sequence (even if carriage returns are
> contained)?
>
> Thanks,
> John
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
--
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database
More information about the Ferret-talk
mailing list