From u.alberton at gmail.com Fri Apr 4 09:27:10 2008 From: u.alberton at gmail.com (Bira) Date: Fri, 4 Apr 2008 10:27:10 -0300 Subject: [Ferret-talk] Bug Report: Segmentation Fault when indexing with a specific set of FieldInfos. Message-ID: I'm submitting this through the mailing list because Trac won't let me use its bug report form... Is there some more appropriate way of submitting bugs if Trac doesn't work? This is the Trac error message: 500 Internal Server Error (Submission rejected as potential spam (IP 127.0.0.1 blacklisted by bsb.empty.us, sc.surbl.org, Maximum number of posts per hour for this IP exceeded)) And this is the bug description: I'm indexing e-mail messages, and using a specific FieldInfos configuration for this. Unfortunately, when given certain (spammy) messages using this configuration, Ferret segfaults. I've tested this in several places. In my local development environment, it works just fine. The segfaults happen in the remote EC2 servers used by the project. I managed to isolate a test case, that both makes the defect easier to see and proves this is a problem with Ferret as opposed to all the code that was layered on top of it. Here's the information on each environment I ran this with: '''My local environment''': * Linux 2.6.23-gentoo-r6 x86_64 AMD Athlon(tm) 64 Processor 3500+ AuthenticAMD GNU/Linux * ruby 1.8.6 (2008-03-03 patchlevel 114) [x86_64-linux] * ferret (0.11.6) * Results: Test code runs without error. '''Remote Server 1''': * Linux 2.6.16-xenU SMP i686 GNU/Linux * ruby 1.8.6 (2007-09-23 patchlevel 110) [i686-linux] (compiled from source) * ferret (0.11.6) Results: /home/sonian/lib/ruby/gems/1.8/gems/ferret-0.11.6/lib/ferret/index.rb:298: [BUG] Segmentation fault ruby 1.8.6 (2007-09-23) [i686-linux] Aborted '''Remote Server 2''': * Linux 2.6.18-xenU-ec2-v1.0 SMP i686 GNU/Linux * ruby 1.8.6 (2008-03-03 patchlevel 114) [i486-linux] (installed through apt-get) * ferret (0.11.6) Results: *** stack smashing detected ***: ruby terminated ======= Backtrace: ========= /lib/libc.so.6(__fortify_fail+0x4b)[0xb7d8f81b] /lib/libc.so.6(__fortify_fail+0x0)[0xb7d8f7d0] /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so[0xb7b6bb74] /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so[0xb7b13a61] /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so(mb_lcf_next+0x23)[0xb7b11d13] /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so[0xb7b11659] /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so[0xb7b11e9e] /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so(dw_invert_field+0x134)[0xb7b40ab4] /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so(dw_add_doc+0xa8)[0xb7b40ff8] /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so(iw_add_doc+0x3a)[0xb7b4116a] /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so[0xb7b617ce] /usr/lib/libruby1.8.so.1.8[0xb7e88592] /usr/lib/libruby1.8.so.1.8[0xb7e90bbf] /usr/lib/libruby1.8.so.1.8[0xb7e90e78] /usr/lib/libruby1.8.so.1.8[0xb7e96dcf] /usr/lib/libruby1.8.so.1.8[0xb7e9b9b6] /usr/lib/libruby1.8.so.1.8[0xb7e971d5] /usr/lib/libruby1.8.so.1.8[0xb7e9b9b6] /usr/lib/libruby1.8.so.1.8[0xb7e971d5] /usr/lib/libruby1.8.so.1.8[0xb7e99d73] /usr/lib/libruby1.8.so.1.8[0xb7e90b0e] /usr/lib/libruby1.8.so.1.8[0xb7e90e78] /usr/lib/libruby1.8.so.1.8[0xb7e96f0b] /usr/lib/libruby1.8.so.1.8[0xb7e9a181] /usr/lib/libruby1.8.so.1.8[0xb7e99b38] /usr/lib/libruby1.8.so.1.8[0xb7e90b0e] /usr/lib/libruby1.8.so.1.8[0xb7e90e78] /usr/lib/libruby1.8.so.1.8[0xb7e96dcf] /usr/lib/libruby1.8.so.1.8[0xb7e9a181] /usr/lib/libruby1.8.so.1.8[0xb7e90b0e] /usr/lib/libruby1.8.so.1.8[0xb7e90e78] /usr/lib/libruby1.8.so.1.8[0xb7e96dcf] /usr/lib/libruby1.8.so.1.8[0xb7e9e857] /usr/lib/libruby1.8.so.1.8(ruby_exec+0x22)[0xb7e9e8a2] /usr/lib/libruby1.8.so.1.8(ruby_run+0x2f)[0xb7e9e8df] ruby[0x80486bd] /lib/libc.so.6(__libc_start_main+0xe0)[0xb7ccc450] ruby[0x8048601] -- Bira http://compexplicita.wordpress.com http://compexplicita.tumblr.com -------------- next part -------------- A non-text attachment was scrubbed... Name: ferret_test.tar.bz2 Type: application/x-bzip2 Size: 7696 bytes Desc: not available Url : http://rubyforge.org/pipermail/ferret-talk/attachments/20080404/b49d6af0/attachment-0001.bz2 From kraemer at webit.de Fri Apr 4 10:46:40 2008 From: kraemer at webit.de (Jens Kraemer) Date: Fri, 4 Apr 2008 16:46:40 +0200 Subject: [Ferret-talk] Bug Report: Segmentation Fault when indexing with a specific set of FieldInfos. In-Reply-To: References: Message-ID: <20080404144640.GE27735@cordoba.webit.de> I can confirm that: *** stack smashing detected ***: ruby terminated Environment: Ubuntu 7.10, 2.6.22-14-generic #1 SMP Tue Feb 12 07:42:25 UTC 2008 i686 GNU/Linux ruby 1.8.6 (2007-06-07 patchlevel 36) [i486-linux] Jens On Fri, Apr 04, 2008 at 10:27:10AM -0300, Bira wrote: > I'm submitting this through the mailing list because Trac won't let me > use its bug report form... Is there some more appropriate way of > submitting bugs if Trac doesn't work? > > This is the Trac error message: > > 500 Internal Server Error (Submission rejected as potential spam (IP > 127.0.0.1 blacklisted by bsb.empty.us, sc.surbl.org, Maximum number of > posts per hour for this IP exceeded)) > > And this is the bug description: > > I'm indexing e-mail messages, and using a specific FieldInfos > configuration for this. Unfortunately, when given certain (spammy) > messages using this configuration, Ferret segfaults. > > I've tested this in several places. In my local development > environment, it works just fine. The segfaults happen in the remote > EC2 servers used by the project. I managed to isolate a test case, > that both makes the defect easier to see and proves this is a problem > with Ferret as opposed to all the code that was layered on top of it. > > Here's the information on each environment I ran this with: > > '''My local environment''': > > * Linux 2.6.23-gentoo-r6 x86_64 AMD Athlon(tm) 64 Processor 3500+ > AuthenticAMD GNU/Linux > > * ruby 1.8.6 (2008-03-03 patchlevel 114) [x86_64-linux] > > * ferret (0.11.6) > > * Results: Test code runs without error. > > '''Remote Server 1''': > > * Linux 2.6.16-xenU SMP i686 GNU/Linux > > * ruby 1.8.6 (2007-09-23 patchlevel 110) [i686-linux] (compiled from source) > > * ferret (0.11.6) > > Results: > > /home/sonian/lib/ruby/gems/1.8/gems/ferret-0.11.6/lib/ferret/index.rb:298: > [BUG] Segmentation fault > ruby 1.8.6 (2007-09-23) [i686-linux] > > Aborted > > > '''Remote Server 2''': > > * Linux 2.6.18-xenU-ec2-v1.0 SMP i686 GNU/Linux > > * ruby 1.8.6 (2008-03-03 patchlevel 114) [i486-linux] (installed > through apt-get) > > * ferret (0.11.6) > > Results: > > *** stack smashing detected ***: ruby terminated > ======= Backtrace: ========= > /lib/libc.so.6(__fortify_fail+0x4b)[0xb7d8f81b] > /lib/libc.so.6(__fortify_fail+0x0)[0xb7d8f7d0] > /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so[0xb7b6bb74] > /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so[0xb7b13a61] > /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so(mb_lcf_next+0x23)[0xb7b11d13] > /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so[0xb7b11659] > /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so[0xb7b11e9e] > /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so(dw_invert_field+0x134)[0xb7b40ab4] > /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so(dw_add_doc+0xa8)[0xb7b40ff8] > /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so(iw_add_doc+0x3a)[0xb7b4116a] > /var/lib/gems/1.8/gems/ferret-0.11.6/lib/ferret_ext.so[0xb7b617ce] > /usr/lib/libruby1.8.so.1.8[0xb7e88592] > /usr/lib/libruby1.8.so.1.8[0xb7e90bbf] > /usr/lib/libruby1.8.so.1.8[0xb7e90e78] > /usr/lib/libruby1.8.so.1.8[0xb7e96dcf] > /usr/lib/libruby1.8.so.1.8[0xb7e9b9b6] > /usr/lib/libruby1.8.so.1.8[0xb7e971d5] > /usr/lib/libruby1.8.so.1.8[0xb7e9b9b6] > /usr/lib/libruby1.8.so.1.8[0xb7e971d5] > /usr/lib/libruby1.8.so.1.8[0xb7e99d73] > /usr/lib/libruby1.8.so.1.8[0xb7e90b0e] > /usr/lib/libruby1.8.so.1.8[0xb7e90e78] > /usr/lib/libruby1.8.so.1.8[0xb7e96f0b] > /usr/lib/libruby1.8.so.1.8[0xb7e9a181] > /usr/lib/libruby1.8.so.1.8[0xb7e99b38] > /usr/lib/libruby1.8.so.1.8[0xb7e90b0e] > /usr/lib/libruby1.8.so.1.8[0xb7e90e78] > /usr/lib/libruby1.8.so.1.8[0xb7e96dcf] > /usr/lib/libruby1.8.so.1.8[0xb7e9a181] > /usr/lib/libruby1.8.so.1.8[0xb7e90b0e] > /usr/lib/libruby1.8.so.1.8[0xb7e90e78] > /usr/lib/libruby1.8.so.1.8[0xb7e96dcf] > /usr/lib/libruby1.8.so.1.8[0xb7e9e857] > /usr/lib/libruby1.8.so.1.8(ruby_exec+0x22)[0xb7e9e8a2] > /usr/lib/libruby1.8.so.1.8(ruby_run+0x2f)[0xb7e9e8df] > ruby[0x80486bd] > /lib/libc.so.6(__libc_start_main+0xe0)[0xb7ccc450] > ruby[0x8048601] > > > -- > Bira > http://compexplicita.wordpress.com > http://compexplicita.tumblr.com > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold From davidj503 at gmail.com Mon Apr 7 12:26:18 2008 From: davidj503 at gmail.com (David James) Date: Mon, 7 Apr 2008 12:26:18 -0400 Subject: [Ferret-talk] patch to warn that "ferret_server appears to be already running" Message-ID: <59b5d4330804070926r2234e840rf3e1c4e2c6c7b7a8@mail.gmail.com> Hi, I added the following two lines to the top of the start method in ferret_server.rb (acts_as_ferret) pid = read_pid_file raise "ferret_server appears to be already running" if pid Without this, I found that: 1. I could do 'script/ferret_start -e production start' multiple times without getting a warning message. 2. Running script/ferret_start... a 2nd time managed to kill an already existing pid file, but left the old process running. 3. My capistrano deployments were sometimes confusing (based on the above two items) -David From davidj503 at gmail.com Mon Apr 7 14:10:53 2008 From: davidj503 at gmail.com (David James) Date: Mon, 7 Apr 2008 14:10:53 -0400 Subject: [Ferret-talk] patch to warn that "ferret_server appears to be already running" In-Reply-To: <59b5d4330804070926r2234e840rf3e1c4e2c6c7b7a8@mail.gmail.com> References: <59b5d4330804070926r2234e840rf3e1c4e2c6c7b7a8@mail.gmail.com> Message-ID: <59b5d4330804071110h4203b288i7084593a2b710e95@mail.gmail.com> Update: I now prefer to use $stdout.puts instead of 'raise' -- because I don't consider this situation to be exceptional -- and I don't want Capistrano to treat this situation as an error and halt. On Mon, Apr 7, 2008 at 12:26 PM, David James wrote: > Hi, > > I added the following two lines to the top of the start method in > ferret_server.rb (acts_as_ferret) > > pid = read_pid_file > raise "ferret_server appears to be already running" if pid From mrj at bigpond.net.au Thu Apr 10 11:33:42 2008 From: mrj at bigpond.net.au (Mark Reginald James) Date: Fri, 11 Apr 2008 01:33:42 +1000 Subject: [Ferret-talk] Which field(s) matched? In-Reply-To: <47E8C6E2.5000202@ebi.ac.uk> References: <47E8C6E2.5000202@ebi.ac.uk> Message-ID: Robert Hulme wrote: > Is there a way to discover which fields in a document matched the search > that was performed? > > I've just started with Ferret, but so far I can only get back the ids of > the documents that were matched (along with the score). > > I'm aware there is a highlighting method, but I was hoping for something > computer readable (unless I'm misunderstanding what data can be gained > from that method). I've implemented a field_scores hash that can be returned along with each document and aggregate score. It's only currently working in Ferret 0.11.4. From jeff at boowebb.com Thu Apr 10 12:36:32 2008 From: jeff at boowebb.com (Jeff Webb) Date: Thu, 10 Apr 2008 09:36:32 -0700 Subject: [Ferret-talk] more_like_this Message-ID: <67f7f6410804100936o553f2d6aoc3c56d96ed1fb061@mail.gmail.com> Does anyone have an example of how more_like_this works in acts_as_ferret? I've tried the following..... @results = Post.find_by_contents params[:q] @mlt = @results.more_like_this :field_names => [:subject,:body] but get errors saying "more_like_this" is not a method of @results. -- Jeff Webb jeff at boowebb.com http://boowebb.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080410/689ce68e/attachment.html From jk at jkraemer.net Thu Apr 10 18:08:26 2008 From: jk at jkraemer.net (Jens Kraemer) Date: Fri, 11 Apr 2008 00:08:26 +0200 Subject: [Ferret-talk] more_like_this In-Reply-To: <67f7f6410804100936o553f2d6aoc3c56d96ed1fb061@mail.gmail.com> References: <67f7f6410804100936o553f2d6aoc3c56d96ed1fb061@mail.gmail.com> Message-ID: <20080410220826.GG11666@thunder.jkraemer.net> Hi! more_like_this works on a single record, not on the whole results list. Cheers, Jens On Thu, Apr 10, 2008 at 09:36:32AM -0700, Jeff Webb wrote: > Does anyone have an example of how more_like_this works in acts_as_ferret? > I've tried the following..... > > @results = Post.find_by_contents params[:q] > @mlt = @results.more_like_this :field_names => [:subject,:body] > > but get errors saying "more_like_this" is not a method of @results. > > -- > > Jeff Webb > jeff at boowebb.com > http://boowebb.com/ > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk -- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database From jk at jkraemer.net Thu Apr 10 18:10:19 2008 From: jk at jkraemer.net (Jens Kraemer) Date: Fri, 11 Apr 2008 00:10:19 +0200 Subject: [Ferret-talk] patch to warn that "ferret_server appears to be already running" In-Reply-To: <59b5d4330804071110h4203b288i7084593a2b710e95@mail.gmail.com> References: <59b5d4330804070926r2234e840rf3e1c4e2c6c7b7a8@mail.gmail.com> <59b5d4330804071110h4203b288i7084593a2b710e95@mail.gmail.com> Message-ID: <20080410221019.GH11666@thunder.jkraemer.net> Hi, acts_as_ferret's current trunk already has something similar in lib/unix_daemon.rb Cheers, Jens On Mon, Apr 07, 2008 at 02:10:53PM -0400, David James wrote: > Update: I now prefer to use $stdout.puts instead of 'raise' -- because > I don't consider this situation to be exceptional -- and I don't want > Capistrano to treat this situation as an error and halt. > > On Mon, Apr 7, 2008 at 12:26 PM, David James wrote: > > Hi, > > > > I added the following two lines to the top of the start method in > > ferret_server.rb (acts_as_ferret) > > > > pid = read_pid_file > > raise "ferret_server appears to be already running" if pid > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database From jeff at boowebb.com Thu Apr 10 20:34:10 2008 From: jeff at boowebb.com (Jeff Webb) Date: Thu, 10 Apr 2008 17:34:10 -0700 Subject: [Ferret-talk] more_like_this In-Reply-To: <20080410220826.GG11666@thunder.jkraemer.net> References: <67f7f6410804100936o553f2d6aoc3c56d96ed1fb061@mail.gmail.com> <20080410220826.GG11666@thunder.jkraemer.net> Message-ID: <67f7f6410804101734r3771cd3fy3c0239e50304bc9e@mail.gmail.com> big fat "duh" on my part - thanks for the quick response. :-) On Thu, Apr 10, 2008 at 3:08 PM, Jens Kraemer wrote: > Hi! > > more_like_this works on a single record, not on the whole results list. > > > Cheers, > Jens > > On Thu, Apr 10, 2008 at 09:36:32AM -0700, Jeff Webb wrote: > > Does anyone have an example of how more_like_this works in > acts_as_ferret? > > I've tried the following..... > > > > @results = Post.find_by_contents params[:q] > > @mlt = @results.more_like_this :field_names => [:subject,:body] > > > > but get errors saying "more_like_this" is not a method of @results. > > > > -- > > > > Jeff Webb > > jeff at boowebb.com > > http://boowebb.com/ > > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > -- > Jens Kr?mer > Finkenlust 14, 06449 Aschersleben, Germany > VAT Id DE251962952 > http://www.jkraemer.net/ - Blog > http://www.omdb.org/ - The new free film database > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -- Jeff Webb jeff at boowebb.com http://boowebb.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080410/91c983cd/attachment.html From john at digitalpulp.com Fri Apr 11 16:09:04 2008 From: john at digitalpulp.com (John Bachir) Date: Fri, 11 Apr 2008 16:09:04 -0400 Subject: [Ferret-talk] Trunk and single table inheritance References: <8FB04381-C1C2-4712-BCE7-727F191AA921@digitalpulp.com> Message-ID: My colleague Josh was having a hard time getting this email though either the web or email interfaces, so I'm forwarding it along. ------------- Hi all, I'm getting discrepancies between stable and trunk when doing a search involving single-table inheritance and I can't locate the problem. I'm running Rails 2.0.2 with the following (simplified) schema: class Model < ActiveRecord::Base acts_as_ferret end class ModelSubClass < Model end Under AAF/tags/stable, Model.find_by_contents('*') returns all Models and all ModelSubClasses. Under AAF/trunk, the same query only returns Models. Has AAF changed its STI model? The API also indicates that find_by_contents no longer takes the :models param, and including it does not change the output in either case -- is there a new way to search across multiple classes? If I could manually force the query to look at Model and all subclasses, I could live with the workaround. Thanks, Josh -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080411/1d6e4d9e/attachment-0001.html From sd.codewarrior at gmail.com Sat Apr 12 00:45:55 2008 From: sd.codewarrior at gmail.com (S D) Date: Sat, 12 Apr 2008 00:45:55 -0400 Subject: [Ferret-talk] Indexing an XML/HTML File Message-ID: <4f10e2890804112145o175a5860jb2b4e67e3d5aa9ee@mail.gmail.com> I'm planning on indexing XML/HTML files. I only want to index the text contained in the files and not any of the elements or tags. I just finished reading Chapter 6 of "Ferret" (Balmain/O'Reilley) that presented a solution for this issue. The essence of the solution was to parse the XML/HTML and extract the text content using a parser such as Hpricot. My concern is that this approach will not support highlighting of the results [correct me if I'm wrong here] since the corresponding indexed field will only contain text without the elements and tags that are necessary to indicate the position of the text. Question: wouldn't a better approach be to implement a tokenizer that ignores XML/HTML tags and preserves the positions of the appropriately indexed items? If this is indeed an ideal approach does such a solution exist or, alternatively, how can I contribute when I implement it? Regards, John aka sd.codewarrior -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080412/4884f333/attachment.html From jk at jkraemer.net Sat Apr 12 03:48:26 2008 From: jk at jkraemer.net (Jens Kraemer) Date: Sat, 12 Apr 2008 09:48:26 +0200 Subject: [Ferret-talk] Trunk and single table inheritance In-Reply-To: References: <8FB04381-C1C2-4712-BCE7-727F191AA921@digitalpulp.com> Message-ID: <20080412074826.GT11666@thunder.jkraemer.net> Hi, there has happened a lot in aaf trunk since the 0.4.3 was released, so this might well be a bug I introduced recently. I'll look into this. Could you please file a ticket on trac? Cheers, Jens On Fri, Apr 11, 2008 at 04:09:04PM -0400, John Bachir wrote: > My colleague Josh was having a hard time getting this email though > either the web or email interfaces, so I'm forwarding it along. > ------------- > Hi all, > > I'm getting discrepancies between stable and trunk when doing a > search involving single-table inheritance and I can't locate the > problem. > > I'm running Rails 2.0.2 with the following (simplified) schema: > > class Model < ActiveRecord::Base > acts_as_ferret > end > > class ModelSubClass < Model > end > > Under AAF/tags/stable, Model.find_by_contents('*') returns all Models > and all ModelSubClasses. Under AAF/trunk, the same query only returns > Models. Has AAF changed its STI model? > > The API also indicates that find_by_contents no longer takes > the :models param, and including it does not change the output in > either case -- is there a new way to search across multiple classes? > If I could manually force the query to look at Model and all > subclasses, I could live with the workaround. > > Thanks, > Josh > > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk -- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database From hulme at ebi.ac.uk Mon Apr 14 05:43:17 2008 From: hulme at ebi.ac.uk (Robert Hulme) Date: Mon, 14 Apr 2008 10:43:17 +0100 Subject: [Ferret-talk] Which field(s) matched? In-Reply-To: References: <47E8C6E2.5000202@ebi.ac.uk> Message-ID: <48032735.5050202@ebi.ac.uk> > I've implemented a field_scores hash that can be returned along with > each document and aggregate score. It's only currently working in > Ferret 0.11.4. Could you please give a code example of how to do a query and get the field_scores hash? I have Ferret 0.11.6. -Rob From toastkid.williams at gmail.com Mon Apr 14 11:58:34 2008 From: toastkid.williams at gmail.com (Max Williams) Date: Mon, 14 Apr 2008 16:58:34 +0100 Subject: [Ferret-talk] find_with_ferret :multi not working for me (latest version) (apologies if this was already posted) Message-ID: First of all, sorry if this appears in the list twice - i was awaiting moderation all day and then finally joined the list properly and sent this direct. Previously, for all my ferret searches, i was using find_by_contents, like this: @stuff = LearningObject.find_by_contents("trumpet", #ferret options {:multi => [TeachingObject, LearningObject, Lesson, Course], :page => 1, :per_page => 15 }, #find options {} ) That all worked fine. We've upgraded to the latest ferret/a_a_f, though, which no longer uses find_by_contents: i believe we're supposed to use find_with_ferret instead. However, if i do the equivalent, with find_by_contents replaced with find_with_ferret, then the 'multi' part doesn't work: i get the same results as if i didn't pass multi at all. Looking in the API ( http://projects.jkraemer.net/rdoc/acts_as_ferret/ ), it looks like find_with_ferret doesn't take a :multi option, so i'm in the dark over how to do multi-model searches now. My ferret-indexed classes all have ":store_class_name => true", which i read was necessary for multi searches. Can anyone help please? (To make life a bit more complicated, two of the classes extend one of the others, and i'd like a ferret search on the superclass to return results from both subclasses. However, i'd settle right now for just having a multi-model search that works) thanks Max Williams -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080414/9c7ffd3f/attachment.html From jk at jkraemer.net Mon Apr 14 16:24:25 2008 From: jk at jkraemer.net (Jens Kraemer) Date: Mon, 14 Apr 2008 22:24:25 +0200 Subject: [Ferret-talk] find_with_ferret :multi not working for me (latest version) (apologies if this was already posted) In-Reply-To: References: Message-ID: <20080414202424.GB11666@thunder.jkraemer.net> Hi! what you're experiencing is a result of the recent refactorings I did to the aaf code base. Sorry for the inconvenience this has caused you - I didn't find the time to document this properly yet. I moved the multi search functionality from the class level methods (like find_with_ferret) into the ActsAsFerret namespace. It has always been a bit inconsistent calling find_with_ferret on one class, passing in any other classes to search in via the :multi option. To get multi search back, you should use ActsAsFerret::find like this: @stuff = ActsAsFerret::find( "trumpet", [TeachingObject, LearningObject, Lesson, Course], { :page => 1, :per_page => 15 }, {} # find options ) Cheers, Jens -- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database From mrj at bigpond.net.au Mon Apr 14 22:26:42 2008 From: mrj at bigpond.net.au (Mark Reginald James) Date: Tue, 15 Apr 2008 12:26:42 +1000 Subject: [Ferret-talk] Which field(s) matched? In-Reply-To: <48032735.5050202@ebi.ac.uk> References: <47E8C6E2.5000202@ebi.ac.uk> <48032735.5050202@ebi.ac.uk> Message-ID: Robert Hulme wrote: >> I've implemented a field_scores hash that can be returned along with >> each document and aggregate score. It's only currently working in >> Ferret 0.11.4. > Could you please give a code example of how to do a query and get the > field_scores hash? > > I have Ferret 0.11.6. Rob, FerretResult objects are given a ferret_field_scores attribute, which is an array (not a hash, as I said above) of field_score objects, in order of decreasing score. Field_score objects have "field" and "score" attributes. You specify which fields you want to see the field scores for by adding the field_info option :field_score => :yes. You can use field_scores to display an extract of the field that best matches the user's query: @result.highlight @query, :field => @result.ferret_field_scores.first.field It'd take some work to port it to Ferret 0.11.6, though I will do this soon. From shiraskar.pravin at gmail.com Tue Apr 15 04:09:49 2008 From: shiraskar.pravin at gmail.com (pravin shiraskar) Date: Tue, 15 Apr 2008 13:39:49 +0530 Subject: [Ferret-talk] ferret (search application) Message-ID: <6a58c8cf0804150109p5e6a6458u6383e8390e3f04a4@mail.gmail.com> hi, This is pravin. i am trying to make search application in ROR for that i have installed ferret & acts_as_ferret plugin but , i got the following error Processing SearchController#search (for 127.0.0.1 at 2008-04-15 12:49:54) [GET] Session ID: ae2547277cd00cf930f79afdbc8c8b61 Parameters: {"action"=>"search", "q"=>"Tata Consultancy Services Pune", "controller"=>"search"} Asked for a remote server ? true, ENV["FERRET_USE_LOCAL_INDEX"] is nil, looks like we are not the server Will use local index. using index in ./script/../config/../index/development/detail default field list: [:name] Query: Tata Consultancy Services Pune total hits: 0, results delivered: 0 Rendering search/search Rendered search/_search_form (0.00167) Completed in 0.00878 (113 reqs/sec) | Rendering: 0.00243 (27%) | 200 OK [ http://localhost/search/search?q=Tata+Consultancy+Services+Pune] pravin -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080415/9ff7a396/attachment.html From toastkid.williams at gmail.com Tue Apr 15 04:50:53 2008 From: toastkid.williams at gmail.com (Max Williams) Date: Tue, 15 Apr 2008 09:50:53 +0100 Subject: [Ferret-talk] find_with_ferret :multi not working for me (latest version) (apologies if this was already posted) In-Reply-To: <20080414202424.GB11666@thunder.jkraemer.net> References: <20080414202424.GB11666@thunder.jkraemer.net> Message-ID: Hi Jens That works great, thanks a lot! And thanks generally for a_a_f, it's invaluable for us. max On 14/04/2008, Jens Kraemer wrote: > > Hi! > > what you're experiencing is a resultH of the recent refactorings I did to > the aaf code base. Sorry for the inconvenience this has caused you - I > didn't find the time to document this properly yet. > > I moved the multi search functionality from the class level methods > (like find_with_ferret) into the ActsAsFerret namespace. It has always > been a bit inconsistent calling find_with_ferret on one class, passing > in any other classes to search in via the :multi option. > > To get multi search back, you should use ActsAsFerret::find like this: > > @stuff = ActsAsFerret::find( > "trumpet", > > [TeachingObject, LearningObject, Lesson, Course], > { :page => 1, :per_page => 15 }, > {} # find options > > ) > > > Cheers, > Jens > > > -- > Jens Kr?mer > Finkenlust 14, 06449 Aschersleben, Germany > VAT Id DE251962952 > http://www.jkraemer.net/ - Blog > http://www.omdb.org/ - The new free film database > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080415/cfd38ccf/attachment.html From toastkid.williams at gmail.com Tue Apr 15 06:12:48 2008 From: toastkid.williams at gmail.com (Max Williams) Date: Tue, 15 Apr 2008 11:12:48 +0100 Subject: [Ferret-talk] find_with_ferret :multi not working for me (latest version) (apologies if this was already posted) In-Reply-To: References: <20080414202424.GB11666@thunder.jkraemer.net> Message-ID: I just remembered the second part of my last question... I have a superclass, Resource, which has two subclasses, TeachingObject and LearningObject. All the records that are saved are one of the subclasses. However, i'd like to be able to do a ferret search on Resource and get both kinds of subclass, like when i do Resource.find(), which returns both kinds. I'm having problems with the index though: TeachingObject and LearningObject both have :store_class_name => true, and they have their own indexes. When i try to build the index for Resource, it crashes with the following trace, whether i have :store_class_name => true set for Resource or not. Any ideas, anyone? thanks, max NoMethodError: You have a nil object when you didn't expect it! The error occurred while evaluating nil.each_pair from /home/jars/rails/lesson_planner/branches/bundles/vendor/plugins/acts_as_ferret/lib/instance_methods.rb:130:in `to_doc' from /home/jars/rails/lesson_planner/branches/bundles/vendor/rails/activerecord/lib/../../activesupport/lib/active_support/core_ext/object/misc.rb:28:in `returning' from /home/jars/rails/lesson_planner/branches/bundles/vendor/plugins/acts_as_ferret/lib/instance_methods.rb:124:in `to_doc' from /home/jars/rails/lesson_planner/branches/bundles/vendor/plugins/acts_as_ferret/lib/bulk_indexer.rb:19:in `index_records' from /home/jars/rails/lesson_planner/branches/bundles/vendor/plugins/acts_as_ferret/lib/bulk_indexer.rb:19:in `each' from /home/jars/rails/lesson_planner/branches/bundles/vendor/plugins/acts_as_ferret/lib/bulk_indexer.rb:19:in `index_records' from /home/jars/rails/lesson_planner/branches/bundles/vendor/plugins/acts_as_ferret/lib/bulk_indexer.rb:29:in `measure_time' from /home/jars/rails/lesson_planner/branches/bundles/vendor/plugins/acts_as_ferret/lib/bulk_indexer.rb:18:in `index_records' from /home/jars/rails/lesson_planner/branches/bundles/vendor/plugins/acts_as_ferret/lib/ferret_extensions.rb:52:in `index_model' from /home/jars/rails/lesson_planner/branches/bundles/vendor/plugins/acts_as_ferret/lib/class_methods.rb:79:in `records_for_rebuild' from /home/jars/rails/lesson_planner/branches/bundles/vendor/rails/activerecord/lib/active_record/connection_adapters/abstract/database_statements.rb:66:in `transaction' from /home/jars/rails/lesson_planner/branches/bundles/vendor/rails/activerecord/lib/active_record/transactions.rb:80:in `transaction' from /home/jars/rails/lesson_planner/branches/bundles/vendor/plugins/acts_as_ferret/lib/class_methods.rb:74:in `records_for_rebuild' from /home/jars/rails/lesson_planner/branches/bundles/vendor/plugins/acts_as_ferret/lib/ferret_extensions.rb:51:in `index_model' from /home/jars/rails/lesson_planner/branches/bundles/vendor/plugins/acts_as_ferret/lib/ferret_extensions.rb:39:in `index_models' from /home/jars/rails/lesson_planner/branches/bundles/vendor/plugins/acts_as_ferret/lib/ferret_extensions.rb:39:in `each' from /home/jars/rails/lesson_planner/branches/bundles/vendor/plugins/acts_as_ferret/lib/ferret_extensions.rb:39:in `index_models' from /home/jars/rails/lesson_planner/branches/bundles/vendor/plugins/acts_as_ferret/lib/local_index.rb:54:in `rebuild_index' from /home/jars/rails/lesson_planner/branches/bundles/vendor/plugins/acts_as_ferret/lib/class_methods.rb:28:in `rebuild_index' -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080415/b8a7c56e/attachment-0001.html From toastkid.williams at gmail.com Thu Apr 17 06:26:24 2008 From: toastkid.williams at gmail.com (Max Williams) Date: Thu, 17 Apr 2008 11:26:24 +0100 Subject: [Ferret-talk] Multi select with conditions Message-ID: Hi I'm doing a multi-model search like this, which works absolutely fine: @search_results = ActsAsFerret::find( "trumpet", [Resource, Lesson, Course], #(ferret) options { :page => 1, :per_page => 20, :sort => Ferret::Search::SortField.new(:name, :reverse => false) }, #find options {} ) My new problem is that the results need to be restricted: basically any user has only limited access to certain lessons, resources and courses, depending on their shared priveleges. I have methods to produce arrays of ids of each that the user is allowed to see, eg @user.allowed_lessons => array of ids of all lessons the user has access to Can anyone show me how to incorporate this into my search? I have the feeling that i can do something with the :conditions options, which can be passed through in the find options hash. But i can't work it out: i tried this: :conditions => ["lesson.id in (?) or course.id in (?) or resource.id in (?)", @user.allowed_lessons, @user.allowed_courses, @user.allowed_resources] But it doesn't work: it says 'no such field as lesson.id in resources'. I feel like this should be possible - can anyone help? thanks max -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080417/ea8fbed5/attachment.html From toastkid.williams at gmail.com Mon Apr 21 12:25:43 2008 From: toastkid.williams at gmail.com (Max Williams) Date: Mon, 21 Apr 2008 17:25:43 +0100 Subject: [Ferret-talk] ignoring 'accents' when i search Message-ID: Does anyone know of a way of being 'accent-insensitive' when i do a search? For example, if i have a resource with the name "La Boh?me", and someone searches for 'boheme' i want them to find that resource, even though the 'e' doesn't have the accent. At the moment, it will only find it if they search for the properly accented version. I guess soundex support for ferret is what I mean, but maybe there's another way? thanks, max -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080421/2932e694/attachment.html From toastkid.williams at gmail.com Mon Apr 21 12:49:43 2008 From: toastkid.williams at gmail.com (Max Williams) Date: Mon, 21 Apr 2008 17:49:43 +0100 Subject: [Ferret-talk] ignoring 'accents' when i search In-Reply-To: References: Message-ID: I just discovered the rather handy fuzzy searches, which i can do by adding (eg) "~0.6" to the end of my search term. So, this does the job (yay), but i'd still be interested in hearing if anyone else has solved this problem in a different way. :) On 21/04/2008, Max Williams wrote: > > Does anyone know of a way of being 'accent-insensitive' when i do a > search? > > For example, if i have a resource with the name "La Boh?me", and someone > searches for 'boheme' i want them to find that resource, even though the 'e' > doesn't have the accent. At the moment, it will only find it if they search > for the properly accented version. > > I guess soundex support for ferret is what I mean, but maybe there's > another way? > > thanks, max > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080421/69a842b7/attachment.html From jk at jkraemer.net Mon Apr 21 15:41:54 2008 From: jk at jkraemer.net (Jens Kraemer) Date: Mon, 21 Apr 2008 21:41:54 +0200 Subject: [Ferret-talk] ignoring 'accents' when i search In-Reply-To: References: Message-ID: <20080421194154.GC3723@thunder.jkraemer.net> Hi! You might create a custom Analyzer that does the job of replacing accentuated characters with their non-accentuated counterparts. If you apply this kind of analysis to both indexed content and queries, you'll find "La Boh?me" with both 'boheme' and 'boh?me' as the query string. there's a sample method that does the replacement part of the job up on the aaf wiki: http://projects.jkraemer.net/acts_as_ferret/#UTF-8support Have a look at the analyzer used in the omdb project for a more complete example: https://svn.omdb-beta.org/trunk/lib/omdb/ferret/omdb_analyzer.rb Cheers, Jens On Mon, Apr 21, 2008 at 05:49:43PM +0100, Max Williams wrote: > I just discovered the rather handy fuzzy searches, which i can do by adding > (eg) "~0.6" to the end of my search term. So, this does the job (yay), but > i'd still be interested in hearing if anyone else has solved this problem in > a different way. :) > > On 21/04/2008, Max Williams wrote: > > > > Does anyone know of a way of being 'accent-insensitive' when i do a > > search? > > > > For example, if i have a resource with the name "La Boh?me", and someone > > searches for 'boheme' i want them to find that resource, even though the 'e' > > doesn't have the accent. At the moment, it will only find it if they search > > for the properly accented version. > > > > I guess soundex support for ferret is what I mean, but maybe there's > > another way? > > > > thanks, max > > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk -- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database From toastkid.williams at gmail.com Tue Apr 22 05:12:50 2008 From: toastkid.williams at gmail.com (Max Williams) Date: Tue, 22 Apr 2008 10:12:50 +0100 Subject: [Ferret-talk] ignoring 'accents' when i search In-Reply-To: <20080421194154.GC3723@thunder.jkraemer.net> References: <20080421194154.GC3723@thunder.jkraemer.net> Message-ID: That's very useful, thanks! I'm just using the fuzzy search for now, but if it proves too vague (too many false positive results) then i'll look at this. I'd actually never seen that tr() method before, that combined with the ready-made accent substitutions in your link is itself very handy! cheers, max On 21/04/2008, Jens Kraemer wrote: > > Hi! > > You might create a custom Analyzer that does the job of replacing > accentuated characters with their non-accentuated counterparts. If you > apply this kind of analysis to both indexed content and queries, you'll > find "La Boh?me" with both 'boheme' and 'boh?me' as the query string. > > there's a sample method that does the replacement part of the job up on > the aaf wiki: http://projects.jkraemer.net/acts_as_ferret/#UTF-8support > > Have a look at the analyzer used in the omdb project for a more complete > example: > https://svn.omdb-beta.org/trunk/lib/omdb/ferret/omdb_analyzer.rb > > Cheers, > Jens > > > On Mon, Apr 21, 2008 at 05:49:43PM +0100, Max Williams wrote: > > I just discovered the rather handy fuzzy searches, which i can do by > adding > > (eg) "~0.6" to the end of my search term. So, this does the job (yay), > but > > i'd still be interested in hearing if anyone else has solved this > problem in > > a different way. :) > > > > On 21/04/2008, Max Williams wrote: > > > > > > Does anyone know of a way of being 'accent-insensitive' when i do a > > > search? > > > > > > For example, if i have a resource with the name "La Boh?me", and > someone > > > searches for 'boheme' i want them to find that resource, even though > the 'e' > > > doesn't have the accent. At the moment, it will only find it if they > search > > > for the properly accented version. > > > > > > I guess soundex support for ferret is what I mean, but maybe there's > > > another way? > > > > > > thanks, max > > > > > > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > -- > Jens Kr?mer > Finkenlust 14, 06449 Aschersleben, Germany > VAT Id DE251962952 > http://www.jkraemer.net/ - Blog > http://www.omdb.org/ - The new free film database > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080422/de07f300/attachment.html From sd.codewarrior at gmail.com Wed Apr 23 00:50:25 2008 From: sd.codewarrior at gmail.com (S D) Date: Wed, 23 Apr 2008 00:50:25 -0400 Subject: [Ferret-talk] Problem if method is called during Analyzer.token_stream operation Message-ID: <4f10e2890804222150y5a5aefcp653af9f097a812f6@mail.gmail.com> I've written a tokenizer/analyzer that parses a file extracting tokens and operate this analyzer/tokenizer on ASCII data consisting of XML files (the tokenizer skips over XML elements but maintains relative positioning). I've written many units tests to check the produced token stream and was confident that the tokenizer was working properly. Then I noticed two problems: 1. StopFilter (using English stop words) does not properly filter the token stream output from my tokenizer. If I explicitly pass an array of stop words to the stop filter it still doesn't work. If I simply switch my tokenizer to a StandardTokenizer the stop words are appropriately filtered (of course the XML tags are treated differently). 2. When I try a simple search no results come up. I can see that my tokenizer is adding files to the index but a simple search (using Ferret::Index::Index.search_each) produces no results. I'm now trying to track down the above problem which seems to have led me to another (though possibly related) problem for which I am seeking an answer. Below is the token_stream() method of my analyzer (XMLAnalyzer). Note that I've commented out my custom tokenizer (XMLTokenizer) so that the StandardTokenizer is being used within my custom analyzer. def token_stream(field, str) # ts = XMLTokenizer.new(str) ts = StandardTokenizer.new(str) # test_token_stream(ts) ts end In the above I've commented out the test_token_stream() method taken from Balmain's Ferret book (O'Reilly, pg 68) that simply prints out the tokens contained within a stream; i.e.,: def test_token_stream(token_stream) puts "\033[32mStart | End | PosInc | Text\033[m" while tkn = token_stream.next puts "%5d |%4d |%5d | %s" % [tkn.start, tkn.end, tkn.pos_inc, tkn.text] end end If I keep test_token_stream() commented out then the indexing and search work fine (using StandardTokenizer). However, if I do not comment out test_token_stream() then creating the index appears to work fine but a search produces no results. I haven't been able to track this down but thought it might be related to the problems I was having with XMLTokenizer. Note that I create my index with the Ferret::Index::Index index = Index::Index.new(:analyzer => XMLAnalyzer.new(), :path => options.indexLocation, :create_if_missing => true) and I perform searches using Ferret::Search::Searcher Any thoughts would be appreciated. Regards, John aka sd.codewarrior -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080423/8d3c78ed/attachment-0001.html From jk at jkraemer.net Wed Apr 23 04:13:56 2008 From: jk at jkraemer.net (Jens Kraemer) Date: Wed, 23 Apr 2008 10:13:56 +0200 Subject: [Ferret-talk] Problem if method is called during Analyzer.token_stream operation In-Reply-To: <4f10e2890804222150y5a5aefcp653af9f097a812f6@mail.gmail.com> References: <4f10e2890804222150y5a5aefcp653af9f097a812f6@mail.gmail.com> Message-ID: <20080423081356.GO3723@thunder.jkraemer.net> Hi! First guess - the test_token_stream method removes items from the stream by calling next(), so the stream is empty when you return it, and Ferret has nothing left to index. Cheers, Jens On Wed, Apr 23, 2008 at 12:50:25AM -0400, S D wrote: > I've written a tokenizer/analyzer that parses a file extracting tokens and > operate this analyzer/tokenizer on ASCII data consisting of XML files (the > tokenizer skips over XML elements but maintains relative positioning). I've > written many units tests to check the produced token stream and was > confident that the tokenizer was working properly. Then I noticed two > problems: > > 1. StopFilter (using English stop words) does not properly filter the > token stream output from my tokenizer. If I explicitly pass an array of stop > words to the stop filter it still doesn't work. If I simply switch my > tokenizer to a StandardTokenizer the stop words are appropriately filtered > (of course the XML tags are treated differently). > 2. When I try a simple search no results come up. I can see that my > tokenizer is adding files to the index but a simple search (using > Ferret::Index::Index.search_each) produces no results. > > I'm now trying to track down the above problem which seems to have led me to > another (though possibly related) problem for which I am seeking an answer. > Below is the token_stream() method of my analyzer (XMLAnalyzer). Note that > I've commented out my custom tokenizer (XMLTokenizer) so that the > StandardTokenizer is being used within my custom analyzer. > def token_stream(field, str) > # ts = XMLTokenizer.new(str) > ts = StandardTokenizer.new(str) > # test_token_stream(ts) > ts > end > In the above I've commented out the test_token_stream() method taken from > Balmain's Ferret book (O'Reilly, pg 68) that simply prints out the tokens > contained within a stream; i.e.,: > def test_token_stream(token_stream) > puts "\033[32mStart | End | PosInc | Text\033[m" > while tkn = token_stream.next > puts "%5d |%4d |%5d | %s" % [tkn.start, tkn.end, > tkn.pos_inc, tkn.text] > end > end > > If I keep test_token_stream() commented out then the indexing and search > work fine (using StandardTokenizer). However, if I do not comment out > test_token_stream() then creating the index appears to work fine but a > search produces no results. I haven't been able to track this down but > thought it might be related to the problems I was having with XMLTokenizer. > Note that I create my index with the Ferret::Index::Index > > index = Index::Index.new(:analyzer => XMLAnalyzer.new(), > :path => options.indexLocation, > :create_if_missing => true) > > and I perform searches using Ferret::Search::Searcher > > Any thoughts would be appreciated. > > Regards, > John > aka sd.codewarrior > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk -- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database From sd.codewarrior at gmail.com Wed Apr 23 12:18:12 2008 From: sd.codewarrior at gmail.com (S D) Date: Wed, 23 Apr 2008 12:18:12 -0400 Subject: [Ferret-talk] Custom Tokenizer not working Message-ID: <4f10e2890804230918g30eed819x442dc86eda6de8ec@mail.gmail.com> First, thanks to Jens K. for pointing a stupid error on my part regarding the use of test_token_stream(). My current problem, a custom tokenizer I've written in Ruby does not properly create an index (or at least searches on the index don't work). Using test_token_stream() I have verified that my tokenizer properly creates the token_stream; certainly each Token's attributes are set properly. Nevertheless, simple searches return zero results. The essence of my tokenizer is to skip beyond XML tags in a file and break up and return text components as tokens. I use this approach as opposed to an Hpricot approach because I need to keep track of the location of the text with respect to XML tags since after a search for a phrase I'll want to extract the nearby XML tags as they contain important context. My tokenizer (XMLTokenizer) contains a the obligatory initialize, next and text methods (shown below) as well as a lot of parsing methods that are called at the top level by the method XMLTokenizer.get_next_token which is the primary action within next. I didn't add the details of get_next_token as I'm assuming that if each token produced by get_next_token has the proper attributes then it shouldn't be the cause of the problem. What more should I be looking for? I've been looking for a custom tokenizer written in Ruby to model after; any suggestions? def initialize(xmlText) @xmlText = xmlText.gsub(/[;,!]/, ' ') @currPtr = 0 @currWordStart = nil @currTextStart = 0 @nextTagStart = 0 @startOfTextRegion = 0 @currTextStart = \ XMLTokenizer.skip_beyond_current_tag(@currPtr, @xmlText) @nextTagStart = \ XMLTokenizer.skip_beyond_current_text(@currTextStart, @xmlText) @currPtr = @currTextStart @startOfTextRegion = 1 end def next tkn = get_next_token if tkn != nil puts "%5d |%4d |%5d | %s" % [tkn.start, tkn.end, tkn.pos_inc, tkn.text] end return tkn end def text=(text) initialize(text) @xmlText end Below is text from a previous, related message that shows that StopFiltering is not working: >* I've written a tokenizer/analyzer that parses a file extracting tokens and *>* operate this analyzer/tokenizer on ASCII data consisting of XML files (the *>* tokenizer skips over XML elements but maintains relative positioning). I've *>* written many units tests to check the produced token stream and was *>* confident that the tokenizer was working properly. Then I noticed two *>* problems: *>* *>* 1. StopFilter (using English stop words) does not properly filter the *>* token stream output from my tokenizer. If I explicitly pass an array of stop *>* words to the stop filter it still doesn't work. If I simply switch my *>* tokenizer to a StandardTokenizer the stop words are appropriately filtered *>* (of course the XML tags are treated differently). *> >* 2. When I try a simple search no results come up. I can see that my *>* tokenizer is adding files to the index but a simple search (using *>* Ferret::Index::Index.search_each) produces no results. * Any suggestions are appreciated. John -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080423/588284c8/attachment.html From jk at jkraemer.net Wed Apr 23 13:03:24 2008 From: jk at jkraemer.net (Jens Kraemer) Date: Wed, 23 Apr 2008 19:03:24 +0200 Subject: [Ferret-talk] Custom Tokenizer not working In-Reply-To: <4f10e2890804230918g30eed819x442dc86eda6de8ec@mail.gmail.com> References: <4f10e2890804230918g30eed819x442dc86eda6de8ec@mail.gmail.com> Message-ID: <20080423170323.GQ3723@thunder.jkraemer.net> Hi! On Wed, Apr 23, 2008 at 12:18:12PM -0400, S D wrote: [..] > My current problem, a custom tokenizer I've written in Ruby does not > properly create an index (or at least searches on the index don't work). > Using test_token_stream() I have verified that my tokenizer properly creates > the token_stream; certainly each Token's attributes are set properly. > Nevertheless, simple searches return zero results. Could you have a look at your index with the ferret_browser utility? It allows you to check what exactly has been indexed and that maybe leads to the root of your problem. What does your analyzer, where you use the Tokenizer, look like? Is your next() method below being called and working correctly when test driving your analyzer i.e. in irb? Cheers, Jens > The essence of my tokenizer is to skip beyond XML tags in a file and break > up and return text components as tokens. I use this approach as opposed to > an Hpricot approach because I need to keep track of the location of the text > with respect to XML tags since after a search for a phrase I'll want to > extract the nearby XML tags as they contain important context. My tokenizer > (XMLTokenizer) contains a the obligatory initialize, next and text methods > (shown below) as well as a lot of parsing methods that are called at the top > level by the method XMLTokenizer.get_next_token which is the primary action > within next. I didn't add the details of get_next_token as I'm assuming that > if each token produced by get_next_token has the proper attributes then it > shouldn't be the cause of the problem. What more should I be looking for? > I've been looking for a custom tokenizer written in Ruby to model after; any > suggestions? > > def initialize(xmlText) > @xmlText = xmlText.gsub(/[;,!]/, ' ') > @currPtr = 0 > @currWordStart = nil > @currTextStart = 0 > @nextTagStart = 0 > @startOfTextRegion = 0 > > @currTextStart = \ > XMLTokenizer.skip_beyond_current_tag(@currPtr, @xmlText) > @nextTagStart = \ > XMLTokenizer.skip_beyond_current_text(@currTextStart, @xmlText) > @currPtr = @currTextStart > @startOfTextRegion = 1 > end > > def next > tkn = get_next_token > if tkn != nil > puts "%5d |%4d |%5d | %s" % [tkn.start, tkn.end, tkn.pos_inc, > tkn.text] > end > return tkn > end > > def text=(text) > initialize(text) > @xmlText > end > > Below is text from a previous, related message that shows that StopFiltering > is not working: > > >* I've written a tokenizer/analyzer that parses a file extracting tokens and > *>* operate this analyzer/tokenizer on ASCII data consisting of XML files (the > *>* tokenizer skips over XML elements but maintains relative positioning). I've > *>* written many units tests to check the produced token stream and was > *>* confident that the tokenizer was working properly. Then I noticed two > *>* problems: > *>* > *>* 1. StopFilter (using English stop words) does not properly filter the > *>* token stream output from my tokenizer. If I explicitly pass an > array of stop > *>* words to the stop filter it still doesn't work. If I simply switch my > *>* tokenizer to a StandardTokenizer the stop words are > appropriately filtered > *>* (of course the XML tags are treated differently). > *> > >* 2. When I try a simple search no results come up. I can see that my > *>* tokenizer is adding files to the index but a simple search (using > *>* Ferret::Index::Index.search_each) produces no results. > * > > > Any suggestions are appreciated. > > John > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk -- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database From sd.codewarrior at gmail.com Wed Apr 23 13:59:32 2008 From: sd.codewarrior at gmail.com (S D) Date: Wed, 23 Apr 2008 13:59:32 -0400 Subject: [Ferret-talk] Custom Tokenizer not working Message-ID: <4f10e2890804231059t5788c42elf18d2040d354891@mail.gmail.com> [unfortunately I received my messages as a batched digest...hence, I'm forced to respond in a new thread. I've requested the administrator to change my config to receive each message on this list. Sorry for any inconvenience] Thanks for the response below. Here is XMLAnalyzer (currently I'm not using the stop or lower case filter): class XMLAnalyzer < Ferret::Analysis::Analyzer def initialize(synonym_engine = nil, stop_words = FULL_ENGLISH_STOP_WORDS, lower = true) @synonym_engine = synonym_engine @lower = lower @stop_words = stop_words end def token_stream(field, str) # ts = XMLTokenizer.new(str) ts = StandardTokenizer.new(str) # test_token_stream(ts) return ts end end I just tried running ferret-browser by pointing to an index created with StandardTokenizer and got the error below in Firefox. Is there any configuration that is necessary? Presumably the defaults should work. John Internal Server Error No such file or directory - /usr/local/lib/site_ruby/1.8/ferret/browser/views/error/index.rhtml ------------------------------ WEBrick/1.3.1 (Ruby/1.8.6/2007-06-07) at 127.0.0.1:3301 Hi! On Wed, Apr 23, 2008 at 12:18:12PM -0400, S D wrote: [..] >* My current problem, a custom tokenizer I've written in Ruby does not *>* properly create an index (or at least searches on the index don't work). *>* Using test_token_stream() I have verified that my tokenizer properly creates *>* the token_stream; certainly each Token's attributes are set properly. *>* Nevertheless, simple searches return zero results. * Could you have a look at your index with the ferret_browser utility? It allows you to check what exactly has been indexed and that maybe leads to the root of your problem. What does your analyzer, where you use the Tokenizer, look like? Is your next() method below being called and working correctly when test driving your analyzer i.e. in irb? Cheers, Jens >* The essence of my tokenizer is to skip beyond XML tags in a file and break *>* up and return text components as tokens. I use this approach as opposed to *>* an Hpricot approach because I need to keep track of the location of the text *>* with respect to XML tags since after a search for a phrase I'll want to *>* extract the nearby XML tags as they contain important context. My tokenizer *>* (XMLTokenizer) contains a the obligatory initialize, next and text methods *>* (shown below) as well as a lot of parsing methods that are called at the top *>* level by the method XMLTokenizer.get_next_token which is the primary action *>* within next. I didn't add the details of get_next_token as I'm assuming that *>* if each token produced by get_next_token has the proper attributes then it *>* shouldn't be the cause of the problem. What more should I be looking for? *>* I've been looking for a custom tokenizer written in Ruby to model after; any *>* suggestions? *>* *>* def initialize(xmlText) *>* @xmlText = xmlText.gsub(/[;,!]/, ' ') *>* @currPtr = 0 *>* @currWordStart = nil *>* @currTextStart = 0 *>* @nextTagStart = 0 *>* @startOfTextRegion = 0 *>* *>* @currTextStart = \ *>* XMLTokenizer.skip_beyond_current_tag(@currPtr, @xmlText) *>* @nextTagStart = \ *>* XMLTokenizer.skip_beyond_current_text(@currTextStart, @xmlText) *>* @currPtr = @currTextStart *>* @startOfTextRegion = 1 *>* end *>* *>* def next *>* tkn = get_next_token *>* if tkn != nil *>* puts "%5d |%4d |%5d | %s" % [tkn.start, tkn.end, tkn.pos_inc, *>* tkn.text] *>* end *>* return tkn *>* end *>* *>* def text=(text) *>* initialize(text) *>* @xmlText *>* end *>* *>* Below is text from a previous, related message that shows that StopFiltering *>* is not working: *>* *>* >* I've written a tokenizer/analyzer that parses a file extracting tokens and *>* *>* operate this analyzer/tokenizer on ASCII data consisting of XML files (the *>* *>* tokenizer skips over XML elements but maintains relative positioning). I've *>* *>* written many units tests to check the produced token stream and was *>* *>* confident that the tokenizer was working properly. Then I noticed two *>* *>* problems: *>* *>* *>* *>* 1. StopFilter (using English stop words) does not properly filter the *>* *>* token stream output from my tokenizer. If I explicitly pass an *>* array of stop *>* *>* words to the stop filter it still doesn't work. If I simply switch my *>* *>* tokenizer to a StandardTokenizer the stop words are *>* appropriately filtered *>* *>* (of course the XML tags are treated differently). *>* *> *>* >* 2. When I try a simple search no results come up. I can see that my *>* *>* tokenizer is adding files to the index but a simple search (using *>* *>* Ferret::Index::Index.search_each) produces no results. *>* * *>* *>* *>* Any suggestions are appreciated. *>* *>* John * >* _______________________________________________ *>* Ferret-talk mailing list *>* Ferret-talk at rubyforge.org *>* http://rubyforge.org/mailman/listinfo/ferret-talk * -- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080423/6bacba0b/attachment-0001.html From kraemer at webit.de Wed Apr 23 14:23:42 2008 From: kraemer at webit.de (Jens Kraemer) Date: Wed, 23 Apr 2008 20:23:42 +0200 Subject: [Ferret-talk] Custom Tokenizer not working In-Reply-To: <4f10e2890804231059t5788c42elf18d2040d354891@mail.gmail.com> References: <4f10e2890804231059t5788c42elf18d2040d354891@mail.gmail.com> Message-ID: <20080423182342.GA22831@cordoba.webit.de> Hi! On Wed, Apr 23, 2008 at 01:59:32PM -0400, S D wrote: > [unfortunately I received my messages as a batched digest...hence, I'm > forced to respond in a new thread. I've requested the administrator to > change my config to receive each message on this list. Sorry for any > inconvenience] > > Thanks for the response below. Here is XMLAnalyzer (currently I'm not using > the stop or lower case filter): > > class XMLAnalyzer < Ferret::Analysis::Analyzer could you try if not inheriting from Ferret's Analyzer changes anything? At least I usually don't do that in my analyzers. [..] > I just tried running ferret-browser by pointing to an index created with > StandardTokenizer and got the error below in Firefox. Is there any > configuration that is necessary? Presumably the defaults should work. [..] > Internal Server Error No such file or directory - > /usr/local/lib/site_ruby/1.8/ferret/browser/views/error/index.rhtml > ------------------------------ > WEBrick/1.3.1 (Ruby/1.8.6/2007-06-07) at 127.0.0.1:3301 works just fine here (Ferret 0.11.6 / Ubuntu), just tried it out. The location from the error message looks a bit strange to me, how did you install ferret? Cheers, Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold From eric.duminil at gmail.com Sun Apr 20 16:35:03 2008 From: eric.duminil at gmail.com (Eric Duminil) Date: Sun, 20 Apr 2008 22:35:03 +0200 Subject: [Ferret-talk] Picolena, a ferret+rails documents search engine Message-ID: <02b7d68df3805e85e95bcb90b5076801@ruby-forum.com> Hi everybody! I am proud to present you a small project I have been working on for a while: Picolena, a documents search engine written in Rails. ( http://picolena.devjavu.com/ ). It obviously uses Ferret for indexing and searching, and adds some plain text extractors in order to index OOffice.org, pdf and MS Office documents (and some others as well). Everything is packed in a gem (gem install picolena), with a few rake tasks, a multi-threaded indexer, a language guesser, a rails frontend and some specs to be sure everything works fine. I would love to hear some feedback from acts_as_ferret developers or users! My project is in now way supposed to be a competitor of AAF: we have different goals (ActiveRecord indexing plugin vs. stand-alone rails-app for documents indexing), but still a lot in common. I dare say Picolena would be useful in a lot of companies (as a google-mini alternative), and has already been working in production for a few months without a hitch. This has been made possible thanks to Ferret's incredible speed. Kudos to the devs! Best regards, Eric demo website : http://citynet.hft-stuttgart.de:4000/ trac : http://picolena.devjavu.com rubyforge : http://rubyforge.org/projects/picolena/ svn repo : http://svn.devjavu.com/picolena/trunk/ -- Posted via http://www.ruby-forum.com/. From tom.castle at flextrade.com Fri Apr 18 06:36:53 2008 From: tom.castle at flextrade.com (Tom Castle) Date: Fri, 18 Apr 2008 12:36:53 +0200 Subject: [Ferret-talk] Default maximum result size Message-ID: <9978b03465c8a647cf3569395514b9f3@ruby-forum.com> Hi, I've got a Ferret index which I'm using through ActsAsFerret on a database with about 300,000 records. Searches on the index (for example wildcard - "A*") seem to be truncated at 512 results. This is a problem as I also pass conditions to ActiveRecord which filter this resultset down, often to zero results when in fact the database contains many records which matched the conditions but just happened to be outside the 512 Ferret return. I thought the solution would be setting max terms to a higher value, with something like - Ferret::Search::MultiTermQuery.default_max_terms = 5000, however this seems to have no effect. Is there somewhere this value is overridden or could there be some other setting which is restricting the number of results returned by Ferret? Any help or ideas would be great! Thanks! -- Posted via http://www.ruby-forum.com/. From comics at bigpond.net.au Fri Apr 18 07:25:49 2008 From: comics at bigpond.net.au (Katsuo Isono) Date: Fri, 18 Apr 2008 13:25:49 +0200 Subject: [Ferret-talk] search two related tables Message-ID: <1878df3937e6a5a05c1cfb59f3436484@ruby-forum.com> Hi, It would be trulely appreciated if anyone can give me some advise on this. I have two related tables titles and authors. Using acts_as_ferret, I can search by title name. However, when search by author name or ausname in this case, I only get "sorry, found 0 item". class Title < ActiveRecord::Base acts_as_ferret :additional_fields => [:author_name] belongs_to :author def author_name return author.ausname end ------------------------- class Author < ActiveRecord::Base has_many :titles end ------------------------- def search if params[:q] query = params[:q] @titles = Title.find_by_contents(query) unless @titles.size > 0 flash.now[:notice_not_found] = "Sorry, found 0 item" end render :action => 'search_result' end end -- Posted via http://www.ruby-forum.com/. From sseung at nettheory.com Wed Apr 23 17:27:15 2008 From: sseung at nettheory.com (Spencer Seung) Date: Wed, 23 Apr 2008 23:27:15 +0200 Subject: [Ferret-talk] filter_proc under acts_as_ferret Message-ID: <506adc769626876cd0e45731287996d6@ruby-forum.com> Hi, I'm having a problem with acts_as_ferret and the :filter_proc option that gets passed to ferret::search. Let me preface this with being a ferret/AaF noob. The situation: I am trying to extend the search functionality of a website that has already been built. I understand that ferret isn't (or doesn't seem to be) designed for numerical value searches, but at the moment it seems that it would be easier to try to get it to work rather than completely rebuild the search. We have a number of products, and they have numerical values associated to certain attributes. For example, a pair of shoes might have a trendiness value of 50. If someone searched for trendiness:50 then those shoes would show up. But we want it to show up even if trendiness:55 is the query. I figured doing a range search would take care of that (Can aaf do ranged queries?) The problem is that if a user searches for trendiness:55 and there are two hits, one a pair of shoes with trendiness:50 and another with trendiness:45, The first shoes should have a higher score. I figure that passing a search a :filter_proc that looks at the searcher object, calculates the distance between search query and value and then return a float to weight the score would do the trick. I just can't seem to get that to work. Ignoring the rangedQuery part, The following lines output the same results with the same scores: Product.find_id_with_contents("shoes") Product.find_id_with_contents("shoes", :filter_proc => lambda {|doc_id,score, searcher| return 0.5} Shouldn't the 2nd version have halved the resulting ferret_scores of each result? thanks for the time, -Spencer -- Posted via http://www.ruby-forum.com/. From sd.codewarrior at gmail.com Wed Apr 23 22:58:38 2008 From: sd.codewarrior at gmail.com (S D) Date: Wed, 23 Apr 2008 22:58:38 -0400 Subject: [Ferret-talk] Custom Tokenizer not working Message-ID: <4f10e2890804231958u68b3aa05j46ee29e710b39c7b@mail.gmail.com> I changed XMLAnalyzer so that it does not inherit from Ferret::Analysis::Analyzer. That seemed to have no effect. I successfully ran ferret-browser.As shown below, I am using two fields - :file and :content. When I browse through the "file" term everything appears fine; all of the filenames are found. The "content" term on the other hand is empty. Apparently I'm not stuffing the tokens in the index at all. One question I have is exactly what should happen in the Tokenizer#text method and when will this method be called? Thanks, John ===== index = Index::Index.new(:analyzer => XMLAnalyzer.new(), :path => options.indexLocation, :create => true) Find.find(options.searchPath) do |path| if FileTest.file? path File.open(path) do |file| puts "Adding file to index: " + path index.add_document :file => path, :content => file.readlines end end end ===== Hi! On Wed, Apr 23, 2008 at 01:59:32PM -0400, S D wrote: >* [unfortunately I received my messages as a batched digest...hence, I'm *>* forced to respond in a new thread. I've requested the administrator to *>* change my config to receive each message on this list. Sorry for any *>* inconvenience] *>* *>* Thanks for the response below. Here is XMLAnalyzer (currently I'm not using *>* the stop or lower case filter): *>* *>* class XMLAnalyzer < Ferret::Analysis::Analyzer * could you try if not inheriting from Ferret's Analyzer changes anything? At least I usually don't do that in my analyzers. [..] >* I just tried running ferret-browser by pointing to an index created with *>* StandardTokenizer and got the error below in Firefox. Is there any *>* configuration that is necessary? Presumably the defaults should work. *[..] >* Internal Server Error No such file or directory - *>* /usr/local/lib/site_ruby/1.8/ferret/browser/views/error/index.rhtml *>* ------------------------------ *>* WEBrick/1.3.1 (Ruby/1.8.6/2007-06-07) at 127.0.0.1:3301 * works just fine here (Ferret 0.11.6 / Ubuntu), just tried it out. The location from the error message looks a bit strange to me, how did you install ferret? Cheers, Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080423/d7d224c4/attachment.html From sd.codewarrior at gmail.com Wed Apr 23 23:07:13 2008 From: sd.codewarrior at gmail.com (S D) Date: Wed, 23 Apr 2008 23:07:13 -0400 Subject: [Ferret-talk] Uninstalling ferret from source Message-ID: <4f10e2890804232007j295b166eoeda902430cf81fb3@mail.gmail.com> I installed ferret from a tarball rather than gem install. Per the README I installed from the extracted directory via; $ rake ext $ ruby setup.rb config $ ruby setup.rb setup $ ruby setup.rb install I'm now trying to uninstall but there are no instructions on how (either in the README or via "ruby setup.rb --help"). How can I do this? Much Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080423/9970d295/attachment.html From jk at jkraemer.net Thu Apr 24 02:10:48 2008 From: jk at jkraemer.net (Jens Kraemer) Date: Thu, 24 Apr 2008 08:10:48 +0200 Subject: [Ferret-talk] Default maximum result size In-Reply-To: <9978b03465c8a647cf3569395514b9f3@ruby-forum.com> References: <9978b03465c8a647cf3569395514b9f3@ruby-forum.com> Message-ID: <20080424061047.GS3723@thunder.jkraemer.net> On Fri, Apr 18, 2008 at 12:36:53PM +0200, Tom Castle wrote: > Hi, I've got a Ferret index which I'm using through ActsAsFerret on a > database with about 300,000 records. Searches on the index (for example > wildcard - "A*") seem to be truncated at 512 results. This is a problem > as I also pass conditions to ActiveRecord which filter this resultset > down, often to zero results when in fact the database contains many > records which matched the conditions but just happened to be outside the > 512 Ferret return. > > I thought the solution would be setting max terms to a higher value, > with something like - Ferret::Search::MultiTermQuery.default_max_terms = > 5000, however this seems to have no effect. Is there somewhere this > value is overridden or could there be some other setting which is > restricting the number of results returned by Ferret? Yes, most probably it's the boolean query's max_clauses option which defaults to 512. If you're using the QueryParser, you may pass it :max_clauses => 5000 option to globally raise this for all kinds of queries. There should also be a max_clauses class method in BooleanQuery. Cheers, Jens -- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database From jk at jkraemer.net Thu Apr 24 02:14:05 2008 From: jk at jkraemer.net (Jens Kraemer) Date: Thu, 24 Apr 2008 08:14:05 +0200 Subject: [Ferret-talk] search two related tables In-Reply-To: <1878df3937e6a5a05c1cfb59f3436484@ruby-forum.com> References: <1878df3937e6a5a05c1cfb59f3436484@ruby-forum.com> Message-ID: <20080424061405.GT3723@thunder.jkraemer.net> On Fri, Apr 18, 2008 at 01:25:49PM +0200, Katsuo Isono wrote: > Hi, It would be trulely appreciated if anyone can give me some advise on > this. > I have two related tables titles and authors. Using acts_as_ferret, I > can search by title name. However, when search by author name or ausname > in this case, I only get "sorry, found 0 item". Did you have a look at your log files? Maybe author is nil when acts_as_ferret calls the author_name method. Aaf logs every field and value it adds to the index so you should pretty easily find out what's going on. Cheers, Jens -- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database From toastkid.williams at gmail.com Fri Apr 25 06:45:30 2008 From: toastkid.williams at gmail.com (Max Williams) Date: Fri, 25 Apr 2008 11:45:30 +0100 Subject: [Ferret-talk] Pagination, sorting and conditions: the combination is breaking my search results... Message-ID: Hi I have a problem with a search where i want to get some results according to some conditions, sort the results, and then paginate over the sorted collection. My search looks like this: @results = TeachingObject.find_with_ferret(search_term, #(ferret) options {:page => options[:page], :per_page => options[:per_page], :sort => Ferret::Search::SortField.new(:asset_count, :type => :integer, :reverse => true )}, #find options { :conditions => ["id in (?)", @ids] } ) where @ids is an array of ids from which the results must come (ie a collection of 'allowed' results of which @results will be a subset): often the search term is set to * to get all of this collection in @results. ':asset_count' is an untokenized ferret field that stores integers. Through debugging and experimenting, i've observed the following: - The overall results set, without sorting/pagination, is correct (therefore :conditions is being taken into account). - If :per_page is set to be so large that no pagination is required, then the sorting occurs properly (therefore sorting is being taken into account) - If :per_page is reduced so that pagination is required, then the sorting of the overall set breaks: it seems as if the results are ordered by id, then paginated. - However, on every individual page, the results are sorted properly for that page, ie each page-size subset is internally sorted. - If i sort on a different untokenized field, the problem persists. It seems as if the pagination is happening and THEN the sorting is happening, which obviously doesn't give the expected results. This is just a theory on my part though. Can anyone tell me how to fix this problem? I've been gnashing my teeth over it for over a day now and can't find any solutions... thanks max -------------- next part -------------- An HTML attachment was scrubbed... URL: From kraemer at webit.de Fri Apr 25 07:59:43 2008 From: kraemer at webit.de (Jens Kraemer) Date: Fri, 25 Apr 2008 13:59:43 +0200 Subject: [Ferret-talk] Pagination, sorting and conditions: the combination is breaking my search results... In-Reply-To: References: Message-ID: <20080425115943.GA308@cordoba.webit.de> Hi Max, thanks for your detailed report. Might well be that I broke one or more of the various combinations of pagination / sorting / active record conditions (where you might specify :order, too, btw) in trunk. I'll look into it asap. Cheers, Jens On Fri, Apr 25, 2008 at 11:45:30AM +0100, Max Williams wrote: > Hi > > I have a problem with a search where i want to get some results according to > some conditions, sort the results, and then paginate over the sorted > collection. > > My search looks like this: > > @results = TeachingObject.find_with_ferret(search_term, > #(ferret) options > {:page => options[:page], > :per_page => options[:per_page], > :sort => Ferret::Search::SortField.new(:asset_count, > :type => :integer, :reverse => true )}, > #find options > { :conditions => ["id in (?)", @ids] } ) > > where @ids is an array of ids from which the results must come (ie a > collection of 'allowed' results of which @results will be a subset): often > the search term is set to * to get all of this collection in @results. > ':asset_count' is an untokenized ferret field that stores integers. > > Through debugging and experimenting, i've observed the following: > > - The overall results set, without sorting/pagination, is correct > (therefore :conditions is being taken into account). > - If :per_page is set to be so large that no pagination is required, then > the sorting occurs properly (therefore sorting is being taken into account) > - If :per_page is reduced so that pagination is required, then the > sorting of the overall set breaks: it seems as if the results are ordered by > id, then paginated. > - However, on every individual page, the results are sorted properly for > that page, ie each page-size subset is internally sorted. > - If i sort on a different untokenized field, the problem persists. > > It seems as if the pagination is happening and THEN the sorting is > happening, which obviously doesn't give the expected results. This is just > a theory on my part though. > > Can anyone tell me how to fix this problem? I've been gnashing my teeth > over it for over a day now and can't find any solutions... > > thanks > max > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold From toastkid.williams at gmail.com Fri Apr 25 08:35:48 2008 From: toastkid.williams at gmail.com (Max Williams) Date: Fri, 25 Apr 2008 13:35:48 +0100 Subject: [Ferret-talk] Pagination, sorting and conditions: the combination is breaking my search results... In-Reply-To: <20080425115943.GA308@cordoba.webit.de> References: <20080425115943.GA308@cordoba.webit.de> Message-ID: Fantastic, thanks Jens. BTW, I can't :order by :asset_count as it's a method, rather than an instance variable. I wish AR::find would let me order by method returns. While we're talking, i just tried another way of searching and found another bit of weirdness: instead of find_with_ferret, i tried using ActsAsFerret::find(term, class_array) instead, like so: #sorting/pagination is broken here @results = ActsAsFerret::find(search_term, [TeachingObject], #(ferret) options { :page => options[:page], :per_page => options[:per_page], :sort => Ferret::Search::SortField.new(:asset_count, :type => :integer, :reverse => true ) }, #find options - need to specify conditions for each searched class individually {:conditions => { :teaching_object => ["id in (?)", @ids] } } ) This gave exactly the same results as the previous search. However, when i added another class to the search, it works! #this works! @results = ActsAsFerret::find(search_term, [TeachingObject, LearningObject], #(ferret) options { :page => options[:page], :per_page => options[:per_page], :sort => Ferret::Search::SortField.new(:asset_count, :type => :integer, :reverse => true ) }, #find options - need to specify conditions for each searched class individually {:conditions => { :teaching_object => ["id in (?)", @ids], :learning_object => ["id in (?)", @ids] } } ) So, this works while the previous doesn't. It so happens in this case that LearningObject is a 'sister' class of TeachingObject (they both extend a class called Resource), where both are saved in a table called resources using STI, and at the moment i don't actually have any LearningObject records, so adding LearningObject doesn't harm my results. Obviously though this isn't a nice workaround. Sorry to pile bug reports on you, i just mention it in case it's relevant. :/ Thanks a lot max 2008/4/25 Jens Kraemer : > Hi Max, > > thanks for your detailed report. Might well be that I broke one or more > of the various combinations of pagination / sorting / active record > conditions (where you might specify :order, too, btw) in trunk. > > I'll look into it asap. > > Cheers, > Jens > > On Fri, Apr 25, 2008 at 11:45:30AM +0100, Max Williams wrote: > > Hi > > > > I have a problem with a search where i want to get some results according > to > > some conditions, sort the results, and then paginate over the sorted > > collection. > > > > My search looks like this: > > > > @results = TeachingObject.find_with_ferret(search_term, > > #(ferret) options > > {:page => options[:page], > > :per_page => options[:per_page], > > :sort => > Ferret::Search::SortField.new(:asset_count, > > :type => :integer, :reverse => true )}, > > #find options > > { :conditions => ["id in (?)", @ids] } ) > > > > where @ids is an array of ids from which the results must come (ie a > > collection of 'allowed' results of which @results will be a subset): > often > > the search term is set to * to get all of this collection in @results. > > ':asset_count' is an untokenized ferret field that stores integers. > > > > Through debugging and experimenting, i've observed the following: > > > > - The overall results set, without sorting/pagination, is correct > > (therefore :conditions is being taken into account). > > - If :per_page is set to be so large that no pagination is required, > then > > the sorting occurs properly (therefore sorting is being taken into > account) > > - If :per_page is reduced so that pagination is required, then the > > sorting of the overall set breaks: it seems as if the results are > ordered by > > id, then paginated. > > - However, on every individual page, the results are sorted properly > for > > that page, ie each page-size subset is internally sorted. > > - If i sort on a different untokenized field, the problem persists. > > > > It seems as if the pagination is happening and THEN the sorting is > > happening, which obviously doesn't give the expected results. This is > just > > a theory on my part though. > > > > Can anyone tell me how to fix this problem? I've been gnashing my teeth > > over it for over a day now and can't find any solutions... > > > > thanks > > max > > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > -- > Jens Kr?mer > webit! Gesellschaft f?r neue Medien mbH > Schnorrstra?e 76 | 01069 Dresden > Telefon +49 351 46766-0 | Telefax +49 351 46766-66 > kraemer at webit.de | www.webit.de > > Amtsgericht Dresden | HRB 15422 > GF Sven Haubold > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -------------- next part -------------- An HTML attachment was scrubbed... URL: From u.alberton at gmail.com Fri Apr 25 14:17:08 2008 From: u.alberton at gmail.com (Bira) Date: Fri, 25 Apr 2008 15:17:08 -0300 Subject: [Ferret-talk] Public Ferret git repository? Message-ID: Are there plans for a public git repository for Ferret? I read its development had switched to git, and since git is my preferred version control system, it would be nice to have a public repository to clone from :). -- Bira http://compexplicita.wordpress.com http://compexplicita.tumblr.com From sd.codewarrior at gmail.com Mon Apr 28 03:04:36 2008 From: sd.codewarrior at gmail.com (S D) Date: Mon, 28 Apr 2008 03:04:36 -0400 Subject: [Ferret-talk] Handling Carriage Returns Message-ID: <4f10e2890804280004n57bed4cnfbc6591b3c4e443b@mail.gmail.com> It's my understanding that the tokens in a token_stream consist of text along with start/stop positions that represent the byte positions of the text within the corresponding document field. The documentation I've been reading (i.e., O'Reilly - Ferret - page 67) suggests that these byte positions represent positions within the entire field but based on my testing it appears that the byte positions are with respect to the line that contains the corresponding text within the field. I read my fields following Brian McCallister: index.add_document :file => path, :content => file.readlines Hence, if I have a file that contains carriage returns, the token positions will be reset with each new line. For example, the following file contents (File A) this is a sentence will result in a token for the text "sentence" with start position equal to 10 (assume "this" starts in position 0) while a file with a carriage return this is a sentence will result in a token for the text "sentence" with start position equal to 0. I get the same results for my custom tokenizer as well as StandardTokenizer. The above does not seem consistent with the documentation but more importantly, it seems that global positions are more useful than line-based positions (e.g., for highlighting). Digging a little deeper it seems that the tokenizer's initialize method is called each time the token_stream method of the containing analyzer is called: class CustomAnalyzer def token_stream(field, str) ts = StandardTokenizer.new(str) end end Am I missing something here? Are the start/stop byte positions intended to be with respect to the line? Is there a way for token_stream to only be called once for an entire string sequence (even if carriage returns are contained)? Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: From hulme at ebi.ac.uk Mon Apr 28 05:48:13 2008 From: hulme at ebi.ac.uk (Robert Hulme) Date: Mon, 28 Apr 2008 10:48:13 +0100 Subject: [Ferret-talk] Which field(s) matched? Message-ID: Mark - Have you made any progress with porting ferret_field_scores to 0.11.6? I'm at the stage now where I desperately need it, so if you haven't I'd really appreciate it if you could just post the diff for the earlier version. Thanks -Rob From jk at jkraemer.net Mon Apr 28 06:34:17 2008 From: jk at jkraemer.net (Jens Kraemer) Date: Mon, 28 Apr 2008 12:34:17 +0200 Subject: [Ferret-talk] Public Ferret git repository? In-Reply-To: References: Message-ID: <20080428103416.GK3723@thunder.jkraemer.net> Hi, now with rubyforge having git support this might be a good idea. I just opened the support ticket to switch the project over to git. This won't hurt any svn users since I never used rubyforge's SVN anyway. I'll mail the list once I have the code up there. Cheers, Jens On Fri, Apr 25, 2008 at 03:17:08PM -0300, Bira wrote: > Are there plans for a public git repository for Ferret? I read its > development had switched to git, and since git is my preferred version > control system, it would be nice to have a public repository to clone > from :). > > -- > Bira > http://compexplicita.wordpress.com > http://compexplicita.tumblr.com > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database From jk at jkraemer.net Mon Apr 28 06:37:21 2008 From: jk at jkraemer.net (Jens Kraemer) Date: Mon, 28 Apr 2008 12:37:21 +0200 Subject: [Ferret-talk] Handling Carriage Returns In-Reply-To: <4f10e2890804280004n57bed4cnfbc6591b3c4e443b@mail.gmail.com> References: <4f10e2890804280004n57bed4cnfbc6591b3c4e443b@mail.gmail.com> Message-ID: <20080428103721.GL3723@thunder.jkraemer.net> Hi, File.readlines returns an array which I think is the root cause of the problem. Just using File.read instead should solve your problem. Cheers, Jens On Mon, Apr 28, 2008 at 03:04:36AM -0400, S D wrote: > It's my understanding that the tokens in a token_stream consist of text > along with start/stop positions that represent the byte positions of the > text within the corresponding document field. The documentation I've been > reading (i.e., O'Reilly - Ferret - page 67) suggests that these byte > positions represent positions within the entire field but based on my > testing it appears that the byte positions are with respect to the line that > contains the corresponding text within the field. I read my fields following > Brian McCallister: > > index.add_document :file => path, > :content => file.readlines > > > Hence, if I have a file that contains carriage returns, the token positions > will be reset with each new line. For example, the following file contents > (File A) > this is a sentence > will result in a token for the text "sentence" with start position equal to > 10 (assume "this" starts in position 0) while a file with a carriage return > this is a > sentence > will result in a token for the text "sentence" with start position equal to > 0. I get the same results for my custom tokenizer as well as > StandardTokenizer. The above does not seem consistent with the documentation > but more importantly, it seems that global positions are more useful than > line-based positions (e.g., for highlighting). > > Digging a little deeper it seems that the tokenizer's initialize method is > called each time the token_stream method of the containing analyzer is > called: > > class CustomAnalyzer > def token_stream(field, str) > ts = StandardTokenizer.new(str) > end > end > > Am I missing something here? Are the start/stop byte positions intended to > be with respect to the line? Is there a way for token_stream to only be > called once for an entire string sequence (even if carriage returns are > contained)? > > Thanks, > John > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk -- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database From samuelgiffney at gmail.com Mon Apr 28 21:25:50 2008 From: samuelgiffney at gmail.com (Sam Giffney) Date: Tue, 29 Apr 2008 03:25:50 +0200 Subject: [Ferret-talk] Looped concurrent rebuild Message-ID: <583c0a06e345d2c6f4111dd67364dbe3@ruby-forum.com> I've used the code at http://pastie.textmate.org/66602 for an index rebuild which catches edits during the rebuild. I've tweaked this for my own use and sharing here for others & feedback. It's not the prettiest loop and, in theory at least, it could run forever... but this tweak also allows you to catch edits during the entire rebuild process. http://pastie.textmate.org/188436 Sam -- Posted via http://www.ruby-forum.com/. From sd.codewarrior at gmail.com Wed Apr 30 01:47:24 2008 From: sd.codewarrior at gmail.com (S D) Date: Wed, 30 Apr 2008 01:47:24 -0400 Subject: [Ferret-talk] Handling Carriage Returns In-Reply-To: <20080428103721.GL3723@thunder.jkraemer.net> References: <4f10e2890804280004n57bed4cnfbc6591b3c4e443b@mail.gmail.com> <20080428103721.GL3723@thunder.jkraemer.net> Message-ID: <4f10e2890804292247j3da867afh843efc1e99e39ba6@mail.gmail.com> That was it. Stupid mistake on my part. Thanks! John On Mon, Apr 28, 2008 at 6:37 AM, Jens Kraemer wrote: > Hi, > > File.readlines returns an array which I think is the root cause of the > problem. > Just using File.read instead should solve your problem. > > Cheers, > Jens > > On Mon, Apr 28, 2008 at 03:04:36AM -0400, S D wrote: > > It's my understanding that the tokens in a token_stream consist of text > > along with start/stop positions that represent the byte positions of the > > text within the corresponding document field. The documentation I've > been > > reading (i.e., O'Reilly - Ferret - page 67) suggests that these byte > > positions represent positions within the entire field but based on my > > testing it appears that the byte positions are with respect to the line > that > > contains the corresponding text within the field. I read my fields > following > > Brian McCallister: > > > > index.add_document :file => path, > > :content => file.readlines > > > > > > Hence, if I have a file that contains carriage returns, the token > positions > > will be reset with each new line. For example, the following file > contents > > (File A) > > this is a sentence > > will result in a token for the text "sentence" with start position equal > to > > 10 (assume "this" starts in position 0) while a file with a carriage > return > > this is a > > sentence > > will result in a token for the text "sentence" with start position equal > to > > 0. I get the same results for my custom tokenizer as well as > > StandardTokenizer. The above does not seem consistent with the > documentation > > but more importantly, it seems that global positions are more useful > than > > line-based positions (e.g., for highlighting). > > > > Digging a little deeper it seems that the tokenizer's initialize method > is > > called each time the token_stream method of the containing analyzer is > > called: > > > > class CustomAnalyzer > > def token_stream(field, str) > > ts = StandardTokenizer.new(str) > > end > > end > > > > Am I missing something here? Are the start/stop byte positions intended > to > > be with respect to the line? Is there a way for token_stream to only be > > called once for an entire string sequence (even if carriage returns are > > contained)? > > > > Thanks, > > John > > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > -- > Jens Kr?mer > Finkenlust 14, 06449 Aschersleben, Germany > VAT Id DE251962952 > http://www.jkraemer.net/ - Blog > http://www.omdb.org/ - The new free film database > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -------------- next part -------------- An HTML attachment was scrubbed... URL: