From anatol.pomozov at gmail.com Thu Dec 1 04:13:55 2005 From: anatol.pomozov at gmail.com (Anatol Pomozov) Date: Thu, 1 Dec 2005 10:13:55 +0100 Subject: [Ferret-talk] Compilation of ferret C-extension under Windows. In-Reply-To: References: <3665a1a00511301104o40a16c2ta30a92141a12cbb9@mail.gmail.com> Message-ID: <3665a1a00512010113i437c2f75v9b6429b3d5698873@mail.gmail.com> Hi, David. On 12/1/05, David Balmain wrote: > > Hi Anatol, > > On 12/1/05, Anatol Pomozov wrote: > > Hi, David. > > > > I have recently fixed ferret C sources and successfully compile > extension > > with MSVC.Net The problem was that MS compiler is more stricter that GCC > and > > require that all variables were declared before using. There was ~30 > such > > declaration. I have fixed them all. > > > > But I am not sure that it works because tests failed with following > error > > both on clean and patched versions. So seems that it is ferret internal > > error. > > > > test_persist_index(IndexTest): > > RuntimeError: could not obtain lock: > > C:/work/opensource/1/111/test/temp/fsdir/ferret- > > e0bcfc4d8e4ef5b2678a85120e4b572ccommit.lock > > > > This isn't a bug but rather caused by the fact that you have a lock > still open in your index. I have put finalizers on the lock in the > version of Ferret in trunk to stop this from happening but it is > better if you make sure that you close the index before you shut down > the process. I think a lot of people are getting this error when > they're running ferret in a webapp and they kill the server process. > To get it working again, just delete the lock file. But It happens during Ferret tests. Now I have another error in Ferret tests. 3) Error: test_persist_index(IndexTest): Errno::EACCES: Permission denied - C:/work/opensource/ferret/test/temp/fsdir/_8.cfs C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:105:in `delete' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:105:in `refresh' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:104:in `each' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:123:in `each' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:123:in `each' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:104:in `refresh' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:101:in `synchronize' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:101:in `refresh' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:74:in `new' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:68:in `synchronize' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:68:in `new' C:/work/opensource/ferret/test/../lib/ferret/index/index.rb:539:in `persist' C:/work/opensource/ferret/test/../lib/ferret/index/index.rb:535:in `synchronize' C:/work/opensource/ferret/test/../lib/ferret/index/index.rb:535:in `persist' C:/work/opensource/ferret/test/unit/../unit/analysis/../../unit/document/../../unit/index/tc_index.rb:260:in `test_persist_index' Dir C:\work\opensource\ferret\test\temp\fsdir is empty after tests are finished. > > > Anyway I could share or send a patch for C sources if you like. > > A patch would be great if its not too much trouble. Otherwise, I'd > love to see an example of what exactly is causing the error. Do you > mean it doesn't accept; > > int x = 3; Such thing accepted but not GET_STE; TermInfo *ti; Data_Get_Struct(rti, TermInfo, ti); because GET_STE contains function call. After that we declire variale TermInfo *ti; It is not accepted by MSVC compiler. Ok. There is part of .diff file. Index: index_io.c =================================================================== --- index_io.c (revision 152) +++ index_io.c (working copy) @@ -32,9 +32,9 @@ static VALUE frt_indexin_init_copy(VALUE self, VALUE orig) { - GET_MY_BUF; IndexBuffer *orig_buf; int len; + GET_MY_BUF; if (self == orig) return self; @@ -53,10 +53,11 @@ static VALUE frt_indexin_refill(VALUE self) { - GET_MY_BUF; long start; + VALUE rStr; int stop, len_to_read; int input_len = FIX2INT(rb_funcall(self, frt_length, 0, NULL)); + GET_MY_BUF; start = my_buf->start + my_buf->pos; stop = start + BUFFER_SIZE; @@ -69,7 +70,7 @@ rb_raise(rb_eEOFError, "IndexInput: Read past End of File"); } - VALUE rStr = rb_str_new((char *)my_buf->buffer, BUFFER_SIZE); + rStr = rb_str_new((char *)my_buf->buffer, BUFFER_SIZE); rb_funcall(self, frt_read_internal, 3, rStr, INT2FIX(0), INT2FIX(len_to_read)); -- anatol (http://pomozov.info) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20051201/89ca7546/attachment-0001.htm From anatol.pomozov at gmail.com Thu Dec 1 04:16:26 2005 From: anatol.pomozov at gmail.com (Anatol Pomozov) Date: Thu, 1 Dec 2005 10:16:26 +0100 Subject: [Ferret-talk] Compilation of ferret C-extension under Windows. In-Reply-To: <3665a1a00512010113i437c2f75v9b6429b3d5698873@mail.gmail.com> References: <3665a1a00511301104o40a16c2ta30a92141a12cbb9@mail.gmail.com> <3665a1a00512010113i437c2f75v9b6429b3d5698873@mail.gmail.com> Message-ID: <3665a1a00512010116g6f1c8742n1d786331a179d1a3@mail.gmail.com> Ok. Better if you look at whole patch for Ferret. http://pomozov.info/downloads/ferret-msvc-patch.zip On 12/1/05, Anatol Pomozov wrote: > > Hi, David. > > On 12/1/05, David Balmain wrote: > > > > Hi Anatol, > > > > On 12/1/05, Anatol Pomozov wrote: > > > Hi, David. > > > > > > I have recently fixed ferret C sources and successfully compile > > extension > > > with MSVC.Net The problem was that MS compiler is more stricter that > > GCC and > > > require that all variables were declared before using. There was ~30 > > such > > > declaration. I have fixed them all. > > > > > > But I am not sure that it works because tests failed with following > > error > > > both on clean and patched versions. So seems that it is ferret > > internal > > > error. > > > > > > test_persist_index(IndexTest): > > > RuntimeError: could not obtain lock: > > > C:/work/opensource/1/111/test/temp/fsdir/ferret- > > > e0bcfc4d8e4ef5b2678a85120e4b572ccommit.lock > > > > > > > This isn't a bug but rather caused by the fact that you have a lock > > still open in your index. I have put finalizers on the lock in the > > version of Ferret in trunk to stop this from happening but it is > > better if you make sure that you close the index before you shut down > > the process. I think a lot of people are getting this error when > > they're running ferret in a webapp and they kill the server process. > > To get it working again, just delete the lock file. > > > But It happens during Ferret tests. > Now I have another error in Ferret tests. > > 3) Error: > test_persist_index(IndexTest): > Errno::EACCES: Permission denied - > C:/work/opensource/ferret/test/temp/fsdir/_8.cfs > C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:105:in > `delete' > C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:105:in > `refresh' > C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:104:in > `each' > C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:123:in > `each' > C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:123:in > `each' > C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:104:in > `refresh' > C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:101:in > `synchronize' > C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:101:in > `refresh' > C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:74:in > `new' > C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:68:in > `synchronize' > C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:68:in > `new' > C:/work/opensource/ferret/test/../lib/ferret/index/index.rb:539:in > `persist' > C:/work/opensource/ferret/test/../lib/ferret/index/index.rb:535:in > `synchronize' > C:/work/opensource/ferret/test/../lib/ferret/index/index.rb:535:in > `persist' > > C:/work/opensource/ferret/test/unit/../unit/analysis/../../unit/document/../../unit/index/tc_index.rb:260:in > `test_persist_index' > > Dir C:\work\opensource\ferret\test\temp\fsdir is empty after tests are > finished. > > > > > > Anyway I could share or send a patch for C sources if you like. > > > > A patch would be great if its not too much trouble. Otherwise, I'd > > love to see an example of what exactly is causing the error. Do you > > mean it doesn't accept; > > > > int x = 3; > > Such thing accepted but not > GET_STE; > TermInfo *ti; Data_Get_Struct(rti, TermInfo, ti); > because GET_STE contains function call. After that we declire variale > TermInfo *ti; It is not accepted by MSVC compiler. > > Ok. There is part of .diff file. > > Index: index_io.c > =================================================================== > --- index_io.c (revision 152) > +++ index_io.c (working copy) > @@ -32,9 +32,9 @@ > static VALUE > frt_indexin_init_copy(VALUE self, VALUE orig) > { > - GET_MY_BUF; > IndexBuffer *orig_buf; > int len; > + GET_MY_BUF; > if (self == orig) > return self; > > @@ -53,10 +53,11 @@ > static VALUE > frt_indexin_refill(VALUE self) > { > - GET_MY_BUF; > long start; > + VALUE rStr; > int stop, len_to_read; > int input_len = FIX2INT(rb_funcall(self, frt_length, 0, NULL)); > + GET_MY_BUF; > > start = my_buf->start + my_buf->pos; > stop = start + BUFFER_SIZE; > @@ -69,7 +70,7 @@ > rb_raise(rb_eEOFError, "IndexInput: Read past End of File"); > } > > - VALUE rStr = rb_str_new((char *)my_buf->buffer, BUFFER_SIZE); > + rStr = rb_str_new((char *)my_buf->buffer, BUFFER_SIZE); > rb_funcall(self, frt_read_internal, 3, > rStr, INT2FIX(0), INT2FIX(len_to_read)); > > -- > anatol ( http://pomozov.info) -- anatol (http://pomozov.info) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20051201/18992b6e/attachment.htm From aslak.hellesoy at gmail.com Thu Dec 1 10:33:44 2005 From: aslak.hellesoy at gmail.com (aslak hellesoy) Date: Thu, 1 Dec 2005 10:33:44 -0500 Subject: [Ferret-talk] Compilation of ferret C-extension under Windows. In-Reply-To: <3665a1a00512010116g6f1c8742n1d786331a179d1a3@mail.gmail.com> References: <3665a1a00511301104o40a16c2ta30a92141a12cbb9@mail.gmail.com> <3665a1a00512010113i437c2f75v9b6429b3d5698873@mail.gmail.com> <3665a1a00512010116g6f1c8742n1d786331a179d1a3@mail.gmail.com> Message-ID: <8d961d900512010733g6854b1bcsab8270eeedda71e2@mail.gmail.com> On 12/1/05, Anatol Pomozov wrote: > Ok. > Better if you look at whole patch for Ferret. > > http://pomozov.info/downloads/ferret-msvc-patch.zip > 404 Could you upload it to Ferret's Trac? http://ferret.davebalmain.com/trac/newticket (you can attach files after you create a ticket) Looking forward to better win32 support! Aslak From anatol.pomozov at gmail.com Thu Dec 1 10:35:17 2005 From: anatol.pomozov at gmail.com (Anatol Pomozov) Date: Thu, 1 Dec 2005 16:35:17 +0100 Subject: [Ferret-talk] Compilation of ferret C-extension under Windows. In-Reply-To: References: <3665a1a00511301104o40a16c2ta30a92141a12cbb9@mail.gmail.com> <3665a1a00512010113i437c2f75v9b6429b3d5698873@mail.gmail.com> <3665a1a00512010116g6f1c8742n1d786331a179d1a3@mail.gmail.com> <3665a1a00512010639r956c072p8a5e55776a27919d@mail.gmail.com> Message-ID: <3665a1a00512010735g12d6caa2w1631d9ae68e98c5a@mail.gmail.com> >> Thanks again for the patch. I've applied it. No problems. I dont see any changes in ferret/ext >>Let me know if you have any problems outside of the unit tests. I still have tests failed. I have attached log of test running. On 12/1/05, David Balmain wrote: > > On 12/1/05, Anatol Pomozov wrote: > > I have attached patch. > > > > Seems that some changes just a TAB->SPACE conversation. My editor does > not > > like tabs. > > BTW is any source code formatter for C? > > Well, I use vim. It does a pretty good of reformatting C. And it can > convert between tabs and spaces. It works on Windows too. But you are > probably happy with the editor you have. I don't know any other > formatters for C in Windows. Maybe one of these will work for you; > > http://sourceforge.net/projects/astyle/ > http://sourceforge.net/projects/codeshine/ > > Thanks again for the patch. I've applied it. No problems. I'll try and > fix those other windows bugs for you. Let me know if you have any > problems outside of the unit tests. > -- anatol (http://pomozov.info) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20051201/4f8774ba/attachment.htm -------------- next part -------------- Loaded suite C:/work/opensource/ferret/test/test_all Started .............................................................................EE...E................................................................................................... Finished in 18.437 seconds. 1) Error: test_fs_index(IndexTest): Errno::EACCES: Permission denied - C:/work/opensource/ferret/test/temp/fsdir/_1.cfs C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:105:in `delete' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:105:in `refresh' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:104:in `each' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:123:in `each' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:123:in `each' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:104:in `refresh' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:101:in `synchronize' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:101:in `refresh' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:74:in `new' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:68:in `synchronize' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:68:in `new' C:/work/opensource/ferret/test/../lib/ferret/index/index.rb:126:in `initialize' C:/work/opensource/ferret/test/unit/../unit/analysis/../../unit/document/../../unit/index/tc_index.rb:145:in `new' C:/work/opensource/ferret/test/unit/../unit/analysis/../../unit/document/../../unit/index/tc_index.rb:145:in `test_fs_index' 2) Error: test_fs_index_is_persistant(IndexTest): Errno::EACCES: Permission denied - C:/work/opensource/ferret/test/temp/fsdir/_8.tmp or C:/work/opensource/ferret/test/temp/fsdir/_8.cfs C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:162:in `rename' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:162:in `rename' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:161:in `synchronize' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:161:in `rename' C:/work/opensource/ferret/test/../lib/ferret/index/index_writer.rb:428:in `merge_segments' C:/work/opensource/ferret/test/../lib/ferret/index/index_writer.rb:426:in `while_locked' C:/work/opensource/ferret/test/../lib/ferret/index/index_writer.rb:426:in `merge_segments' C:/work/opensource/ferret/test/../lib/ferret/index/index_writer.rb:425:in `synchronize' C:/work/opensource/ferret/test/../lib/ferret/index/index_writer.rb:425:in `merge_segments' C:/work/opensource/ferret/test/../lib/ferret/index/index_writer.rb:351:in `flush_ram_segments' C:/work/opensource/ferret/test/../lib/ferret/index/index_writer.rb:127:in `close' C:/work/opensource/ferret/test/../lib/ferret/index/index_writer.rb:126:in `synchronize' C:/work/opensource/ferret/test/../lib/ferret/index/index_writer.rb:126:in `close' C:/work/opensource/ferret/test/../lib/ferret/index/index.rb:586:in `ensure_reader_open' C:/work/opensource/ferret/test/../lib/ferret/index/index.rb:495:in `size' C:/work/opensource/ferret/test/../lib/ferret/index/index.rb:494:in `synchronize' C:/work/opensource/ferret/test/../lib/ferret/index/index.rb:494:in `size' C:/work/opensource/ferret/test/unit/../unit/analysis/../../unit/document/../../unit/index/tc_index.rb:166:in `test_fs_index_is_persistant' 3) Error: test_persist_index(IndexTest): Errno::EACCES: Permission denied - C:/work/opensource/ferret/test/temp/fsdir/_8.cfs C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:105:in `delete' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:105:in `refresh' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:104:in `each' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:123:in `each' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:123:in `each' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:104:in `refresh' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:101:in `synchronize' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:101:in `refresh' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:74:in `new' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:68:in `synchronize' C:/work/opensource/ferret/test/../lib/ferret/store/fs_store.rb:68:in `new' C:/work/opensource/ferret/test/../lib/ferret/index/index.rb:554:in `persist' C:/work/opensource/ferret/test/../lib/ferret/index/index.rb:550:in `synchronize' C:/work/opensource/ferret/test/../lib/ferret/index/index.rb:550:in `persist' C:/work/opensource/ferret/test/unit/../unit/analysis/../../unit/document/../../unit/index/tc_index.rb:260:in `test_persist_index' 182 tests, 3921 assertions, 0 failures, 3 errors From carl at youngbloods.org Thu Dec 1 20:32:00 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Thu, 1 Dec 2005 17:32:00 -0800 Subject: [Ferret-talk] How to get the count of matching documents Message-ID: I'm trying to generate a rails pagination helper for some ferret search results, and I need to know how many total matches there are to my search query. I don't see an obvious way of finding this. Any help would be appreciated. Thanks, Carl Youngblood From dbalmain.ml at gmail.com Thu Dec 1 21:07:44 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Fri, 2 Dec 2005 11:07:44 +0900 Subject: [Ferret-talk] How to get the count of matching documents In-Reply-To: References: Message-ID: Hi Carl, This is easy. If you use the search method (not search_each) a TopDocs object is returned which has the attribute total hits. Or you can continue to use the search_each method like this; total_hits = index.search_each("query") {|doc, score| puts "doc"} Hope this helps, Dave On 12/2/05, Carl Youngblood wrote: > I'm trying to generate a rails pagination helper for some ferret > search results, and I need to know how many total matches there are to > my search query. I don't see an obvious way of finding this. Any > help would be appreciated. > > Thanks, > Carl Youngblood > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From carl at youngbloods.org Thu Dec 1 22:16:33 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Thu, 1 Dec 2005 19:16:33 -0800 Subject: [Ferret-talk] How to get the count of matching documents In-Reply-To: References: Message-ID: No doubt I just needed to scour the documentation more thoroughly, but thanks for the help. Carl On 12/1/05, David Balmain wrote: > Hi Carl, > > This is easy. If you use the search method (not search_each) a TopDocs > object is returned which has the attribute total hits. Or you can > continue to use the search_each method like this; > > total_hits = index.search_each("query") {|doc, score| puts "doc"} > > Hope this helps, > Dave From carl at youngbloods.org Fri Dec 2 00:16:27 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Thu, 1 Dec 2005 21:16:27 -0800 Subject: [Ferret-talk] cFerret ETA? Message-ID: I'm noticing some long delays when optimizing my index. I know this is terribly inefficient, but in order to make sure that my ActiveRecord model is in sync with my index, I'm optimizing after every new record that I store, like so: class Resume < ActiveRecord::Base include Ferret has_and_belongs_to_many :users SEARCH_INDEX = File.dirname(__FILE__) + '/../../searchindex' # syncronization with ferret index def after_save @@index ||= Index::Index.new(:path => SEARCH_INDEX, :create_if_missing => true) @@index << {:id => id, :email => email, :contents => contents, :date => found_on} @@index.flush @@index.optimize end def after_destroy @@index ||= Index::Index.new(:path => SEARCH_INDEX, :create_if_missing => true) @@index.delete(id) @@index.flush @@index.optimize end ... end I'm noticing about a 2-3 second delay after every new record that I store. I'm thinking that this will be bearable when cFerret comes out. Do you have any estimate on when that might be? Thanks, Carl From dbalmain.ml at gmail.com Fri Dec 2 01:22:57 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Fri, 2 Dec 2005 15:22:57 +0900 Subject: [Ferret-talk] cFerret ETA? In-Reply-To: References: Message-ID: Hi Carl, I actually finished integrating my cFerret indexer but I'm not going to release it. The performance is great but the code is really messy and it would be a nightmare to maintain. Instead I'm porting the search part of Lucene to C and I'll write a Ruby interface to this when I'm finished. I wish I could give you an accurate estimate but the search module is the largest part or Lucene so it's going to take time. As an added bonus, search will be a lot faster as well as indexing so it will be worth the wait. Hopefully I'll have something finished by Christmas. Solution: instead of optimizing your index every time you change it, just flush it. This will keep your ActiveRecord model is in sync with your index without the large delay. Cheers, Dave On 12/2/05, Carl Youngblood wrote: > I'm noticing some long delays when optimizing my index. I know this > is terribly inefficient, but in order to make sure that my > ActiveRecord model is in sync with my index, I'm optimizing after > every new record that I store, like so: > > class Resume < ActiveRecord::Base > include Ferret > has_and_belongs_to_many :users > SEARCH_INDEX = File.dirname(__FILE__) + '/../../searchindex' > > # syncronization with ferret index > def after_save > @@index ||= Index::Index.new(:path => SEARCH_INDEX, > :create_if_missing => true) > @@index << {:id => id, :email => email, :contents => contents, > :date => found_on} > @@index.flush > @@index.optimize > end > > def after_destroy > @@index ||= Index::Index.new(:path => SEARCH_INDEX, > :create_if_missing => true) > @@index.delete(id) > @@index.flush > @@index.optimize > end > > ... > end > > I'm noticing about a 2-3 second delay after every new record that I > store. I'm thinking that this will be bearable when cFerret comes > out. Do you have any estimate on when that might be? > > Thanks, > Carl > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From carl at youngbloods.org Fri Dec 2 02:23:58 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Thu, 1 Dec 2005 23:23:58 -0800 Subject: [Ferret-talk] Compile error on FreeBSD 4.10 gcc 2.95.4 Message-ID: FYI, I tried installing ferret on my freebsd virtual server and got this: retango# gem install ferret --include-dependencies Attempting local installation of 'ferret' Local gem file not found: ferret*.gem Attempting remote installation of 'ferret' Updating Gem source index for: http://gems.rubyforge.org Building native extensions. This could take a while... index_io.c: In function `frt_indexin_refill': index_io.c:80: syntax error before `rStr' index_io.c:82: `rStr' undeclared (first use in this function) index_io.c:82: (Each undeclared identifier is reported only once index_io.c:82: for each function it appears in.) index_io.c: In function `frt_read_byte': index_io.c:103: syntax error before `res' index_io.c:104: `res' undeclared (first use in this function) index_io.c: In function `frt_indexin_refill': index_io.c:80: syntax error before `rStr' index_io.c:82: `rStr' undeclared (first use in this function) index_io.c:82: (Each undeclared identifier is reported only once index_io.c:82: for each function it appears in.) index_io.c: In function `frt_read_byte': index_io.c:103: syntax error before `res' index_io.c:104: `res' undeclared (first use in this function) ruby extconf.rb install ferret --include-dependencies creating Makefile make gcc -fPIC -g -O2 -I. -I/usr/local/lib/ruby/1.8/i386-freebsd4.10 -I/usr/local/lib/ruby/1.8/i386-freebsd4.10 -I. -c index_io.c *** Error code 1 Stop in /usr/local/lib/ruby/gems/1.8/gems/ferret-0.2.2/ext. make install gcc -fPIC -g -O2 -I. -I/usr/local/lib/ruby/1.8/i386-freebsd4.10 -I/usr/local/lib/ruby/1.8/i386-freebsd4.10 -I. -c index_io.c *** Error code 1 Stop in /usr/local/lib/ruby/gems/1.8/gems/ferret-0.2.2/ext. Successfully installed ferret-0.2.2 Not sure why the native extensions don't compile. I'm assuming it's a problem with gcc. Thanks, Carl On 12/1/05, David Balmain wrote: > Hi Carl, > > I actually finished integrating my cFerret indexer but I'm not going > to release it. The performance is great but the code is really messy > and it would be a nightmare to maintain. Instead I'm porting the > search part of Lucene to C and I'll write a Ruby interface to this > when I'm finished. I wish I could give you an accurate estimate but > the search module is the largest part or Lucene so it's going to take > time. As an added bonus, search will be a lot faster as well as > indexing so it will be worth the wait. Hopefully I'll have something > finished by Christmas. > > Solution: instead of optimizing your index every time you change it, > just flush it. This will keep your ActiveRecord model is in sync with > your index without the large delay. > > Cheers, > Dave > > On 12/2/05, Carl Youngblood wrote: > > I'm noticing some long delays when optimizing my index. I know this > > is terribly inefficient, but in order to make sure that my > > ActiveRecord model is in sync with my index, I'm optimizing after > > every new record that I store, like so: > > > > class Resume < ActiveRecord::Base > > include Ferret > > has_and_belongs_to_many :users > > SEARCH_INDEX = File.dirname(__FILE__) + '/../../searchindex' > > > > # syncronization with ferret index > > def after_save > > @@index ||= Index::Index.new(:path => SEARCH_INDEX, > > :create_if_missing => true) > > @@index << {:id => id, :email => email, :contents => contents, > > :date => found_on} > > @@index.flush > > @@index.optimize > > end > > > > def after_destroy > > @@index ||= Index::Index.new(:path => SEARCH_INDEX, > > :create_if_missing => true) > > @@index.delete(id) > > @@index.flush > > @@index.optimize > > end > > > > ... > > end > > > > I'm noticing about a 2-3 second delay after every new record that I > > store. I'm thinking that this will be bearable when cFerret comes > > out. Do you have any estimate on when that might be? > > > > Thanks, > > Carl > > > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From anatol.pomozov at gmail.com Fri Dec 2 03:12:42 2005 From: anatol.pomozov at gmail.com (Anatol Pomozov) Date: Fri, 2 Dec 2005 09:12:42 +0100 Subject: [Ferret-talk] How to get the count of matching documents In-Reply-To: References: Message-ID: <3665a1a00512020012t35de9577t4f31a9192dfffca9@mail.gmail.com> What do you mean saying "more thoroughly". This is code from my application and it works great. page_num = @params['page'] ? @params['page'].to_i : 1 total_hits = index.search_each(query, :num_docs => PAGE_SIZE, :first_doc => PAGE_SIZE*(page_num-1)) do |doc_num, score| @documents << index[doc_num] end @document_pages = Paginator.new self, total_hits, PAGE_SIZE, page_num Is it makes sense to put it to "HowTos" page?? On 12/2/05, Carl Youngblood wrote: > > No doubt I just needed to scour the documentation more thoroughly, but > thanks for the help. > > Carl > > On 12/1/05, David Balmain wrote: > > Hi Carl, > > > > This is easy. If you use the search method (not search_each) a TopDocs > > object is returned which has the attribute total hits. Or you can > > continue to use the search_each method like this; > > > > total_hits = index.search_each("query") {|doc, score| puts "doc"} > > > > Hope this helps, > > Dave > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -- anatol (http://pomozov.info) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20051202/ea6f24ec/attachment.htm From anatol.pomozov at gmail.com Fri Dec 2 03:31:23 2005 From: anatol.pomozov at gmail.com (Anatol Pomozov) Date: Fri, 2 Dec 2005 09:31:23 +0100 Subject: [Ferret-talk] Compile error on FreeBSD 4.10 gcc 2.95.4 In-Reply-To: References: Message-ID: <3665a1a00512020031h33dd695blcd12bc425908021@mail.gmail.com> I am not sure what does it mean. So could not solve your problem right now. But probably I could say more if you try to grab latest Ferret from SVN and build it manually. svn co svn://www.davebalmain.com/ferret/trunk ferret cd ferret/ext ruby extconf.rb make On 12/2/05, Carl Youngblood wrote: > > FYI, I tried installing ferret on my freebsd virtual server and got this: > > retango# gem install ferret --include-dependencies > Attempting local installation of 'ferret' > Local gem file not found: ferret*.gem > Attempting remote installation of 'ferret' > Updating Gem source index for: http://gems.rubyforge.org > Building native extensions. This could take a while... > index_io.c: In function `frt_indexin_refill': > index_io.c:80: syntax error before `rStr' > index_io.c:82: `rStr' undeclared (first use in this function) > index_io.c:82: (Each undeclared identifier is reported only once > index_io.c:82: for each function it appears in.) > index_io.c: In function `frt_read_byte': > index_io.c:103: syntax error before `res' > index_io.c:104: `res' undeclared (first use in this function) > index_io.c: In function `frt_indexin_refill': > index_io.c:80: syntax error before `rStr' > index_io.c:82: `rStr' undeclared (first use in this function) > index_io.c:82: (Each undeclared identifier is reported only once > index_io.c:82: for each function it appears in.) > index_io.c: In function `frt_read_byte': > index_io.c:103: syntax error before `res' > index_io.c:104: `res' undeclared (first use in this function) > ruby extconf.rb install ferret --include-dependencies > creating Makefile > > make > gcc -fPIC -g -O2 -I. -I/usr/local/lib/ruby/1.8/i386-freebsd4.10 > -I/usr/local/lib/ruby/1.8/i386-freebsd4.10 -I. -c index_io.c > *** Error code 1 > > Stop in /usr/local/lib/ruby/gems/1.8/gems/ferret-0.2.2/ext. > > make install > gcc -fPIC -g -O2 -I. -I/usr/local/lib/ruby/1.8/i386-freebsd4.10 > -I/usr/local/lib/ruby/1.8/i386-freebsd4.10 -I. -c index_io.c > *** Error code 1 > > Stop in /usr/local/lib/ruby/gems/1.8/gems/ferret-0.2.2/ext. > Successfully installed ferret-0.2.2 > > > Not sure why the native extensions don't compile. I'm assuming it's a > problem with gcc. > > Thanks, > Carl > > On 12/1/05, David Balmain wrote: > > Hi Carl, > > > > I actually finished integrating my cFerret indexer but I'm not going > > to release it. The performance is great but the code is really messy > > and it would be a nightmare to maintain. Instead I'm porting the > > search part of Lucene to C and I'll write a Ruby interface to this > > when I'm finished. I wish I could give you an accurate estimate but > > the search module is the largest part or Lucene so it's going to take > > time. As an added bonus, search will be a lot faster as well as > > indexing so it will be worth the wait. Hopefully I'll have something > > finished by Christmas. > > > > Solution: instead of optimizing your index every time you change it, > > just flush it. This will keep your ActiveRecord model is in sync with > > your index without the large delay. > > > > Cheers, > > Dave > > > > On 12/2/05, Carl Youngblood wrote: > > > I'm noticing some long delays when optimizing my index. I know this > > > is terribly inefficient, but in order to make sure that my > > > ActiveRecord model is in sync with my index, I'm optimizing after > > > every new record that I store, like so: > > > > > > class Resume < ActiveRecord::Base > > > include Ferret > > > has_and_belongs_to_many :users > > > SEARCH_INDEX = File.dirname(__FILE__) + '/../../searchindex' > > > > > > # syncronization with ferret index > > > def after_save > > > @@index ||= Index::Index.new(:path => SEARCH_INDEX, > > > :create_if_missing => true) > > > @@index << {:id => id, :email => email, :contents => contents, > > > :date => found_on} > > > @@index.flush > > > @@index.optimize > > > end > > > > > > def after_destroy > > > @@index ||= Index::Index.new(:path => SEARCH_INDEX, > > > :create_if_missing => true) > > > @@index.delete(id) > > > @@index.flush > > > @@index.optimize > > > end > > > > > > ... > > > end > > > > > > I'm noticing about a 2-3 second delay after every new record that I > > > store. I'm thinking that this will be bearable when cFerret comes > > > out. Do you have any estimate on when that might be? > > > > > > Thanks, > > > Carl > > > > > > _______________________________________________ > > > Ferret-talk mailing list > > > Ferret-talk at rubyforge.org > > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > > > > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -- anatol (http://pomozov.info) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20051202/8cf76548/attachment-0001.htm From anatol.pomozov at gmail.com Fri Dec 2 03:38:56 2005 From: anatol.pomozov at gmail.com (Anatol Pomozov) Date: Fri, 2 Dec 2005 09:38:56 +0100 Subject: [Ferret-talk] Compilation of ferret C-extension under Windows. In-Reply-To: <3665a1a00512010735g12d6caa2w1631d9ae68e98c5a@mail.gmail.com> References: <3665a1a00511301104o40a16c2ta30a92141a12cbb9@mail.gmail.com> <3665a1a00512010113i437c2f75v9b6429b3d5698873@mail.gmail.com> <3665a1a00512010116g6f1c8742n1d786331a179d1a3@mail.gmail.com> <3665a1a00512010639r956c072p8a5e55776a27919d@mail.gmail.com> <3665a1a00512010735g12d6caa2w1631d9ae68e98c5a@mail.gmail.com> Message-ID: <3665a1a00512020038i250e778aic162f24b56e9822d@mail.gmail.com> Hi David. C extension for Ferret is compiled, all tests are passed. Does it mean that native extension already supported on Windows?? BTW I know that many of Win developers have no installed C compiler and make on their machines. Is it makes sense to share ferret\ext\ferret_ext.so file that compiled by MSVC.Net?? Does Gem have something like "Platform specific" gems, for example for *nix one *.gem for Windows another?? So running 'gem install' Gem selects *.gem depending on user platform. On 12/1/05, Anatol Pomozov wrote: > > >> Thanks again for the patch. I've applied it. No problems. > I dont see any changes in ferret/ext > > >>Let me know if you have any problems outside of the unit tests. > I still have tests failed. I have attached log of test running. > > On 12/1/05, David Balmain wrote: > > > > On 12/1/05, Anatol Pomozov wrote: > > > I have attached patch. > > > > > > Seems that some changes just a TAB->SPACE conversation. My editor does > > not > > > like tabs. > > > BTW is any source code formatter for C? > > > > Well, I use vim. It does a pretty good of reformatting C. And it can > > convert between tabs and spaces. It works on Windows too. But you are > > probably happy with the editor you have. I don't know any other > > formatters for C in Windows. Maybe one of these will work for you; > > > > http://sourceforge.net/projects/astyle/ > > http://sourceforge.net/projects/codeshine/ > > > > Thanks again for the patch. I've applied it. No problems. I'll try and > > fix those other windows bugs for you. Let me know if you have any > > problems outside of the unit tests. > > > > > > -- > anatol (http://pomozov.info) > -- anatol (http://pomozov.info) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20051202/d344be25/attachment.htm From dbalmain.ml at gmail.com Fri Dec 2 09:50:52 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Fri, 2 Dec 2005 23:50:52 +0900 Subject: [Ferret-talk] Compilation of ferret C-extension under Windows. In-Reply-To: <3665a1a00512020038i250e778aic162f24b56e9822d@mail.gmail.com> References: <3665a1a00511301104o40a16c2ta30a92141a12cbb9@mail.gmail.com> <3665a1a00512010113i437c2f75v9b6429b3d5698873@mail.gmail.com> <3665a1a00512010116g6f1c8742n1d786331a179d1a3@mail.gmail.com> <3665a1a00512010639r956c072p8a5e55776a27919d@mail.gmail.com> <3665a1a00512010735g12d6caa2w1631d9ae68e98c5a@mail.gmail.com> <3665a1a00512020038i250e778aic162f24b56e9822d@mail.gmail.com> Message-ID: On 12/2/05, Anatol Pomozov wrote: > Hi David. > > C extension for Ferret is compiled, all tests are passed. Does it mean that > native extension already supported on Windows?? I guess so. :) > BTW I know that many of Win developers have no installed C compiler and make > on their machines. Is it makes sense to share ferret\ext\ferret_ext.so file > that compiled by MSVC.Net?? I think that would be a great idea. > Does Gem have something like "Platform specific" gems, for example for *nix > one *.gem for Windows another?? So running 'gem install' Gem selects *.gem > depending on user platform. I'm not sure that gems has support for this but if it doesn't I'm sure I can work out some way to do it. Could you send me the ferret_ext.so file and I'll see what I can work out. At least it would be good to have it available from the Ferret wiki. Thanks, Dave > > On 12/1/05, Anatol Pomozov wrote: > > >> Thanks again for the patch. I've applied it. No problems. > > I dont see any changes in ferret/ext > > > > >>Let me know if you have any problems outside of the unit tests. > > I still have tests failed. I have attached log of test running. > > > > > > On 12/1/05, David Balmain < dbalmain.ml at gmail.com> wrote: > > > > > On 12/1/05, Anatol Pomozov wrote: > > > > I have attached patch. > > > > > > > > Seems that some changes just a TAB->SPACE conversation. My editor does > not > > > > like tabs. > > > > BTW is any source code formatter for C? > > > > > > Well, I use vim. It does a pretty good of reformatting C. And it can > > > convert between tabs and spaces. It works on Windows too. But you are > > > probably happy with the editor you have. I don't know any other > > > formatters for C in Windows. Maybe one of these will work for you; > > > > > > http://sourceforge.net/projects/astyle/ > > > http://sourceforge.net/projects/codeshine/ > > > > > > Thanks again for the patch. I've applied it. No problems. I'll try and > > > fix those other windows bugs for you. Let me know if you have any > > > problems outside of the unit tests. > > > > > > > > > > > -- > > anatol (http://pomozov.info) > > > > > > -- > anatol (http://pomozov.info) > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > > > From anatol.pomozov at gmail.com Fri Dec 2 10:13:53 2005 From: anatol.pomozov at gmail.com (Anatol Pomozov) Date: Fri, 2 Dec 2005 16:13:53 +0100 Subject: [Ferret-talk] Compilation of ferret C-extension under Windows. In-Reply-To: References: <3665a1a00511301104o40a16c2ta30a92141a12cbb9@mail.gmail.com> <3665a1a00512010113i437c2f75v9b6429b3d5698873@mail.gmail.com> <3665a1a00512010116g6f1c8742n1d786331a179d1a3@mail.gmail.com> <3665a1a00512010639r956c072p8a5e55776a27919d@mail.gmail.com> <3665a1a00512010735g12d6caa2w1631d9ae68e98c5a@mail.gmail.com> <3665a1a00512020038i250e778aic162f24b56e9822d@mail.gmail.com> Message-ID: <3665a1a00512020713n6c712d2fsd1248d8de0971450@mail.gmail.com> Hi, David. Just made svnup (Revision 159), clean and then build extension, run tests 182 tests, 4035 assertions, 0 failures, 0 errors Seems that on Win everything OK. So I am put builded extension here http://pomozov.info/downloads/ferret_ext.so Yestrday I had problems with my hoster but now site is alive. On 12/2/05, David Balmain wrote: > > On 12/2/05, Anatol Pomozov wrote: > > Hi David. > > > > C extension for Ferret is compiled, all tests are passed. Does it mean > that > > native extension already supported on Windows?? > > I guess so. :) > > > BTW I know that many of Win developers have no installed C compiler and > make > > on their machines. Is it makes sense to share ferret\ext\ferret_ext.so > file > > that compiled by MSVC.Net?? > > I think that would be a great idea. > > > Does Gem have something like "Platform specific" gems, for example for > *nix > > one *.gem for Windows another?? So running 'gem install' Gem selects > *.gem > > depending on user platform. > > I'm not sure that gems has support for this but if it doesn't I'm sure > I can work out some way to do it. Could you send me the ferret_ext.so > file and I'll see what I can work out. At least it would be good to > have it available from the Ferret wiki. > > Thanks, > Dave > > > > > On 12/1/05, Anatol Pomozov wrote: > > > >> Thanks again for the patch. I've applied it. No problems. > > > I dont see any changes in ferret/ext > > > > > > >>Let me know if you have any problems outside of the unit tests. > > > I still have tests failed. I have attached log of test running. > > > > > > > > > On 12/1/05, David Balmain < dbalmain.ml at gmail.com> wrote: > > > > > > > On 12/1/05, Anatol Pomozov wrote: > > > > > I have attached patch. > > > > > > > > > > Seems that some changes just a TAB->SPACE conversation. My editor > does > > not > > > > > like tabs. > > > > > BTW is any source code formatter for C? > > > > > > > > Well, I use vim. It does a pretty good of reformatting C. And it can > > > > convert between tabs and spaces. It works on Windows too. But you > are > > > > probably happy with the editor you have. I don't know any other > > > > formatters for C in Windows. Maybe one of these will work for you; > > > > > > > > http://sourceforge.net/projects/astyle/ > > > > http://sourceforge.net/projects/codeshine/ > > > > > > > > Thanks again for the patch. I've applied it. No problems. I'll try > and > > > > fix those other windows bugs for you. Let me know if you have any > > > > problems outside of the unit tests. > -- anatol (http://pomozov.info) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20051202/a34783d6/attachment.htm From dbalmain.ml at gmail.com Fri Dec 2 10:15:10 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 3 Dec 2005 00:15:10 +0900 Subject: [Ferret-talk] How to get the count of matching documents In-Reply-To: <3665a1a00512020012t35de9577t4f31a9192dfffca9@mail.gmail.com> References: <3665a1a00512020012t35de9577t4f31a9192dfffca9@mail.gmail.com> Message-ID: On 12/2/05, Anatol Pomozov wrote: > What do you mean saying "more thoroughly". This is code from my application > and it works great. > > page_num = @params['page'] ? @params['page'].to_i : 1 > > total_hits = index.search_each(query, :num_docs => PAGE_SIZE, :first_doc > => PAGE_SIZE*(page_num-1)) do |doc_num, score| > @documents << index[doc_num] > end > > @document_pages = Paginator.new self, total_hits, PAGE_SIZE, page_num > > Is it makes sense to put it to "HowTos" page?? Very good idea. I've had a few people ask me about this. I wish I had time. I'm leaving the Wiki up to the users at this stage. When I finish cFerret I'll have more time. Dave From dbalmain.ml at gmail.com Fri Dec 2 10:23:41 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 3 Dec 2005 00:23:41 +0900 Subject: [Ferret-talk] Ferret 0.3.0 released Message-ID: Hi folks, This latest release of Ferret has a lot of improvements. There have been substantial improvements to performance. Try it for yourself to see. I won't be publishing any numbers just yet. I will say though that it's still about 2-4 times slower than Lucene with the extension installed. There is also some performance improvements in the pure Ruby version if you haven't been able to install the C extension. As well as working on the performance, the locking system has had a few changes to prevent some of the problems it has been causing people. Now, if you cancel a process while it has a lock open, the lock should still be released. Also, you can set the index to auto flush; index = Index::Index.new(:auto_flush => true) This will make sure no locks are kept open after updating the index. You won't need to call flush anymore. This is very useful if you have multiple processes modifying the index as you might have in a Rails application. Also, it should now compile with the MSVC compiler on windows thanks to a patch from Anatol Pomozov. You can download it from here; http://ferret.davebalmain.com/trac/attachment/wiki/Windows/ferret_ext.so?format=raw And place it in the lib directory of your Ferret distribution. Cheers, Dave Changes: * Added lock finalizer. * Added :auto_flush option to Index::Index * Many speed optimizations * Fixed extension to compile on Windows. From fcsmith at gmail.com Fri Dec 2 10:28:18 2005 From: fcsmith at gmail.com (Finn Smith) Date: Fri, 2 Dec 2005 10:28:18 -0500 Subject: [Ferret-talk] Ferret 0.3.0 released In-Reply-To: References: Message-ID: <6e72bbd70512020728ya3fc516h5178d2ccc83112a3@mail.gmail.com> On 12/2/05, David Balmain wrote: > > As well as working on the performance, the locking system has had a > few changes to prevent some of the problems it has been causing > people. Now, if you cancel a process while it has a lock open, the > lock should still be released. Also, you can set the index to auto > flush; > > index = Index::Index.new(:auto_flush => true) > > This will make sure no locks are kept open after updating the index. > You won't need to call flush anymore. This is very useful if you have > multiple processes modifying the index as you might have in a Rails > application. Thanks, David. These feataures will be very useful. -F From dbalmain.ml at gmail.com Fri Dec 2 11:01:14 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 3 Dec 2005 01:01:14 +0900 Subject: [Ferret-talk] Ferret 0.3.0 released In-Reply-To: References: <6e72bbd70512020728ya3fc516h5178d2ccc83112a3@mail.gmail.com> Message-ID: One thing I forgot to mention. I'm currently investigating ceasing support for the pure ruby version of Ferret. This will enable me (I think) to build a much faster version of Ferret in the long term. If anyone has any strong opinions on this topic, please let me know. Cheers, Dave From aslak.hellesoy at gmail.com Fri Dec 2 11:35:57 2005 From: aslak.hellesoy at gmail.com (aslak hellesoy) Date: Fri, 2 Dec 2005 11:35:57 -0500 Subject: [Ferret-talk] Ferret 0.3.0 released In-Reply-To: References: <6e72bbd70512020728ya3fc516h5178d2ccc83112a3@mail.gmail.com> Message-ID: <8d961d900512020835w354505dbkace628601b66c847@mail.gmail.com> On 12/2/05, David Balmain wrote: > One thing I forgot to mention. I'm currently investigating ceasing > support for the pure ruby version of Ferret. This will enable me (I > think) to build a much faster version of Ferret in the long term. If > anyone has any strong opinions on this topic, please let me know. > I think that would probably be a good idea, allowing you to stay more focused and worry less about feature compatibility. However, I think it would be important to ensure that binary gems for win32 get released at the same pace as the standard (POSIX) gems. See how the sqlite gem handles it => http://rubyforge.org/cgi-bin/viewcvs.cgi/sqlite3-ruby/?cvsroot=sqlite-ruby If you don't have easy access to a pure win32 build environment (visual studio) maybe you'd consider inviting someone else who does as an additional committer? Cheers, Aslak > Cheers, > Dave > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From carl at youngbloods.org Fri Dec 2 13:03:07 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Fri, 2 Dec 2005 10:03:07 -0800 Subject: [Ferret-talk] Ferret 0.3.0 released In-Reply-To: References: Message-ID: I'm a little confused by this. The lib directory of Ferret won't exist until you have already installed it, and by that time the compiling attempt has already taken place. On 12/2/05, David Balmain wrote: > Also, it should now compile with the MSVC compiler on windows thanks > to a patch from Anatol Pomozov. You can download it from here; > > http://ferret.davebalmain.com/trac/attachment/wiki/Windows/ferret_ext.so?format=raw > > And place it in the lib directory of your Ferret distribution. From carl at youngbloods.org Fri Dec 2 13:05:30 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Fri, 2 Dec 2005 10:05:30 -0800 Subject: [Ferret-talk] Ferret 0.3.0 released In-Reply-To: References: Message-ID: Would it be possible to add this .so to the gem so that those who have MSVC installed on their machine can compile? On 12/2/05, David Balmain wrote: > Also, it should now compile with the MSVC compiler on windows thanks > to a patch from Anatol Pomozov. You can download it from here; > > http://ferret.davebalmain.com/trac/attachment/wiki/Windows/ferret_ext.so?format=raw > > And place it in the lib directory of your Ferret distribution. From anatol.pomozov at gmail.com Fri Dec 2 13:20:55 2005 From: anatol.pomozov at gmail.com (Anatol Pomozov) Date: Fri, 2 Dec 2005 19:20:55 +0100 Subject: [Ferret-talk] Ferret 0.3.0 released In-Reply-To: References: Message-ID: <3665a1a00512021020j73d73593s1886d99deafacf87@mail.gmail.com> If you try install Ferret on Win and if you dont have MS compiler && nmake then native extension compilation will fail. But Ferret will be installed anyway and work as pure Ruby version. If you want that Fetter use native extension that much faster than pure Ruby you need to download file from wiki and put it to "lib" dir where Ferret installed. For me this dir is "C:\Program Files\ruby\lib\ruby\gems\1.8\gems\ferret-0.3.0\lib" On 12/2/05, Carl Youngblood wrote: > > I'm a little confused by this. The lib directory of Ferret won't > exist until you have already installed it, and by that time the > compiling attempt has already taken place. > > On 12/2/05, David Balmain wrote: > > Also, it should now compile with the MSVC compiler on windows thanks > > to a patch from Anatol Pomozov. You can download it from here; > > > > > http://ferret.davebalmain.com/trac/attachment/wiki/Windows/ferret_ext.so?format=raw > > > > And place it in the lib directory of your Ferret distribution. > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -- anatol (http://pomozov.info) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20051202/252069e9/attachment.htm From carl at youngbloods.org Fri Dec 2 13:32:47 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Fri, 2 Dec 2005 10:32:47 -0800 Subject: [Ferret-talk] Ferret 0.3.0 released In-Reply-To: <3665a1a00512021020j73d73593s1886d99deafacf87@mail.gmail.com> References: <3665a1a00512021020j73d73593s1886d99deafacf87@mail.gmail.com> Message-ID: Okay, so you're saying that the .so file you submitted has already been compiled and if I simply copy it to the lib directory, then ferret will use the native extensions? That makes more sense. THanks, Carl On 12/2/05, Anatol Pomozov wrote: > If you try install Ferret on Win and if you dont have MS compiler && nmake > then native extension compilation will fail. > But Ferret will be installed anyway and work as pure Ruby version. > > If you want that Fetter use native extension that much faster than pure Ruby > you need to download file from wiki and put it to "lib" dir where Ferret > installed. For me this dir is "C:\Program > Files\ruby\lib\ruby\gems\1.8\gems\ferret- 0.3.0\lib" > > > On 12/2/05, Carl Youngblood wrote: > > > > I'm a little confused by this. The lib directory of Ferret won't > > exist until you have already installed it, and by that time the > > compiling attempt has already taken place. > > > > On 12/2/05, David Balmain < dbalmain.ml at gmail.com> wrote: > > > Also, it should now compile with the MSVC compiler on windows thanks > > > to a patch from Anatol Pomozov. You can download it from here; > > > > > > > http://ferret.davebalmain.com/trac/attachment/wiki/Windows/ferret_ext.so?format=raw > > > > > > And place it in the lib directory of your Ferret distribution. > > > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > > > > -- > anatol (http://pomozov.info) > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > > > From anatol.pomozov at gmail.com Fri Dec 2 13:54:09 2005 From: anatol.pomozov at gmail.com (Anatol Pomozov) Date: Fri, 2 Dec 2005 19:54:09 +0100 Subject: [Ferret-talk] Ferret 0.3.0 released In-Reply-To: <8d961d900512020835w354505dbkace628601b66c847@mail.gmail.com> References: <6e72bbd70512020728ya3fc516h5178d2ccc83112a3@mail.gmail.com> <8d961d900512020835w354505dbkace628601b66c847@mail.gmail.com> Message-ID: <3665a1a00512021054q3fddc360kd5ea6abae661b82a@mail.gmail.com> I think it is good idea to have win32 version of gem for Ferret. I could take a look at it and create *-win32 gemspec On 12/2/05, aslak hellesoy wrote: > > On 12/2/05, David Balmain wrote: > > One thing I forgot to mention. I'm currently investigating ceasing > > support for the pure ruby version of Ferret. This will enable me (I > > think) to build a much faster version of Ferret in the long term. If > > anyone has any strong opinions on this topic, please let me know. > > > > I think that would probably be a good idea, allowing you to stay more > focused and worry less about feature compatibility. However, I think > it would be important to ensure that binary gems for win32 get > released at the same pace as the standard (POSIX) gems. See how the > sqlite gem handles it => > http://rubyforge.org/cgi-bin/viewcvs.cgi/sqlite3-ruby/?cvsroot=sqlite-ruby > > If you don't have easy access to a pure win32 build environment > (visual studio) maybe you'd consider inviting someone else who does as > an additional committer? > > Cheers, > Aslak > > > Cheers, > > Dave > > > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -- anatol (http://pomozov.info) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20051202/28e96ac5/attachment.htm From carl at youngbloods.org Fri Dec 2 21:01:47 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Fri, 2 Dec 2005 18:01:47 -0800 Subject: [Ferret-talk] How to avoid duplicate search results Message-ID: I seem to be getting the same document multiple times in my search results. I'm wondering if this is because by default a document is placed in the search results every time the word you're looking for shows up. Is that the way it works? Thanks, Carl From dbalmain.ml at gmail.com Fri Dec 2 21:07:51 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 3 Dec 2005 11:07:51 +0900 Subject: [Ferret-talk] How to avoid duplicate search results In-Reply-To: References: Message-ID: On 12/3/05, Carl Youngblood wrote: > I seem to be getting the same document multiple times in my search > results. I'm wondering if this is because by default a document is > placed in the search results every time the word you're looking for > shows up. Is that the way it works? Hi Carl, This means the document has been placed in the index more than once. Sounds to me like you are adding the an object to the index every time it is updated. You could try setting :key to :id. This will make sure that :id is unique in the index. That is, every time you add an existing document, the document is replaced. index = Index::Index.new(:key => :id) Alternatively you could handle the deletes yourself. Hope this helps. Dave > Thanks, > Carl > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From carl at youngbloods.org Fri Dec 2 21:18:01 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Fri, 2 Dec 2005 18:18:01 -0800 Subject: [Ferret-talk] How to avoid duplicate search results In-Reply-To: References: Message-ID: Thanks Dave. I'll add the :key => :id. On 12/2/05, David Balmain wrote: > This means the document has been placed in the index more than once. > Sounds to me like you are adding the an object to the index every time > it is updated. You could try setting :key to :id. This will make sure > that :id is unique in the index. That is, every time you add an > existing document, the document is replaced From anatol.pomozov at gmail.com Sat Dec 3 02:58:02 2005 From: anatol.pomozov at gmail.com (Anatol Pomozov) Date: Sat, 3 Dec 2005 08:58:02 +0100 Subject: [Ferret-talk] How to avoid duplicate search results In-Reply-To: References: Message-ID: <3665a1a00512022358p175e915apb0a29a157ad7248d@mail.gmail.com> Hi, Carl. Some info about it could be found in HowTo http://ferret.davebalmain.com/trac/wiki/HowTos see "How to use keys for document". Feel free to add or fix any tips'n'tricks about Ferret to wiki. On 12/3/05, Carl Youngblood wrote: > > Thanks Dave. I'll add the :key => :id. > > On 12/2/05, David Balmain wrote: > > This means the document has been placed in the index more than once. > > Sounds to me like you are adding the an object to the index every time > > it is updated. You could try setting :key to :id. This will make sure > > that :id is unique in the index. That is, every time you add an > > existing document, the document is replaced > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -- anatol (http://pomozov.info) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20051203/56bab187/attachment.htm From erik at ehatchersolutions.com Sat Dec 3 06:14:48 2005 From: erik at ehatchersolutions.com (Erik Hatcher) Date: Sat, 3 Dec 2005 06:14:48 -0500 Subject: [Ferret-talk] [Rails] ANN: acts_as_ferret In-Reply-To: References: Message-ID: <2B3DB48D-3667-42FA-A994-6717C93170E2@ehatchersolutions.com> CC'ing ferret-talk also. Nice work, Kasper! You've beaten me to it - this was something I was planning on tackling in the near future. I've got some additional feedback for you inlined below. Keep in mind that I'm being highly detailed in my feedback, in order to help this extension become the best it can be given Lucene best practices. Your work is a great start, and I want to see this evolve. All comments below are constructive, not even 'criticism'. Thanks for getting this started! On Dec 2, 2005, at 1:22 PM, Kasper Weibel wrote: > The result is the acts_as_ferret Mixin for ActivcRecord. > > Use it as follows: > In any model.rb add acts_as_ferret > > class Foo < ActiveRecord::Base > acts_as_ferret > end Ideally there will be many options desired besides just enabling a table to be indexed fully. More on that in a moment. > All CRUD operations will be performed on both ActiveRecord (as > usual) and a > ferret index for further searching. The toughest issue to deal with here is transactions. Suppose a database operation rolls back - then what happens to the index? It's out of sync. I don't have any easy solutions though, and it is an issue that pops up regularly in the Java Lucene community as well. There is quite a mismatch between a relational database and a full- text index when it comes to how updates and additions are handled. At the very least, a warning should be included mentioning the transactional issue. Another facility that is desirable with Lucene is the ability to rebuild the entire index from scratch. Why? Perhaps you change the analyzer, you will need to re-index all documents to have them re- analyzed. > The following method is available in your controllers: > > ActiveRecord::find_by_contents(query) # Query is a string > representing you query Dave mentioned this, but you're currently only indexing "id", but not the table name. Thus you could get documents that matching the query from other tables, and get an id that doesn't exist for the current table or one from a different table. Table name needs to be considered somehow, either by building a separate index for each table, or adding the table name as an indexed, untokenized field. > The Ferret DB is stored in: > > {RAILS_ROOT}/db/index.db Please consider NOT calling it a "DB". Ferret is Lucene. What it builds is an "index", not a "database" in the traditional sense. I think it would be best to avoid "db" terminology to prevent confusion. > module ClassMethods > include Ferret > > INDEX_DIR = "#{RAILS_ROOT}/db/index.db" I'm not sure how to parameterize "acts_as" extensions, but making the index location more configurable would be good. > # Finds instances by file contents. > def find_by_contents(query, options = {}) > index_searcher ||= Search::IndexSearcher.new(INDEX_DIR) > query_parser ||= > QueryParser.new(index_searcher.reader.get_field_names.to_a) > query = query_parser.parse(query) QueryParser is only one (and often crude) way to formulate a Query. Ideally there would be a couple of methods to search with, one that takes a QueryParser-friendly expression like "foo AND bar NOT baz" and another that takes a Query instance allowing a developer to formulate sophisticated queries via the Ferret query API rather than parsing an expression. There are many good reasons for this, most importantly from a user interface perspective where the application makes more sense to have separate fields that build up a query rather than the one totally free-form Google-esque text box. Many applications need full-text search, but not in a way that users need to know query expression operators like +/-/AND/OR. Back to the table name issue, here you'll want to wrap the query with a BooleanQuery AND'd with a TermQuery for table: so that you're sure the only hits returned will be for the current table. > result = [] > index_searcher.search_each(query) do |doc, score| > id = index_searcher.reader.get_document(doc)["id"] > res = self.find(id) > result << res if res > end Some handling of paging needs to be added here. It is unlikely that all hits are needed, and accessing the Document for every hit will be an enormous performance bottle-neck with lots of data. It is very important to choose the hits enumeration carefully. Doing a database query for every hit is also likely to be a huge bottleneck. Perhaps doing a SQL "IN" query for all id's after the narrowing the set of hits (by page) is feasible, though I'm not sure what limits exist on how many items you can have with an "IN" clause. I've not delved into Ferret in much depth yet, but in Java Lucene a HitCollector would possibly be a good way to handle this. > index_searcher.close() > result > end It is definitely unwise to close the IndexSearcher instance for every search - leaving it open allows for field caches to warm up and speeds up successive searches. > # private > > def ferret_create > index ||= Index::Index.new(:key => :id, > :path => INDEX_DIR, > :create_if_missing => true, > :default_field => "*") Dave mentioned the key thing, and I'll reiterate the need to add the table name to it. > index << self.to_doc > index.optimize() > index.close() > end Reiterating Dave, but just to be thorough, optimizing and closing an index is not a good thing to do on every document operation as it can be slow. And definitely heed his advice about using flush. There does need to be a facility to optimize the index on demand, which developers may choose to do as a nightly batch process, or periodically as the index becomes segmented. > def ferret_update > #code to update index > index ||= Index::Index.new(:key => :id, > :path => INDEX_DIR, > :create_if_missing => true, > :default_field => "*") I recommend centralizing the Index constructor, so as to not duplicate all of those parameters and allowing them to be changed in one spot. > index.delete(self.id.to_s) > index << self.to_doc > index.optimize > index.close() > end > > def ferret_destroy > # code to delete from index > index ||= Index::Index.new(:key => :id, > :path => INDEX_DIR, > :create_if_missing => true, > :default_field => "*") > index_writer.delete(self.id.to_s) > index_writer.optimize() > index_writer.close() > end Again, the table name should be part of the key for all operations above. > def to_doc > # Churn through the complete Active Record and add it to the Ferret > document > doc = Ferret::Document::Document.new > self.attributes.each_pair do |key,val| > doc << Ferret::Document::Field.new(key, val.to_s, > Ferret::Document::Field::Store::YES, > Ferret::Document::Field::Index::TOKENIZED) > end > doc > end This to_doc is where a lot of fun can be had. There are many options that need to be parameterized by the developer at the model level. For example, how a field is indexed is crucial. You're storing and tokenizing every field, including the "id" field. You definitely do not want to tokenize the "id" field. Adding the table name is needed also, untokenized. Each field should allow flexibility on how it is (or is not) indexed, including whether to store/tokenize the field or not. Storing fields is unnecessary in the ActiveRecord sense, since what you're returning from the search method are records from the database, not documents from the index. Making the analyzer controllable is necessary at a global level for the index, and overridable on a per-field level too. A common technique with Lucene when field-level searching granularity is not relevant is to create an aggregate field, say "contents" where all text is indexed. With Ferret, you could do this by iterating over all fields that should be indexed/tokenized using the "contents" as the field name for all fields of the record. Then searches would occur only against "contents". While Dave likes the default field to be "*", I personally find distributing a query expression across all fields tricky and error-prone, especially given that different fields may be analyzed differently. Consider a query for "foo bar". With two fields "title" and "body", how do you expand that query across all fields? Not trivial. This is why I like the aggregate "contents" field technique, which can work in conjunction with fields indexed individually also, so a query for "foo bar" would search the "contents" field by default, but someone could do "title:foo body:bar" to refine things. I think this is enough, and perhaps too much(!), feedback for now :) Sorry if it seems overly picky, but I think this is a very important addition to the Rails and ActiveRecord. The magic that is Lucene is very special, with I'm thrilled that it has now entered the Ruby world. I want to help Ferret and its integration into places like ActiveRecord goes as smoothly as possible and keeps the outstanding reputation that Lucene has in the Java (and C# and Python, etc) world. There are many ways to use Lucene inefficiently - I'll be here doing what I can to help oversee that things are done in the best possible way. Erik From dbalmain.ml at gmail.com Sat Dec 3 07:00:47 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 3 Dec 2005 21:00:47 +0900 Subject: [Ferret-talk] [Rails] ANN: acts_as_ferret In-Reply-To: <2B3DB48D-3667-42FA-A994-6717C93170E2@ehatchersolutions.com> References: <2B3DB48D-3667-42FA-A994-6717C93170E2@ehatchersolutions.com> Message-ID: Thanks for the feedback Erik. I've actually posted the acts_as_ferret code on the Ferret wiki with a few improvements. But it's far from optimal. Please add improvements or post your ideas here; http://ferret.davebalmain.com/trac/wiki/FerretOnRails Hopefully with Eriks feedback and a few Rails gurus looking over it we'll soon have a really nice solution to Rails Ferret integration. > While Dave likes the default field to > be "*", I personally find distributing a query expression across all > fields tricky and error-prone, especially given that different fields > may be analyzed differently. Just to defend my honour :-P I actually totally agree with Erik here. Think of the default field "*" as like Rails scaffolding. It's handy to get you started but you'll have to put a bit of work and thought into it yourself to get the most out of Ferret. Cheers, Dave From erik at ehatchersolutions.com Mon Dec 5 04:36:25 2005 From: erik at ehatchersolutions.com (Erik Hatcher) Date: Mon, 5 Dec 2005 04:36:25 -0500 Subject: [Ferret-talk] [Rails] Re: ANN: acts_as_ferret In-Reply-To: References: <2B3DB48D-3667-42FA-A994-6717C93170E2@ehatchersolutions.com> Message-ID: <435B0C4C-2BD2-42BD-B5D7-CD0B7107F422@ehatchersolutions.com> On Dec 4, 2005, at 11:39 PM, Thomas Lockney wrote: > (The Portland Ruby Brigade has their monthly meeting on Tuesday, so > that's one > nights work missed. > ;~) You Portland Rubyists really know how to party! I went to the event during OSCON in August - what a blast. > 1. Adding configuration > > The notation I'm working on is something like this: > > acts_as_ferret :index_dir => "#{RAILS_ROOT}/index/", fields > => {...} So you're thinking that each model may have its own index? I wasn't sure if one index per model made sense or whether a single index, globally configured through environment.rb and friends, made the most sense. Using one index would allow some future clever things such as querying without the table name allowing results to come back with objects spanning multiple models. I'm leaning towards preferring a single index, such that the :index_dir configuration would be done via environments.rb globally, not per model. > 2. Adding the ability to pass Query objects to the find_by_contents > method. Cool. Maybe this should be renamed to find_by_ferret? If a String is passed in, it gets parsed (with the options hash allowing control over the parsing), and if a Query is passed in then it is used as-is. > I've been doing some refactoring along the way, too, and hope to > add some unit > tests eventually. One final suggestion, perhaps the name should be > changed to > acts_as_indexed? I like it being acts_as_ferret personally. "indexed" is overloaded within the relational database domain, so it could be construed as having to do with DB indexes. > Anyway, this is great work. I hope I can make worthwhile > contributions to this. Thanks for your efforts! I'm glad to see this all coming together. Erik From weibel at gmail.com Mon Dec 5 07:23:46 2005 From: weibel at gmail.com (Kasper Weibel Nielsen-Refs) Date: Mon, 5 Dec 2005 12:23:46 +0000 Subject: [Ferret-talk] [Rails] Re: ANN: acts_as_ferret Message-ID: Hi all First of all I'd like to take the oppertunity to thank you all for the great response. Personally I feel that this approach to Ferret/Rails integration will be a good thing to investigate further. People need quality search. I think that we should agree on where to put the input for this project. The page on David Balmains wiki is a good start - thanks for that David. http://ferret.davebalmain.com/trac/wiki/FerretOnRails I needed this code for a specific task on my job and there is still many things to do to make it general usable. I will comment on different peoples input below. Thanks to David for giving direct input for enhancing the quality of the code and explaining index.flush() to me. It's good to have the author of ferret giving direct input as I'm not really sure where the pitfalls in the implementation are speed/quality wise. As both David and Eric Hatcher has pointed out the current implementation will only index one model per application. My view on this issue is that I would like to have one index for all models as opposed to multiple index files; that is ONE Ferret index per application. I will also need to implement a method for rebuilding the index. This will come in handy both when in development mode and probably also in production. Eric pointed out that there will be problems with transactions and I must admit that I don't have any viable ideas of how to approach this issue. I have thought of turning transactions off for the SQL tables in question - if that's possible at all. Eric also had problems with the name index.db. Instead I suggest index.frt The current search method should be worked on. At the moment it fires quite a few SQL select statements. There is also a need for the implementation of pagination. The to_doc method is one way to approach things when building the index. I actually thought of Erics suggestion about an aggregate field which sounds practical. There should be a way of configuring which fields goes where. I have had many ideas of what other things to implement. One of them is that hard core Lucene folks will probably not put up with the limitations of a specific implementation if it makes things difficult. One of the things I like about Active Recored in Rails is the find_by_sql() method which lets you do whatever you want on the SQL side. A similar approach could be implemented with Ferret. find_by_fql() - if there is such a term as Ferret Query Language. Also the many possibilities for fine tuning should not be forgotten in favour of simplicity. There should allways be a way to make the configuration exactly as you would like it. I favour the configuration approach Thomas Lockney has suggested. Lastly: I really appreciate your contributions and I feel that with our combined efforts it will be possible to build a quality solution. In time acts_as_ferret could become the prefered choice for Ferret/Rails integration. Kasper From carl at youngbloods.org Thu Dec 8 00:54:36 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Wed, 7 Dec 2005 21:54:36 -0800 Subject: [Ferret-talk] Confusing lock problem in rails Message-ID: I have a model class in rails that has a class variable that is a ferret index. For some reason, the methods in my class that refer to the class variable are getting lock conflicts. Can anybody see any obvious reason why? I notice that it keeps leaving a lock file in the index directory. I thought auto_flush was supposed to remove the lock automatically after every operation. Is there something I'm doing wrong? FYI, I'm doing my development on Windows. Thanks, Carl class Resume < ActiveRecord::Base include Ferret has_and_belongs_to_many :users @@index = Index::Index.new(:path => RAILS_ROOT + '/searchindex', :key => :email, :create_if_missing => true, :auto_flush => true, :close_dir => true) # syncronization with ferret index def after_save @@index << {:id => id, :email => email, :contents => contents, :date => found_on} end def after_destroy @@index.delete(id) end def self.optimize_index @@index.optimize end def self.search(query, options) docs = [] count = @@index.search_each(query, options) do |id, score| doc = {} doc[:id] = id doc[:email] = @@index[id]['email'] doc[:contents] = @@index[id]['contents'] doc[:date] = @@index[id]['date'] ind = doc[:contents].downcase.index(query) ind = (ind > 20) ? (ind - 20) : 0 doc[:teaser] = doc[:contents][ind..(ind + 220)] docs << doc end [count, docs] end def mark_as_viewed(user_id) # user = self.users.find(user_id) end def self.delete_old @p = Pref.find_by_setting('autodelete') if @p and @p.value.to_i > 0 val = @p.value.to_i destroy_all(["found_on <= DATE_SUB(CURDATE(), INTERVAL ? DAY)", val]) end end def self.delete_before(params) date = sprintf("%04d-%02d-%02d", params[:foundon][:year], params[:foundon][:month], params[:foundon][:day]) deleted = Resume.destroy_all(["found_on <= ?", date]) optimize_index deleted.length end validates_uniqueness_of :email end From dbalmain.ml at gmail.com Thu Dec 8 01:32:56 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Thu, 8 Dec 2005 15:32:56 +0900 Subject: [Ferret-talk] Confusing lock problem in rails In-Reply-To: References: Message-ID: Hi Carl, Sorry, there is a bug. I'll make up another release now. Please let me know if it doesn't fix the problem. Cheers, Dave On 12/8/05, Carl Youngblood wrote: > I have a model class in rails that has a class variable that is a > ferret index. For some reason, the methods in my class that refer to > the class variable are getting lock conflicts. Can anybody see any > obvious reason why? I notice that it keeps leaving a lock file in the > index directory. I thought auto_flush was supposed to remove the lock > automatically after every operation. Is there something I'm doing > wrong? FYI, I'm doing my development on Windows. > > Thanks, > > Carl > > class Resume < ActiveRecord::Base > include Ferret > has_and_belongs_to_many :users > @@index = Index::Index.new(:path => RAILS_ROOT + '/searchindex', > :key => :email, > :create_if_missing => true, > :auto_flush => true, > :close_dir => true) > > # syncronization with ferret index > def after_save > @@index << {:id => id, :email => email, :contents => contents, > :date => found_on} > end > > def after_destroy > @@index.delete(id) > end > > def self.optimize_index > @@index.optimize > end > > def self.search(query, options) > docs = [] > count = @@index.search_each(query, options) do |id, score| > doc = {} > doc[:id] = id > doc[:email] = @@index[id]['email'] > doc[:contents] = @@index[id]['contents'] > doc[:date] = @@index[id]['date'] > ind = doc[:contents].downcase.index(query) > ind = (ind > 20) ? (ind - 20) : 0 > doc[:teaser] = doc[:contents][ind..(ind + 220)] > docs << doc > end > [count, docs] > end > > def mark_as_viewed(user_id) > # user = self.users.find(user_id) > end > > def self.delete_old > @p = Pref.find_by_setting('autodelete') > if @p and @p.value.to_i > 0 > val = @p.value.to_i > destroy_all(["found_on <= DATE_SUB(CURDATE(), INTERVAL ? DAY)", val]) > end > end > > def self.delete_before(params) > date = sprintf("%04d-%02d-%02d", params[:foundon][:year], > params[:foundon][:month], > params[:foundon][:day]) > deleted = Resume.destroy_all(["found_on <= ?", date]) > optimize_index > deleted.length > end > > validates_uniqueness_of :email > end > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From carl at youngbloods.org Thu Dec 8 20:34:56 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Thu, 8 Dec 2005 17:34:56 -0800 Subject: [Ferret-talk] what exactly does close_dir option do? Message-ID: I'm trying to figure out if I should be setting close_dir to true or false when I access my index. It seems like this has something to do with the state that the index is left in after one process is finished using it, but it's not clear exactly what this does. Can anybody explain further? Thanks, Carl From carl at youngbloods.org Thu Dec 8 20:57:42 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Thu, 8 Dec 2005 17:57:42 -0800 Subject: [Ferret-talk] Index returning ids that are one less than they should be Message-ID: I'm saving records to an index like so: index << {:id => id, :email => email, :contents => contents, :date => found_on} In debugging my code, it appears that whatever I set a record's id to, when I find that record in a search, it returns the id minus 1, so if the first record I store in my database has an id of one and I store its counterpart in my index, then when I retrieve it from the index it has an id of zero. Has anybody seen this before? Thanks, Carl From carl at youngbloods.org Thu Dec 8 21:16:54 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Thu, 8 Dec 2005 18:16:54 -0800 Subject: [Ferret-talk] Confusing lock problem in rails In-Reply-To: References: Message-ID: FYI, it appears to be fixed. Thanks for the help. On 12/7/05, David Balmain wrote: > Hi Carl, > > Sorry, there is a bug. I'll make up another release now. Please let me > know if it doesn't fix the problem. > > Cheers, > Dave From dbalmain.ml at gmail.com Thu Dec 8 22:16:59 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Fri, 9 Dec 2005 12:16:59 +0900 Subject: [Ferret-talk] what exactly does close_dir option do? In-Reply-To: References: Message-ID: On 12/9/05, Carl Youngblood wrote: > I'm trying to figure out if I should be setting close_dir to true or > false when I access my index. It seems like this has something to do > with the state that the index is left in after one process is finished > using it, but it's not clear exactly what this does. Can anybody > explain further? Hi Carl, When you create an Index you can pass a path or you can pass an actual Directory object like this; dir = Store::FSDirectory.new("/path/to/index") index = Index::Index.new(:dir => dir, :close_dir =>true) Setting close_dir to true will automatically close the Directory object when you close the Index. If you are not creating the Directory object yourself there is no need to worry about the :close_dir option. Hope that helps, Dave From dbalmain.ml at gmail.com Thu Dec 8 22:23:39 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Fri, 9 Dec 2005 12:23:39 +0900 Subject: [Ferret-talk] Index returning ids that are one less than they should be In-Reply-To: References: Message-ID: On 12/9/05, Carl Youngblood wrote: > I'm saving records to an index like so: > > index << {:id => id, :email => email, :contents => contents, :date => found_on} > > In debugging my code, it appears that whatever I set a record's id to, > when I find that record in a search, it returns the id minus 1, so if > the first record I store in my database has an id of one and I store > its counterpart in my index, then when I retrieve it from the index it > has an id of zero. Has anybody seen this before? Hi Carl, I think you are talking about the internal index of the document in the Index. The internal indexing of the documents starts from zero and is independent of the id field in your document. The id field in your document is just like any of the other fields as far as the index is concerned. It's only the Index::Index class that allows you to do special operations on that field. So to see the id field you need to retrieve it like this; index.search_each("your query") do |doc, score| id = index[doc][:id] # do something with the id here. end Hope that helps. You might also want to check out the code here for some more ideas; http://ferret.davebalmain.com/trac/wiki/FerretOnRails Cheers, Dave > Thanks, > Carl > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From carl at youngbloods.org Tue Dec 13 20:55:22 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Tue, 13 Dec 2005 17:55:22 -0800 Subject: [Ferret-talk] undefined method `add' for Ferret::Search::BooleanQuery Message-ID: Up to now in my ferret development I have been using simple single-word strings as my search queries. I just now am trying to increase the complexity of my queries. When I was passing a single word with no spaces in my index searches, like so: count = index.search_each('testing') do |d, s| ... end everything worked fine. But now when I do something like this: count = index.search_each('contents:"testing|trucks"') do |d, s| ... end I get the following error: undefined method `add' for # Trace is: c:/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.1/lib/ferret/search/multi_phrase_query.rb:170:in `rewrite' c:/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.1/lib/ferret/search/multi_phrase_query.rb:169:in `each' c:/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.1/lib/ferret/search/multi_phrase_query.rb:169:in `rewrite' c:/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.1/lib/ferret/search/index_searcher.rb:165:in `rewrite' c:/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.1/lib/ferret/search/query.rb:50:in `weight' c:/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.1/lib/ferret/search/index_searcher.rb:104:in `search' c:/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.1/lib/ferret/index/index.rb:606:in `do_search' c:/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.1/lib/ferret/index/index.rb:303:in `search_each' c:/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.1/lib/ferret/index/index.rb:302:in `synchronize' c:/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.1/lib/ferret/index/index.rb:302:in `search_each' #{RAILS_ROOT}/app/models/resume.rb:40:in `search' #{RAILS_ROOT}/app/controllers/search_controller.rb:14:in `index' Is this a known bug? Thanks, Carl From dbalmain.ml at gmail.com Tue Dec 13 22:00:03 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Wed, 14 Dec 2005 12:00:03 +0900 Subject: [Ferret-talk] undefined method `add' for Ferret::Search::BooleanQuery In-Reply-To: References: Message-ID: On 12/14/05, Carl Youngblood wrote: > Up to now in my ferret development I have been using simple > single-word strings as my search queries. I just now am trying to > increase the complexity of my queries. When I was passing a single > word with no spaces in my index searches, like so: > > count = index.search_each('testing') do |d, s| > ... > end > > everything worked fine. But now when I do something like this: > > count = index.search_each('contents:"testing|trucks"') do |d, s| > ... > end > > I get the following error: > > undefined method `add' for # > > Trace is: > > Is this a known bug? Hi Carl, There are currently no known bugs in Ferret. If I know about it I'll fix it as soon as possible. This was an unknown bug which has now been swatted. You can get the latest from svn. I might put out another release soon. Find me a couple more bugs and I definitely will. Dave > Thanks, > > Carl > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From carl at youngbloods.org Tue Dec 13 23:14:54 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Tue, 13 Dec 2005 20:14:54 -0800 Subject: [Ferret-talk] undefined method `add' for Ferret::Search::BooleanQuery In-Reply-To: References: Message-ID: On 12/13/05, David Balmain wrote: > fix it as soon as possible. This was an unknown bug which has now been > swatted. You can get the latest from svn. I might put out another > release soon. Find me a couple more bugs and I definitely will. Thanks Dave. I'll keep an eye out. In the mean time I've manually patched my code. Carl From carl at youngbloods.org Tue Dec 13 23:25:42 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Tue, 13 Dec 2005 20:25:42 -0800 Subject: [Ferret-talk] Is it possible to highlight search keywords in results? Message-ID: I'm wondering if ferret has any built-in search/replace mechanism that I might be able to use to highlight the query data in each search result. The reason I think this would be a good idea is that I could end up having to practically duplicate the ferret query parser just to interpret the query so that I can figure out how to highlight the keywords in the search results. Just in case I'm not making sense, here is an example of what I want: query = 'contents:"testing|trucks"' prepend = '' append = '' count = index.search_each(query) do |doc, score| highlighted_contents = index.highlight_for_query(doc, query, prepend, append) puts highlighted_contents end This would make all instances of "testing" and "trucks" appear in bold for html formatted text. Thoughts? Thanks, Carl From dbalmain.ml at gmail.com Tue Dec 13 23:51:13 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Wed, 14 Dec 2005 13:51:13 +0900 Subject: [Ferret-talk] Is it possible to highlight search keywords in results? In-Reply-To: References: Message-ID: There is a highlighter for Lucene in the Lucene sandbox. You can have a look at it and try porting it to Ruby if you like. If you can wait a month or two I'll do it myself. http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/ On 12/14/05, Carl Youngblood wrote: > I'm wondering if ferret has any built-in search/replace mechanism that > I might be able to use to highlight the query data in each search > result. The reason I think this would be a good idea is that I could > end up having to practically duplicate the ferret query parser just to > interpret the query so that I can figure out how to highlight the > keywords in the search results. Just in case I'm not making sense, > here is an example of what I want: > > query = 'contents:"testing|trucks"' > prepend = '' > append = '' > count = index.search_each(query) do |doc, score| > highlighted_contents = > index.highlight_for_query(doc, query, prepend, append) > puts highlighted_contents > end > > This would make all instances of "testing" and "trucks" appear in bold > for html formatted text. Thoughts? > > Thanks, > Carl > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From carl at youngbloods.org Wed Dec 14 00:35:22 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Tue, 13 Dec 2005 21:35:22 -0800 Subject: [Ferret-talk] Is it possible to highlight search keywords in results? In-Reply-To: References: Message-ID: Sounds like a good idea. I have bigger fish to fry right now but I may get around to it if my search queries get hairier. Right now a quick and dirty solution is working okay for me. Thanks, Carl On 12/13/05, David Balmain wrote: > There is a highlighter for Lucene in the Lucene sandbox. You can have > a look at it and try porting it to Ruby if you like. If you can wait a > month or two I'll do it myself. From carl at youngbloods.org Wed Dec 14 00:54:39 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Tue, 13 Dec 2005 21:54:39 -0800 Subject: [Ferret-talk] Query question Message-ID: I have an index in which I want different records to be accessible to different users. I think I can do this by adding a "users" field to each record in the index and narrow down my queries to only those records matching the current user's userid. I have the userids separated by commas. What would be the right way to query for a certain user? I have to make sure that I don't find records belonging to the wrong user because a shorter number matches a larger one. For example, if a users field contains: 3,45,66,7779 I don't want a query for 77 to match this. How can I make sure my query matches whole words only? Thanks, Carl From dbalmain.ml at gmail.com Wed Dec 14 01:14:04 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Wed, 14 Dec 2005 15:14:04 +0900 Subject: [Ferret-talk] Query question In-Reply-To: References: Message-ID: On 12/14/05, Carl Youngblood wrote: > I have an index in which I want different records to be accessible to > different users. I think I can do this by adding a "users" field to > each record in the index and narrow down my queries to only those > records matching the current user's userid. I have the userids > separated by commas. What would be the right way to query for a > certain user? I have to make sure that I don't find records belonging > to the wrong user because a shorter number matches a larger one. For > example, if a users field contains: > > 3,45,66,7779 > > I don't want a query for 77 to match this. How can I make sure my > query matches whole words only? You have two choices. To match whole words only, ie, seperated by spaces, use the WhitespaceAnalyzer. You can use the PerFieldAnalyzer if you only want to use the WhitespaceAnalyzer on one field and the StandardAnalyzers on all the others. The second choice which I'd recommend in this instance is to store the field untokenized. For example doc = Document::Document.new() # Note the UNTOKENIZED here. That means the whole field is indexed in # a single term. You don't have to store the field if you don't want to. doc << Document::Field.new(:user_id, "3,45,66,7779", Document::Field::Store::YES, Document::Field::Index::UNTOKENIZED) index << doc query = TermQuery.new(Term.new(:user_id, "3,45,66,7779")) index.search_each(query)...etc. Hope this makes sense. Let me know if you need more clarification. Cheers, Dave > Thanks, > > Carl > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From carl at youngbloods.org Wed Dec 14 01:50:47 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Tue, 13 Dec 2005 22:50:47 -0800 Subject: [Ferret-talk] Query question In-Reply-To: References: Message-ID: On 12/13/05, David Balmain wrote: > doc << Document::Field.new(:user_id, "3,45,66,7779", > Document::Field::Store::YES, > Document::Field::Index::UNTOKENIZED) > > index << doc > > query = TermQuery.new(Term.new(:user_id, "3,45,66,7779")) > index.search_each(query)...etc. I don't think that's going to work for me, because I'm never going to be querying the full value of :user_id. I'm always going to be querying only one of the numbers between the commas. In this case, an untokenized field won't work for me, right? I think maybe the better thing to do is to separate the ids with spaces and use the WhitespaceAnalyzer. So just to make sure I have this straight, if I separate my ids with spaces, like so: index << { :id => 1, :users => '1 2 3', :contents => 'string number one', } index << { :id => 2, :users => '33 45', :contents => 'string number two', } And then I do a query like this: count = index.search_each('users:("3") contents:"string"') do |d, s| puts index[d][:contents] end Will I get only the first record or will I get both? Thanks, Carl From dbalmain.ml at gmail.com Wed Dec 14 02:13:38 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Wed, 14 Dec 2005 16:13:38 +0900 Subject: [Ferret-talk] Query question In-Reply-To: References: Message-ID: On 12/14/05, Carl Youngblood wrote: > On 12/13/05, David Balmain wrote: > > doc << Document::Field.new(:user_id, "3,45,66,7779", > > Document::Field::Store::YES, > > Document::Field::Index::UNTOKENIZED) > > > > index << doc > > > > query = TermQuery.new(Term.new(:user_id, "3,45,66,7779")) > > index.search_each(query)...etc. > > I don't think that's going to work for me, because I'm never going to > be querying the full value of :user_id. I'm always going to be > querying only one of the numbers between the commas. In this case, an > untokenized field won't work for me, right? > > I think maybe the better thing to do is to separate the ids with > spaces and use the WhitespaceAnalyzer. So just to make sure I have > this straight, if I separate my ids with spaces, like so: > > index << { > :id => 1, > :users => '1 2 3', > :contents => 'string number one', > } > index << { > :id => 2, > :users => '33 45', > :contents => 'string number two', > } > > And then I do a query like this: > > count = index.search_each('users:("3") contents:"string"') do |d, s| > puts index[d][:contents] > end > > Will I get only the first record or will I get both? Just the first one. Sorry, I didn't understand what you wanted before. Any query will only match the whole word unless you use a wildcard. For example index.search_each('users:3* contents:string') Will match both of the documents above. index.search_each('users:3 contents:"string number"') Will only match the first one. Also note that '"' are used to wrap phrases but are not necessary for single word queries. > Thanks, > > Carl > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From erik at ehatchersolutions.com Wed Dec 14 06:56:17 2005 From: erik at ehatchersolutions.com (Erik Hatcher) Date: Wed, 14 Dec 2005 06:56:17 -0500 Subject: [Ferret-talk] Is it possible to highlight search keywords in results? In-Reply-To: References: Message-ID: <658328D1-8EF6-49EC-A29E-50681E188F39@ehatchersolutions.com> The Highlighter is cool, but does have some drawbacks. First, it's advantages.... you can easily get the terms for a Query and then do some substitutions in the original document, but Highlighter actually picks out the bet fragments to show. It's disadvantage is that it only highlights the terms of the query. For example, if you search for "some phrase", any occurrence of the word "some" is highlighted, even if it is not followed by "phrase". More visually, here's how it comes out: http://www.lucenebook.com/search?query=%22term+positions%22 It would be sweet to have a Ruby highlighter. I've done work in Java for a consulting client to convert a Query into a SpanQuery and have created a non-generalizable highlighter (does not fragment, currently) that highlights very precisely, so it is possible, just not trivial. Erik On Dec 13, 2005, at 11:51 PM, David Balmain wrote: > There is a highlighter for Lucene in the Lucene sandbox. You can have > a look at it and try porting it to Ruby if you like. If you can wait a > month or two I'll do it myself. > > http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/ > > On 12/14/05, Carl Youngblood wrote: >> I'm wondering if ferret has any built-in search/replace mechanism >> that >> I might be able to use to highlight the query data in each search >> result. The reason I think this would be a good idea is that I could >> end up having to practically duplicate the ferret query parser >> just to >> interpret the query so that I can figure out how to highlight the >> keywords in the search results. Just in case I'm not making sense, >> here is an example of what I want: >> >> query = 'contents:"testing|trucks"' >> prepend = '' >> append = '' >> count = index.search_each(query) do |doc, score| >> highlighted_contents = >> index.highlight_for_query(doc, query, prepend, append) >> puts highlighted_contents >> end >> >> This would make all instances of "testing" and "trucks" appear in >> bold >> for html formatted text. Thoughts? >> >> Thanks, >> Carl >> >> _______________________________________________ >> Ferret-talk mailing list >> Ferret-talk at rubyforge.org >> http://rubyforge.org/mailman/listinfo/ferret-talk >> > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk From erik at ehatchersolutions.com Wed Dec 14 07:08:39 2005 From: erik at ehatchersolutions.com (Erik Hatcher) Date: Wed, 14 Dec 2005 07:08:39 -0500 Subject: [Ferret-talk] Query question In-Reply-To: References: Message-ID: <98E451E5-8628-4135-ABB1-09F17963915B@ehatchersolutions.com> Dave - isn't there a slick way to use a regex for an analyzer to split tokens with Ferret? If so, that would be an ideal solution for splitting at a comma. Or you could split the string prior to indexing, iterate over the array from the split, and then index each user id as a unique untokenized but indexed field. Erik On Dec 14, 2005, at 12:54 AM, Carl Youngblood wrote: > I have an index in which I want different records to be accessible to > different users. I think I can do this by adding a "users" field to > each record in the index and narrow down my queries to only those > records matching the current user's userid. I have the userids > separated by commas. What would be the right way to query for a > certain user? I have to make sure that I don't find records belonging > to the wrong user because a shorter number matches a larger one. For > example, if a users field contains: > > 3,45,66,7779 > > I don't want a query for 77 to match this. How can I make sure my > query matches whole words only? > > Thanks, > > Carl > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk From dbalmain.ml at gmail.com Wed Dec 14 07:24:42 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Wed, 14 Dec 2005 21:24:42 +0900 Subject: [Ferret-talk] Query question In-Reply-To: <98E451E5-8628-4135-ABB1-09F17963915B@ehatchersolutions.com> References: <98E451E5-8628-4135-ABB1-09F17963915B@ehatchersolutions.com> Message-ID: On 12/14/05, Erik Hatcher wrote: > Dave - isn't there a slick way to use a regex for an analyzer to > split tokens with Ferret? If so, that would be an ideal solution > for splitting at a comma. Or you could split the string prior to > indexing, iterate over the array from the split, and then index each > user id as a unique untokenized but indexed field. Sure. Here is an analyzer that splits the field on commas. class CommaAnalyzer < Ferret::Analysis::Analyzer class CommaTokenizer < Ferret::Analysis::RegExpTokenizer def token_re /[^,]+/ end end def token_stream(field, string) return CommaTokenizer.new(string) end end This makes me think, it might be cool to have a RegExpAnalyzer like this; analyzer = RegExpAnalyzer.new(:default => STANDARD_RE, :user_id => /[^,]+/, :phone_num => /[-()0-9]+/) Any thoughts, criticisms? Dave > Erik > > > On Dec 14, 2005, at 12:54 AM, Carl Youngblood wrote: > > > I have an index in which I want different records to be accessible to > > different users. I think I can do this by adding a "users" field to > > each record in the index and narrow down my queries to only those > > records matching the current user's userid. I have the userids > > separated by commas. What would be the right way to query for a > > certain user? I have to make sure that I don't find records belonging > > to the wrong user because a shorter number matches a larger one. For > > example, if a users field contains: > > > > 3,45,66,7779 > > > > I don't want a query for 77 to match this. How can I make sure my > > query matches whole words only? > > > > Thanks, > > > > Carl > > > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From dbalmain.ml at gmail.com Wed Dec 14 07:26:39 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Wed, 14 Dec 2005 21:26:39 +0900 Subject: [Ferret-talk] Is it possible to highlight search keywords in results? In-Reply-To: <658328D1-8EF6-49EC-A29E-50681E188F39@ehatchersolutions.com> References: <658328D1-8EF6-49EC-A29E-50681E188F39@ehatchersolutions.com> Message-ID: On 12/14/05, Erik Hatcher wrote: > The Highlighter is cool, but does have some drawbacks. First, it's > advantages.... you can easily get the terms for a Query and then do > some substitutions in the original document, but Highlighter actually > picks out the bet fragments to show. It's disadvantage is that it > only highlights the terms of the query. For example, if you search > for "some phrase", any occurrence of the word "some" is highlighted, > even if it is not followed by "phrase". More visually, here's how > it comes out: > > http://www.lucenebook.com/search?query=%22term+positions%22 > > It would be sweet to have a Ruby highlighter. I've done work in Java > for a consulting client to convert a Query into a SpanQuery and have > created a non-generalizable highlighter (does not fragment, > currently) that highlights very precisely, so it is possible, just > not trivial. Not trivial. Sounds like a challenge. I like the sound of that. :P > > Erik From jennyw at dangerousideas.com Wed Dec 14 18:52:51 2005 From: jennyw at dangerousideas.com (jennyw) Date: Wed, 14 Dec 2005 15:52:51 -0800 Subject: [Ferret-talk] Fuzzy search on a phrase Message-ID: <43A0B053.40205@dangerousideas.com> I'm trying to use Ferret to do fuzzy searches. If I use fuzzy search for just one word, it works fine: index.search('name:gogle~0.4') However, if I try to use a phrase, it doesn't work: index.search('name:"gogle search engine"~0.4') On the other hand, I could do: index.search('name:gogle~0.4 AND name:search~0.4 AND name:engine~0.4') This isn't exactly the same as fuzzy search on a phrase, though ... is there a better way? Thanks! Jen From erik at ehatchersolutions.com Wed Dec 14 20:05:58 2005 From: erik at ehatchersolutions.com (Erik Hatcher) Date: Wed, 14 Dec 2005 20:05:58 -0500 Subject: [Ferret-talk] Fuzzy search on a phrase In-Reply-To: <43A0B053.40205@dangerousideas.com> References: <43A0B053.40205@dangerousideas.com> Message-ID: On Dec 14, 2005, at 6:52 PM, jennyw wrote: > I'm trying to use Ferret to do fuzzy searches. If I use fuzzy > search for > just one word, it works fine: > > index.search('name:gogle~0.4') > > However, if I try to use a phrase, it doesn't work: > > index.search('name:"gogle search engine"~0.4') Lucene doesn't support this type of query. I'm still not deep enough into Ferret to know the query parser behavior, but in Java Lucene a "phrase with quotes"~10 makes a sloppy phrase query. The words still have to be spelled correctly but they can be separated in the original text by other words (think of it as proximity, google must be near search and near engine). > On the other hand, I could do: > > index.search('name:gogle~0.4 AND name:search~0.4 AND > name:engine~0.4') > > This isn't exactly the same as fuzzy search on a phrase, though ... is > there a better way? There really isn't an easy better way unless you get into using the SpanQuery infrastructure and create (or use if Ferret implements it) the SpanRegexQuery, and nest it within a SpanNearQuery. Query parser support for such a beast would be tricky at best and Java Lucene currently does not implement this sort of thing. Though for a consulting client I created (Span)RegexQuery for just these sorts of queries. So it is possible, but not without going deeper into creating a query using the API or making your own query parser that could do this. Dave can fill us in on any details I've missed and probably demonstrate just how easy it is and I don't even know it :) Erik From carl at youngbloods.org Wed Dec 14 20:50:11 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Wed, 14 Dec 2005 17:50:11 -0800 Subject: [Ferret-talk] non-character search Message-ID: I need to be able to search for certain symbol characters. For example, I want the search for c++ to work. I try enclosing it in quotes, but it seems to treat C and the pluses as two separate characters to look for. Search for ++ alone seems to find documents that don't have any pluses in them. Why would that happen? Thanks, Carl From jennyw at dangerousideas.com Wed Dec 14 22:08:23 2005 From: jennyw at dangerousideas.com (jennyw) Date: Wed, 14 Dec 2005 19:08:23 -0800 Subject: [Ferret-talk] Bug in fuzzy search Message-ID: <43A0DE27.9050707@dangerousideas.com> I just submitted this to Trac, including a proposed fix. I've never really used Trac before, so hopefully I did the right thing. Anyway, in case anyone else runs into this, the problem is that fuzzy search fails if some of your saved fields are larger than Ferret::Search::TYPICAL_LONGEST_WORD_IN_INDEX. This won't happen in all cases where your fields are larger (for example, if your longest field is one character larger there won't be a problem). Here's the writeup: http://ferret.davebalmain.com/trac/ticket/17 Jen From jennyw at dangerousideas.com Wed Dec 14 22:13:41 2005 From: jennyw at dangerousideas.com (jennyw) Date: Wed, 14 Dec 2005 19:13:41 -0800 Subject: [Ferret-talk] Fuzzy search on a phrase In-Reply-To: References: <43A0B053.40205@dangerousideas.com> Message-ID: <43A0DF65.20901@dangerousideas.com> Erik Hatcher wrote: >There really isn't an easy better way unless you get into using the >SpanQuery infrastructure and create (or use if Ferret implements it) >the SpanRegexQuery, and nest it within a SpanNearQuery. Query parser >support for such a beast would be tricky at best and Java Lucene >currently does not implement this sort of thing. Though for a >consulting client I created (Span)RegexQuery for just these sorts of >queries. So it is possible, but not without going deeper into >creating a query using the API or making your own query parser that >could do this. > > I was afraid the answer would be something like that! I think that'd be a bit beyond the scope of my current project (I haven't yet gotten deep into Ferret enough to know what a SpanNearQuery is). >Dave can fill us in on any details I've missed and probably >demonstrate just how easy it is and I don't even know it :) > > That'd be great if so. Here's to optimism! ;-) Jen From erik at ehatchersolutions.com Wed Dec 14 22:31:32 2005 From: erik at ehatchersolutions.com (Erik Hatcher) Date: Wed, 14 Dec 2005 22:31:32 -0500 Subject: [Ferret-talk] Fuzzy search on a phrase In-Reply-To: <43A0DF65.20901@dangerousideas.com> References: <43A0B053.40205@dangerousideas.com> <43A0DF65.20901@dangerousideas.com> Message-ID: <0FFE35E1-310E-4883-968D-9A5CC05513F3@ehatchersolutions.com> On Dec 14, 2005, at 10:13 PM, jennyw wrote: > Erik Hatcher wrote: > >> There really isn't an easy better way unless you get into using the >> SpanQuery infrastructure and create (or use if Ferret implements it) >> the SpanRegexQuery, and nest it within a SpanNearQuery. Query parser >> support for such a beast would be tricky at best and Java Lucene >> currently does not implement this sort of thing. Though for a >> consulting client I created (Span)RegexQuery for just these sorts of >> queries. So it is possible, but not without going deeper into >> creating a query using the API or making your own query parser that >> could do this. >> >> > I was afraid the answer would be something like that! I think > that'd be > a bit beyond the scope of my current project (I haven't yet gotten > deep > into Ferret enough to know what a SpanNearQuery is). You can get some nuggets of info here: http://www.lucenebook.com/search?query=SpanQuery Ferret *is* Lucene, so by knowing what Java Lucene can do, you've got a good handle on what Ferret can do. Dave has made some handy conveniences on top of the API, so for general uses you can get by without knowing the guts, but it always helps to know the capabilities under the covers when a more sophisticated requirement is encountered. In brief, a SpanQuery is a family of subclasses (SpanNearQuery, SpanFirstQuery, SpanOrQuery, and now SpanRegexQuery). These allow for matching on positional ranges of the indexed terms, like a window of words. A PhaseQuery (what you created using "words in quotes") matches on words together as well, but it doesn't have the capability to do any refinement beyond just a proximity setting. By using SpanQuery's in a nested fashion, matches could be made on a query like this, for example: "some phrase" within 10 positions of "another phrase" but "excluded phrase" cannot be in the middle I was actually a bit wrong about the SpanRegexQuery being useful in your example, unless the query was something like "go*gle" where it was a regular expression that could match various terms in the index. What you'd need is a SpanFuzzyQuery, which would not be hard to create, but is certainly an advanced thing to do with Lucene. >> Dave can fill us in on any details I've missed and probably >> demonstrate just how easy it is and I don't even know it :) >> >> > That'd be great if so. Here's to optimism! ;-) Dave's pretty darn good, but I'll be amazed if he pulls this one out of his hat without coding up a SpanFuzzyQuery and adding support in his query parser for it. Erik From erik at ehatchersolutions.com Wed Dec 14 22:50:08 2005 From: erik at ehatchersolutions.com (Erik Hatcher) Date: Wed, 14 Dec 2005 22:50:08 -0500 Subject: [Ferret-talk] non-character search In-Reply-To: References: Message-ID: <6893B3AF-609E-4A4C-9295-9FD3606F93EA@ehatchersolutions.com> On Dec 14, 2005, at 8:50 PM, Carl Youngblood wrote: > I need to be able to search for certain symbol characters. For > example, I want the search for c++ to work. I try enclosing it in > quotes, but it seems to treat C and the pluses as two separate > characters to look for. Search for ++ alone seems to find documents > that don't have any pluses in them. Why would that happen? What analyzer are you using? If you can get a little lower-level and get the Query object, what does it's to_s provide? Analysis and query parsing is tricky business that requires attention to the details of how text is tokenized and how the parser interacts with it. My article "QueryParser Rules" is an oldie, but a goodie that might help troubleshoot this situation: http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html Erik From dbalmain.ml at gmail.com Thu Dec 15 00:17:16 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Thu, 15 Dec 2005 14:17:16 +0900 Subject: [Ferret-talk] Fuzzy search on a phrase In-Reply-To: <0FFE35E1-310E-4883-968D-9A5CC05513F3@ehatchersolutions.com> References: <43A0B053.40205@dangerousideas.com> <43A0DF65.20901@dangerousideas.com> <0FFE35E1-310E-4883-968D-9A5CC05513F3@ehatchersolutions.com> Message-ID: On 12/15/05, Erik Hatcher wrote: > > On Dec 14, 2005, at 10:13 PM, jennyw wrote: > > > Erik Hatcher wrote: > > > >> There really isn't an easy better way unless you get into using the > >> SpanQuery infrastructure and create (or use if Ferret implements it) > >> the SpanRegexQuery, and nest it within a SpanNearQuery. Query parser > >> support for such a beast would be tricky at best and Java Lucene > >> currently does not implement this sort of thing. Though for a > >> consulting client I created (Span)RegexQuery for just these sorts of > >> queries. So it is possible, but not without going deeper into > >> creating a query using the API or making your own query parser that > >> could do this. > >> > >> > > I was afraid the answer would be something like that! I think > > that'd be > > a bit beyond the scope of my current project (I haven't yet gotten > > deep > > into Ferret enough to know what a SpanNearQuery is). > > You can get some nuggets of info here: > > http://www.lucenebook.com/search?query=SpanQuery > > Ferret *is* Lucene, so by knowing what Java Lucene can do, you've got > a good handle on what Ferret can do. Dave has made some handy > conveniences on top of the API, so for general uses you can get by > without knowing the guts, but it always helps to know the > capabilities under the covers when a more sophisticated requirement > is encountered. > > In brief, a SpanQuery is a family of subclasses (SpanNearQuery, > SpanFirstQuery, SpanOrQuery, and now SpanRegexQuery). These allow > for matching on positional ranges of the indexed terms, like a window > of words. A PhaseQuery (what you created using "words in quotes") > matches on words together as well, but it doesn't have the capability > to do any refinement beyond just a proximity setting. By using > SpanQuery's in a nested fashion, matches could be made on a query > like this, for example: > > "some phrase" within 10 positions of "another phrase" but "excluded > phrase" cannot be in the middle > > I was actually a bit wrong about the SpanRegexQuery being useful in > your example, unless the query was something like "go*gle" where it > was a regular expression that could match various terms in the > index. What you'd need is a SpanFuzzyQuery, which would not be hard > to create, but is certainly an advanced thing to do with Lucene. > > >> Dave can fill us in on any details I've missed and probably > >> demonstrate just how easy it is and I don't even know it :) > >> > >> > > That'd be great if so. Here's to optimism! ;-) > > Dave's pretty darn good, but I'll be amazed if he pulls this one out > of his hat without coding up a SpanFuzzyQuery and adding support in > his query parser for it. Thanks for the complement. :-) SpanFuzzyQuery definitely sounds like the solution to this problem. I'll be implementing span queries in C in the next couple of days so I'll keep it in mind. Dave From carl at youngbloods.org Thu Dec 15 20:05:53 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Thu, 15 Dec 2005 17:05:53 -0800 Subject: [Ferret-talk] Ordering results by something other than relevance Message-ID: Along with the contents of the documents in my index, I have stored the date they were added. I want to search for keywords in the index but have the results be sorted by their date rather than their relevance to the keywords. How would I do this in ferret? Thanks, Carl From dbalmain.ml at gmail.com Thu Dec 15 22:15:27 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Fri, 16 Dec 2005 12:15:27 +0900 Subject: [Ferret-talk] Ordering results by something other than relevance In-Reply-To: References: Message-ID: On 12/16/05, Carl Youngblood wrote: > Along with the contents of the documents in my index, I have stored > the date they were added. I want to search for keywords in the index > but have the results be sorted by their date rather than their > relevance to the keywords. How would I do this in ferret? Hi Carl, Good question. The easiest way to do this is to index the date a string, year first; include Ferret::Search include Ferret::Index data = [ {:content => "one", :date => "20051023"}, {:content => "two", :date => "19530315"}, {:content => "three", :date => "19390912"} ] index = Index.new(:analyzer => WhiteSpaceAnalyzer.new) data.each { |doc| index << doc } sf_date = SortField.new("date", {:sort_type => SortField::SortType::STRING}) top_docs = index.search("one", :sort => [sf_date, SortField::FIELD_SCORE]) SortField is from the Search module. Here we are sorting by string and then score if two dates are the same. If we want to reverse the sort; sf_date = SortField.new("date", {:sort_type => SortField::SortType::STRING :reverse => true}) There is also a module Ferret::Utils::DateTools which you can use to serialize your dates more efficiently but they won't be human readable. Cheers, Dave From erik at ehatchersolutions.com Fri Dec 16 03:57:27 2005 From: erik at ehatchersolutions.com (Erik Hatcher) Date: Fri, 16 Dec 2005 03:57:27 -0500 Subject: [Ferret-talk] Ordering results by something other than relevance In-Reply-To: References: Message-ID: Dave, Wouldn't sorting YYYYMMDD dates as an integer rather than a string use less resources in the cache? Erik On Dec 15, 2005, at 10:15 PM, David Balmain wrote: > On 12/16/05, Carl Youngblood wrote: >> Along with the contents of the documents in my index, I have stored >> the date they were added. I want to search for keywords in the index >> but have the results be sorted by their date rather than their >> relevance to the keywords. How would I do this in ferret? > > Hi Carl, > > Good question. The easiest way to do this is to index the date a > string, year first; > > include Ferret::Search > include Ferret::Index > > data = [ > {:content => "one", :date => "20051023"}, > {:content => "two", :date => "19530315"}, > {:content => "three", :date => "19390912"} > ] > index = Index.new(:analyzer => WhiteSpaceAnalyzer.new) > data.each { |doc| > index << doc > } > > sf_date = SortField.new("date", {:sort_type => > SortField::SortType::STRING}) > top_docs = index.search("one", :sort => [sf_date, > SortField::FIELD_SCORE]) > > SortField is from the Search module. Here we are sorting by string and > then score if two dates are the same. If we want to reverse the sort; > > sf_date = SortField.new("date", {:sort_type => > SortField::SortType::STRING > :reverse > => true}) > > There is also a module Ferret::Utils::DateTools which you can use to > serialize your dates more efficiently but they won't be human > readable. > > Cheers, > Dave > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk From dbalmain.ml at gmail.com Fri Dec 16 06:51:52 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Fri, 16 Dec 2005 20:51:52 +0900 Subject: [Ferret-talk] Ordering results by something other than relevance In-Reply-To: References: Message-ID: On 12/16/05, Erik Hatcher wrote: > Dave, > > Wouldn't sorting YYYYMMDD dates as an integer rather than a string > use less resources in the cache? Yes Erik, you are quite correct. Silly me. I think I need to get more sleep. Good thing you mentioned it because I found a bug in the integer and float sorts. They sort the opposite way to strings by default (largest first). I don't know why I did this. I assumed it in all my unit tests too but now it doesn't make any sense and it doesn't seem to be that way in Lucene so I've decided to fix it and make another release. So now that I've thought about it a bit more, the easiest way to sort by date is like this; top_docs = index.search("one", :sort => Sort.new("date")) This is telling Ferret to sort by whatever it finds in the date column. Since it parses as an integer, it will sort by integer. Explicitly like this; sf_date = SortField.new("date", {:sort_type => SortField::SortType::INT}) top_docs = index.search("one", :sort => [sf_date, SortField::FIELD_SCORE]) That probably should be INTEGER. I'll change that as well. Cheers, Dave From dbalmain.ml at gmail.com Fri Dec 16 07:50:53 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Fri, 16 Dec 2005 21:50:53 +0900 Subject: [Ferret-talk] [ANN] Ferret 0.3.2 released Message-ID: Hi folks, I've just released Ferret 0.3.2. Nothing much new here. Mostly bug fixes. The one thing to watch out for though is I've changed the order of sorts on Integers and Floats. The order will now be reversed. The default is to sort smallest numbers first. You can always reverse the search if you like. Also, I changed the integer sort type from Search::SortField::SortType::INT to Search::SortField::SortType::INTEGER. Cheers, Dave From jennyw at dangerousideas.com Sun Dec 18 20:12:04 2005 From: jennyw at dangerousideas.com (jennyw) Date: Sun, 18 Dec 2005 17:12:04 -0800 Subject: [Ferret-talk] Parentheses for precedence? Message-ID: <43A608E4.4010809@dangerousideas.com> I'm not sure whether this is a bug or whether I'm simply expecting Ferret queries to work in a way other than they're intended. I notice that if use a query like: (other_text:"Collaborative tools") AND NOT other_text:podcasts I'll get correct search results. However, if I put parentheses around the second part, like: (other_text:"Collaborative tools") AND (NOT other_text:podcasts) I won't get any search resuts. This seems to only be an issue when NOT (or a preceding -) are used. For example, both of these work: (other_text:"Collaborative tools") AND other_text:podcasts (other_text:"Collaborative tools") AND (other_text:podcasts) My use of parentheses is for precedence (I assume they work that way). I quickly looked at query_parser.y, but as I've never used racc (or yacc), it's not apparent to me what should be happening. Thanks! Jen From leto.kauler at education.tas.gov.au Sun Dec 18 20:46:47 2005 From: leto.kauler at education.tas.gov.au (Kauler, Leto S) Date: Mon, 19 Dec 2005 12:46:47 +1100 Subject: [Ferret-talk] Parentheses for precedence? Message-ID: Note: I'm familiar with Lucene but just keeping an eye on Ferret progress. You need to be careful with the construction of your queries. I don't know if you can do it in Ferret but you really should check the QueryParser's output to see how it interprets your queries. Parentheses are used to create groups of expressions, so wrapping them around a single expression actually does nothing. That's why both your last queries work - they're the same: other_text:"Collaborative tools" AND other_text:podcasts Your first query is basic and needs no () other_text:"Collaborative tools" AND NOT other_text:podcasts or +other_text:"Collaborative tools" -other_text:podcasts Examples of parenthese use: (other_text:"Collaborative tools" OR other_text:Feeds) AND NOT other_text:podcasts (other_text:Feeds AND NOT other_text:RSS) OR (other_text:"Collaborative tools" AND NOT other_text:podcasts) Regards, --Leto (please excuse the email footer...) -----Original Message----- From: jennyw I'm not sure whether this is a bug or whether I'm simply expecting Ferret queries to work in a way other than they're intended. I notice that if use a query like: (other_text:"Collaborative tools") AND NOT other_text:podcasts I'll get correct search results. However, if I put parentheses around the second part, like: (other_text:"Collaborative tools") AND (NOT other_text:podcasts) I won't get any search resuts. This seems to only be an issue when NOT (or a preceding -) are used. For example, both of these work: (other_text:"Collaborative tools") AND other_text:podcasts (other_text:"Collaborative tools") AND (other_text:podcasts) My use of parentheses is for precedence (I assume they work that way). I quickly looked at query_parser.y, but as I've never used racc (or yacc), it's not apparent to me what should be happening. Thanks! Jen Tasmania Together 5 Year Review: Have your say : www.tasmaniatogether.tas.gov.au CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. From dbalmain.ml at gmail.com Sun Dec 18 21:59:47 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Mon, 19 Dec 2005 11:59:47 +0900 Subject: [Ferret-talk] Parentheses for precedence? In-Reply-To: <43A608E4.4010809@dangerousideas.com> References: <43A608E4.4010809@dangerousideas.com> Message-ID: Hi Jenny, The following two queries are different. (other_text:"Collaborative tools") AND NOT other_text:podcasts (other_text:"Collaborative tools") AND (NOT other_text:podcasts) You can check how the query parser parses these queries like this; $ ruby lib/ferret/query_parser/query_parser.tab.rb (other_text:"Collaborative tools") AND NOT other_text:podcasts Ferret::Search::BooleanQuery +other_text:"collaborative tools" -other_text:podcasts (other_text:"Collaborative tools") AND (NOT other_text:podcasts) Ferret::Search::BooleanQuery +other_text:"collaborative tools" +(-other_text:podcasts) Let me explain. The first I think you understand already. The second query you are taking the conjunction ("AND") of two queries. The first one, other_text:"collaborative tools", is obviously returning results. The second one, -other_text:podcasts contains no positive queries, ie. nothing to look for so it returns nothing. The conjunction of something and nothing is obviously nothing. Now I guess this could be interpreted differently. Boolean queries with only exclusive ("NOT") clauses could return everything but documents containing these results. But every boolean query requires an optional or required clause. The reason being, if you did a search for a every document that didn't contain a rare word and you have a large index, you could be getting hundreds of thousands of results back. Lucene handles this query the same way. While the brackets do set precedence, it's worthwhile to remember that every time you put brackets like that you are creating a boolean query. As you can see from the output above, single term boolean queries are optimized into term queries, (ie the brackets are removed). Hope that helps, Dave On 12/19/05, jennyw wrote: > I'm not sure whether this is a bug or whether I'm simply expecting > Ferret queries to work in a way other than they're intended. > > I notice that if use a query like: > > (other_text:"Collaborative tools") AND NOT other_text:podcasts > > I'll get correct search results. However, if I put parentheses around > the second part, like: > > (other_text:"Collaborative tools") AND (NOT other_text:podcasts) > > I won't get any search resuts. This seems to only be an issue when NOT > (or a preceding -) are used. For example, both of these work: > > (other_text:"Collaborative tools") AND other_text:podcasts > (other_text:"Collaborative tools") AND (other_text:podcasts) > > My use of parentheses is for precedence (I assume they work that way). I > quickly looked at query_parser.y, but as I've never used racc (or yacc), > it's not apparent to me what should be happening. > > Thanks! > > Jen > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From fortez at gmail.com Mon Dec 19 09:11:54 2005 From: fortez at gmail.com (hui) Date: Mon, 19 Dec 2005 22:11:54 +0800 Subject: [Ferret-talk] Indexing so slow...... Message-ID: <35ae50b10512190611l6761152bo@mail.gmail.com> I am indexing over 10,000 rows of data, it is very slow when it is indexing the 100,1000,10000 row, and now it is over 1 hour passed on the row 10,000. how to make it faster? here is my code: ================== doc = Document.new doc << Field.new("id", t.id, Field::Store::YES, Field::Index::UNTOKENIZED) doc << Field.new("title", t.title, Field::Store::NO, Field::Index::TOKENIZED) doc << Field.new("body", t.body, Field::Store::NO, Field::Index::TOKENIZED) doc << Field.new("album", t.album.name, Field::Store::NO, Field::Index::TOKENIZED) doc << Field.new("artist", t.album.artist.name, Field::Store::NO, Field::Index::TOKENIZED) doc << Field.new("release", t.album.release, Field::Store::NO, Field::Index::UNTOKENIZED) index << doc ============================================== I just store the id, other data saved in database, because if i store data in ferret, my PC looks like just dead. From erik at ehatchersolutions.com Mon Dec 19 09:27:23 2005 From: erik at ehatchersolutions.com (Erik Hatcher) Date: Mon, 19 Dec 2005 09:27:23 -0500 Subject: [Ferret-talk] Indexing so slow...... In-Reply-To: <35ae50b10512190611l6761152bo@mail.gmail.com> References: <35ae50b10512190611l6761152bo@mail.gmail.com> Message-ID: This is very likely due to the merge factors. Lucene (and thus Ferret) reorganizes the index periodically. These settings are controllable, at least with Java Lucene. The trade-off is how much memory you want the indexing process to use. Erik On Dec 19, 2005, at 9:11 AM, hui wrote: > I am indexing over 10,000 rows of data, it is very slow when it is > indexing the 100,1000,10000 row, and now it is over 1 hour passed on > the row 10,000. > > how to make it faster? > here is my code: > ================== > doc = Document.new > doc << Field.new("id", t.id, Field::Store::YES, > Field::Index::UNTOKENIZED) > doc << Field.new("title", t.title, Field::Store::NO, > Field::Index::TOKENIZED) > doc << Field.new("body", t.body, Field::Store::NO, > Field::Index::TOKENIZED) > doc << Field.new("album", t.album.name, Field::Store::NO, > Field::Index::TOKENIZED) > doc << Field.new("artist", t.album.artist.name, Field::Store::NO, > Field::Index::TOKENIZED) > doc << Field.new("release", t.album.release, Field::Store::NO, > Field::Index::UNTOKENIZED) > index << doc > ============================================== > I just store the id, other data saved in database, because if i store > data in ferret, my PC looks like just dead. > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk From dbalmain.ml at gmail.com Mon Dec 19 09:38:06 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Mon, 19 Dec 2005 23:38:06 +0900 Subject: [Ferret-talk] Indexing so slow...... In-Reply-To: <35ae50b10512190611l6761152bo@mail.gmail.com> References: <35ae50b10512190611l6761152bo@mail.gmail.com> Message-ID: On 12/19/05, hui wrote: > I am indexing over 10,000 rows of data, it is very slow when it is > indexing the 100,1000,10000 row, and now it is over 1 hour passed on > the row 10,000. > > how to make it faster? > here is my code: > ================== > doc = Document.new > doc << Field.new("id", t.id, Field::Store::YES, > Field::Index::UNTOKENIZED) > doc << Field.new("title", t.title, Field::Store::NO, > Field::Index::TOKENIZED) > doc << Field.new("body", t.body, Field::Store::NO, > Field::Index::TOKENIZED) > doc << Field.new("album", t.album.name, Field::Store::NO, > Field::Index::TOKENIZED) > doc << Field.new("artist", t.album.artist.name, Field::Store::NO, > Field::Index::TOKENIZED) > doc << Field.new("release", t.album.release, Field::Store::NO, > Field::Index::UNTOKENIZED) > index << doc Hi Hui, This looks fine. Some suggestions; * Make sure you are not using auto_flush. That will slow indexing down considerably. This will slow things down considerable. In fact, it is probably better to use Index::IndexWriter rather than Index::Index. * You could index everything in memory and write it to disk. This will depend on how much memory you have and how big the index becomes. * You can play around with merge_factor, :min_merge_docs and :max_merge_docs in IndexWriter. They are currently set to 10 but you might get more speed with different settings. Try anything between 2 and 100. * You switch :use_compound_file to false. This will speed things up but you may get an error for having too many files open. There are a few other things you can do like indexing in parallel but I won't go into it yet. I'm currently working on some pretty big speed ups by implement everything in C. After I release that version I think most performance problems will go away. It will certainly speed up things more than any changes you might make to the current version. Anyway, I thought I'd have it out by Christmas but it's turning into a bigger task then I thought (doesn't this always seem to happen). I will get finished early next year though so if you can put up with the slow performance until then, relief is on it's way. :-) Cheers, Dave PS The reason it takes a long time at 100, 1000, 10,000 is that indexing is done in segments and at 100, 1000, etc a number of the segments are merged together into bigger segments. Here is a good picture for you; http://nutch.sourceforge.net/blog/2004/11/dynamization-and-lucene.html The :merge_factor in the picture is 3. The current merge factor in Ferret is 10. Hope that makes sense. > ============================================== > I just store the id, other data saved in database, because if i store > data in ferret, my PC looks like just dead. > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From dbalmain.ml at gmail.com Mon Dec 19 09:53:05 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Mon, 19 Dec 2005 23:53:05 +0900 Subject: [Ferret-talk] Indexing so slow...... In-Reply-To: References: <35ae50b10512190611l6761152bo@mail.gmail.com> Message-ID: Sorry, just to correct myself. :max_merge_docs is set to a very big number, not 10. I'll try to quickly explain this numbers and the payoffs. :min_merge_docs => the minimum number of documents a segment must have before it is merged. Set this to a larger number if you want to use more RAM and speed things up. :merge_factor => this tells ferret when to merge for segments larger than min_merge_docs. A high value means merges are done less often which means faster indexing but slower searching. You can set this to a high value when you do your batch index and then optimize the index and lower the value afterwoods when search speed is more important. A higher value also requires more memory :max_merge_docs => this sets the maximum number of documents in a segment. Once this count is reached, that segment is no longer merged with the other documents unless optimize is called. You might set this to a lower value to stop the IndexWriter from holding the lock for too long while doing a merge. For example if you set it to 1000, you wouldn't get that really long hang time at 10000. On 12/19/05, David Balmain wrote: > On 12/19/05, hui wrote: > > I am indexing over 10,000 rows of data, it is very slow when it is > > indexing the 100,1000,10000 row, and now it is over 1 hour passed on > > the row 10,000. > > > > how to make it faster? > > here is my code: > > ================== > > doc = Document.new > > doc << Field.new("id", t.id, Field::Store::YES, > > Field::Index::UNTOKENIZED) > > doc << Field.new("title", t.title, Field::Store::NO, > > Field::Index::TOKENIZED) > > doc << Field.new("body", t.body, Field::Store::NO, > > Field::Index::TOKENIZED) > > doc << Field.new("album", t.album.name, Field::Store::NO, > > Field::Index::TOKENIZED) > > doc << Field.new("artist", t.album.artist.name, Field::Store::NO, > > Field::Index::TOKENIZED) > > doc << Field.new("release", t.album.release, Field::Store::NO, > > Field::Index::UNTOKENIZED) > > index << doc > > Hi Hui, > > This looks fine. Some suggestions; > > * Make sure you are not using auto_flush. That will slow indexing down > considerably. This will slow things down considerable. In fact, it is > probably better to use Index::IndexWriter rather than Index::Index. > * You could index everything in memory and write it to disk. This will > depend on how much memory you have and how big the index becomes. > * You can play around with merge_factor, :min_merge_docs and > :max_merge_docs in IndexWriter. They are currently set to 10 but you > might get more speed with different settings. Try anything between 2 > and 100. > * You switch :use_compound_file to false. This will speed things up > but you may get an error for having too many files open. > > There are a few other things you can do like indexing in parallel but > I won't go into it yet. I'm currently working on some pretty big speed > ups by implement everything in C. After I release that version I think > most performance problems will go away. It will certainly speed up > things more than any changes you might make to the current version. > Anyway, I thought I'd have it out by Christmas but it's turning into a > bigger task then I thought (doesn't this always seem to happen). I > will get finished early next year though so if you can put up with the > slow performance until then, relief is on it's way. :-) > > Cheers, > Dave > > PS The reason it takes a long time at 100, 1000, 10,000 is that > indexing is done in segments and at 100, 1000, etc a number of the > segments are merged together into bigger segments. Here is a good > picture for you; > > http://nutch.sourceforge.net/blog/2004/11/dynamization-and-lucene.html > > The :merge_factor in the picture is 3. The current merge factor in > Ferret is 10. Hope that makes sense. > > > > ============================================== > > I just store the id, other data saved in database, because if i store > > data in ferret, my PC looks like just dead. > > > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > From dbalmain.ml at gmail.com Mon Dec 19 09:53:50 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Mon, 19 Dec 2005 23:53:50 +0900 Subject: [Ferret-talk] Indexing so slow...... In-Reply-To: References: <35ae50b10512190611l6761152bo@mail.gmail.com> Message-ID: Oh and one last thing. Erik's book "Lucene in Action" has a much better explanation of all of this. On 12/19/05, David Balmain wrote: > Sorry, just to correct myself. :max_merge_docs is set to a very big > number, not 10. I'll try to quickly explain this numbers and the > payoffs. > > :min_merge_docs => the minimum number of documents a segment must have > before it is merged. Set this to a larger number if you want to use > more RAM and speed things up. > > :merge_factor => this tells ferret when to merge for segments larger > than min_merge_docs. A high value means merges are done less often > which means faster indexing but slower searching. You can set this to > a high value when you do your batch index and then optimize the index > and lower the value afterwoods when search speed is more important. A > higher value also requires more memory > > :max_merge_docs => this sets the maximum number of documents in a > segment. Once this count is reached, that segment is no longer merged > with the other documents unless optimize is called. You might set this > to a lower value to stop the IndexWriter from holding the lock for too > long while doing a merge. For example if you set it to 1000, you > wouldn't get that really long hang time at 10000. > > On 12/19/05, David Balmain wrote: > > On 12/19/05, hui wrote: > > > I am indexing over 10,000 rows of data, it is very slow when it is > > > indexing the 100,1000,10000 row, and now it is over 1 hour passed on > > > the row 10,000. > > > > > > how to make it faster? > > > here is my code: > > > ================== > > > doc = Document.new > > > doc << Field.new("id", t.id, Field::Store::YES, > > > Field::Index::UNTOKENIZED) > > > doc << Field.new("title", t.title, Field::Store::NO, > > > Field::Index::TOKENIZED) > > > doc << Field.new("body", t.body, Field::Store::NO, > > > Field::Index::TOKENIZED) > > > doc << Field.new("album", t.album.name, Field::Store::NO, > > > Field::Index::TOKENIZED) > > > doc << Field.new("artist", t.album.artist.name, Field::Store::NO, > > > Field::Index::TOKENIZED) > > > doc << Field.new("release", t.album.release, Field::Store::NO, > > > Field::Index::UNTOKENIZED) > > > index << doc > > > > Hi Hui, > > > > This looks fine. Some suggestions; > > > > * Make sure you are not using auto_flush. That will slow indexing down > > considerably. This will slow things down considerable. In fact, it is > > probably better to use Index::IndexWriter rather than Index::Index. > > * You could index everything in memory and write it to disk. This will > > depend on how much memory you have and how big the index becomes. > > * You can play around with merge_factor, :min_merge_docs and > > :max_merge_docs in IndexWriter. They are currently set to 10 but you > > might get more speed with different settings. Try anything between 2 > > and 100. > > * You switch :use_compound_file to false. This will speed things up > > but you may get an error for having too many files open. > > > > There are a few other things you can do like indexing in parallel but > > I won't go into it yet. I'm currently working on some pretty big speed > > ups by implement everything in C. After I release that version I think > > most performance problems will go away. It will certainly speed up > > things more than any changes you might make to the current version. > > Anyway, I thought I'd have it out by Christmas but it's turning into a > > bigger task then I thought (doesn't this always seem to happen). I > > will get finished early next year though so if you can put up with the > > slow performance until then, relief is on it's way. :-) > > > > Cheers, > > Dave > > > > PS The reason it takes a long time at 100, 1000, 10,000 is that > > indexing is done in segments and at 100, 1000, etc a number of the > > segments are merged together into bigger segments. Here is a good > > picture for you; > > > > http://nutch.sourceforge.net/blog/2004/11/dynamization-and-lucene.html > > > > The :merge_factor in the picture is 3. The current merge factor in > > Ferret is 10. Hope that makes sense. > > > > > > > ============================================== > > > I just store the id, other data saved in database, because if i store > > > data in ferret, my PC looks like just dead. > > > > > > _______________________________________________ > > > Ferret-talk mailing list > > > Ferret-talk at rubyforge.org > > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > > > > From fcsmith at gmail.com Mon Dec 19 11:21:59 2005 From: fcsmith at gmail.com (Finn Smith) Date: Mon, 19 Dec 2005 11:21:59 -0500 Subject: [Ferret-talk] Indexing so slow...... In-Reply-To: References: <35ae50b10512190611l6761152bo@mail.gmail.com> Message-ID: <6e72bbd70512190821h1aa276a7uad4e57faaeaa51d2@mail.gmail.com> On 12/19/05, David Balmain wrote: > Oh and one last thing. Erik's book "Lucene in Action" has a much > better explanation of all of this. I'll second this. Even though I am not using the Java version of Lucene, I have found this book very helpful in explaining the concepts underlying the various Lucene frameworks. -F From tlockney at oddpost.com Mon Dec 19 12:23:48 2005 From: tlockney at oddpost.com (Thomas Lockney) Date: Mon, 19 Dec 2005 09:23:48 -0800 Subject: [Ferret-talk] Indexing so slow...... In-Reply-To: <6e72bbd70512190821h1aa276a7uad4e57faaeaa51d2@mail.gmail.com> References: <35ae50b10512190611l6761152bo@mail.gmail.com> <6e72bbd70512190821h1aa276a7uad4e57faaeaa51d2@mail.gmail.com> Message-ID: <1135013028.2981.5.camel@localhost.localdomain> On Mon, 2005-12-19 at 11:21 -0500, Finn Smith wrote: > On 12/19/05, David Balmain wrote: > > Oh and one last thing. Erik's book "Lucene in Action" has a much > > better explanation of all of this. > > I'll second this. Even though I am not using the Java version of > Lucene, I have found this book very helpful in explaining the concepts > underlying the various Lucene frameworks. As long as we're voting... I'll third it! I was trying to figure things out from the API docs and source code and was horribly confused until I went out and picked up Lucene in Action last week. It helps that it's a very well written book (good job Erik!). Thomas -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20051219/d931c73f/attachment.htm From erik at ehatchersolutions.com Mon Dec 19 13:47:18 2005 From: erik at ehatchersolutions.com (Erik Hatcher) Date: Mon, 19 Dec 2005 13:47:18 -0500 Subject: [Ferret-talk] Indexing so slow...... In-Reply-To: <1135013028.2981.5.camel@localhost.localdomain> References: <35ae50b10512190611l6761152bo@mail.gmail.com> <6e72bbd70512190821h1aa276a7uad4e57faaeaa51d2@mail.gmail.com> <1135013028.2981.5.camel@localhost.localdomain> Message-ID: On Dec 19, 2005, at 12:23 PM, Thomas Lockney wrote: > On Mon, 2005-12-19 at 11:21 -0500, Finn Smith wrote: >> On 12/19/05, David Balmain wrote: > Oh and >> one last thing. Erik's book "Lucene in Action" has a much > better >> explanation of all of this. I'll second this. Even though I am not >> using the Java version of Lucene, I have found this book very >> helpful in explaining the concepts underlying the various Lucene >> frameworks. > > As long as we're voting... I'll third it! I was trying to figure > things out from the API docs and source code and was horribly > confused until I went out and picked up Lucene in Action last week. > It helps that it's a very well written book (good job Erik!). Thanks everyone! I'll pass these kind words on to Otis as well, who is probably not tuned into the Ferret community (Python people... geez!). Erik From fortez at gmail.com Mon Dec 19 19:49:30 2005 From: fortez at gmail.com (hui) Date: Tue, 20 Dec 2005 08:49:30 +0800 Subject: [Ferret-talk] Indexing so slow...... In-Reply-To: References: <35ae50b10512190611l6761152bo@mail.gmail.com> <6e72bbd70512190821h1aa276a7uad4e57faaeaa51d2@mail.gmail.com> <1135013028.2981.5.camel@localhost.localdomain> Message-ID: <35ae50b10512191649t4f86750fr@mail.gmail.com> Thank you very much everybody! I will try all the suggestions, re-indexing my data. And ferret is great! Cannot wait for the new version, and unicode support ;-) hui From f at andreas-s.net Tue Dec 20 07:16:53 2005 From: f at andreas-s.net (Andreas S.) Date: Tue, 20 Dec 2005 13:16:53 +0100 Subject: [Ferret-talk] [ANN] Mailing list mirror on www.ruby-forum.com Message-ID: <88f8897e18aa00dcce96fc85f571ca3c@ruby-forum.com> Hi, with David's permission I have set up a forum mirror for this mailing list: http://www.ruby-forum.com/forum/5 Andreas -- Posted via http://www.ruby-forum.com/. From fortez at gmail.com Tue Dec 20 09:06:59 2005 From: fortez at gmail.com (hui) Date: Tue, 20 Dec 2005 22:06:59 +0800 Subject: [Ferret-talk] Indexing so slow...... In-Reply-To: <35ae50b10512191649t4f86750fr@mail.gmail.com> References: <35ae50b10512190611l6761152bo@mail.gmail.com> <6e72bbd70512190821h1aa276a7uad4e57faaeaa51d2@mail.gmail.com> <1135013028.2981.5.camel@localhost.localdomain> <35ae50b10512191649t4f86750fr@mail.gmail.com> Message-ID: <35ae50b10512200606u72abb805h@mail.gmail.com> the indexing is quite fast now, i use the following code, =========================================== index = IndexWriter.new("db/index.db", :create_if_missing=>true, :use_compound_file=>false) index.max_merge_docs = 30000 index.min_merge_docs = 4000 ============================= but problem comes when optimizing, the data files size grows from about 200m to 3G, and broke finally with: ============================================= D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/store/buffer ed_index_io.rb:178:in `refill': EOFError (EOFError) from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /store/buffered_index_io.rb:94:in `read_byte' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /store/index_io.rb:61:in `read_vint' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/term_doc_enum.rb:131:in `next?' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/term_doc_enum.rb:273:in `next?' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/segment_merger.rb:269:in `append_postings' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/segment_merger.rb:262:in `times' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/segment_merger.rb:262:in `append_postings' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/segment_merger.rb:240:in `merge_term_info' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/segment_merger.rb:215:in `merge_term_infos' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/segment_merger.rb:176:in `merge_terms' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/segment_merger.rb:48:in `merge' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/index_writer.rb:403:in `merge_segments' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/index_writer.rb:183:in `optimize' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/index_writer.rb:173:in `synchronize' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/index_writer.rb:173:in `optimize' from script/indexdb.rb:55 ======================================================= are there some tips about optimizing? Thanks again. hui From dbalmain.ml at gmail.com Tue Dec 20 11:39:09 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Wed, 21 Dec 2005 01:39:09 +0900 Subject: [Ferret-talk] Indexing so slow...... In-Reply-To: <35ae50b10512200606u72abb805h@mail.gmail.com> References: <35ae50b10512190611l6761152bo@mail.gmail.com> <6e72bbd70512190821h1aa276a7uad4e57faaeaa51d2@mail.gmail.com> <1135013028.2981.5.camel@localhost.localdomain> <35ae50b10512191649t4f86750fr@mail.gmail.com> <35ae50b10512200606u72abb805h@mail.gmail.com> Message-ID: Hi Hui, Can you email me your index directory listing? 3Gb sounds very large. Also, how much data are you indexing? How many files/records and what total size? This will help me work out what is wrong. I have a few ideas. Cheers, Dave On 12/20/05, hui wrote: > the indexing is quite fast now, i use the following code, > =========================================== > index = IndexWriter.new("db/index.db", :create_if_missing=>true, > :use_compound_file=>false) > index.max_merge_docs = 30000 > index.min_merge_docs = 4000 > ============================= > > but problem comes when optimizing, > the data files size grows from about 200m to 3G, and broke finally with: > ============================================= > D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/store/buffer > ed_index_io.rb:178:in `refill': EOFError (EOFError) > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /store/buffered_index_io.rb:94:in `read_byte' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /store/index_io.rb:61:in `read_vint' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/term_doc_enum.rb:131:in `next?' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/term_doc_enum.rb:273:in `next?' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/segment_merger.rb:269:in `append_postings' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/segment_merger.rb:262:in `times' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/segment_merger.rb:262:in `append_postings' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/segment_merger.rb:240:in `merge_term_info' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/segment_merger.rb:215:in `merge_term_infos' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/segment_merger.rb:176:in `merge_terms' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/segment_merger.rb:48:in `merge' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/index_writer.rb:403:in `merge_segments' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/index_writer.rb:183:in `optimize' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/index_writer.rb:173:in `synchronize' > from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret > /index/index_writer.rb:173:in `optimize' > from script/indexdb.rb:55 > ======================================================= > are there some tips about optimizing? > > Thanks again. > > hui > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From nick.snels at gmail.com Wed Dec 21 15:01:58 2005 From: nick.snels at gmail.com (Nick Snels) Date: Wed, 21 Dec 2005 21:01:58 +0100 Subject: [Ferret-talk] Ferret and Rails transaction Message-ID: <857ffc970512211201s273e4c58r2615b88e2d873a4d@mail.gmail.com> Hi, following the discussion about acts_as_ferret on the Rails mailinglist, there was an issue about transactions, which could result in beind the database and ferret out of sync. I have taken a different approach from acts_as_ferret trying to resolve the transaction problem. Instead of adding things to the ferret index in the model, I have added it in the controller. I have only the create part for now and it's very rough, but it works. def create if request.get? redirect_to :action => 'new' else @user = User.find(session[:user_id]) @ad = Ad.new(params[:ad]) @listing = Listing.new(params[:listing]) @listing.listing_type = "ads" index ||= Index::Index.new(:key => [:id, :table], :path => "#{RAILS_ROOT}/db/index.test", :auto_flush => true) begin Listing.transaction(@ad) do @ad.listings << @listing if @ad.save session[:last_addition] = @ad.id #ferret create a new entry in the index doc = Ferret::Document::Document.new doc << Ferret::Document::Field.new("id", @ad.id, Document::Field::Store::YES, Document::Field::Index::UNTOKENIZED) doc << Ferret::Document::Field.new("table", @ listing.listing_type, Document::Field::Store::YES, Document::Field::Index::UNTOKENIZED) doc << Ferret::Document::Field.new("content", @ad.title + " " + @ad.text, Document::Field::Store::NO, Document::Field::Index::TOKENIZED) index << doc index.flush flash[:notice] = 'Ad was successfully created.' redirect_to :action => 'list' else index.close render :action => 'new' end end rescue if session[:last_addition] logger.error "We verwijderen ad met id: #{session[:last_addition]}" index.query_delete("+id:#{session[:last_addition]} +table:ads") index.flush session[:last_addition] = nil end flash[:notice] = 'An error occurred.' redirect_to :action => 'new' end end end How does it work, or at least how do I think it works. If everything goes well, the entry gets in the Ferret index. I had to use index.flush, because without I got a lock. Don't know why this happens, because I set autoflush to true. When I have a validation error (field of form is empty, or...) in my model, the else part of @ad.save is taken, there I close the index. I don't know if this is necessary, but just to be sure. When their is another error: ferret index is not available, trying to add wrong field or something. Transaction - rescue kicks in. If there are no validation errors, the ad should get saved, so I set a session variable last_addition to hold the id of the ad that is being saved, when an error occurs I have the id of the ad just added. In the rescue block I check if the session variable last_addition is set. If it is, I search ferret and delete the addition, afterwhich I flush index (had to add it, otherwise I would get a lock). Is my solution waterproof or am I missing something. All feedback is welcome. Kind regards, Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20051221/6bf4535a/attachment.htm From carl at youngbloods.org Fri Dec 23 00:16:55 2005 From: carl at youngbloods.org (Carl Youngblood) Date: Thu, 22 Dec 2005 22:16:55 -0700 Subject: [Ferret-talk] Ordering results by something other than relevance In-Reply-To: References: Message-ID: On 12/16/05, David Balmain wrote: > This is telling Ferret to sort by whatever it finds in the date > column. Since it parses as an integer, it will sort by integer. > Explicitly like this; > > sf_date = SortField.new("date", {:sort_type => SortField::SortType::INT}) > top_docs = index.search("one", :sort => [sf_date, SortField::FIELD_SCORE]) > > That probably should be INTEGER. I'll change that as well. Another question: is it possible to sort by date first but if two documents have the same date, then sort them by relevance? Thanks, Carl From erik at ehatchersolutions.com Fri Dec 23 04:19:24 2005 From: erik at ehatchersolutions.com (Erik Hatcher) Date: Fri, 23 Dec 2005 04:19:24 -0500 Subject: [Ferret-talk] Ordering results by something other than relevance In-Reply-To: References: Message-ID: <68E304CC-B2A0-4322-BE4F-67ACF4B6981B@ehatchersolutions.com> On Dec 23, 2005, at 12:16 AM, Carl Youngblood wrote: > On 12/16/05, David Balmain wrote: >> This is telling Ferret to sort by whatever it finds in the date >> column. Since it parses as an integer, it will sort by integer. >> Explicitly like this; >> >> sf_date = SortField.new("date", {:sort_type => >> SortField::SortType::INT}) >> top_docs = index.search("one", :sort => [sf_date, >> SortField::FIELD_SCORE]) >> >> That probably should be INTEGER. I'll change that as well. > > Another question: is it possible to sort by date first but if two > documents have the same date, then sort them by relevance? That is exactly what Dave's example will do. Providing a SortField array does multi-level sorting such that if the first criteria is equal the next criteria is used, and so on. It is also possible to specify ascending or descending for each SortField. Erik From fortez at gmail.com Fri Dec 23 04:20:08 2005 From: fortez at gmail.com (hui) Date: Fri, 23 Dec 2005 17:20:08 +0800 Subject: [Ferret-talk] Indexing so slow...... In-Reply-To: References: <35ae50b10512190611l6761152bo@mail.gmail.com> <6e72bbd70512190821h1aa276a7uad4e57faaeaa51d2@mail.gmail.com> <1135013028.2981.5.camel@localhost.localdomain> <35ae50b10512191649t4f86750fr@mail.gmail.com> <35ae50b10512200606u72abb805h@mail.gmail.com> Message-ID: <35ae50b10512230120n51b5f9efn@mail.gmail.com> have you got my email, David? i find searching is slow without optimizing, It takes 5s when querying one word from 130,000 records, and 30s when querying a 4 words phrase. Hui 2005/12/21, David Balmain : > Hi Hui, > > Can you email me your index directory listing? 3Gb sounds very large. > Also, how much data are you indexing? How many files/records and what > total size? This will help me work out what is wrong. I have a few > ideas. > > Cheers, > Dave From dbalmain.ml at gmail.com Fri Dec 23 21:07:41 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 24 Dec 2005 13:07:41 +1100 Subject: [Ferret-talk] Indexing so slow...... In-Reply-To: <35ae50b10512230120n51b5f9efn@mail.gmail.com> References: <35ae50b10512190611l6761152bo@mail.gmail.com> <6e72bbd70512190821h1aa276a7uad4e57faaeaa51d2@mail.gmail.com> <1135013028.2981.5.camel@localhost.localdomain> <35ae50b10512191649t4f86750fr@mail.gmail.com> <35ae50b10512200606u72abb805h@mail.gmail.com> <35ae50b10512230120n51b5f9efn@mail.gmail.com> Message-ID: Hi hui, Sorry, I'm taking a couple of weeks of for Christmas. I'll be back to work on Ferret on the 11th Jan. Hope you can wait till then. I suggest with the number of records you're working on you wait until I finish cFerret. Merry Christmas. Dave On 12/23/05, hui wrote: > have you got my email, David? > > i find searching is slow without optimizing, > It takes 5s when querying one word from 130,000 records, > and 30s when querying a 4 words phrase. > > Hui > > 2005/12/21, David Balmain : > > Hi Hui, > > > > Can you email me your index directory listing? 3Gb sounds very large. > > Also, how much data are you indexing? How many files/records and what > > total size? This will help me work out what is wrong. I have a few > > ideas. > > > > Cheers, > > Dave > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > From dbalmain.ml at gmail.com Fri Dec 23 21:10:53 2005 From: dbalmain.ml at gmail.com (David Balmain) Date: Sat, 24 Dec 2005 13:10:53 +1100 Subject: [Ferret-talk] Merry Christmas to everyone. Signing off until the 11th. Message-ID: Hi Everyone, Merry Christmas. I hope everyone has a enjoyable and relaxing holiday season. I myself will be going bushwalking for a couple of weeks so I won't have any access to the internet. I'll be back to work on the 11th of January so please don't think I'm ignoring your emails. I'll get to them all them. Merry Christmas and a happy new year. Dave From fortez at gmail.com Fri Dec 23 21:28:06 2005 From: fortez at gmail.com (hui) Date: Sat, 24 Dec 2005 10:28:06 +0800 Subject: [Ferret-talk] Indexing so slow...... In-Reply-To: <35ae50b10512230120n51b5f9efn@mail.gmail.com> References: <35ae50b10512190611l6761152bo@mail.gmail.com> <6e72bbd70512190821h1aa276a7uad4e57faaeaa51d2@mail.gmail.com> <1135013028.2981.5.camel@localhost.localdomain> <35ae50b10512191649t4f86750fr@mail.gmail.com> <35ae50b10512200606u72abb805h@mail.gmail.com> <35ae50b10512230120n51b5f9efn@mail.gmail.com> Message-ID: <35ae50b10512231828o754b4512q@mail.gmail.com> I tried lucene last night, it is so fast. about just one hour indexed 130,000 records (stored all data), which was 10 hous using ferret (only id and without opmiziting). and it seems ferrer cannot use the lucene index data, I got an error: ===================================================== D:\InstantRails\rails_apps\muvava>ruby script\console Loading development environment. >> index = Ferret::Index::Index.new("db/index.db") IndexError: index 11667591 out of string from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/index.rb:122:in `[]=' from D:/InstantRails/ruby/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret /index/index.rb:122:in `initialize' from (irb):1:in `new' from (irb):1 >> ===================================================== 2005/12/23, hui : > have you got my email, David? > > i find searching is slow without optimizing, > It takes 5s when querying one word from 130,000 records, > and 30s when querying a 4 words phrase. > > Hui > > 2005/12/21, David Balmain : > > Hi Hui, > > > > Can you email me your index directory listing? 3Gb sounds very large. > > Also, how much data are you indexing? How many files/records and what > > total size? This will help me work out what is wrong. I have a few > > ideas. > > > > Cheers, > > Dave > From fortez at gmail.com Sun Dec 25 02:37:19 2005 From: fortez at gmail.com (hui) Date: Sun, 25 Dec 2005 15:37:19 +0800 Subject: [Ferret-talk] Indexing so slow...... In-Reply-To: References: <35ae50b10512190611l6761152bo@mail.gmail.com> <6e72bbd70512190821h1aa276a7uad4e57faaeaa51d2@mail.gmail.com> <1135013028.2981.5.camel@localhost.localdomain> <35ae50b10512191649t4f86750fr@mail.gmail.com> <35ae50b10512200606u72abb805h@mail.gmail.com> <35ae50b10512230120n51b5f9efn@mail.gmail.com> Message-ID: <35ae50b10512242337u5f2c00fep@mail.gmail.com> Have nice holidays :) 2005/12/24, David Balmain : > Hi hui, > > Sorry, I'm taking a couple of weeks of for Christmas. I'll be back to > work on Ferret on the 11th Jan. Hope you can wait till then. I suggest > with the number of records you're working on you wait until I finish > cFerret. > > Merry Christmas. > Dave From jennyw at dangerousideas.com Wed Dec 28 19:13:17 2005 From: jennyw at dangerousideas.com (jennyw) Date: Wed, 28 Dec 2005 16:13:17 -0800 Subject: [Ferret-talk] Short words not indexed? Message-ID: <43B32A1D.3000205@dangerousideas.com> I noticed that if I have a field that contains something like "Institute for medicine", that if I search using nay of these queries: for *for* for~ Nothing shows up. If I search for either of the other two words, though, that term would show up in the result set. Does this indicate that short words like "for" are not indexed? Thanks! Jen From erik at ehatchersolutions.com Wed Dec 28 19:28:17 2005 From: erik at ehatchersolutions.com (Erik Hatcher) Date: Wed, 28 Dec 2005 19:28:17 -0500 Subject: [Ferret-talk] Short words not indexed? In-Reply-To: <43B32A1D.3000205@dangerousideas.com> References: <43B32A1D.3000205@dangerousideas.com> Message-ID: What analyzer are you using? On Dec 28, 2005, at 7:13 PM, jennyw wrote: > I noticed that if I have a field that contains something like > "Institute > for medicine", that if I search using nay of these queries: > > for > *for* > for~ > > Nothing shows up. If I search for either of the other two words, > though, > that term would show up in the result set. Does this indicate that > short > words like "for" are not indexed? Jen - what analyzer are you using? If you're using the default, it is the StandardAnalyzer, which removes these stop words during tokenization: ENGLISH_STOP_WORDS = [ "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "s", "such", "t", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with" ] Off the cuff, you should be able to adjust this to not remove any stop words by using: :analyzer => StandardAnalyzer.new([]) if you're using the Index class Ferret provides. Erik From vamlists at gmx.net Sat Dec 31 03:23:55 2005 From: vamlists at gmx.net (Vamsee Kanakala) Date: Sat, 31 Dec 2005 13:53:55 +0530 Subject: [Ferret-talk] Some newbie questions Message-ID: <43B6401B.9070408@gmx.net> Hi all, I'm newly discovering Ferret, and just did a crash course on Lucene to (I get the concepts behind it). But I'm have a few doubts: 1. What is the 'recommended' way of integrating Ferret into my app now? The gem version or the plugin? 2. Where do I checkout the plugin from? Do I just copy the version on the wiki? Those apart, Ferret is really lacking a simple guide. The wiki page "HowToIntegrateFerretWithRails" is sometimes not very clear. I would love to clean it up or create a separate page for ferret on the wiki, if you guys help me out figuring Ferret. It will be very useful to me and others. The plugin version looks much simpler. Where do I start? Thanks much, Vamsee. From JanPrill at blauton.de Sat Dec 31 06:47:58 2005 From: JanPrill at blauton.de (Jan Prill) Date: Sat, 31 Dec 2005 12:47:58 +0100 Subject: [Ferret-talk] Some newbie questions In-Reply-To: <43B6401B.9070408@gmx.net> References: <43B6401B.9070408@gmx.net> Message-ID: <43B66FEE.2060604@blauton.de> Hi, Vamsee, some answers inline... Vamsee Kanakala wrote: >Hi all, > >I'm newly discovering Ferret, and just did a crash course on Lucene to >(I get the concepts behind it). But I'm have a few doubts: > >1. What is the 'recommended' way of integrating Ferret into my app now? >The gem version or the plugin? > > It's not a question of 'OR'. You'll definitly need the ferret gem, because this is providing the ferret functionality to build and query indexes (and so on). The plugin just tries to offer a convenient way to integrate the functionality provided by the gem. So the first step __has to be__: gem install ferret >2. Where do I checkout the plugin from? Do I just copy the version on >the wiki? > > As you have realized integration questions are not very advanced yet. I'm not aware of a repository that provides a svn checkout of a plugin. Maybe these are in the works but if you've got the resources and get the experience while you are going on i'm pretty sure that the rails crowd would love a "best way of ferret-integration" plugin. >Those apart, Ferret is really lacking a simple guide. The wiki page >"HowToIntegrateFerretWithRails" is sometimes not very clear. > Indeed. This was my first approach with rails and ferret and it definitly needs a clean up. I think the Rails wiki would be a great place for putting together the different integration efforts that you might find on the ferret wiki, the rails mailing list, this list and the rails wiki. >I would >love to clean it up or create a separate page for ferret on the wiki, if >you guys help me out figuring Ferret. It will be very useful to me and >others. > > Regarding your offer of help: GREAT. I would suggest you try to put things together for your app and share with us if you've got something. I think its great that Dave is putting his power into advancing ferret performance-wise. Ferret is a ruby gem and should work not only with rails. So its just fair that we (the rails community) commit a best practice of integration - that shouldn't be too 'heavy lifting' Regards Jan >The plugin version looks much simpler. Where do I start? > >Thanks much, >Vamsee. > >_______________________________________________ >Ferret-talk mailing list >Ferret-talk at rubyforge.org >http://rubyforge.org/mailman/listinfo/ferret-talk > > > From tlockney at oddpost.com Sat Dec 31 12:48:11 2005 From: tlockney at oddpost.com (Thomas Lockney) Date: Sat, 31 Dec 2005 09:48:11 -0800 Subject: [Ferret-talk] Some newbie questions In-Reply-To: <43B66FEE.2060604@blauton.de> References: <43B6401B.9070408@gmx.net> <43B66FEE.2060604@blauton.de> Message-ID: <43B6C45B.1050608@oddpost.com> Jan Prill wrote: >As you have realized integration questions are not very advanced yet. >I'm not aware of a repository that provides a svn checkout of a plugin. >Maybe these are in the works but if you've got the resources and get the >experience while you are going on i'm pretty sure that the rails crowd >would love a "best way of ferret-integration" plugin. > > If you are planning on using the acts_as_ferret plugin, it's still very much a work in progress. However, if you decide to take the plunge, be sure to notice that there are actually two versions available on the wiki. I added some functionality to the initial implementation and have listed my version lower down on the page. It still needs a lot of testing and may contain some serious bugs. As for a repository, that's been something I've been thinking about recently and would like to set up soon, but I don't currently have a good hosting solution for it. As soon as I can figure that aspect out and make sure that Kasper Weibel (who developed the original version) is not opposed, I will be setting one up. >Indeed. This was my first approach with rails and ferret and it >definitly needs a clean up. I think the Rails wiki would be a great >place for putting together the different integration efforts that you >might find on the ferret wiki, the rails mailing list, this list and the >rails wiki. > > I think a few alternatives have popped up since that wiki page was put together. I think collecting these approaches in one place would be an excellent idea. Regards, Thomas Lockney