From smaloff at veer.com Thu May 1 22:24:46 2008 From: smaloff at veer.com (Sheldon Maloff) Date: Fri, 2 May 2008 04:24:46 +0200 Subject: [Ferret-talk] =?utf-8?q?Pagination=2C_sorting_and_conditions=3A_t?= =?utf-8?q?he=09combination_is_b?= In-Reply-To: <20080425115943.GA308@cordoba.webit.de> References: <20080425115943.GA308@cordoba.webit.de> Message-ID: <66d3a747bc767cb7b72be166d2773179@ruby-forum.com> Hello Jens, I think I know what's going on here, because our descending sort searches are broken too and I have started to investigate what's causing the problem and trying to fix it. I have a January '08 version of the trunk. I believe it's changed quite a lot since that time. Jens, I don't think it's anything you "broke" but rather an artifiact of how MySQL works. At least, I'm using MySQL and this is the behaviour I see. I created 6 records, whose ids are 1 to 6 in my database. I am paginating on every 5 records. In my reverse sort I would expect to see records 6, 5, 4, 3, 2 on page 1 of the results. And id 1 on page 2 of the results. What I see is a method called ar_find_by_contents. It calls find_id_by_content that returns an array that in turn calls ferret. The array that comes back from ferret is actually correctly sorted: 6 0.928179502487183 5 0.928179502487183 4 0.928179502487183 3 0.928179502487183 2 0.928179502487183 1 0.928179502487183 The first number is the id, the second is the rank. Now what happens is ar_find_by_contents calls retrieve_records. And retrieve_records produces a SELECT statement like so: SELECT * FROM model WHERE id IN (6, 1, 2, 3, 4, 5) LIMIT 0, 5 It took me a while to figure out that things are being passed around as a hash, and hence the wacky order of the ids in the IN clause. Now the problem with this statement is that MySQL doesn't return records in the order that the ids appear in the IN clause. MySQL returns records in the order of the Primary Key on the table, which happens to be the id column. So MySQL is returning records 1, 2, 3, 4, 5, 6, in that order. Then the LIMIT clause kicks in and truncates the results to 1 through 5. Now the rest of ar_find_by_contents valiantly tries to order the AR results with the rank returned by ferret (my first table above). The problem is, record 6, the youngest, is no longer in the results because LIMIT took it out. So AAF sorts records 1 through 5 descending. Following along we can see how page two returns only record 6. On page two, the limit changes to SELECT * FROM model WHERE id IN (6, 1, 2, 3, 4, 5) LIMIT 5, 5 Once again, My SQL returns records 1, 2, 3, 4, 5, 6, but this time the limit returns only the last record, id 6. And then AAF sorts that descending. I working on a patch for the version I have by making MySQL return only the correct set of records in the first place. In other words, ensuring that the only ids present in the IN clause are the ones that should appear on page 1 of the results, or page 2, or pane N. So my AR query for page 1 looks like SELECT * FROM model WHERE id IN (6, 5, 4, 3, 2) LIMIT 0, 5 and the AR query for page 2 looks like SELECT * FROM model WHERE id IN (1) LIMIT 0, 5 I got it working, but in the process have made every other search, not work. Funny. I'm sure I'll figure it out. Anyway, Jens, that's the gist of the problem at least how it relates to MySQL. Other databases may vary. Regards Sheldon Maloff veer.com Jens Kraemer wrote: > Hi Max, > > thanks for your detailed report. Might well be that I broke one or more > of the various combinations of pagination / sorting / active record > conditions (where you might specify :order, too, btw) in trunk. > > I'll look into it asap. > > Cheers, > Jens -- Posted via http://www.ruby-forum.com/. From cgansen at gmail.com Tue May 6 01:50:40 2008 From: cgansen at gmail.com (Chris G.) Date: Tue, 6 May 2008 07:50:40 +0200 Subject: [Ferret-talk] Porblem with custom analyzer In-Reply-To: <6f4f702874c2d6ac1d8d734e1f6f727f@ruby-forum.com> References: <6f4f702874c2d6ac1d8d734e1f6f727f@ruby-forum.com> Message-ID: <8486669c52b6828465f09cc6e1f87e10@ruby-forum.com> Guillaume Guillaume wrote: > Hi, > > I m trying to set French stop Words. > So i created a file called "FrenchStemmingAnalyzer.rb" and i put it in > /lib of my rails App. > You might want to follow standard naming convention and rename the file to: 'french_stemming_analyzer.rb' instead. Then drop a line like: require 'lib/french_stemming_analyzer' inside config/environment.rb -c- -- Posted via http://www.ruby-forum.com/. From ij.rubylist at gmail.com Fri May 9 05:05:33 2008 From: ij.rubylist at gmail.com (Izidor Jerebic) Date: Fri, 9 May 2008 11:05:33 +0200 Subject: [Ferret-talk] doing a join between two ferret indexes? Message-ID: Hello, everybody, we have a situation where there are two sets of information about documents - slow changing and fast changing properties. We index documents by slow changing properties (content) using Ferret directly, and it works rather well. But now we would like to filter the searches by a fast-changing property, that is calculated separately, e.g. "isBookAvailable". The idea is to put this property in another Ferret index (which gets rebuilt very frequently really fast), together with unique document id (we have external unique id for each document) and then somehow "join" the original index with this index to provide final results (there are even more nuances, but this description is enough for start). The problem is we need the correct paging behaviour and search result count. If we could somehow "join" the two indexes, the paging and result count would be provided by Ferret. If not, we must go through *all* ferret result documents from the query on the first index and apply filter on them (i.e. remove from result set if not available), to get correct result count and paging behaviour. At the first look this "joining" is not possible with Ferret. Or is it? Please, can somebody enlighten me, izidor From jamesaharvey at gmail.com Fri May 9 10:49:47 2008 From: jamesaharvey at gmail.com (James Harvey) Date: Fri, 9 May 2008 10:49:47 -0400 Subject: [Ferret-talk] Searcher Explain Message-ID: <724e6cbf0805090749s45b0638coa5a0fa2cb701c646@mail.gmail.com> Hi, I am unable to use the Searcher's explain method. Anytime I call it, I get Segmentation Faults and it kills the process I have running my Rails site. Has anyone else had this problem? Here is some code I am trying to use it in... search = Search.create(:query => query) @quotations = [] searcher = Ferret::Search::Searcher.new("index") # FerretConfig::INDEX bq = self.build_query(query) # Builds a Boolean Query searcher.search_each(bq) {|doc, score| @quotations << SearchResult.new(searcher[doc][:id], searcher[doc][:quotation], searcher[doc][:author], score) } p searcher.explain(bq, @quotations[0].id).to_s searcher.close Thanks in advance for any help! -James -------------- next part -------------- An HTML attachment was scrubbed... URL: From sd.codewarrior at gmail.com Mon May 12 14:33:23 2008 From: sd.codewarrior at gmail.com (S D) Date: Mon, 12 May 2008 14:33:23 -0400 Subject: [Ferret-talk] Using StemFilter with PhraseQuery Message-ID: <4f10e2890805121133u605ea1e8o995f44d4aff8cd67@mail.gmail.com> Hi, I'm having difficulty getting the StemFilter and PhraseQuery to work properly together. When I use a StemFilter with a PhraseQuery, searches only work if the phrase consists of stems. For example, the search phrase "reduces health care" will not work but the phrase "reduce health care" will work even though the exact text "reduces health care" is contained in the original document. I'd like to use StemFilter in conjunction with PhraseQuery because I need the stemming and I also need to be able to use the slop feature of PhraseQuery. Below is my use of StemFilter and PhraseQuery. Is there anything I'm doing wrong or is the above description what I should expect? To get the response that I'm expecting I could parse the phrase and build up a query to be used by QueryParser but I'd like a more succinct solution for now. I use a StemFilter in my analyzer as follows: def token_stream(field, str) ... ts = LowerCaseFilter.new(ts) if @lower ts = StopFilter.new(ts, @stop_words) ts = StemFilter.new(ts) ... end My use of PhraseQuery is as follows: def generate_query(phrase) phrase = phrase.downcase phrase_parts = phrase.split(' ') query = Ferret::Search::PhraseQuery.new(:content, 2) phrase_parts.each do |part| # puts "part: \"" + part + "\"" query.add_term(part, 1) end query end -------------- next part -------------- An HTML attachment was scrubbed... URL: From kraemer at webit.de Tue May 13 03:09:33 2008 From: kraemer at webit.de (Jens Kraemer) Date: Tue, 13 May 2008 09:09:33 +0200 Subject: [Ferret-talk] Using StemFilter with PhraseQuery In-Reply-To: <4f10e2890805121133u605ea1e8o995f44d4aff8cd67@mail.gmail.com> References: <4f10e2890805121133u605ea1e8o995f44d4aff8cd67@mail.gmail.com> Message-ID: <20080513070933.GA9008@cordoba.webit.de> Hi! I think what you get is the expected behaviour. Since you don't use Ferret's QueryParser but build your queries on your own, you're also responsible for proper tokenization / analysis of your query content. So running your phrase through your analyzer before constructing the phrase query should work as expected. Cheers, Jens On Mon, May 12, 2008 at 02:33:23PM -0400, S D wrote: > Hi, > > I'm having difficulty getting the StemFilter and PhraseQuery to work > properly together. When I use a StemFilter with a PhraseQuery, searches only > work if the phrase consists of stems. For example, the search phrase > "reduces health care" will not work but the phrase "reduce health care" will > work even though the exact text "reduces health care" is contained in the > original document. I'd like to use StemFilter in conjunction with > PhraseQuery because I need the stemming and I also need to be able to use > the slop feature of PhraseQuery. Below is my use of StemFilter and > PhraseQuery. Is there anything I'm doing wrong or is the above description > what I should expect? To get the response that I'm expecting I could parse > the phrase and build up a query to be used by QueryParser but I'd like a > more succinct solution for now. > > I use a StemFilter in my analyzer as follows: > > def token_stream(field, str) > ... > ts = LowerCaseFilter.new(ts) if @lower > ts = StopFilter.new(ts, @stop_words) > ts = StemFilter.new(ts) > ... > end > > My use of PhraseQuery is as follows: > > def generate_query(phrase) > phrase = phrase.downcase > phrase_parts = phrase.split(' ') > query = Ferret::Search::PhraseQuery.new(:content, 2) > phrase_parts.each do |part| > # puts "part: \"" + part + "\"" > query.add_term(part, 1) > end > query > end > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold From kraemer at webit.de Tue May 13 03:14:11 2008 From: kraemer at webit.de (Jens Kraemer) Date: Tue, 13 May 2008 09:14:11 +0200 Subject: [Ferret-talk] Searcher Explain In-Reply-To: <724e6cbf0805090749s45b0638coa5a0fa2cb701c646@mail.gmail.com> References: <724e6cbf0805090749s45b0638coa5a0fa2cb701c646@mail.gmail.com> Message-ID: <20080513071411.GB9008@cordoba.webit.de> Hi! On Fri, May 09, 2008 at 10:49:47AM -0400, James Harvey wrote: > I am unable to use the Searcher's explain method. Anytime I call it, I get > Segmentation Faults and it kills the process I have running my Rails site. > Has anyone else had this problem? Here is some code I am trying to use it > in... > > search = Search.create(:query => query) > > @quotations = [] > > searcher = Ferret::Search::Searcher.new("index") # FerretConfig::INDEX > > bq = self.build_query(query) # Builds a Boolean Query > > searcher.search_each(bq) {|doc, score| > @quotations << SearchResult.new(searcher[doc][:id], > searcher[doc][:quotation], searcher[doc][:author], score) > } > > p searcher.explain(bq, @quotations[0].id).to_s If I get this right, @quotations[0].id will give the value of the id field of that result, which is not a valid argument to explain. What the explain method expects instead is the ferret-internal document id (the doc value in your search_each block). cheers, Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold From jan.prill at gmail.com Thu May 15 12:07:14 2008 From: jan.prill at gmail.com (Jan Prill) Date: Thu, 15 May 2008 18:07:14 +0200 Subject: [Ferret-talk] Picolena, a ferret+rails documents search engine In-Reply-To: <02b7d68df3805e85e95bcb90b5076801@ruby-forum.com> References: <02b7d68df3805e85e95bcb90b5076801@ruby-forum.com> Message-ID: <562a35c10805150907j211eb3c1m4ccf245737ee8daa@mail.gmail.com> Hey Eric, I've looked at the demo and am quite impressed. Thanks for making your code available!! Cheers, Jan 2008/4/20 Eric Duminil : > Hi everybody! > > I am proud to present you a small project I have been working on for a > while: > Picolena, a documents search engine written in Rails. > ( http://picolena.devjavu.com/ ). > > It obviously uses Ferret for indexing and searching, and adds some plain > text extractors in order to index OOffice.org, pdf and MS Office > documents (and some others as well). > > Everything is packed in a gem (gem install picolena), with a few rake > tasks, a multi-threaded indexer, a language guesser, a rails frontend > and some specs to be sure everything works fine. > > I would love to hear some feedback from acts_as_ferret developers or > users! > My project is in now way supposed to be a competitor of AAF: we have > different goals (ActiveRecord indexing plugin vs. stand-alone rails-app > for documents indexing), but still a lot in common. > > I dare say Picolena would be useful in a lot of companies (as a > google-mini alternative), and has already been working in production for > a few months without a hitch. This has been made possible thanks to > Ferret's incredible speed. Kudos to the devs! > > Best regards, > Eric > > > demo website : http://citynet.hft-stuttgart.de:4000/ > trac : http://picolena.devjavu.com > rubyforge : http://rubyforge.org/projects/picolena/ > svn repo : http://svn.devjavu.com/picolena/trunk/ > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -- Jan Prill Rechtsanwalt Gr?nebergstra?e 38 22763 Hamburg Tel +49 (0)40 41265809 Fax +49 (0)40 380178-73022 Mobil +49 (0)171 3516667 -------------- next part -------------- An HTML attachment was scrubbed... URL: From schulte.eric at gmail.com Mon May 19 12:00:59 2008 From: schulte.eric at gmail.com (Eric Schulte) Date: Mon, 19 May 2008 09:00:59 -0700 Subject: [Ferret-talk] Error decoding input string. Message-ID: <4831a428.08b38c0a.1b25.7aad@mx.google.com> Hi, I am trying to index a number of Spanish language text files, but a large fraction of the files are generating errors like the following... Error: exception 2 not handled: Error decoding input string. Check that you have the locale set correctly however it looks to me like my locale matches the file type. Running the file command on the files returns $ file /media/.../raw/abc/20Jan2007_abc_001041_67.es /media/.../raw/abc/20Jan2007_abc_001041_67.es: UTF-8 Unicode text and my locale is $ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= after enough of these errors are generated, I begin to get errors for having too many open files, and the indexing fails. Error: exception 2 not handled: Too many open files Any suggestions would be greatly appreciated. Thanks, Eric From radu at rdconcept.ro Mon May 19 16:46:59 2008 From: radu at rdconcept.ro (Radu Spineanu) Date: Mon, 19 May 2008 23:46:59 +0300 Subject: [Ferret-talk] feature question Message-ID: <4831E743.9090500@rdconcept.ro> Hi, Can ferret search for a combination of words and return the distance between them in a text? If it exists is there a way you can improve on this by looking if they are separated by a certain character(like . for different sentences)? Thanks, Radu From jk at jkraemer.net Mon May 19 17:15:42 2008 From: jk at jkraemer.net (Jens Kraemer) Date: Mon, 19 May 2008 23:15:42 +0200 Subject: [Ferret-talk] Error decoding input string. In-Reply-To: <4831a428.08b38c0a.1b25.7aad@mx.google.com> References: <4831a428.08b38c0a.1b25.7aad@mx.google.com> Message-ID: Hi! Are you *sure* this is all valid UTF8? I dont know how the file command determines this, and if it always is right. Maybe try to play around with iconv to ensure whatever you send to Ferret really is UTF8. Cheers, Jens On 19.05.2008, at 18:00, Eric Schulte wrote: > Hi, > > I am trying to index a number of Spanish language text files, but a > large fraction of the files are generating errors like the > following... > > Error: exception 2 not handled: Error decoding input string. Check > that you have the locale set correctly > > however it looks to me like my locale matches the file type. Running > the file command on the files returns > > $ file /media/.../raw/abc/20Jan2007_abc_001041_67.es > /media/.../raw/abc/20Jan2007_abc_001041_67.es: UTF-8 Unicode text > > > and my locale is > > $ locale > LANG=en_US.UTF-8 > LC_CTYPE="en_US.UTF-8" > LC_NUMERIC="en_US.UTF-8" > LC_TIME="en_US.UTF-8" > LC_COLLATE="en_US.UTF-8" > LC_MONETARY="en_US.UTF-8" > LC_MESSAGES="en_US.UTF-8" > LC_PAPER="en_US.UTF-8" > LC_NAME="en_US.UTF-8" > LC_ADDRESS="en_US.UTF-8" > LC_TELEPHONE="en_US.UTF-8" > LC_MEASUREMENT="en_US.UTF-8" > LC_IDENTIFICATION="en_US.UTF-8" > LC_ALL= > > > after enough of these errors are generated, I begin to get errors for > having too many open files, and the indexing fails. > > Error: exception 2 not handled: Too many open files > > Any suggestions would be greatly appreciated. > > Thanks, > Eric > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > -- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database From jk at jkraemer.net Mon May 19 17:34:02 2008 From: jk at jkraemer.net (Jens Kraemer) Date: Mon, 19 May 2008 23:34:02 +0200 Subject: [Ferret-talk] feature question In-Reply-To: <4831E743.9090500@rdconcept.ro> References: <4831E743.9090500@rdconcept.ro> Message-ID: <316D7FB3-C178-4D84-9510-6102C39F4DBD@jkraemer.net> Hi! On 19.05.2008, at 22:46, Radu Spineanu wrote: > Hi, > > Can ferret search for a combination of words and return the distance > between them in a text? It won't directly return you the distance but given the fact that Ferret stores term positions it should be possible to manually determine the distance between different terms. You may also issue phrase queries that only return hits for terms that are separated by at most n other terms. The QueryParser API docs or the Ferret book have examples of this. > If it exists is there a way you can improve on this by looking if > they are separated by a certain character(like . for different > sentences)? Usually you dont index characters like '.' at all (they are removed during analysis, when the text is split up into tokens), but if you changed that so sentence endings end up in the index as kind of special terms this might be possible, too. I dont know your use case, but keep in mind that you can get the effect of ranking terms that are closer together higher by chaining Phrase Queries with different Slop values, and assigning them different boosts: ("red fox")^15 OR ("red fox"~4)^10 OR ("red fox"~10)^5 OR ("red fox"~100) this will boost the exact match the most, and assign lower boosts to matches where the terms have larger distance. Maybe something like this will already be a 'good enough' solution to your problem? cheers, Jens -- Jens Kr?mer Finkenlust 14, 06449 Aschersleben, Germany VAT Id DE251962952 http://www.jkraemer.net/ - Blog http://www.omdb.org/ - The new free film database From schulte.eric at gmail.com Tue May 20 14:00:33 2008 From: schulte.eric at gmail.com (Eric Schulte) Date: Tue, 20 May 2008 11:00:33 -0700 Subject: [Ferret-talk] Error decoding input string. In-Reply-To: References: <4831a428.08b38c0a.1b25.7aad@mx.google.com> Message-ID: <483311a7.3bf5220a.180d.58d7@mx.google.com> Hi Jens, Thanks for the reply! I used iconv (thanks for the pointer, I had no idea this tool existed) and was able to convert all of the articles to and from utf8 without any errors being generated, so I am pretty sure that the input sources are valid utf8. I should mention that I am using an old version of ferret. v.0.9.6 which is the last version to have a pure-ruby implementation. I'm using this version because I have added in some changes which allow me to specify the scoring algorithm used on a per-search basis. I haven't however made any changes to the indexing portion of the application. I current have an iconv script creating transliterated ASCII copies of all my articles, so I am going to try to index over these. Also, I am thinking of trying to index using Lucene since there is a chance that the older version of ferret is compatible with lucene indexes. If you have any other suggestions I'd love to hear them, but I understand that I can't expect much help with such an old version. Do you know of a way to specify custom scoring algorithms in the current versions of ferret? Best, Eric On Monday, May 19, at 23:15, Jens Kraemer wrote: > Hi! > > Are you *sure* this is all valid UTF8? I dont know how the file > command determines this, and if it always is right. > Maybe try to play around with iconv to ensure whatever you send to > Ferret really is UTF8. > > Cheers, > Jens > > On 19.05.2008, at 18:00, Eric Schulte wrote: > > > Hi, > > > > I am trying to index a number of Spanish language text files, but a > > large fraction of the files are generating errors like the > > following... > > > > Error: exception 2 not handled: Error decoding input string. Check > > that you have the locale set correctly > > > > however it looks to me like my locale matches the file type. Running > > the file command on the files returns > > > > $ file /media/.../raw/abc/20Jan2007_abc_001041_67.es > > /media/.../raw/abc/20Jan2007_abc_001041_67.es: UTF-8 Unicode text > > > > > > > > and my locale is > > > > $ locale > > LANG=en_US.UTF-8 > > LC_CTYPE="en_US.UTF-8" > > LC_NUMERIC="en_US.UTF-8" > > LC_TIME="en_US.UTF-8" > > LC_COLLATE="en_US.UTF-8" > > LC_MONETARY="en_US.UTF-8" > > LC_MESSAGES="en_US.UTF-8" > > LC_PAPER="en_US.UTF-8" > > LC_NAME="en_US.UTF-8" > > LC_ADDRESS="en_US.UTF-8" > > LC_TELEPHONE="en_US.UTF-8" > > LC_MEASUREMENT="en_US.UTF-8" > > LC_IDENTIFICATION="en_US.UTF-8" > > LC_ALL= > > > > > > after enough of these errors are generated, I begin to get errors for > > having too many open files, and the indexing fails. > > > > Error: exception 2 not handled: Too many open files > > > > Any suggestions would be greatly appreciated. > > > > Thanks, > > Eric > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > > -- > Jens Kr?mer > Finkenlust 14, 06449 Aschersleben, Germany > VAT Id DE251962952 > http://www.jkraemer.net/ - Blog > http://www.omdb.org/ - The new free film database > -- schulte From jamesaharvey at gmail.com Fri May 23 11:59:13 2008 From: jamesaharvey at gmail.com (James Harvey) Date: Fri, 23 May 2008 11:59:13 -0400 Subject: [Ferret-talk] Index size Message-ID: <724e6cbf0805230859i212a19g5be4d7275ce7cc39@mail.gmail.com> Hi, Is it possible that when building indexes on my Windows dev machine that I end up with 3.cfs files and then when deploying to my prod Linux box I only end up with 1 cfs file. It seems a little odd, but everything seems to be working correctly. Just curious if anyone else has experienced this as well. Thanks! -James -------------- next part -------------- An HTML attachment was scrubbed... URL: From julioody at gmail.com Sun May 25 23:34:17 2008 From: julioody at gmail.com (Julio Cesar Ody) Date: Mon, 26 May 2008 13:34:17 +1000 Subject: [Ferret-talk] project directions Message-ID: Hey all, just recently I stumbled upon this http://ferret.davebalmain.com/trac/timeline which seemed like good news. I thought Ferret was put on hold or perhaps dying, and having participated recently in a few discussions with people who also thought that was the case, I didn't have a good answer for it. So, is anyone informed if there will be some development going on Ferret, as in consistently? Thanks. ps: I'm not bitching. From schulte.eric at gmail.com Thu May 29 12:44:00 2008 From: schulte.eric at gmail.com (Eric Schulte) Date: Thu, 29 May 2008 09:44:00 -0700 Subject: [Ferret-talk] Error decoding input string. In-Reply-To: <18483.4545.306776.825606@eschulte-work.hsd1.wa.comcast.net.> References: <4831a428.08b38c0a.1b25.7aad@mx.google.com> <18483.4545.306776.825606@eschulte-work.hsd1.wa.comcast.net.> Message-ID: <483edd06.14b48c0a.627e.7963@mx.google.com> Hi, So I've tried switching to the latest version of Ferret (0.11.06), but I am still getting the following errors. ,---- | Error: exception 2 not handled: Error decoding input string. Check that you have the locale set correctly | from spanish_indexer.rb:45 | from spanish_indexer.rb:38:in `each' | from spanish_indexer.rb:38 `---- The articles are recognized as valid utf8 using iconv, and I believe my locale is set properly ,---- | LANG=en_US.UTF-8 | LC_CTYPE="en_US.UTF-8" | LC_NUMERIC="en_US.UTF-8" | LC_TIME="en_US.UTF-8" | LC_COLLATE="en_US.UTF-8" | LC_MONETARY="en_US.UTF-8" | LC_MESSAGES="en_US.UTF-8" | LC_PAPER="en_US.UTF-8" | LC_NAME="en_US.UTF-8" | LC_ADDRESS="en_US.UTF-8" | LC_TELEPHONE="en_US.UTF-8" | LC_MEASUREMENT="en_US.UTF-8" | LC_IDENTIFICATION="en_US.UTF-8" | LC_ALL= `---- what's weird here is that the errors don't always happen on the same articles, if I try to run indexing three times, printing out the articles that throw this error, I get a different list of articles each time. In fact I just changed my indexing script so that it keeps trying to index failed articles ,---- | # ind is my index | # | # add_arts is a method which takes a list of articles, tries to | # index them, and returns a list of the articles that | # threw errors during indexing | # | puts art_paths.size.to_s + "articles" | missed = add_arts(art_paths, ind) | while missed.size > 0 | missed = add_arts(missed, ind) | puts missed.size | end `---- and I was able to index all of the articles with the following output ,---- | 5843 articles | 34 | 16 | 10 | 9 | 7 | 7 | 6 | 1 | 0 `---- any ideas what could be causing this non-deterministic behavior? Thanks, Eric -- schulte