From bas at luon.net Sun Jul 2 06:13:43 2006 From: bas at luon.net (Bas Kloet) Date: Sun, 2 Jul 2006 12:13:43 +0200 Subject: 2 bugs when parsing emphasized or bold text Message-ID: <20060702101343.GA4541@edison.luon.net> I've found 2 bugs that produce (imho) incorrect rendering results: 1) The regexp for strong (*) and bold (**) is greedy, which produces very strange results. The simplest way to show the problem is to give an example. This is the original code: ===== Strong: Lets do a little test *t* this should not be strong *u*. Bold: Lets do another test **t** this should not be bold **u**. ===== And this is the (relevant part of) the html that is produced: =====

Strong: Lets do a little test t* this should not be strong *u.

Bold: Lets do another test t* this should not be bold *u.

===== As you can see, the html produced is not exactly what you would expect. 2) Using _TEXT_ to emphasize a string doesn't work if TEXT spans multiple lines. If you want to emphasize a piece of text that spans multiple lines, then _TEXT_ does not work, the underscores are simply shown in the generated text, even if there are no hard linebreaks. I filed these bugs both with the Debian bugtracker and the tracker on rubyforge a couple of weeks ago, but there was no response. Though I'm a pretty decent ruby coder, the redcloth code is way over my head, so I was wondering if anyone here has a solution to one or both of the above problems? Thanks, Bas From christoffer.sawicki at gmail.com Sun Jul 2 18:20:53 2006 From: christoffer.sawicki at gmail.com (Christoffer Sawicki) Date: Mon, 3 Jul 2006 00:20:53 +0200 Subject: 2 bugs when parsing emphasized or bold text In-Reply-To: <20060702101343.GA4541@edison.luon.net> References: <20060702101343.GA4541@edison.luon.net> Message-ID: <1a991fa30607021520v2c671119ta715c08b198cfc6d@mail.gmail.com> > I've found 2 bugs that produce (imho) incorrect rendering results: > > 1) The regexp for strong (*) and bold (**) is greedy, which produces > very strange results. *snip* I haven't looked at the relevant RedCloth code, but the non-greedy modifier in Ruby is "?". In other words: .* is greedy while .*? isn't. -- Christoffer Sawicki From bas at luon.net Mon Jul 3 03:05:20 2006 From: bas at luon.net (Bas Kloet) Date: Mon, 3 Jul 2006 09:05:20 +0200 Subject: 2 bugs when parsing emphasized or bold text In-Reply-To: <1a991fa30607021520v2c671119ta715c08b198cfc6d@mail.gmail.com> References: <20060702101343.GA4541@edison.luon.net> <1a991fa30607021520v2c671119ta715c08b198cfc6d@mail.gmail.com> Message-ID: <20060703070520.GA4561@edison.luon.net> On Mon, Jul 03, 2006 at 12:20:53AM +0200, Christoffer Sawicki wrote: > > I've found 2 bugs that produce (imho) incorrect rendering results: > > > > 1) The regexp for strong (*) and bold (**) is greedy, which produces > > very strange results. > > *snip* > > I haven't looked at the relevant RedCloth code, but the non-greedy > modifier in Ruby is "?". In other words: .* is greedy while .*? isn't. > Thanks, but that's not my real problem. I have a pretty good idea of where in the code this is happening, and I know basic regular expression syntax, but the following regexp code is just a bit too complicated for me: ====== QTAGS = [ ['**', 'b'], ['*', 'strong'], ['??', 'cite', :limit], ['-', 'del', :limit], ['__', 'i'], ['_', 'em', :limit], ['%', 'span', :limit], ['+', 'ins', :limit], ['^', 'sup'], ['~', 'sub'] ] QTAGS.collect! do |rc, ht, rtype| rcq = Regexp::quote rc re = case rtype when :limit /(\W) (#{rcq}) (#{C}) (?::(\S+?))? (\S.*?\S|\S) #{rcq} (?=\W)/x else /(#{rcq}) (#{C}) (?::(\S+))? (\S.*?\S|\S) #{rcq}/xm end [rc, ht, re, rtype] end ====== My main problem is that any trial-and-error modification of the code to fix one problem spawns 2 new ones. I hope this makes my problem a bit clearer. Thanks, Bas From mark at markjuh.net Wed Jul 5 05:56:57 2006 From: mark at markjuh.net (Mark van Eijk) Date: Wed, 5 Jul 2006 11:56:57 +0200 Subject: 2 bugs when parsing emphasized or bold text In-Reply-To: <20060702101343.GA4541@edison.luon.net> References: <20060702101343.GA4541@edison.luon.net> Message-ID: <20060705095657.GA19207@markjuh.net> On Sun, Jul 02, 2006 at 12:13:43PM +0200, Bas Kloet wrote: > I've found 2 bugs that produce (imho) incorrect rendering results: > > 1) The regexp for strong (*) and bold (**) is greedy, which produces > very strange results. > > The simplest way to show the problem is to give an example. > > This is the original code: > > ===== > Strong: > Lets do a little test *t* > this should not be strong *u*. > > Bold: > Lets do another test **t** > this should not be bold **u**. > ===== > > And this is the (relevant part of) the html that is produced: > > ===== >

Strong: > Lets do a little test t* > this should not be strong *u.

> > >

Bold: > Lets do another test t* > this should not be bold *u.

> ===== > > As you can see, the html produced is not exactly what you would expect. I've taken a quick look at it and minimized your example a bit: ===== *t* not strong *u*. ===== This produces: =====

t* not *u

===== But the funny thing is that the following: ===== *tt* not strong *u*. ===== produces: =====

tt not u

===== So I don't think the matching is really greedy. It just doesn't handle 1-character cases very well. Mark From bas at luon.net Wed Jul 5 17:07:43 2006 From: bas at luon.net (Bas Kloet) Date: Wed, 5 Jul 2006 23:07:43 +0200 Subject: 2 bugs when parsing emphasized or bold text In-Reply-To: <20060705095657.GA19207@markjuh.net> References: <20060702101343.GA4541@edison.luon.net> <20060705095657.GA19207@markjuh.net> Message-ID: <20060705210743.GB4625@edison.luon.net> On Wed, Jul 05, 2006 at 11:56:57AM +0200, Mark van Eijk wrote: > > So I don't think the matching is really greedy. It just doesn't handle > 1-character cases very well. > Thanks, that makes the problem a lot clearer. I found a fix for the _ problem when text spans multiple lines myself. I removed the :limit from the following line: --- ['_', 'em', :limit] --- There's probably a good reason why the :limit was there, but all the tests I've run produced correct results, so for the moment I'm happy about that. Thanks for looking into the problem further. Greetings, Bas