Mailinglist Archive: opensuse-bugs (9719 mails)

< Previous Next >
[Bug 329992] Invalid behavior of grep with regexp ^[A-Z]
  • From: bugzilla_noreply@xxxxxxxxxx
  • Date: Tue, 27 Nov 2007 13:11:21 -0700 (MST)
  • Message-id: <20071127201121.C07C0CC6B5@xxxxxxxxxxxxxxxxxxxxxx>
https://bugzilla.novell.com/show_bug.cgi?id=329992#c23





--- Comment #23 from Michael Matz <matz@xxxxxxxxxx> 2007-11-27 13:11:21 MST ---
The code from comment #21 isn't reached in multibyte locales. When parsing
the initial '[' it relatively early does this:

#ifdef MBS_SUPPORT
if (MB_CUR_MAX > 1)
{
/* In multibyte environment a bracket expression may contain
multibyte characters, which must be treated as characters
(not bytes). So we parse it by parse_bracket_exp_mb(). */
parse_bracket_exp_mb();
return lasttok = MBCSET;
}
#endif

which will then set dfa->mbcsets[dfa->nmbcsets++] to the range in question.

The actual matching in done in dfaexec, which for this regexp calls eventually
match_mb_charset(), which uses wcscoll(3) to determine if the character is
in the requested ranges. For 'p' this will return true, as expected from
the collating sequence. So, that's why 'pippo' is matched at all, which is
correct.

When actually going to print this line then, it will use print_line_middle().
That in turn will, for every line it's supposed to print, run a regexp search
for matches, in order to colorize them. _That_ matching is done with the
normal libc regexp machinery. Like so:

print_line_middle:
while ( lim > cur
&& ((match_offset = execute(beg, lim - beg, &match_size,
beg + (cur - beg))) != (size_t) -1))
{
.... do something from match_offset ...

execute is EGexecute for us, hence:

EGexecute:
for (beg = end = buf; end < buflim; beg = end)
{
if (!start_ptr) // it's non-zero for us
else
for (i = 0; i < pcount; i++)
{
patterns[i].regexbuf.not_eol = 0;
if (0 <= (start = re_search (&(patterns[i].regexbuf),
buf, end - buf - 1,
beg - buf, end - beg - 1,
&(patterns[i].regs))))
{

So, this goes through all defined patterns (in our case only one), and uses
re_search on the line with the pattern. The regexbuf member is filled by
calls to re_compile_pattern in GEAcompile. re_search and re_compile_pattern
come from libc, grep is not using the versions from lib/regex.c .

I've verified that re_search does match the capital 'P' but not the small 'p'.
So this definitely disagrees with what wcscoll() returns for the range
endpoints and 'p'.


--
Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

< Previous Next >