[Bug 308698] New: grep slow in utf8 locale
https://bugzilla.novell.com/show_bug.cgi?id=308698 Summary: grep slow in utf8 locale Product: openSUSE 10.3 Version: Beta 2 Platform: Other OS/Version: Other Status: NEW Severity: Major Priority: P5 - None Component: Basesystem AssignedTo: schwab@novell.com ReportedBy: dmueller@novell.com QAContact: qa@suse.de Found By: --- grep from 10.2: time distpin foobla >/dev/null real 0m6.311s user 0m5.964s sys 0m0.448s grep from 10.3: time distpin foobla >/dev/null real 1m5.049s user 1m3.012s sys 0m1.224s export LC_ALL=C time distpin foobla >/dev/null real 0m6.024s user 0m5.692s sys 0m0.400s -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698#c1 Ludwig Nussel <lnussel@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |lnussel@novell.com Severity|Major |Critical --- Comment #1 from Ludwig Nussel <lnussel@novell.com> 2007-09-12 03:49:20 MST --- grep really is horribly slow -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698#c2 Stephan Kulow <coolo@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |coolo@novell.com Severity|Critical |Major --- Comment #2 from Stephan Kulow <coolo@novell.com> 2007-09-14 02:25:19 MST --- Ludwig, you should know: Critical Crash, loss of data, corruption of data, severe memory leak -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698#c3 --- Comment #3 from Dirk Mueller <dmueller@novell.com> 2007-09-14 06:10:31 MST --- the 2.5.3 release does not help either, but using at least a released version would probably still be better. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698#c4 Bernd Strieder <strieder@informatik.uni-kl.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |strieder@informatik.uni-kl.de --- Comment #4 from Bernd Strieder <strieder@informatik.uni-kl.de> 2007-09-27 07:47:59 MST --- just found that I can confirm the regression. openSuSE 10.3 RC1 without utf8
time nm *.o | LANG=C time grep -v " W " > link.nm 0.12user 0.13system 0:02.16elapsed 12%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+189minor)pagefaults 0swaps
real 0m2.179s user 0m1.840s sys 0m0.564s openSuSE 10.3 RC1 with utf8
time nm *.o | time grep -v " W " > link.nm 335.10user 0.08system 5:36.78elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+252minor)pagefaults 0swaps
real 5m36.819s user 5m36.817s sys 0m0.528s For comparision SuSE 9.0 with the same hardware:
time nm *.o | LANG=C time grep -v " W " > link.nm 0.04user 0.12system 0:08.58elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (134major+22minor)pagefaults 0swaps
real 0m8.611s user 0m8.240s sys 0m0.530s Looking at the inner time output we are talking here about hundredths of seconds vs. hundreds of seconds, a factor of over 1000 at the given input size. grep is used by so many scripts within about every distro, that this problem could become a show-stopper. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698#c5 --- Comment #5 from Dirk Mueller <dmueller@novell.com> 2007-11-12 07:14:23 MST --- according to micha: http://cvs.fedora.redhat.com/viewcvs/devel/grep/grep-2.5.1-egf-speedup.patch?rev=1.16&view=log could help here. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698#c6 --- Comment #6 from Michael Matz <matz@novell.com> 2007-11-12 07:30:46 MST --- This patch doesn't apply as is on our grep, but debian also uses this patch, maybe theirs works. It's part of http://ftp.debian.org/debian/pool/main/g/grep/grep_2.5.3~dfsg-3.diff.gz The redhat dfa-optional patch might also be interesting, but needs thorough testing. It might or might not actually be more correct, see https://bugzilla.redhat.com/show_bug.cgi?id=121313#10 . Also see https://bugzilla.redhat.com/show_bug.cgi?id=69900 for the original discussion leading to the egf-speedup patch. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User bj@sernet.de added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c7 Bjoern Jacke <bj@sernet.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |koenig@linux.de --- Comment #7 from Bjoern Jacke <bj@sernet.de> 2007-12-21 07:05:03 MST --- *** Bug 338995 has been marked as a duplicate of this bug. *** https://bugzilla.novell.com/show_bug.cgi?id=338995 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User bj@sernet.de added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c8 Bjoern Jacke <bj@sernet.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bj@sernet.de --- Comment #8 from Bjoern Jacke <bj@sernet.de> 2007-12-21 07:30:07 MST --- the more data the slower grep gets - exponentially: yes | head -10000 | LC_ALL=de_DE.UTF-8 time grep . > /dev/null .. 10s yes | head -30000 | LC_ALL=de_DE.UTF-8 time grep . > /dev/null .. 45s a quite simple grep on a 70MB log file is a *real* problem in a UTF-8 locale. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User dmueller@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c9 --- Comment #9 from Dirk Mueller <dmueller@novell.com> 2008-01-03 07:10:39 MST --- ping.. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User mmarek@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c10 Michal Marek <mmarek@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |root@proximasociale.cz --- Comment #10 from Michal Marek <mmarek@novell.com> 2008-01-15 02:10:05 MST --- *** Bug 353718 has been marked as a duplicate of this bug. *** https://bugzilla.novell.com/show_bug.cgi?id=353718 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User roland.kletzing@materna.de added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c11 roland kletzing <roland.kletzing@materna.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |roland.kletzing@materna.de --- Comment #11 from roland kletzing <roland.kletzing@materna.de> 2008-02-03 07:03:43 MST ---
a quite simple grep on a 70MB log file is a *real* problem in a UTF-8 locale.
yes. i have come across today (opensuse 11 alpha1) after i wondered why grepping trough larger amount of data did nothing but burning cpu cycles. i needed a while to find out that this is a locale problem. this is not obvious to the end-user and people will have major hassle because of this. major annoyance,imho. please fix ! -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User nordhaus@informatik.hu-berlin.de added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c12 --- Comment #12 from Stefan Nordhausen <nordhaus@informatik.hu-berlin.de> 2008-02-03 09:44:28 MST --- As a workaround, I installed grep from Suse 10.2 on my 10.3 system and it worked without problems, performance was back to normal. Why not just _down_grade grep to the version in Suse 10.2? It is better to have an 'old' version that works than a new version that has been broken for 5 months now. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User roland.kletzing@materna.de added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c13 --- Comment #13 from roland kletzing <roland.kletzing@materna.de> 2008-02-03 14:24:37 MST --- well, _my_ workaround is unsetting the LC_CTYPE envvar - but i don`t like workarounds ;) i personally don`t have a real problem if i know the workaround, but i`m sure other people will have a problem with this if they don`t know what`s the issue. and this is why they will tell: nahh. suse is crap. even such basic tools like grep have issues on that. i get back to ubuntu. or use cygwin on windoze. or whatever. (ok, apparently, other distro`s may have that issue, too) imho it`s a quality characteristic if things "just work" without bells and whistles, and that`s why i complain. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User dmueller@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c14 --- Comment #14 from Dirk Mueller <dmueller@novell.com> 2008-02-04 02:53:31 MST --- vote for it! ;) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User roland.kletzing@materna.de added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c15 --- Comment #15 from roland kletzing <roland.kletzing@materna.de> 2008-02-04 16:45:40 MST --- apparently, it seems it`s not only grep which is suffering from this problem: linux:/tmp # echo $LC_CTYPE de_DE.UTF-8 linux:/tmp # time cat largefile |cut -d ":" -f 2- >/dev/null real 0m43.832s user 0m6.008s sys 0m0.160s linux:/tmp # unset LC_CTYPE linux:/tmp # time cat largefile |cut -d ":" -f 2- >/dev/null real 0m2.620s user 0m0.208s sys 0m0.132s -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User lyeoh@inter-touch.com added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c16 Lincoln Yeoh <lyeoh@inter-touch.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |lyeoh@inter-touch.com --- Comment #16 from Lincoln Yeoh <lyeoh@inter-touch.com> 2008-02-18 04:43:03 MST --- Maybe it doesn't cause a crash, but it's still very bad if you had being relying on grep not to suddenly be slower by an order of magnitude (20x), especially on a new machine that's 2x faster, but overall 10X slower with grep "Vista Edition" installed. 20 times slower is terrible. e.g. grep = grep-2.5.2-28 grep251 = grep-2.5.1a-40 from a 10.2 box, which isn't fast either - in fact it's quite slow too in some cases. time grep251 -E "(duosdwrwr|asdsadd)" messages-20080218 real 0m2.145s user 0m1.920s sys 0m0.224s time grep -E "(duosdwrwr|asdsadd)" messages-20080218 real 0m44.081s user 0m43.819s sys 0m0.256s time grep251 -E "(duosdwrwr|asdsadd)" messages-20080218 real 0m2.132s user 0m1.868s sys 0m0.264s time grep251 -E "asdadd|trerwer" messages-20080218 real 0m49.606s user 0m48.527s sys 0m0.356s Note that for the following dual pattern query grep251 is also very slow. time grep -E "asdadd|trerwer" messages-20080218 real 0m50.177s user 0m49.939s sys 0m0.232s ime grep251 -E "asdadd" messages-20080218 real 0m0.759s user 0m0.492s sys 0m0.224s But 251 is not too bad one pattern at a time, while grep-2.5.2-28 is bad. time grep251 -E "trerwer" messages-20080218 real 0m0.686s user 0m0.460s sys 0m0.228s time grep -E "trerwer" messages-20080218 real 0m43.013s user 0m42.719s sys 0m0.204s time grep -E "asdadd" messages-20080218 real 0m42.810s user 0m42.515s sys 0m0.272s export LC_ALL="C" time grep251 -E "(duosdwrwr|asdsadd)" messages-20080218 real 0m1.535s user 0m1.280s sys 0m0.256s time grep -E "(duosdwrwr|asdsadd)" messages-20080218 real 0m1.568s user 0m1.312s sys 0m0.248s time grep -E "asdadd|trerwer" messages-20080218 real 0m2.152s user 0m1.968s sys 0m0.180s time grep251 -E "asdadd|trerwer" messages-20080218 real 0m2.154s user 0m1.840s sys 0m0.312s Must we start turning locale off by default? Maybe getting locale right is not easy, but 40 seconds for an foo|bar search? why not just 2 x longer than a single pattern search? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User dmueller@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c17 --- Comment #17 from Dirk Mueller <dmueller@novell.com> 2008-02-18 07:05:06 MST --- the reason for that is that grep added mbrtowc in the most hottest code path, and that is probably one of the slowest glibc functions to call. I're rewritten that code, and it is now ~ factor 10 faster (which makes it still a factor 2 regression). the debian patches remove that code path alltogether, but they disable the CW algorithm, which give an even bigger speed regression. so while I like getting rid of the code path, disabling the CW algorithm is a no-go. the correct solution would be to adapt the cw matcher to use wide chars, which would be roughly en par with the old speed and allow it to be used. the new suse patches that were added are just entirely broken ;( -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User roland.kletzing@materna.de added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c18 --- Comment #18 from roland kletzing <roland.kletzing@materna.de> 2008-03-29 12:46:49 MST --- so no fix for this in factory yet? i updated to latest today, but.... linux:/tmp # time cat rpm-no-files.txt |grep -v "^/proc" |grep -v "^/sys" |grep -v "^/dev" >rpm-no-files.txt2 real 2m9.914s user 1m46.443s sys 0m1.348s linux-trkh:~ # echo $LC_CTYPE de_DE.UTF-8 linux:/tmp # unset LC_CTYPE linux:/tmp # time cat rpm-no-files.txt |grep -v "^/proc" |grep -v "^/sys" |grep -v "^/dev" >rpm-no-files.txt2 real 0m1.938s user 0m0.860s sys 0m1.052s and this is just a 20mb file.... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User coolo@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c19 --- Comment #19 from Stephan Kulow <coolo@novell.com> 2008-05-20 07:29:49 MST --- *** Bug 381873 has been marked as a duplicate of this bug. *** https://bugzilla.novell.com/show_bug.cgi?id=381873 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User dmueller@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c22 --- Comment #22 from Dirk Mueller <dmueller@novell.com> 2008-05-28 05:47:01 MDT --- the new submission from Andreas Schwab labeled "Some speadups" does improve the situation: 11.0 package: real 0m17.966s user 0m16.301s sys 0m0.308s new submission: real 0m6.727s user 0m5.276s sys 0m0.304s 10.2 package: real 0m3.114s user 0m2.708s sys 0m0.224s so there is room for improvement, but a factor of 3 is already quite good progress. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User dmueller@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c23 --- Comment #23 from Dirk Mueller <dmueller@novell.com> 2008-05-28 05:47:52 MDT --- the numbers above do not match those in comment #1 anymore as I meanwhile have a different machine with a different architecture. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User koenig@linux.de added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c25 --- Comment #25 from Harald Koenig <koenig@linux.de> 2008-05-28 10:51:54 MDT --- (In reply to comment #22 from Dirk Mueller)
the new submission from Andreas Schwab labeled "Some speadups" does improve the situation:
is this new submission available for testing ? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User jengelh@gmx.de added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c26 Jan Engelhardt <jengelh@gmx.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jengelh@gmx.de --- Comment #26 from Jan Engelhardt <jengelh@gmx.de> 2008-05-29 10:16:57 MDT --- Forwarded to upstream some time ago. http://www.nabble.com/horrible-utf-8-performace-in-wc-td17094488.html -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User schwab@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c27 Andreas Schwab <schwab@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mmeeks@novell.com --- Comment #27 from Andreas Schwab <schwab@novell.com> 2008-06-10 07:53:45 MDT --- *** Bug 394665 has been marked as a duplicate of this bug. *** https://bugzilla.novell.com/show_bug.cgi?id=394665 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 Dirk Mueller <dmueller@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Found By|--- |Development -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User schwab@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c28 Andreas Schwab <schwab@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED --- Comment #28 from Andreas Schwab <schwab@novell.com> 2008-11-05 03:29:02 MST --- Fixed. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User matz@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c29 --- Comment #29 from Michael Matz <matz@novell.com> 2008-11-05 06:37:20 MST --- For some expressions this is indeed much better. But other, very simple regexps (that match many lines) it's still very slow: % time LANG=C grep . /var/log/messages > /dev/null real 0m0.010s user 0m0.004s sys 0m0.004s % time LANG=de_DE.UTF-8 grep . /var/log/messages > /dev/null real 0m2.099s user 0m2.072s sys 0m0.028s Interestingly en_EN.UTF-8 is faster again: % time LANG=en_EN.UTF-8 grep . /var/log/messages > /dev/null real 0m0.010s user 0m0.008s sys 0m0.000s So, something is still fishy. Should I create a new PR or just reopen this one? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User jengelh@medozas.de added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c30 --- Comment #30 from Jan Engelhardt <jengelh@medozas.de> 2008-11-05 07:57:39 MST --- That is because en_EN is not a known locale in the first place. Pick one that is, perhaps, valid, before you test? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User mfabian@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c31 --- Comment #31 from Mike Fabian <mfabian@novell.com> 2008-11-05 08:14:43 MST --- In case of unknown locales, C/POSIX is used as a fallback and that seems to be the case in Michael’s benchmark. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User matz@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c32 --- Comment #32 from Michael Matz <matz@novell.com> 2008-11-05 08:20:43 MST --- Bah, indeed :) Okay, that makes my point just more valid, UTF-8 locales are still much slower. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User strieder@informatik.uni-kl.de added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c33 --- Comment #33 from Bernd Strieder <strieder@informatik.uni-kl.de> 2008-11-06 11:32:48 MST --- Whoever needs grep searches with utf-8 support should possibly look at pcregrep. Maybe there is enough pain to replace grep with pcregrep, given it is compatible enough. I haven't checked yet. Anything else will involve reimplementing grep from scratch to get proper UTF-8 support. Most occurrences of char will have to be replaced by wchar_t and conversion has to be done on input of both the pattern and the search data, involving regular epression matching of its own. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
Whoever needs grep searches with utf-8 support should possibly look at
https://bugzilla.novell.com/show_bug.cgi?id=308698 User jengelh@medozas.de added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c34 --- Comment #34 from Jan Engelhardt <jengelh@medozas.de> 2008-11-06 16:19:06 MST --- pcregrep. What would it offer over grep -P? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User jengelh@medozas.de added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c35 --- Comment #35 from Jan Engelhardt <jengelh@medozas.de> 2008-11-06 16:22:46 MST --- We got a winner... $ yes | head -n10000 | LC_ALL=de_DE.UTF-8 time grep . >/dev/null 11.94user 0.00system 0:12.20elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+272minor)pagefaults 0swaps $ yes | head -n10000 | LC_ALL=de_DE.UTF-8 time grep -P . >/dev/null 0.00user 0.00system 0:00.01elapsed 57%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+246minor)pagefaults 0swaps -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=308698 User matz@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=308698#c36 --- Comment #36 from Michael Matz <matz@novell.com> 2008-11-07 05:29:27 MST --- No, pcregrep (or grep -P) is no option, as the regular expression syntax is that of perl, not of grep. Yes, pcre is faster with UTF-8, it always was, but that's irrelevant. Re comment #33: we have this conversation because grep of course _does_ support UTF-8 inputs just fine (as some other multi-byte locales). It is only slow. And yes, grep does implement some specialized versions of the matcher for these locales. No need for any reimplementation from scratch. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com