Mailinglist Archive: opensuse-web (14 mails)
| < Previous | Next > |
Re: [opensuse-web] Wiki search left en.opensuse.org
- From: Christian Boltz <opensuse@xxxxxxxxx>
- Date: Sun, 15 Apr 2012 19:02:10 +0200
- Message-id: <10269640.6GJp3YoHRl@tux.boltz.de.vu>
Hello,
Am Donnerstag, 5. April 2012 schrieb Henne Vogelsang:
One more reason to make the search better. (No smiley, I'm serious about
that.)
IMHO this includes searching all namespaces - and rating the main and
portal namespace up so that they are always on top of the search
results.
I had an interesting IRC discussion with yaloki on this some months ago.
The TL;DR summary is something like "lucene itsself is a very good
search engine, but the implementation in MediaWiki / the MWSearch
extension is just broken".
Pascal was quite shocked when reading the MWSearch code. It more or less
does the equivalent of "SELECT ... WHERE content LIKE '%searchword%'.
In other words: lucene has lots of features to get better search results
and to influence the search result order - but we don't use them.
The people involved in the discussion allowed to send the IRC log. It's
attached and contains lots of useful knownledge, so please read it ;-)
(If you are in a hurry, you can start reading at 01:17)
Regards,
Christian Boltz
--
TikiWiki ist eine sehr umfassende Sammlung von Sicherheitslücken,
konzeptuellen Problemen und Performancekillern, die alles kann und
nichts richtig. [Kristian Köhntopp auf
http://blog.koehntopp.de/archives/2051-5-Jahre-Blogging.html]2011-12-23 #opensuse-project
[01:06] <warlordfff> I lost you when you were saying that openfate sucks, right?
[01:07] <warlordfff> what happened you all left?
[01:07] <Ilmehtar> warlordfff: I dont think openfate sucks..I just think that
its surplus to requirements - why have bugzilla + openfate especially when we
dont actualyl rely on either to *define* our future features..just help guide
things along
[01:07] <warlordfff> I think openFate is a great idea
[01:07] <warlordfff> but it might needs work
[01:08] <warlordfff> but sadly I don't have any ideas about it
[01:08] <warlordfff> so I stand in my corner on that
[01:09] <yaloki> searchability sucks monumentally, unfortunately
[01:09] <yaloki> try searching for "board 2012 elections platform pages" on the
wiki
[01:09] <yaloki> zilch
[01:09] <Ilmehtar> it's a good idea but end of the day from a contributors
perspective, when someone things "I want to do something new for
openSUSE"..where do they look? bugzilla? openfate? it would be much nicer to
have a single spot where a contributor can look ..oh that needs doing, great
[01:09] <yaloki> you can't even find any page about the 2012 elections with
that using google
[01:10] <yaloki> Ilmehtar: right
[01:10] <yaloki> openfate is nice in its own right, but it's pointless to
duplicate, the same can be achieved with bugzilla
[01:10] <Ilmehtar> also from a users perspective..they want a new
feature..where do they file it? bugzilla as an enhancement? openfate as a new
feature?
[01:10] <warlordfff> does bugzilla has RSS or something?
[01:11] <yaloki> some
[01:11] * Ilmehtar grumbles about that gnome3 ML troll
[01:11] <yaloki> Ilmehtar: don't feed :)
[01:11] <Ilmehtar> yaloki: not any more..I tried to be nice, now I'm going to
STFU and hope he does too
[01:12] <simon321> Ilmehtar: openFATE for features; although it is almost as
complicated as bugilla
[01:12] <Ilmehtar> bugzilla also lets you follow others, so like the gnome team
all follow the default assignee for gnome bugs so we all get emails about the
changes to gnome bugs that aren't assigned to anyone specifically
[01:12] <warlordfff> oh you taslk about the guy who told that Gnome3 sucks on
openSUSE?
[01:12] <Ilmehtar> simon321: where do features end, and enhancements begin? :)
[01:12] <simon321> Ilmehtar: it is the same backet
[01:12] <simon321> basket
[01:14] <Ilmehtar> warlordfff: and yes, I was, just saw his latest and decided
to give up trying to use logic to have a meaningful discussion with the fella
[01:14] <simon321> in general with openSUSE web infrastructure is only one
thing bad - it is not connected
[01:14] <yaloki> simon321: and search
[01:14] <warlordfff> Ilmehtar: "the full name of "LINUX" is "Linux is not
Unix" neither "Windows" or "iOS"" =EPIC
[01:15] <yaloki> wiki search is the most ridiculous I've ever seen
[01:15] <yaloki> there is lots of content you can't even find by searching
[01:15] <simon321> yaloki: if you have no idea how to connect things then
nothing works as expected, including search
[01:15] <Ilmehtar> warlordfff: lol, yes, I have to admit his antics have given
me quite a few laughs, but still, would be nice to actually have people
constructivly critique g3 rather than just bitch about it
[01:15] <yaloki> simon321: search is broken because of a bad decision
[01:15] <yaloki> simon321: it is restricted intentionally
[01:16] <Ilmehtar> yaloki: why and how?
[01:16] <yaloki> simon321: having a proper search engine across all parts of
the infrastructure would be awesome, yes, but also pretty tough to implement
[01:16] <warlordfff> why don't we take baby steps and try to make Search work
properly and then move to other stuff?
[01:16] <yaloki> Ilmehtar: only the portal pages are searched by default
[01:16] <yaloki> Ilmehtar: henne thought, still thinks, and insists that that
is the right way
[01:16] <Ilmehtar> yaloki: I did not know that..glad I've started tuning the
gnome portal page..
[01:17] <yaloki> Ilmehtar: he had the illusion that people would then be forced
to put everything in portal pages
[01:17] <simon321> yaloki: apropos wiki search, we can let everyone to see
everything - if henne let that happen :) , but we still need presentation space
separated from support
[01:17] <Ilmehtar> yaloki: it would have a hope of working, if people knew
about it..
[01:17] <yaloki> simon321: please define "presentation space" and "support"
[01:17] <yaloki> simon321: for search, you don't separate: show everything you
can, with good relevance ranking :)
[01:18] <yaloki> Ilmehtar: even then it's stupid imho: if content is there, let
people find it and use it, nevermind how it's structured
[01:18] <simon321> yaloki: how to do that - where is relevance written in the
wiki
[01:18] <warlordfff> guys at the oSC11 Henne admited that he needs hands to
organize the wiki
[01:18] <yaloki> Ilmehtar: it's not like there are 10 people working full time
on maintaining content and structure there
[01:18] <yaloki> simon321: in the search engine implementation
[01:19] <yaloki> simon321: mehle moved from the default search+indexing engine
to lucene
[01:19] <yaloki> simon321: I don't know whether the relevance scoring is good
or not, that depends on how you tune lucene
[01:19] <yaloki> you can influence which fields get higher scores for matches,
etc...
[01:20] * yaloki has done quite a lot with Solr, which is a layer on top of
Lucene
[01:20] <yaloki> including on a mediawiki
[01:20] <simon321> yaloki: on the other side, what is relevant for one is not
for the other see http://en.opensuse.org/Portal:Wiki#Structure
[01:20] <cboltz> the perfect solution[tm] would probably be to add some "bonus
points" to search results in the main and portal namespace so that they always
appear as top search result, and the other (now usually hidden) results could
follow below
[01:21] <yaloki> simon321: partly true, but it can usually be done pretty well,
when you know how the implementation works, and when you spend some time tuning
it
[01:21] <yaloki> cboltz: yes, that can be done easily
[01:21] <yaloki> lucene (and solr) has "boosting"
[01:21] <yaloki> and a big solr instance could be deployed to index all the
parts of the infrastructure too
[01:22] <simon321> yaloki: I can't know who is the visitor, it is up to him/her
to select what part he wants to see
[01:22] <yaloki> lists, forums, wiki, openfate, bugzilla, ?
[01:22] <yaloki> but it would require implementation efforts for each part
[01:22] <yaloki> simon321: no, why?
[01:22] <yaloki> simon321: if you search for "nvidia"
[01:23] <yaloki> simon321: just give all the pages that mention nvidia, with
higher scoring for pages that have nvidia in their title, and pages that
mention it more frequently, etc...
[01:23] <yaloki> (that's what such proper search engines do, unlike just using
mysql for search)
[01:23] <simon321> yaloki: 1) is nvidia supported, 2) do I have problem with
nvidia, wich one is visitor looking for?
[01:24] <yaloki> simon321: no, that's not how search works
[01:24] <yaloki> simon321: it's search, not support on irc :)
[01:24] <yaloki> simon321: search relevance obviously cannot be subjective
[01:24] <yaloki> simon321: it's search relevance through scoring and boosting
in the search engine implementation
[01:25] <cboltz> yaloki: to start with small steps - how exactly can we
implement boosting in the wiki search?
[01:26] <yaloki> cboltz: through the configuration
[01:26] <cboltz> (I don't say your "big" solution is wrong, however it will
need time. And until then, small steps are better than nothing ;-)
[01:26] <yaloki> cboltz: no it doesn't need time
[01:26] <simon321> yaloki: I have no idea what guys want to see, and pretending
that one word reveals that, is being a bit too confident - some words will
actually tell, as "nvidia", other as may be too ambitious
[01:26] <yaloki> cboltz: can be done in an hour
[01:26] <yaloki> simon321: sure, but there are techniques for that too
[01:26] <yaloki> simon321: we're not the first ones to implement search :)
[01:26] <cboltz> "configuration" - yes of course ;-)
[01:27] <yaloki> stopwords, etc..
[01:27] <cboltz> any pointers about details? ;-)
[01:27] <yaloki> cboltz: I don't know how the lucene integration in mediawiki
works
[01:27] <yaloki> cboltz: I've implemented it from scratch with Solr in half a
day
[01:27] <yaloki> cboltz: and there you can configure the boosting in the Solr
configuration file
[01:27] <simon321> yaloki: I know only one that works well, disambiguation
pages (but they are not listed as offical navigation tool)
[01:27] <yaloki> cboltz: and, of course, stop only giving results from portal
pages
[01:28] <cboltz> when creating the search index or when the search query runs?
[01:28] <yaloki> simon321: no, really, it works well
[01:28] <yaloki> cboltz: you can do both
[01:28] <yaloki> cboltz: and then it also depends on what is actually being
indexed
[01:28] <yaloki> cboltz: the implementation I did also indexes the wiki
categories the page is on
[01:28] <yaloki> cboltz: and boosts higher on category word matches
[01:29] <warlordfff> type KDE, it is a bit funny since it does not gives you
the pages found but get's you straight to the KDE page
[01:29] <yaloki> you can also tune on nearest matches
[01:29] <cboltz> the index probably contains all wiki pages - at least
searching with "all: $searchword" works
[01:29] <yaloki> warlordfff: yeah, that's wrong too imho
[01:29] <warlordfff> yeap
[01:29] <warlordfff> also
[01:30] <yaloki> actually
[01:30] <warlordfff> if you are searching e.g from Greece it should get you to
the Greek page if available
[01:30] <yaloki> when you think of how to implement search
[01:30] <yaloki> it's *very* simple
[01:30] <yaloki> it must be like google
[01:30] <yaloki> period
[01:30] <yaloki> that's what people expect
[01:30] <cboltz> it should be possible to change that behaviour for "KDE" -
probably an easy fix in the search form
[01:30] <yaloki> give a list of results, with relevance
[01:30] <simon321> yaloki: going straight to the page is how wikipedia works -
but for terms like KDE they offer listing of topics - not presentation
[01:31] <yaloki> simon321: yes, but that's not what people expect
[01:31] <yaloki> as said, it must behave like google search
[01:31] <yaloki> that's what 99% of people use 99% of the time ^^
[01:31] <simon321> well, I do, just because it works fine on wikipedia :)
[01:31] <yaloki> simon321: it works fine on wikipedia because they have a very
stringent design of the page names and disambiguation
[01:32] <yaloki> simon321: which we don't, and never will have, simply because
it would require a lot more people to do some caretaking of the wiki
[01:33] <warlordfff> the care taking is a problem
[01:33] <warlordfff> Henne told us in that BoF that there are only 5-6 people
doing that
[01:34] <warlordfff> so he is the last I would blame on that
[01:34] <yaloki> cboltz: I actually proposed to matthew that I give him my code
for solr search and explain to him how it works etc..
[01:34] <yaloki> cboltz: but then there was no followup and suddenly he did
something, no one knows what, with lucene
[01:34] <yaloki> you can't expect people to contribute when things are done
like that
[01:35] <cboltz> indeed, that's understandable :-/
[01:36] <yaloki> cboltz: so I can't tell you what needs to be configured where,
never used the mediawiki lucene plugin (as I guess there is such a thing and
that's what has been used)
[01:36] <yaloki> cboltz: that being said, lucene also has boosting and such, as
Solr builds upon that
[01:36] <simon321> well, last time there was discussion, he got working lucene,
and not that much time to spend on alternatives
[01:36] <yaloki> cboltz: and well it just takes a bit of tuning, trial and error
[01:36] <yaloki> cboltz: think of a few searches and what you think would make
sense as results
[01:36] <yaloki> cboltz: then try, analyze and tune until you get that result
[01:37] <yaloki> (solr has a good analyzer too)
[01:37] <cboltz> yes, there is a MW extension (and I'm also using it in a wiki
I maintain)
[01:37] <yaloki> simon321: I offered him my help
[01:37] <cboltz> but IIRC I didn't see anything about boosting in its
documentation
[01:37] <yaloki> simon321: I know exactly how to do it, I've done that for a
mediawiki already, including the implementation of live indexing, and tuning
the search results, etc...
[01:37] <yaloki> simon321: :\
[01:38] <yaloki> cboltz: it's probably hard-coded in the extension then
[01:38] <yaloki> cboltz: and probably with pretty sane defaults
[01:38] <yaloki> cboltz: but we might want to tune it too
[01:38] <simon321> yaloki: you know that he wasn't there for some time, and I
don't think that was his decision
[01:38] <yaloki> cboltz: e.g. boost results with portal or SDB higher than
other ones
[01:39] <yaloki> simon321: I have no idea, and that's what I'm criticizing, it
was totally opaque :\
[01:39] <simon321> yaloki: to me it appeared like someone else decision
[01:39] <yaloki> simon321: no idea, no one knows
[01:40] <yaloki> simon321: but if people there feel like they should just take
the decision and do it on their own without discussing, then it's an issue
[01:40] <yaloki> simon321: it's even ridiculous because people with relevant
skills are in our community and ready to help
[01:40] <yaloki> simon321: (not just talking about this case)
[01:41] <yaloki> simon321: don't get me wrong, I'm not hitting on matthew or
anyone else
[01:41] <simon321> yaloki: I know
[01:41] <warlordfff> ok, but explaining would be nice, right?
[01:41] <yaloki> and we all make mistakes, obviously, and it's human to just go
to the next office and discuss it there rather than going the full loop through
mailing-lists
[01:41] <yaloki> but still
[01:42] <yaloki> if we want contributors, if we want a project and a community
[01:42] <yaloki> then it's not acceptable
[01:42] <cboltz> yaloki: in case you are interested, it's this extension:
http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/MWSearch/
[01:42] <simon321> yaloki: matthew appearance was the best thing that happened
to openSUSE web - just as disappearance was bad
[01:43] <simon321> when that guy is around things are going on, otherwise
stagnate
[01:44] <cboltz> yes, he really does a good job - even if he forgets to
re-apply my patch for MultiBoilerplate on each update ;-)
[01:44] <simon321> and taking decision and pushing too strong in one direction
is not going to work; I have to agree with that
[01:45] <cboltz> but fortunately I can fix that myself now ;-) and just need to
ask for deployment
[01:45] <yaloki> cboltz: oh wow they don't seem to do any boosting at all in
the search.. couldn't see anything in the indexing stage either, at first
[01:45] <simon321> taking decision without any consultation (I meant)
[01:45] <cboltz> yaloki: I'm not sure how the indexing is done
[01:45] <yaloki> simon321: yes sure, it's a balance
[01:45] <yaloki> cboltz: well I'm a bit surprised because I don't see that
they're using the mediawiki hooks to do live indexing ?
[01:46] <simon321> yaloki: btw, search indexer was broken few months
[01:46] <cboltz> there is a PHP file for it in the extension, but in my wiki
I'm just running a daily indexing cronjob
[01:46] <cboltz> (yes, we already discussed that some weeks ago...)
[01:46] <yaloki> cboltz: I think it works offline in batch
[01:46] <yaloki> ouch
[01:47] <yaloki> ok, that's prolly needed for something of the size of wikipedia
[01:47] <yaloki> but on opensuse.org we could really do live indexing
[01:47] <yaloki> (as soon as a page is created, modified or updated, the search
index is updated)
[01:47] <simon321> yes, if they give enough cycles to server :)
[01:48] <cboltz> that would be good, yes - but OTOH getting good search results
(even if the result is slightly outdated) is the more important thing ;-)
[01:48] <cboltz> when we have that, we can start to think about live indexing
[01:48] <yaloki> cboltz: no, there is no boosting at all
[01:48] <yaloki> that's ridiculous :(
[01:49] <yaloki> simon321: lucene and solr are extremely fast
[01:49] <yaloki> cboltz: well, live indexing is actually a lot easier to
implement
[01:49] <yaloki> cboltz: my extension that does that is pretty small,
definitely a lot smaller than MWSearch
[01:50] <yaloki> I mean, you should at the very least boost the title field
[01:50] <simon321> yaloki: do you have something that works on current MW used
on openSUSE wikis
[01:51] <cboltz> obviously ;-) - but not only in the openSUSE wiki, it would be
good for all wikis using MWSearch (in other words: upstream)
[01:51] <yaloki> cboltz: https://wiki.apache.org/solr/SolrRelevancyCookbook
[01:51] <yaloki> simon321: it would need specific tuning but yes
[01:51] <yaloki> simon321: that's what I told matthew ages ago
[01:52] <simon321> so you or cboltz can upload that to git and have ready for
matt to deploy?
[01:52] <yaloki> no
[01:52] <yaloki> it needs more work than that
[01:52] <yaloki> it needs a Solr instance, to start with
[01:52] <yaloki> and that won't work because the admins won't install it if
there is no RPM of it ¬¬
[01:53] <simon321> MW has no rpm
[01:53] <yaloki> just like we don't have our own etherpad instance for the same
reasons
[01:54] <cboltz> simon321: you are wrong - AFAIK there is a MW rpm in openSUSE
;-)
[01:54] <cboltz> (but without extensions etc.)
[01:54] <simon321> cboltz: and what is deployed is what?
[01:54] <yaloki> simon321: but it prolly wouldn't work for other reasons, like
me needing access to some stuff on the mediawiki server (or a staging instance)
[01:55] <cboltz> simon321: not a RPM - everything is "collected" in a git repo
[01:55] <cboltz> and based on tarballs and svn checkouts
[01:55] <yaloki> cboltz: hmmm
[01:55] <yaloki> cboltz: on github?
[01:55] <simon321> cboltz: well, that is what I meant - rpm :)
[01:55] <simon321> - is a minus
[01:56] <cboltz> yaloki: https://github.com/openSUSE/wiki
[01:56] <yaloki> cboltz: ok thanks
[01:57] <cboltz> simon321: a RPM won't really work - you still have to maintain
extensions (well, could be another RPM for each extension), the config file etc.
[01:57] <cboltz> and to make things worse, we need a small modification in a MW
core file...
[01:58] <simon321> cboltz: you know that rpm for single installation is just
the way around without purpose :)
[01:58] <yaloki> well, it's not needed
[01:58] <yaloki> you already have git for versioning etc..
[01:58] <yaloki> oh it uses geshi
[01:59] * yaloki also wrote a plugin + a php module to use a shlib as syntax
highlighter
[01:59] <yaloki> faster :)
[01:59] <yaloki> it's upstream btw
[02:01] <yaloki> anywayz
[02:01] <yaloki> time for me to collect a few bits of sleep
[02:01] <yaloki> n8 folks
[02:01] <warlordfff> Going to sleep, goodnight guys
[02:01] <warlordfff> BB
[02:01] <yaloki> let's revive that discussion later
[02:02] <warlordfff> maybe in a project meeting
[02:02] <suseROCKs> you people are still yapping???
[02:02] <yaloki> :)
[02:02] <cboltz> that makes two good ideas (going to bed and continueing the
discussion) ;-)
[02:02] <warlordfff> oh suseROCKs is here, we're late :D
[02:02] <suseROCKs> somehow it is more reassuring when guy says he's late than
when a gal says she's late...
[02:03] <warlordfff> goodnight people, although from a part and further it was
impossible for me to follow ,I learned a few stuff :D
[02:03] <suseROCKs> warlordfff, did you learn how to chew gum and walk at the
same time?
[02:03] <warlordfff> Niarfff
[02:04] <warlordfff> Goodnight
Am Donnerstag, 5. April 2012 schrieb Henne Vogelsang:
On 04/05/2012 02:05 AM, Rajko M. wrote:
How about changing default to search "Everything" (all namespaces
without few that are in essence tools)
Please don't. Especially openSUSE: is one big chaotic info dump again,
because nobody gives a shit about the wiki rules. I don't want our
distro users to suffer from this just because...
One more reason to make the search better. (No smiley, I'm serious about
that.)
IMHO this includes searching all namespaces - and rating the main and
portal namespace up so that they are always on top of the search
results.
I had an interesting IRC discussion with yaloki on this some months ago.
The TL;DR summary is something like "lucene itsself is a very good
search engine, but the implementation in MediaWiki / the MWSearch
extension is just broken".
Pascal was quite shocked when reading the MWSearch code. It more or less
does the equivalent of "SELECT ... WHERE content LIKE '%searchword%'.
In other words: lucene has lots of features to get better search results
and to influence the search result order - but we don't use them.
The people involved in the discussion allowed to send the IRC log. It's
attached and contains lots of useful knownledge, so please read it ;-)
(If you are in a hurry, you can start reading at 01:17)
Regards,
Christian Boltz
--
TikiWiki ist eine sehr umfassende Sammlung von Sicherheitslücken,
konzeptuellen Problemen und Performancekillern, die alles kann und
nichts richtig. [Kristian Köhntopp auf
http://blog.koehntopp.de/archives/2051-5-Jahre-Blogging.html]2011-12-23 #opensuse-project
[01:06] <warlordfff> I lost you when you were saying that openfate sucks, right?
[01:07] <warlordfff> what happened you all left?
[01:07] <Ilmehtar> warlordfff: I dont think openfate sucks..I just think that
its surplus to requirements - why have bugzilla + openfate especially when we
dont actualyl rely on either to *define* our future features..just help guide
things along
[01:07] <warlordfff> I think openFate is a great idea
[01:07] <warlordfff> but it might needs work
[01:08] <warlordfff> but sadly I don't have any ideas about it
[01:08] <warlordfff> so I stand in my corner on that
[01:09] <yaloki> searchability sucks monumentally, unfortunately
[01:09] <yaloki> try searching for "board 2012 elections platform pages" on the
wiki
[01:09] <yaloki> zilch
[01:09] <Ilmehtar> it's a good idea but end of the day from a contributors
perspective, when someone things "I want to do something new for
openSUSE"..where do they look? bugzilla? openfate? it would be much nicer to
have a single spot where a contributor can look ..oh that needs doing, great
[01:09] <yaloki> you can't even find any page about the 2012 elections with
that using google
[01:10] <yaloki> Ilmehtar: right
[01:10] <yaloki> openfate is nice in its own right, but it's pointless to
duplicate, the same can be achieved with bugzilla
[01:10] <Ilmehtar> also from a users perspective..they want a new
feature..where do they file it? bugzilla as an enhancement? openfate as a new
feature?
[01:10] <warlordfff> does bugzilla has RSS or something?
[01:11] <yaloki> some
[01:11] * Ilmehtar grumbles about that gnome3 ML troll
[01:11] <yaloki> Ilmehtar: don't feed :)
[01:11] <Ilmehtar> yaloki: not any more..I tried to be nice, now I'm going to
STFU and hope he does too
[01:12] <simon321> Ilmehtar: openFATE for features; although it is almost as
complicated as bugilla
[01:12] <Ilmehtar> bugzilla also lets you follow others, so like the gnome team
all follow the default assignee for gnome bugs so we all get emails about the
changes to gnome bugs that aren't assigned to anyone specifically
[01:12] <warlordfff> oh you taslk about the guy who told that Gnome3 sucks on
openSUSE?
[01:12] <Ilmehtar> simon321: where do features end, and enhancements begin? :)
[01:12] <simon321> Ilmehtar: it is the same backet
[01:12] <simon321> basket
[01:14] <Ilmehtar> warlordfff: and yes, I was, just saw his latest and decided
to give up trying to use logic to have a meaningful discussion with the fella
[01:14] <simon321> in general with openSUSE web infrastructure is only one
thing bad - it is not connected
[01:14] <yaloki> simon321: and search
[01:14] <warlordfff> Ilmehtar: "the full name of "LINUX" is "Linux is not
Unix" neither "Windows" or "iOS"" =EPIC
[01:15] <yaloki> wiki search is the most ridiculous I've ever seen
[01:15] <yaloki> there is lots of content you can't even find by searching
[01:15] <simon321> yaloki: if you have no idea how to connect things then
nothing works as expected, including search
[01:15] <Ilmehtar> warlordfff: lol, yes, I have to admit his antics have given
me quite a few laughs, but still, would be nice to actually have people
constructivly critique g3 rather than just bitch about it
[01:15] <yaloki> simon321: search is broken because of a bad decision
[01:15] <yaloki> simon321: it is restricted intentionally
[01:16] <Ilmehtar> yaloki: why and how?
[01:16] <yaloki> simon321: having a proper search engine across all parts of
the infrastructure would be awesome, yes, but also pretty tough to implement
[01:16] <warlordfff> why don't we take baby steps and try to make Search work
properly and then move to other stuff?
[01:16] <yaloki> Ilmehtar: only the portal pages are searched by default
[01:16] <yaloki> Ilmehtar: henne thought, still thinks, and insists that that
is the right way
[01:16] <Ilmehtar> yaloki: I did not know that..glad I've started tuning the
gnome portal page..
[01:17] <yaloki> Ilmehtar: he had the illusion that people would then be forced
to put everything in portal pages
[01:17] <simon321> yaloki: apropos wiki search, we can let everyone to see
everything - if henne let that happen :) , but we still need presentation space
separated from support
[01:17] <Ilmehtar> yaloki: it would have a hope of working, if people knew
about it..
[01:17] <yaloki> simon321: please define "presentation space" and "support"
[01:17] <yaloki> simon321: for search, you don't separate: show everything you
can, with good relevance ranking :)
[01:18] <yaloki> Ilmehtar: even then it's stupid imho: if content is there, let
people find it and use it, nevermind how it's structured
[01:18] <simon321> yaloki: how to do that - where is relevance written in the
wiki
[01:18] <warlordfff> guys at the oSC11 Henne admited that he needs hands to
organize the wiki
[01:18] <yaloki> Ilmehtar: it's not like there are 10 people working full time
on maintaining content and structure there
[01:18] <yaloki> simon321: in the search engine implementation
[01:19] <yaloki> simon321: mehle moved from the default search+indexing engine
to lucene
[01:19] <yaloki> simon321: I don't know whether the relevance scoring is good
or not, that depends on how you tune lucene
[01:19] <yaloki> you can influence which fields get higher scores for matches,
etc...
[01:20] * yaloki has done quite a lot with Solr, which is a layer on top of
Lucene
[01:20] <yaloki> including on a mediawiki
[01:20] <simon321> yaloki: on the other side, what is relevant for one is not
for the other see http://en.opensuse.org/Portal:Wiki#Structure
[01:20] <cboltz> the perfect solution[tm] would probably be to add some "bonus
points" to search results in the main and portal namespace so that they always
appear as top search result, and the other (now usually hidden) results could
follow below
[01:21] <yaloki> simon321: partly true, but it can usually be done pretty well,
when you know how the implementation works, and when you spend some time tuning
it
[01:21] <yaloki> cboltz: yes, that can be done easily
[01:21] <yaloki> lucene (and solr) has "boosting"
[01:21] <yaloki> and a big solr instance could be deployed to index all the
parts of the infrastructure too
[01:22] <simon321> yaloki: I can't know who is the visitor, it is up to him/her
to select what part he wants to see
[01:22] <yaloki> lists, forums, wiki, openfate, bugzilla, ?
[01:22] <yaloki> but it would require implementation efforts for each part
[01:22] <yaloki> simon321: no, why?
[01:22] <yaloki> simon321: if you search for "nvidia"
[01:23] <yaloki> simon321: just give all the pages that mention nvidia, with
higher scoring for pages that have nvidia in their title, and pages that
mention it more frequently, etc...
[01:23] <yaloki> (that's what such proper search engines do, unlike just using
mysql for search)
[01:23] <simon321> yaloki: 1) is nvidia supported, 2) do I have problem with
nvidia, wich one is visitor looking for?
[01:24] <yaloki> simon321: no, that's not how search works
[01:24] <yaloki> simon321: it's search, not support on irc :)
[01:24] <yaloki> simon321: search relevance obviously cannot be subjective
[01:24] <yaloki> simon321: it's search relevance through scoring and boosting
in the search engine implementation
[01:25] <cboltz> yaloki: to start with small steps - how exactly can we
implement boosting in the wiki search?
[01:26] <yaloki> cboltz: through the configuration
[01:26] <cboltz> (I don't say your "big" solution is wrong, however it will
need time. And until then, small steps are better than nothing ;-)
[01:26] <yaloki> cboltz: no it doesn't need time
[01:26] <simon321> yaloki: I have no idea what guys want to see, and pretending
that one word reveals that, is being a bit too confident - some words will
actually tell, as "nvidia", other as may be too ambitious
[01:26] <yaloki> cboltz: can be done in an hour
[01:26] <yaloki> simon321: sure, but there are techniques for that too
[01:26] <yaloki> simon321: we're not the first ones to implement search :)
[01:26] <cboltz> "configuration" - yes of course ;-)
[01:27] <yaloki> stopwords, etc..
[01:27] <cboltz> any pointers about details? ;-)
[01:27] <yaloki> cboltz: I don't know how the lucene integration in mediawiki
works
[01:27] <yaloki> cboltz: I've implemented it from scratch with Solr in half a
day
[01:27] <yaloki> cboltz: and there you can configure the boosting in the Solr
configuration file
[01:27] <simon321> yaloki: I know only one that works well, disambiguation
pages (but they are not listed as offical navigation tool)
[01:27] <yaloki> cboltz: and, of course, stop only giving results from portal
pages
[01:28] <cboltz> when creating the search index or when the search query runs?
[01:28] <yaloki> simon321: no, really, it works well
[01:28] <yaloki> cboltz: you can do both
[01:28] <yaloki> cboltz: and then it also depends on what is actually being
indexed
[01:28] <yaloki> cboltz: the implementation I did also indexes the wiki
categories the page is on
[01:28] <yaloki> cboltz: and boosts higher on category word matches
[01:29] <warlordfff> type KDE, it is a bit funny since it does not gives you
the pages found but get's you straight to the KDE page
[01:29] <yaloki> you can also tune on nearest matches
[01:29] <cboltz> the index probably contains all wiki pages - at least
searching with "all: $searchword" works
[01:29] <yaloki> warlordfff: yeah, that's wrong too imho
[01:29] <warlordfff> yeap
[01:29] <warlordfff> also
[01:30] <yaloki> actually
[01:30] <warlordfff> if you are searching e.g from Greece it should get you to
the Greek page if available
[01:30] <yaloki> when you think of how to implement search
[01:30] <yaloki> it's *very* simple
[01:30] <yaloki> it must be like google
[01:30] <yaloki> period
[01:30] <yaloki> that's what people expect
[01:30] <cboltz> it should be possible to change that behaviour for "KDE" -
probably an easy fix in the search form
[01:30] <yaloki> give a list of results, with relevance
[01:30] <simon321> yaloki: going straight to the page is how wikipedia works -
but for terms like KDE they offer listing of topics - not presentation
[01:31] <yaloki> simon321: yes, but that's not what people expect
[01:31] <yaloki> as said, it must behave like google search
[01:31] <yaloki> that's what 99% of people use 99% of the time ^^
[01:31] <simon321> well, I do, just because it works fine on wikipedia :)
[01:31] <yaloki> simon321: it works fine on wikipedia because they have a very
stringent design of the page names and disambiguation
[01:32] <yaloki> simon321: which we don't, and never will have, simply because
it would require a lot more people to do some caretaking of the wiki
[01:33] <warlordfff> the care taking is a problem
[01:33] <warlordfff> Henne told us in that BoF that there are only 5-6 people
doing that
[01:34] <warlordfff> so he is the last I would blame on that
[01:34] <yaloki> cboltz: I actually proposed to matthew that I give him my code
for solr search and explain to him how it works etc..
[01:34] <yaloki> cboltz: but then there was no followup and suddenly he did
something, no one knows what, with lucene
[01:34] <yaloki> you can't expect people to contribute when things are done
like that
[01:35] <cboltz> indeed, that's understandable :-/
[01:36] <yaloki> cboltz: so I can't tell you what needs to be configured where,
never used the mediawiki lucene plugin (as I guess there is such a thing and
that's what has been used)
[01:36] <yaloki> cboltz: that being said, lucene also has boosting and such, as
Solr builds upon that
[01:36] <simon321> well, last time there was discussion, he got working lucene,
and not that much time to spend on alternatives
[01:36] <yaloki> cboltz: and well it just takes a bit of tuning, trial and error
[01:36] <yaloki> cboltz: think of a few searches and what you think would make
sense as results
[01:36] <yaloki> cboltz: then try, analyze and tune until you get that result
[01:37] <yaloki> (solr has a good analyzer too)
[01:37] <cboltz> yes, there is a MW extension (and I'm also using it in a wiki
I maintain)
[01:37] <yaloki> simon321: I offered him my help
[01:37] <cboltz> but IIRC I didn't see anything about boosting in its
documentation
[01:37] <yaloki> simon321: I know exactly how to do it, I've done that for a
mediawiki already, including the implementation of live indexing, and tuning
the search results, etc...
[01:37] <yaloki> simon321: :\
[01:38] <yaloki> cboltz: it's probably hard-coded in the extension then
[01:38] <yaloki> cboltz: and probably with pretty sane defaults
[01:38] <yaloki> cboltz: but we might want to tune it too
[01:38] <simon321> yaloki: you know that he wasn't there for some time, and I
don't think that was his decision
[01:38] <yaloki> cboltz: e.g. boost results with portal or SDB higher than
other ones
[01:39] <yaloki> simon321: I have no idea, and that's what I'm criticizing, it
was totally opaque :\
[01:39] <simon321> yaloki: to me it appeared like someone else decision
[01:39] <yaloki> simon321: no idea, no one knows
[01:40] <yaloki> simon321: but if people there feel like they should just take
the decision and do it on their own without discussing, then it's an issue
[01:40] <yaloki> simon321: it's even ridiculous because people with relevant
skills are in our community and ready to help
[01:40] <yaloki> simon321: (not just talking about this case)
[01:41] <yaloki> simon321: don't get me wrong, I'm not hitting on matthew or
anyone else
[01:41] <simon321> yaloki: I know
[01:41] <warlordfff> ok, but explaining would be nice, right?
[01:41] <yaloki> and we all make mistakes, obviously, and it's human to just go
to the next office and discuss it there rather than going the full loop through
mailing-lists
[01:41] <yaloki> but still
[01:42] <yaloki> if we want contributors, if we want a project and a community
[01:42] <yaloki> then it's not acceptable
[01:42] <cboltz> yaloki: in case you are interested, it's this extension:
http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/MWSearch/
[01:42] <simon321> yaloki: matthew appearance was the best thing that happened
to openSUSE web - just as disappearance was bad
[01:43] <simon321> when that guy is around things are going on, otherwise
stagnate
[01:44] <cboltz> yes, he really does a good job - even if he forgets to
re-apply my patch for MultiBoilerplate on each update ;-)
[01:44] <simon321> and taking decision and pushing too strong in one direction
is not going to work; I have to agree with that
[01:45] <cboltz> but fortunately I can fix that myself now ;-) and just need to
ask for deployment
[01:45] <yaloki> cboltz: oh wow they don't seem to do any boosting at all in
the search.. couldn't see anything in the indexing stage either, at first
[01:45] <simon321> taking decision without any consultation (I meant)
[01:45] <cboltz> yaloki: I'm not sure how the indexing is done
[01:45] <yaloki> simon321: yes sure, it's a balance
[01:45] <yaloki> cboltz: well I'm a bit surprised because I don't see that
they're using the mediawiki hooks to do live indexing ?
[01:46] <simon321> yaloki: btw, search indexer was broken few months
[01:46] <cboltz> there is a PHP file for it in the extension, but in my wiki
I'm just running a daily indexing cronjob
[01:46] <cboltz> (yes, we already discussed that some weeks ago...)
[01:46] <yaloki> cboltz: I think it works offline in batch
[01:46] <yaloki> ouch
[01:47] <yaloki> ok, that's prolly needed for something of the size of wikipedia
[01:47] <yaloki> but on opensuse.org we could really do live indexing
[01:47] <yaloki> (as soon as a page is created, modified or updated, the search
index is updated)
[01:47] <simon321> yes, if they give enough cycles to server :)
[01:48] <cboltz> that would be good, yes - but OTOH getting good search results
(even if the result is slightly outdated) is the more important thing ;-)
[01:48] <cboltz> when we have that, we can start to think about live indexing
[01:48] <yaloki> cboltz: no, there is no boosting at all
[01:48] <yaloki> that's ridiculous :(
[01:49] <yaloki> simon321: lucene and solr are extremely fast
[01:49] <yaloki> cboltz: well, live indexing is actually a lot easier to
implement
[01:49] <yaloki> cboltz: my extension that does that is pretty small,
definitely a lot smaller than MWSearch
[01:50] <yaloki> I mean, you should at the very least boost the title field
[01:50] <simon321> yaloki: do you have something that works on current MW used
on openSUSE wikis
[01:51] <cboltz> obviously ;-) - but not only in the openSUSE wiki, it would be
good for all wikis using MWSearch (in other words: upstream)
[01:51] <yaloki> cboltz: https://wiki.apache.org/solr/SolrRelevancyCookbook
[01:51] <yaloki> simon321: it would need specific tuning but yes
[01:51] <yaloki> simon321: that's what I told matthew ages ago
[01:52] <simon321> so you or cboltz can upload that to git and have ready for
matt to deploy?
[01:52] <yaloki> no
[01:52] <yaloki> it needs more work than that
[01:52] <yaloki> it needs a Solr instance, to start with
[01:52] <yaloki> and that won't work because the admins won't install it if
there is no RPM of it ¬¬
[01:53] <simon321> MW has no rpm
[01:53] <yaloki> just like we don't have our own etherpad instance for the same
reasons
[01:54] <cboltz> simon321: you are wrong - AFAIK there is a MW rpm in openSUSE
;-)
[01:54] <cboltz> (but without extensions etc.)
[01:54] <simon321> cboltz: and what is deployed is what?
[01:54] <yaloki> simon321: but it prolly wouldn't work for other reasons, like
me needing access to some stuff on the mediawiki server (or a staging instance)
[01:55] <cboltz> simon321: not a RPM - everything is "collected" in a git repo
[01:55] <cboltz> and based on tarballs and svn checkouts
[01:55] <yaloki> cboltz: hmmm
[01:55] <yaloki> cboltz: on github?
[01:55] <simon321> cboltz: well, that is what I meant - rpm :)
[01:55] <simon321> - is a minus
[01:56] <cboltz> yaloki: https://github.com/openSUSE/wiki
[01:56] <yaloki> cboltz: ok thanks
[01:57] <cboltz> simon321: a RPM won't really work - you still have to maintain
extensions (well, could be another RPM for each extension), the config file etc.
[01:57] <cboltz> and to make things worse, we need a small modification in a MW
core file...
[01:58] <simon321> cboltz: you know that rpm for single installation is just
the way around without purpose :)
[01:58] <yaloki> well, it's not needed
[01:58] <yaloki> you already have git for versioning etc..
[01:58] <yaloki> oh it uses geshi
[01:59] * yaloki also wrote a plugin + a php module to use a shlib as syntax
highlighter
[01:59] <yaloki> faster :)
[01:59] <yaloki> it's upstream btw
[02:01] <yaloki> anywayz
[02:01] <yaloki> time for me to collect a few bits of sleep
[02:01] <yaloki> n8 folks
[02:01] <warlordfff> Going to sleep, goodnight guys
[02:01] <warlordfff> BB
[02:01] <yaloki> let's revive that discussion later
[02:02] <warlordfff> maybe in a project meeting
[02:02] <suseROCKs> you people are still yapping???
[02:02] <yaloki> :)
[02:02] <cboltz> that makes two good ideas (going to bed and continueing the
discussion) ;-)
[02:02] <warlordfff> oh suseROCKs is here, we're late :D
[02:02] <suseROCKs> somehow it is more reassuring when guy says he's late than
when a gal says she's late...
[02:03] <warlordfff> goodnight people, although from a part and further it was
impossible for me to follow ,I learned a few stuff :D
[02:03] <suseROCKs> warlordfff, did you learn how to chew gum and walk at the
same time?
[02:03] <warlordfff> Niarfff
[02:04] <warlordfff> Goodnight
| < Previous | Next > |