[opensuse-web] Wiki Upgrade Problems
Hi Everyone, So it turns out we are in a pretty difficult situation with the latest upgrade. The problem with the UTF8 page titles being corrupted is much more widespread than I thought. I have spent a long time looking for a way to fix the issue in the page titles that the upgrade had caused, and although it appears to be a well known issue, no one has a real fix for it yet. We have a few options to take: 1) Roll back all the wikis to 1.16 and restore the database to Sunday night 2) Roll back only the most affected wikis (cn, cs, ru, etc.) and restore the database 3) Leave all the wikis as they are, and keep trying to fix the page titles I'm inclined to do step number two, but I want to hear about any other options or opinions from the individual admins before we try to roll things back. If we can somehow get a list together of the garbled characters and the real characters that are associated, I can even script a fix of some type (making option number 3 more viable). If someone wants to take that on for their particular wiki, they can see the affected pages by going to /Special:AllPages and trying to compare.
Am Donnerstag, den 01.12.2011, 17:11 -0700 schrieb Matthew Ehle:
Hi Everyone,
So it turns out we are in a pretty difficult situation with the latest upgrade. The problem with the UTF8 page titles being corrupted is much more widespread than I thought. I have spent a long time looking for a way to fix the issue in the page titles that the upgrade had caused, and although it appears to be a well known issue, no one has a real fix for it yet.
We have a few options to take:
1) Roll back all the wikis to 1.16 and restore the database to Sunday night 2) Roll back only the most affected wikis (cn, cs, ru, etc.) and restore the database 3) Leave all the wikis as they are, and keep trying to fix the page titles
I'm inclined to do step number two,
+ 1 (please not forgot the de wiki when Roll back ;-) )
but I want to hear about any other options or opinions from the individual admins before we try to roll things back. If we can somehow get a list together of the garbled characters and the real characters that are associated, I can even script a fix of some type (making option number 3 more viable). If someone wants to take that on for their particular wiki, they can see the affected pages by going to /Special:AllPages and trying to compare.
Good luck! -- Grüße aus' m Schwabenland ↓ → Lisufa, der Linuxsusefan ↓ ################################## ********************************** ....::: openSUSE Member :::..... ************************************************************* Die 'SuS(i)E' sei mit euch, wo immer ihr auch seid .... *************************************************************
On 12/02/2011 01:11 AM, Matthew Ehle wrote:
Hi Everyone,
So it turns out we are in a pretty difficult situation with the latest upgrade. The problem with the UTF8 page titles being corrupted is much more widespread than I thought. I have spent a long time looking for a way to fix the issue in the page titles that the upgrade had caused, and although it appears to be a well known issue, no one has a real fix for it yet.
We have a few options to take:
1) Roll back all the wikis to 1.16 and restore the database to Sunday night 2) Roll back only the most affected wikis (cn, cs, ru, etc.) and restore the database 3) Leave all the wikis as they are, and keep trying to fix the page titles
I'm inclined to do step number two, but I want to hear about any other options or opinions from the individual admins before we try to roll things back. If we can somehow get a list together of the garbled characters and the real characters that are associated, I can even script a fix of some type (making option number 3 more viable). If someone wants to take that on for their particular wiki, they can see the affected pages by going to /Special:AllPages and trying to compare.
Hi Matthew, please don't roll back the content of the english wiki, as it doesn't have utf-8 characters in the titles, and there have been quite some changes this week. We are starting the opensuse board elections today, and use the wiki as platform for the candidates for example. Also for the other wikis I hope there comes up a patch so we can fix the page titles without losing the new edits. I think there also have been some changes in the german wiki at least. Greetings -- Thomas Schmidt (tom [at] opensuse.org) openSUSE Boosters Team "Don't Panic", Douglas Adams (1952 - 11.05.2001) -- To unsubscribe, e-mail: opensuse-web+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-web+owner@opensuse.org
Hi Matthew, please don't roll back the content of the english wiki, as it doesn't have utf-8 characters in the titles, and there have been quite some changes this week. We are starting the opensuse board elections today, and use the wiki as platform for the candidates for example.
I have verified that it doesn't contain UTF8 in the titles and is safe to continue. It will NOT be rolled back, and I have removed the lock on it. An interesting thing I noticed is that the English wiki has a different collation on the page table than the other wikis. It is a new wiki, unlike the others, which have been upgraded from much earlier versions. That may be a clue as to the root cause and how we can fix it. The other wiki that is new, of course, is the German wiki. It has the same collation on the page table as the English wiki, and it is different than the old German wiki and all the other wikis. Guess which UTF8 wiki isn't broken?
Also for the other wikis I hope there comes up a patch so we can fix the page titles without losing the new edits. I think there also have been some changes in the german wiki at least. It looks like the German wiki is actually fine, for the reasons mentioned above. I'll remove the lock on it soon. I'm working hard on saving the others. I think the Russian wiki may be a lost cause, but I haven't lost hope yet.
-Matt
Hello, Am Freitag, 2. Dezember 2011 schrieb Matthew Ehle:
Hi Matthew, please don't roll back the content of the english wiki, as it doesn't have utf-8 characters in the titles, and there have been quite some changes this week. We are starting the opensuse board elections today, and use the wiki as platform for the candidates for example.
I have verified that it doesn't contain UTF8 in the titles and is safe to continue. It will NOT be rolled back, and I have removed the lock on it.
An interesting thing I noticed is that the English wiki has a different collation on the page table than the other wikis. It is a new wiki, unlike the others, which have been upgraded from much earlier versions. That may be a clue as to the root cause and how we can fix it.
Which collation is used for a) the english (and new german) wiki b) for all other wikis? Which collation was used before doing the update on the now broken wikis? BTW: The MySQL default charset might also be involved - I seem to remember that old mediawiki versions just used whatever was the default. At least I have a wiki with similar problems where the column uses the default MySQL charset - but fortunately it only affects Special:Listfiles (https://bugzilla.wikimedia.org/show_bug.cgi?id=32207 if you are interested). In my case, the database contains utf-8, but the column is marked as iso-8859-15. I had a short look at the ru wiki - I don't understand anything there ;-) but the page titles look like double-encoded utf-8 to me. Write some of them to a text file and try recode utf-8..$previous_charset $file <scary idea> If I understood you right, the problem only affects the page _titles_. It looks like the page title is stored in the "page" table - and not in too many other tables (I found it in some logging and cache tables, which aren't too relevant IMHO). Can you try to just roll back the page titles in the page table? Run the following query on the _old_ database to get a list of the correct page titles as UPDATE statements: select concat('UPDATE page SET page_title="', page_title, '" WHERE page_id=' , page_id) from page; Check that the result is valid utf-8 (or use recode to fix it), make sure your MySQL connection uses utf-8 and then apply the resulting UPDATE queries to the new database. WARNING: this is completely untested and wrapped in a "<scary idea>" tag for a reason. It might work, but I can't promise anything... </scary idea>
The other wiki that is new, of course, is the German wiki. It has the same collation on the page table as the English wiki, and it is different than the old German wiki and all the other wikis.
Guess which UTF8 wiki isn't broken?
Hmmm... ;-)
Also for the other wikis I hope there comes up a patch so we can fix the page titles without losing the new edits. I think there also have been some changes in the german wiki at least. It looks like the German wiki is actually fine, for the reasons mentioned above. I'll remove the lock on it soon. I'm working hard on saving the others. I think the Russian wiki may be a lost cause, but I haven't lost hope yet.
See above - and I don't see a reason why the ru wiki should be "more lost" than other language wikis ;-) Regards, Christian Boltz --
Glaub mir, die Schrott-Quote bei den ATA/Billig-SATA ist enorm, die meisten merken's halt nur nicht. ;) PS. Wir handeln u.a. mit sowas und die Rücklaufrate ist (sehr) hoch. Du bist Schrotthaendler? ;-) [> Mirko Richter und Thomas Hertweck in suse-linux]
-- To unsubscribe, e-mail: opensuse-web+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-web+owner@opensuse.org
Which collation is used for a) the english (and new german) wiki b) for all other wikis? Which collation was used before doing the update on the now broken wikis?
They use a binary collation, as opposed to UTF-8 for the rest of the wikis.
BTW: The MySQL default charset might also be involved - I seem to remember that old mediawiki versions just used whatever was the default. At least I have a wiki with similar problems where the column uses the default MySQL charset - but fortunately it only affects Special:Listfiles (https://bugzilla.wikimedia.org/show_bug.cgi?id=32207 if you are interested). In my case, the database contains utf-8, but the column is marked as iso-8859-15.
I had a short look at the ru wiki - I don't understand anything there ;-) but the page titles look like double-encoded utf-8 to me. Write some of them to a text file and try recode utf-8..$previous_charset $file
I thought so too at first. That's not quite the case. I have actually made a lot of progress on the Russian wiki in stage. From what I have found, it appears that the update just dumped Latin1 encoded text into a UTF-8 table without properly encoding the text itself.
<scary idea> If I understood you right, the problem only affects the page _titles_. It looks like the page title is stored in the "page" table - and not in too many other tables (I found it in some logging and cache tables, which aren't too relevant IMHO). Also category tables, and several others.
Can you try to just roll back the page titles in the page table? If we did it immediately after upgrading, it probably would have worked. However, that table will no longer be consistent with all of the other tables.
Run the following query on the _old_ database to get a list of the correct page titles as UPDATE statements:
select concat('UPDATE page SET page_title="', page_title, '" WHERE page_id=' , page_id) from page;
Check that the result is valid utf-8 (or use recode to fix it), make sure your MySQL connection uses utf-8 and then apply the resulting UPDATE queries to the new database.
WARNING: this is completely untested and wrapped in a "<scary idea>" tag for a reason. It might work, but I can't promise anything... </scary idea> Unfortunately, we have too many keys and indexes for me to think that will work very well.
Also for the other wikis I hope there comes up a patch so we can fix the page titles without losing the new edits. I think there also have been some changes in the german wiki at least. It looks like the German wiki is actually fine, for the reasons mentioned above. I'll remove the lock on it soon. I'm working hard on saving the others. I think the Russian wiki may be a lost cause, but I haven't lost hope yet.
See above - and I don't see a reason why the ru wiki should be "more lost" than other language wikis ;-)
There are a LOT of inconsistencies in the Russian wiki, apparently not just with the UTF-8 page titles. There are a lot of duplicate keys and indexes. I managed to get the stage database properly encoded, but I lost about two dozen pages in the process due to duplicate key errors. Matthew Ehle Web Engineer IS&T Mobile Phone: (801) 358-1655 mehle@novell.com
Dne Pá 2. prosince 2011 01:11:34, Matthew Ehle napsal(a):
2) Roll back only the most affected wikis (cn, cs, ru, etc.) and restore the database 3) Leave all the wikis as they are, and keep trying to fix the page titles
I'm inclined to do step number two, but I want to hear about any other options or opinions from the individual admins before we try to roll things back. If we can somehow get a list together of the garbled characters and the real characters that are associated, I can even script a fix of some type (making option number 3 more viable). If someone wants to take that on for their particular wiki, they can see the affected pages by going to /Special:AllPages and trying to compare.
Hi, at this moment (in cs.opensuse.org) it looks that majority or all titles are ok, but i noticed that some category titles are broken: http://cs.opensuse.org/index.php?title=Speci%C3%A1ln%C3%AD:Kategorie&limit=500 and eg. Navigation toolbar http://cs.opensuse.org/index.php?title=Kategorie:Naviga%C4%8Dn%C3%AD_li%C5%A1ty&action=edit&redlink=1 thows MediaWiki error. It looks, that MediaWiki errors are displayed, when page doesn't exist. It seems that existing pages with diacritic work. Eg. http://cs.opensuse.org/Struktura_soubor%C5%AF_u_opera%C4%8Dn%C3%ADho_syst%C3... ... Huh, I found some error at Special pages - Category tree (Strom kategorií), when I tried to found Help category tree (Nápověda) http://cs.opensuse.org/index.php?title=Speci%C3%A1ln%C3%AD%3AStrom_kategori%C3%AD&target=N%C3%A1pov%C4%9Bda&mode=categories&dotree=Zobrazit It said: Database returned error „1267: Illegal mix of collations (latin1_bin,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '=' (137.65.246.148:3307)“. ... I think, that Czech wiki can stay at new version, but we have to do something to repair broken links and categories... Thanks. Jan -- To unsubscribe, e-mail: opensuse-web+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-web+owner@opensuse.org
>>>> Jan Papež<honyczek-mailing@email.cz> 12/9/2011 6:12 AM >>> >Dne Pá 2. prosince 2011 01:11:34, Matthew Ehle napsal(a): >> 2) Roll back only the most affected wikis (cn, cs, ru, etc.) and restore the >> database >> 3) Leave all the wikis as they are, and keep trying to fix the >> page titles >> >> I'm inclined to do step number two, but I want to hear about any other >> options or opinions from the individual admins before we try to roll things >> back. If we can somehow get a list together of the garbled characters and >> the real characters that are associated, I can even script a fix of some >> type (making option number 3 more viable). If someone wants to take that >> on for their particular wiki, they can see the affected pages by going to >> /Special:AllPages and trying to compare. > >Hi, > >at this moment (in cs.opensuse.org) it looks that majority or all titles are >ok, but i noticed that some category titles are broken: >http://cs.opensuse.org/index.php?title=Speci%C3%A1ln%C3%AD:Kategorie&limit=500 > >and eg. Navigation toolbar >http://cs.opensuse.org/index.php?title=Kategorie:Naviga%C4%8Dn%C3%AD_li%C5%A1ty&action=edit&redlink=1 >thows MediaWiki error. > >It looks, that MediaWiki errors are displayed, when page doesn't exist. > >It seems that existing pages with diacritic work. Eg. >http://cs.opensuse.org/Struktura_soubor%C5%AF_u_opera%C4%8Dn%C3%ADho_syst%C3%A9mu_Linux > >... > >Huh, I found some error at Special pages - Category tree (Strom kategorií), >when I tried to found Help category tree (Nápověda) >http://cs.opensuse.org/index.php?title=Speci%C3%A1ln%C3%AD%3AStrom_kategori%C3%AD&target=N%C3%A1pov%C4%9Bda&mode=categories&dotree=Zobrazit >It said: >Database returned error „1267: Illegal mix of collations >(latin1_bin,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '=' >(137.65.246.148:3307)“. >... > >I think, that Czech wiki can stay at new version, but we have to do something >to repair broken links and categories... It looks like the Czech wiki had another collation issue that the other wikis didn't have. I had to revert a few days back, but it's fixed now. -Matt
participants (5)
-
Christian Boltz
-
Jan Papež
-
LisufasGnome3
-
Matthew Ehle
-
Thomas Schmidt