>Which collation is used for a) the english (and new german) wiki b) for
>all other wikis? Which collation was used before doing the update on the
>now broken wikis?
They use a binary collation, as opposed to UTF-8 for the rest of the wikis.

>BTW: The MySQL default charset might also be involved - I seem to
>remember that old mediawiki versions just used whatever was the default.
>At least I have a wiki with similar problems where the column uses the
>default MySQL charset - but fortunately it only affects
>Special:Listfiles (https://bugzilla.wikimedia.org/show_bug.cgi?id=32207
>if you are interested). In my case, the database contains utf-8, but the
>column is marked as iso-8859-15.
>
>I had a short look at the ru wiki - I don't understand anything there
>;-) but the page titles look like double-encoded utf-8 to me. Write some
>of them to a text file and try   recode utf-8..$previous_charset $file
I thought so too at first.  That's not quite the case.  I have actually made a lot of progress on the Russian wiki in stage.  >From what I have found, it appears that the update just dumped Latin1 encoded text into a UTF-8 table without properly encoding the text itself.

><scary idea>
>If I understood you right, the problem only affects the page _titles_.
>It looks like the page title is stored in the "page" table - and not in
>too many other tables (I found it in some logging and cache tables,
>which aren't too relevant IMHO).
Also category tables, and several others.

>Can you try to just roll back the page titles in the page table?
If we did it immediately after upgrading, it probably would have worked.  However, that table will no longer be consistent with all of the other tables.

>Run the following query on the _old_ database to get a list of the
>correct page titles as UPDATE statements:
>
>select concat('UPDATE page SET page_title="', page_title, '" WHERE
>page_id=' , page_id) from page;
>
>Check that the result is valid utf-8 (or use recode to fix it), make
>sure your MySQL connection uses utf-8 and then apply the resulting
>UPDATE queries to the new database.
>
>WARNING: this is completely untested and wrapped in a "<scary idea>" tag
>for a reason. It might work, but I can't promise anything...
></scary idea>
Unfortunately, we have too many keys and indexes for me to think that will work very well.

>> >Also for the other wikis I hope there comes up a patch so we can fix
>> >the page titles without losing the new edits. I think there also
>> >have been some changes in the german wiki at least.
>> It looks like the German wiki is actually fine, for the reasons
>> mentioned above.  I'll remove the lock on it soon.  I'm working hard
>> on saving the others.  I think the Russian wiki may be a lost cause,
>> but I haven't lost hope yet.
>
>See above - and I don't see a reason why the ru wiki should be "more
>lost" than other language wikis ;-)
There are a LOT of inconsistencies in the Russian wiki, apparently not just with the UTF-8 page titles.  There are a lot of duplicate keys and indexes.  I managed to get the stage database properly encoded, but I lost about two dozen pages in the process due to duplicate key errors.
 
 

Matthew Ehle

Web Engineer

IS&T

 

Mobile Phone: (801) 358-1655

mehle@novell.com