
On Mi, Jun 15 2022 at 22:19:50 +0200, Sasi Olin <hellcp@opensuse.org> wrote:
A quick note on that, it's not entirely true, Russian part of the forums was using utf8 since it started, so we will have to exclude all of the posts that belong to that part of the forums and export them together with the rest of utf8 posts after the date. This far I found 1304945809 to be the most recent broken post, but I will have to check it again after I come up with a query that excludes russian forum posts.
After some time, I came up with the following queries: ``` # Posts excluding the russian forum (909) select * from vb_post where dateline < '1304944363' and threadid in (select threadid from vb_thread where forumid not in (with recursive cte as (select forumid from vb_forum where forumid='909' union all select c.forumid from vb_forum c join cte ON cte.forumid=c.parentid) select forumid from cte)); # Threads excluding the russian forum (909) select * from vb_thread where dateline < '1304944363' and threadid in (select threadid from vb_thread where forumid not in (with recursive cte as (select forumid from vb_forum where forumid='909' union all select c.forumid from vb_forum c join cte ON cte.forumid=c.parentid) select forumid from cte)); # All of pms select * from vb_pmtext; ``` I am not touching the rest of the forums, since we will not need it for discourse migration anyway. Should we schedule maintenance for the forums and fix them using those? What I would essentially do would be: 1. Dump the entire database into utf8 charset (essentially intentionally breaking it with mysqldump --set-charset=utf8 > ...) 2. Dump the bits of the database with latin1 encoding using the queries above (with mysqldump --set-charset=latin1 --single-transaction=TRUE \ --where="$query_after_where" $database $table > ...) 3. Import the big dump into a new database 4. Override the broken database bits with the small dumps It seems like the least potential for breakage this way, since all of those seem properly indexed but I am open to suggestions. There are some older and newer posts with broken encoding, because a lot of people were using utf8 with nntp before the switch and some people kept using signatures with broken encoding past the utf8 switch. We can keep around the old database and fix the posts one by one, but people did reference broken posts in the forums themselves as well, so it would be breaking the actual flow of the conversations at that point. LCP [Sasi] https://lcp.world/