Comment # 8 on bug 1199970 from Jan Kara

OK, I was looking more into why allocations get more spread among groups with
mb_optimize_scan=1. Let me summarize here my current findings so that I don't
forget while I'm on vacation next week :).

The things get somewhat obscured by group preallocations because small
allocations (below 64k which is our case) get allocated from those. Group
preallocations are per-CPU and they get initialized like:

mb_optimize_scan=0:
49 81 113 97 17 33 113 49 81 33 97 113 81 1 17 33 33 81 1 113 97 17 113 113 33
33 97 81 49 81 17 49

mb_optimize_scan=1:
127 126 126 125 126 127 125 126 127 124 123 124 122 122 121 120 119 118 117 116
115 116 114 113 111 110 109 108 107 106 105 104 104

So we can see groups from which "group preallocations" get allocated drift with
mb_optimize_scan=1 while they keep jumping among same groups with
mb_optimize_scan=0.

This is likely because with mb_optimize_scan=0 we always start searching for
free space in the goal group which is determined by the inode and inode's group
is determined by the parent directory. So in that case we always start the
search in the same group.

With mb_optimize_scan=1 we always call ext4_mb_choose_next_group_cr0() to
determine the first group to search. The drifting seems to be caused by each
free space update (e.g. mb_mark_used()) calls mb_set_largest_free_order() which
deletes group from bb_largest_free_order_node list, recomputes the order, and
inserts it at the tail of the list again. Anyway this seems a bit suboptimal
because inode - data block locality is desirable. Also given the algorithm of
group selection all allocations of the same size will be contending on the same
group.

After I return from vacation, I'll take this upstream for discussion, whether
the current behavior does not need tweaking.