What | Removed | Added |
---|---|---|
Status | NEW | IN_PROGRESS |
OK, I was looking more into why allocations get more spread among groups with mb_optimize_scan=1. Let me summarize here my current findings so that I don't forget while I'm on vacation next week :). The things get somewhat obscured by group preallocations because small allocations (below 64k which is our case) get allocated from those. Group preallocations are per-CPU and they get initialized like: mb_optimize_scan=0: 49 81 113 97 17 33 113 49 81 33 97 113 81 1 17 33 33 81 1 113 97 17 113 113 33 33 97 81 49 81 17 49 mb_optimize_scan=1: 127 126 126 125 126 127 125 126 127 124 123 124 122 122 121 120 119 118 117 116 115 116 114 113 111 110 109 108 107 106 105 104 104 So we can see groups from which "group preallocations" get allocated drift with mb_optimize_scan=1 while they keep jumping among same groups with mb_optimize_scan=0. This is likely because with mb_optimize_scan=0 we always start searching for free space in the goal group which is determined by the inode and inode's group is determined by the parent directory. So in that case we always start the search in the same group. With mb_optimize_scan=1 we always call ext4_mb_choose_next_group_cr0() to determine the first group to search. The drifting seems to be caused by each free space update (e.g. mb_mark_used()) calls mb_set_largest_free_order() which deletes group from bb_largest_free_order_node list, recomputes the order, and inserts it at the tail of the list again. Anyway this seems a bit suboptimal because inode - data block locality is desirable. Also given the algorithm of group selection all allocations of the same size will be contending on the same group. After I return from vacation, I'll take this upstream for discussion, whether the current behavior does not need tweaking.