Andrew Morton
Peter Osterlund
wrote: If I create an ext2 filesystem with 2kb blocksize, I hit a bug when I try to write some large files to the filesystem. The problem is that the code in mpage_writepage() fails if a page is mapped to disk across a packet boundary. In that case, the bio_add_page() call at line 543 in mpage.c can fail even if the bio was previously empty. The code then passes an empty bio to submit_bio(), which triggers a bug at line 2303 in ll_rw_blk.c. This patch seems to fix the problem.
--- linux/fs/mpage.c.old 2004-01-02 00:26:19.000000000 +0100 +++ linux/fs/mpage.c 2004-01-02 00:26:50.000000000 +0100 @@ -541,6 +541,11 @@
length = first_unmapped << blkbits; if (bio_add_page(bio, page, length, 0) < length) { + if (!bio->bi_size) { + bio_put(bio); + bio = NULL; + goto confused; + } bio = mpage_bio_submit(WRITE, bio); goto alloc_new; }
Confused. We initially have an empty BIO, and we run bio_add_page() against it, adding one page.
How can that bio_add_page() fail to add the page?
Cold you describe the failure a little more please?
In bio_add_page(), there is this check: /* * if queue has other restrictions (eg varying max sector size * depending on offset), it can specify a merge_bvec_fn in the * queue to get further control */ if (q->merge_bvec_fn) { /* * merge_bvec_fn() returns number of bytes it can accept * at this offset */ if (q->merge_bvec_fn(q, bio, bvec) < len) { bvec->bv_page = NULL; bvec->bv_len = 0; bvec->bv_offset = 0; return 0; } } The packet writing code has the restriction that a bio must not span a packet boundary. (A packet is 32*2048 bytes.) If the page when mapped to disk starts 2kb before a packet boundary, merge_bvec_fn therefore returns 2048, which is less than len, which is 4096 if the whole page is mapped, so the bio_add_page() call fails. I don't think this case ever happens with the standard kernel, because it doesn't (afaik) contain any merge_bvec_fn implementations that are as restrictive as the one in the cdrw packet writing code. I still think the code in mpage_writepage() is wrong, even if it happens to work in the standard kernel.
I noted that performance is quite bad with 2kb blocksize. It is a lot faster with 4kb blocksize. (2kb blocksize with the udf filesystem is not slow, only ext2 seems to have this problem.) Maybe the "confused" case (which calls a_ops->writepage()) in mpage_writepage() isn't really meant to be fast. Is there a better way to fix this problem?
How much slower is it?
I didn't make any exact measurements, but it looked like the pageout code didn't throw enough parallel bios at the packet writing request queue, so a lot of incomplete packets had to be written to disk. This is very costly, because the code then has to do read/modify/write cycles instead of just writing the whole packet directly. A rough guess is that 2kb block size was 2-3 times slower than 4kb block size, but I'll make more tests later to see what's really going on here. -- Peter Osterlund - petero2@telia.com http://w1.894.telia.com/~u89404340