Hi Jeff: Thanks for the details and your time analysing this. I think I will now point this thread to the Nacl list (who provided the testing program) if they can find the right solution - as this all started with Nacl not working on Opensuse 11.4 (while claimed to work on for example Ubuntu). I realize both Nacl and the test code now work on Opensuse 12.1 RC2 (as you pointed out and I tried as well yesterday), but it hindges on the distro setting no "virtual memory" ulimit. So, if I understand correctly, if the Nacl code mmaps huge space without checking "virtual memory" ulimit, Nacl will always be failing on systems with ulimit set to a finite value; I'd like that Nacl works on any Linux distro and installation, in particular Opensuse :) and hope this thread would help in that direction... There are a few comments I added inline below (although I am definitely in over my head area in the details), I am happy with the explanation and hope it will be useful to others. Thanks again for your analysis and explanation, milan On Thu, Nov 10, 2011 at 3:10 PM, Jeff Mahoney <jeffm@suse.de> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 11/10/2011 02:43 PM, milan zimmermann wrote:
On Thu, Nov 10, 2011 at 10:15 AM, Jeff Mahoney <jeffm@suse.de> wrote: On 11/10/2011 11:06 AM, milan zimmermann wrote:
On Wed, Nov 9, 2011 at 3:41 PM, Jeff Mahoney <jeffm@suse.de> wrote:
>>> virtual memory (kbytes, -v) 29536000
... but here is your problem.
I disagree. With the switches I used, and with
Then test it. On my system, I see:
jeffm@jetfire:~> ulimit -v unlimited jeffm@jetfire:~> ./a.out Hello world Allocating 29Gb... Success: 0x7fc7732d6000
jeffm@jetfire:~> ulimit -v 28000000 jeffm@jetfire:~> ./a.out Hello world Allocating 29Gb... FAILED
You are right, that is how it works, I agree (my "disagree" was too harsh there :) ).
I should have said I am not sure with these particular flags ulimit should matter, as there should be no memory allocated (reserved). But I am only basic that on half-understanding this:
http://www.mjmwired.net/kernel/Documentation/vm/overcommit-accounting
(I think you are describing the difference below, also a question there)
I am, but there's another key difference here. There's a "max memory size" ulimit that controls actual allocations. The "virtual memory" one controls how much of the address space you can use. They're two separate things.
yes, understand that, thanks
Here's a quick run down.
Overcommit covers memory use that is backed by memory itself plus any swap space you have.
mmap can be used to assign portions of the address space to objects and not all of them are backed by memory+swap.
yes
For example, your program executable itself, any libraries you load, and any files you mmap() read-only are all backed by the file on-disk unless they're modified by private mappings. The kernel knows that it can drop those pages and read them back in from the file, similar to how it can read swapped pages back from the swap space.
getting a bit hard for me but yes, makes sense
Take a look at /proc/<pid>/maps to see what I mean.
For your test case, I see this:
jeffm@jetfire:~> ./a.out Hello world Allocating 29Gb... Success: 0x7f0ad3de2000 pid=3485
jeffm@jetfire:~> cat /proc/3485/maps 00400000-00401000 r-xp 00000000 fd:03 131888 /home/jeffm/a.out 00600000-00601000 r--p 00000000 fd:03 131888 /home/jeffm/a.out 00601000-00602000 rw-p 00001000 fd:03 131888 /home/jeffm/a.out 7f0ad3de2000-7f1213de2000 ---p 00000000 00:00 0 # Here's the test mmap 7f1213de2000-7f1213f67000 r-xp 00000000 fd:01 262980 /lib64/libc-2.14.1.so 7f1213f67000-7f1214167000 ---p 00185000 fd:01 262980 /lib64/libc-2.14.1.so 7f1214167000-7f121416b000 r--p 00185000 fd:01 262980 /lib64/libc-2.14.1.so 7f121416b000-7f121416c000 rw-p 00189000 fd:01 262980 /lib64/libc-2.14.1.so 7f121416c000-7f1214171000 rw-p 00000000 00:00 0 7f1214171000-7f1214191000 r-xp 00000000 fd:01 262991 /lib64/ld-2.14.1.so 7f1214372000-7f1214375000 rw-p 00000000 00:00 0 7f121438f000-7f1214391000 rw-p 00000000 00:00 0 7f1214391000-7f1214392000 r--p 00020000 fd:01 262991 /lib64/ld-2.14.1.so 7f1214392000-7f1214393000 rw-p 00021000 fd:01 262991 /lib64/ld-2.14.1.so 7f1214393000-7f1214394000 rw-p 00000000 00:00 0 7fff0588a000-7fff058ab000 rw-p 00000000 00:00 0 [stack] 7fff0593a000-7fff0593c000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Address range perms offset dev ino usage
The perms are: r - read w - write x - exec p - private (its absence means shared)
So, we can see that the first three lines cover the test program itself.
ok
The next line is the mmap, which with PROT_NONE is actually backed by nothing since it's inaccessible.
for curiosity waht is PROT_NONE here, thanks: 7f0ad3de2000-7f1213de2000 ---p 00000000 00:00 0 # Here's the test mmap
The next 4 lines are (obv) libc, etc.
ok
So that's a a bunch of address space used when in fact only the chunks with "w" and "p" in the permissions need to actually be backed by swap. It's actually a bit more complicated that that, but for the sake of a simple example, it's enough.
yes, I appreciate the details, thanks
sysctl -w vm.overcommit_memory=0 # or 1
according to the link I posted
http://www.mjmwired.net/kernel/Documentation/vm/overcommit-accounting
the cost of allocation with these parameters should be 0, and
the call
mmap((void *) NULL, someSize * (((size_t) 1) << 30), PROT_NONE, MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0); # someSize is integer with number of Gigs
should succeed no matter what virtual memory, and no matter what size I am allocating.
As far as the kernel generally is concerned, yes. You're hitting user resource limits, not a kernel out-of-memory condition. If you had overcommit disabled, then you might run into that issue.
So it seems that the user resource limit is checked first. Let me ask a speculative question (but I would appreciate a comment as it may help getting Nacl working on Linux, it seems their code is non-reserving 84Gb this way)
:)
So the question: Do you think that on any Linux, with ulimit virtual memory set to X Gb, this call will always fail:
mmap((void *) NULL, (X+1) * (((size_t) 1) << 30), PROT_NONE, MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
Thanks (one more comment below)
There's no reason to think it will _always_ fail but it'd be easy enough for them to check the failure case and compare it to the ulimit.
yes, this is what I hope can be done, I will point this out on the Nacl list
On my freshly installed 12.1-rc1 system, there's no limit for virtual memory set for my account.
yes, and nacl works there as well as the test code.
So far I have no other explanation then this is a bug (anyone agrees?). On the chrome list, people indicated same issue was fixed in Ubuntu lately (mmap 84Gb succeds with way less virtual memory). I only have 4 boxes with Opensuse ... will also try Opensuse 12.1 and report
I blame Windows for perpetuating this myth that "virtual memory" = "real memory + swap" since that's not at all what it means. Virtual memory is the address space. You're allocating part of the address space but you're not using it.
That is why I thought the user limit should be checked at the point I actually use it (not during MAP_NORESERVE), but certainly have little support for that :)
Well, that can get tricky. It's certainly possible but you won't like how it will enforce that limit. Rather than return an error code, your process will get a SIGBUS and be killed.
Yes, I would accept that part; the consequence would be that most Nacl programs run only those actually go over the limit fail; although if it would kill the browser that is not a good solution either; anyway I am getting speculative here. [I think this is related to the vm.overcommit_memory=0 described here http://www.mjmwired.net/kernel/Documentation/vm/overcommit-accounting ? (no need to comment I need to stop at some point). Thanks again Milan
- -Jeff
Thanks for your help and comments,
milan
-Jeff
>>> let me know if i should run anything else, thanks >> >> >> On my 11.4 64-bit system, I can't reproduce your >> failure. I see: >> >> jeffm@sled2:~> ./a.out Hello world Allocating 29Gb... >> Success: 0x7f7f79e57000 jeffm@sled2:~> uname -a Linux >> sled2 2.6.37.6-0.9-desktop #1 SMP PREEMPT 2011-10-19 >> 22:33:27 +0200 x86_64 x86_64 x86_64 GNU/Linux >> >> -Jeff >> >>>>> I am trying to resolve a failure of Google >>>>> Native Client in Opensuse 64 11.4, discussed in >>>>> Google Native Client forum: >>>>> >>>>> [1] >>>>> https://groups.google.com/forum/#!topic/native-client-discuss/7DUFfi_BxqM >>>>> >>>>> >>>>>
>>>>>
>>>>> To repeat the issue in a nutshell:: >>>>> --------------------------------------------- >>>>> >>>>> Google native client (Chrome) works on recent >>>>> versions of at least Ubuntu [1], but fails on >>>>> Opensuse 11.4 (with all latest updates up to Nov >>>>> 4). This failure can be reproduced in chrome 14, >>>>> 15, 16 (from >>>>> http://dl.google.com/linux/chrome/rpm/stable/x86_64) >>>>> >>>>> and verified by loading >>>>> >>>>> [2] >>>>> http://www.gonacl.com/dev/demos/sdk_examples/load_progress/load_progress.htm... >>>>> >>>>> >>>>>
>>>>>
>>>>> The problem / question for opensuse (kernel?) :: >>>>> ------------------------------------------------------------------ >>>>> >>>>> >>>>>
>>>>>
>>>>> There is a long discussion in the above thread, to get to the >>>>> point quickly: The Google guys identified an >>>>> issue with mmap() with MAP_NORESERVE (see below). >>>>> They believe it may be a bug or a kernel >>>>> configuration issue(?) >>>>> >>>>> A Chrome Nacl person suggest the following code >>>>> should print "Success" but it fails in my >>>>> testing: >>>>> >>>>> #include <stdio.h> #include <sys/mman.h> >>>>> >>>>> int main(void) { void *addr; >>>>> >>>>> printf("Hello world\nAllocating 29Gb...\n"); addr >>>>> = mmap((void *) NULL, 29 * (((size_t) 1) << 30), >>>>> PROT_NONE, MAP_ANONYMOUS | MAP_NORESERVE | >>>>> MAP_PRIVATE, -1, 0); /* test 29 or other values >>>>> */ if (MAP_FAILED == addr) { printf("FAILED\n"); >>>>> } else { printf("Success: %p\n", addr); } return >>>>> 0; } >>>>> >>>>> This prints FAILED on Opensuse 11.4 64 bit. >>>>> >>>>> I did some experiments. On my system: >>>>> >>>>> # cat /proc/meminfo MemTotal: 15948428 kB >>>>> MemFree: 11270612 kB .... CommitLimit: >>>>> 28945728 kB Committed_AS: 4918284 kB ... >>>>> >>>>> From running the test program above, it looks >>>>> like *CommitLimit* is clearly used as upper limit >>>>> of mmap(MAP_ANONYMOUS | MAP_NORESERVE | >>>>> MAP_PRIVATE), *no matter what >>>>> vm.overcommit_memory* flag is used. >>>>> >>>>> In concrete terms: >>>>> >>>>> mmap((void *) NULL, 29 * (((size_t) 1) << 30), >>>>> PROT_NONE, MAP_ANONYMOUS | MAP_NORESERVE | >>>>> MAP_PRIVATE, -1, 0); // always FAILS with value >>>>> of 29 or higher >>>>> >>>>> mmap((void *) NULL, 28 * (((size_t) 1) << 30), >>>>> PROT_NONE, MAP_ANONYMOUS | MAP_NORESERVE | >>>>> MAP_PRIVATE, -1, 0); // always SUCCEEDS with >>>>> value of 28 or lower >>>>> >>>>> No matter what setting of sysctl -w >>>>> vm.overcommit_memory=0 # or 1 or 2 >>>>> >>>>> (This seems a Opensuse 11.4 bug in modes 0 and 1, >>>>> as according to [2] anonymous private readonly >>>>> should have 0 cost) >>>>> >>>>> Any comments or solutions or how to fix this? >>>>> >>>>> Thanks, >>>>> >>>>> Milan >>>>> >>>>> [1] >>>>> http://www.mjmwired.net/kernel/Documentation/filesystems/proc.txt >>>>> >>>>>
>>>>>
>>>>> [2] >>>>> http://www.mjmwired.net/kernel/Documentation/vm/overcommit-accounting >>>>> >>>>> >>>>> >>>>> >>>>>
>>>>>
>>>>> ======================= PS: I am attaching a few pieces of info >>>>> about my system that may be relevant: >>>>> >>>>> The hardware is AMD 4 core AMD Athlon II X4 610e >>>>> and has 16Gb (sixteen) of memory, running very >>>>> little (just KDE desktop at this point) >>>>> >>>>> # swapon -s Filename Type Size Used Priority >>>>> /dev/sda1 partition 20971516 0 -1 >>>>> >>>>> # cat /proc/sys/vm/overcommit_memory 0 >>>>> >>>>> # ulimit -a core file size (blocks, -c) >>>>> 0 data seg size (kbytes, -d) unlimited scheduling >>>>> priority (-e) 0 file size (blocks, >>>>> -f) unlimited pending signals (-i) 123980 max >>>>> locked memory (kbytes, -l) 64 max memory size >>>>> (kbytes, -m) 13556224 open files (-n) 1024 pipe >>>>> size (512 bytes, -p) 8 POSIX message >>>>> queues (bytes, -q) 819200 real-time priority (-r) >>>>> 0 stack size (kbytes, -s) 8192 cpu >>>>> time (seconds, -t) unlimited max user processes >>>>> (-u) 123980 virtual memory (kbytes, -v) 29536000 >>>>> file locks (-x) unlimited >>>>> >>>>> # df Filesystem 1K-blocks Used >>>>> Available Use% Mounted on rootfs >>>>> 82567856 15558248 62815408 20% / devtmpfs >>>>> 7934744 244 7934500 1% /dev tmpfs >>>>> 7974212 1592 7972620 1% /dev/shm /dev/sda2 >>>>> 82567856 15558248 62815408 20% / /dev/sda3 >>>>> 377510440 90623252 267710692 26% /home >> >> >>>
- -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQIcBAEBAgAGBQJOvD3NAAoJEB57S2MheeWy/JgP/2E0mA89AHJQvQ+Shpe8ezV8 6XThoBQPI5qfpAbtalbXQsT7M2L8q2qgjlP4Gk/54n6MPq44a4Y6PbZkCn+nioIi e+gIiMCFOPRkO1S95s+SSSMlj9B9k1zv7//s35/I2h8zgSyISxsdVUux49sKNbTj dl/86JHS2QsjvtwY/N2bdOCjqz7cUXsEC8nQgobrEapZ9x9zND+vetK03NTstXtk ArdjeWhsbhejGCaz7LpJmszPmz1XSawtb1Jm/omNRGyk3ns2GzT6/AaAqG44tOe9 tybzhDrAx79tYpufABTZOp2sSbIKVNj63/716zXT+d6flnOhUre/e51mm0mea7TW WAbRxsYHNUGPwpjTdugIsBjJjxO76Z6eYhe7Xivw7l8b0wR/iNVhLrhVOBEUC7iY FXz19lRjIYZ6w345h11fhndF/CoEXgOPdWMuvIyE+vtQFQUyyNYsPo8C1Q3MTSfv OUYjtKRgaoo4z/Ngu9HHm1Fo9C2WJsyxmP0d0HS9xdUG4fAp7mVUkiU9136Vljzk rEUVVeXXOWGdIrDfVODTwgy8e+n0LxQ3of6eRAKc7XoB7jdZLgcDxBChCZrXCLxD hDjFjjSs8GU5lqOiL3gAa1TnLPWnqhztVurdY70kHV17BNatRSy6LyVB13SCGz95 qnLErliWms7EUKV4X4zx =+E2z -----END PGP SIGNATURE-----
-- To unsubscribe, e-mail: opensuse-kernel+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kernel+owner@opensuse.org