[opensuse] valgrind - swapping with SSD locks machine during swap - bug or feature?
All, I think it was James that had a similar question, but at the time I had no context to compare it to. But I had an occurrence last night where running valgrind on a simple bit of code that allocated 4.5G causes the my box to swap. The machine was in hardlock during swap -- no mouse pointer movement, not keyboard control -- no nothing. I was contemplating a hard reset while I sat waiting... and waiting... and waiting (probably no more than 3 minutes -- but that provides a lot of time to think about whether you need to nuke it or not) I have had boxes swap hundreds and hundreds of times over the decades, but never had a box exhibit this "hard-lock" behavior before. I now valgrid sets up its own protected environment to do its work in, but I don't see why it swapping would have any worse of different impact on the system than anything else swapping (I just don't know -- but wouldn't expect it to) Is this a bug, a feature, a valgrind special environment issue? If I recall the earlier thread (there have been a couple - Why swap?, etc.. recently), James or whoever it was was seeing similar unresponsiveness during the swap. I ran into this answering a question on StackOverflow. With 8G of RAM total allocating 4.5 while you have a full desktop with a dozen or so tabs in Firefox, etc. will cause your box to swap. Simple example code posted at: https://paste.opensuse.org/3916734 Why would the system appear locked while valgrind was swapping? -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 15/04/2020 08.48, David C. Rankin wrote:
All,
...
404 page not found.
Why would the system appear locked while valgrind was swapping?
Sorry, no idea. -- Cheers / Saludos, Carlos E. R. (from 15.1 x86_64 at Telcontar)
On 04/15/2020 03:38 AM, Carlos E. R. wrote:
On 15/04/2020 08.48, David C. Rankin wrote:
All,
...
404 page not found.
Why would the system appear locked while valgrind was swapping?
Sorry, no idea.
Huh?? (Oh, I missed the '3' at the end - my bad) Apr 15 01:41:22 - "malloc 1500x1500x500" malloc_1500x1500x500_ann.c Pasted as: https://susepaste.org/39167343 https://paste.opensuse.org/39167343 expires: Wed May 13 01:41:27 CDT 2020 -- David C. Rankin, J.D.,P.E.
15.04.2020 09:48, David C. Rankin пишет:
All,
I think it was James that had a similar question, but at the time I had no context to compare it to. But I had an occurrence last night where running valgrind on a simple bit of code that allocated 4.5G causes the my box to swap.
The machine was in hardlock during swap -- no mouse pointer movement, not keyboard control -- no nothing. I was contemplating a hard reset while I sat waiting... and waiting... and waiting (probably no more than 3 minutes -- but that provides a lot of time to think about whether you need to nuke it or not)
Try increasing /proc/sys/vm/dirty_ratio to very large value, e.g. 95. Do you still observe these stalls? Do not leave it that that in normal case.
I have had boxes swap hundreds and hundreds of times over the decades, but never had a box exhibit this "hard-lock" behavior before. I now valgrid sets up its own protected environment to do its work in, but I don't see why it swapping would have any worse of different impact on the system than anything else swapping (I just don't know -- but wouldn't expect it to)
Is this a bug, a feature, a valgrind special environment issue? If I recall the earlier thread (there have been a couple - Why swap?, etc.. recently), James or whoever it was was seeing similar unresponsiveness during the swap.
I ran into this answering a question on StackOverflow. With 8G of RAM total allocating 4.5 while you have a full desktop with a dozen or so tabs in Firefox, etc. will cause your box to swap. Simple example code posted at:
https://paste.opensuse.org/3916734
Why would the system appear locked while valgrind was swapping?
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 04/15/2020 11:57 PM, Andrei Borzenkov wrote:
The machine was in hardlock during swap -- no mouse pointer movement, not keyboard control -- no nothing. I was contemplating a hard reset while I sat waiting... and waiting... and waiting (probably no more than 3 minutes -- but that provides a lot of time to think about whether you need to nuke it or not)
Try increasing /proc/sys/vm/dirty_ratio to very large value, e.g. 95. Do you still observe these stalls?
Set to 95 and testing again, same hardlock 2 min. 5 sec. no mouse, no keyboard no nothing when it reached line 122 of output, e.g. $ valgrind ./bin/malloc_1500x1500x500_ann ==20384== Memcheck, a memory error detector ==20384== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==20384== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==20384== Command: ./bin/malloc_1500x1500x500_ann ==20384== pointers allocated all allocated setting: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 <snip> 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 <hardlock swapping - 2:05> ... freeing memory ==20384== ==20384== HEAP SUMMARY: ==20384== in use at exit: 0 bytes in 0 blocks ==20384== total heap usage: 1,502 allocs, 1,502 frees, 4,500,013,024 bytes allocated ==20384== ==20384== All heap blocks were freed -- no leaks are possible ==20384== ==20384== For counts of detected and suppressed errors, rerun with: -v ==20384== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Do not leave it that that in normal case.
Restored to 20. Why do you think this swap appears to hardlock? Does valgrind (by virtue of the low-level environment it is managing), essentially take 100% of the resources from user-space when it starts swapping? I only used roughly 1/4 of the available swap -- and there is no reason it should take 2 minutes to write 500M to swap on SSD? I have the following set in /etc/sysctl.conf if that makes a difference: vm.swappiness = 10 Weird issue. Run without valgrind, there is no swap or showdown, e.g. $ time ./bin/malloc_1500x1500x500_ann pointers allocated all allocated setting: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 <snip> 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 freeing memory real 0m1.520s user 0m0.605s sys 0m0.910s Any other ideas what is causing this? -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
16.04.2020 09:35, David C. Rankin пишет:
On 04/15/2020 11:57 PM, Andrei Borzenkov wrote:
The machine was in hardlock during swap -- no mouse pointer movement, not keyboard control -- no nothing. I was contemplating a hard reset while I sat waiting... and waiting... and waiting (probably no more than 3 minutes -- but that provides a lot of time to think about whether you need to nuke it or not)
Try increasing /proc/sys/vm/dirty_ratio to very large value, e.g. 95. Do you still observe these stalls?
Set to 95 and testing again, same hardlock 2 min. 5 sec. no mouse, no keyboard no nothing when it reached line 122 of output, e.g.
$ valgrind ./bin/malloc_1500x1500x500_ann ==20384== Memcheck, a memory error detector ==20384== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==20384== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info ==20384== Command: ./bin/malloc_1500x1500x500_ann ==20384== pointers allocated all allocated setting: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 <snip> 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 <hardlock swapping - 2:05> ... freeing memory ==20384== ==20384== HEAP SUMMARY: ==20384== in use at exit: 0 bytes in 0 blocks ==20384== total heap usage: 1,502 allocs, 1,502 frees, 4,500,013,024 bytes allocated ==20384== ==20384== All heap blocks were freed -- no leaks are possible ==20384== ==20384== For counts of detected and suppressed errors, rerun with: -v ==20384== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Do not leave it that that in normal case.
Restored to 20.
Why do you think this swap appears to hardlock?
When amount of dirty memory exceeds threshold, every process that allocates memory goes in direct reclaim mode - i.e. it synchronously tries to free enough used pages to satisfy request. While process is performing reclaim, it cannot really do anything else - which looks like hard lock. Your program dirties 4.5GB, so any process needing more memory will have to wait until dirty pages are written to swap. As was mentioned in another reply, valgrind seems to need a lot of memory for itself which is likely why it provokes this effect. I had similar effects with 5.3 and 5.4 kernels. As soon as even minimum amount of swap was allocated (several megabytes) I git regular frequent (every several minutes) stalls for several seconds. The most annoying it became when watching video. Kernel 5.5 seems better, at least I can see film with at most one hiccup. I suppose some kernel algorithms dealing with memory reclaim changed and caused ineffective scan with long delays.
Does valgrind (by virtue of the low-level environment it is managing), essentially take 100% of the resources from user-space when it starts swapping?
I only used roughly 1/4 of the available swap -- and there is no reason it should take 2 minutes to write 500M to swap on SSD?
Again, the problem could be not writing itself, but scanning page lists to find suitable candidates to write out. You have a lot of dirty memory so quite a long page lists; assuming every process needs to scan the whole list and that under mutual locking it could easily become a bottleneck.
I have the following set in /etc/sysctl.conf if that makes a difference:
vm.swappiness = 10
Weird issue. Run without valgrind, there is no swap or showdown, e.g.
$ time ./bin/malloc_1500x1500x500_ann pointers allocated all allocated setting: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 <snip> 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 freeing memory
real 0m1.520s user 0m0.605s sys 0m0.910s
Any other ideas what is causing this?
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
David C. Rankin wrote:
Is this a bug, a feature, a valgrind special environment issue? If I recall the earlier thread (there have been a couple - Why swap?, etc.. recently), James or whoever it was was seeing similar unresponsiveness during the swap. I think it might have been me. I ended up replacing the drive which seemed to improve things. But since then I have switched to a different computer and saw the same unresponsiveness yesterday. I had google chrome running.
One thing I didn't check and wished I had (on that old drive) was the SMART data to see if there were read errors. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Richmond wrote:
David C. Rankin wrote:
Is this a bug, a feature, a valgrind special environment issue? If I recall the earlier thread (there have been a couple - Why swap?, etc.. recently), James or whoever it was was seeing similar unresponsiveness during the swap. I think it might have been me. I ended up replacing the drive which seemed to improve things. But since then I have switched to a different computer and saw the same unresponsiveness yesterday. I had google chrome running.
One thing I didn't check and wished I had (on that old drive) was the SMART data to see if there were read errors.
Looking at this current machine I see these: smartd[1093]: Device: /dev/sda [SAT], 16075 Currently unreadable (pending) sectors smartd[1093]: Device: /dev/sda [SAT], 36223 Offline uncorrectable sectors (changed +1) smartd[1093]: Device: /dev/sda [SAT], 16075 Currently unreadable (pending) sectors smartd[1093]: Device: /dev/sda [SAT], 36223 Offline uncorrectable sectors smartd[1093]: Device: /dev/sda [SAT], 16075 Currently unreadable (pending) sectors smartd[1093]: Device: /dev/sda [SAT], 36226 Offline uncorrectable sectors (changed +3) smartd[1093]: Device: /dev/sda [SAT], 16075 Currently unreadable (pending) sectors smartd[1093]: Device: /dev/sda [SAT], 36226 Offline uncorrectable sectors smartd[1093]: Device: /dev/sda [SAT], 16075 Currently unreadable (pending) sectors smartd[1093]: Device: /dev/sda [SAT], 36226 Offline uncorrectable sectors I can't check yesterday though, journalctl says there is no persistent journal. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 16/04/2020 11.25, Richmond wrote:
Richmond wrote:
David C. Rankin wrote:
Is this a bug, a feature, a valgrind special environment issue? If I recall the earlier thread (there have been a couple - Why swap?, etc.. recently), James or whoever it was was seeing similar unresponsiveness during the swap. I think it might have been me. I ended up replacing the drive which seemed to improve things. But since then I have switched to a different computer and saw the same unresponsiveness yesterday. I had google chrome running.
One thing I didn't check and wished I had (on that old drive) was the SMART data to see if there were read errors.
Looking at this current machine I see these:
smartd[1093]: Device: /dev/sda [SAT], 16075 Currently unreadable (pending) sectors smartd[1093]: Device: /dev/sda [SAT], 36223 Offline uncorrectable sectors (changed +1)
That's VERY bad. Replace ASAP. Like this minute. -- Cheers / Saludos, Carlos E. R. (from 15.1 x86_64 at Telcontar)
Le 16/04/2020 à 11:58, Carlos E. R. a écrit :
On 16/04/2020 11.25, Richmond wrote:
Looking at this current machine I see these:
smartd[1093]: Device: /dev/sda [SAT], 16075 Currently unreadable (pending) sectors smartd[1093]: Device: /dev/sda [SAT], 36223 Offline uncorrectable sectors (changed +1)
That's VERY bad. Replace ASAP. Like this minute.
on a ssd? you are lucky if you still can backup data :-( jdd -- http://dodin.org -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
jdd@dodin.org wrote:
Le 16/04/2020 à 11:58, Carlos E. R. a écrit :
On 16/04/2020 11.25, Richmond wrote:
Looking at this current machine I see these:
smartd[1093]: Device: /dev/sda [SAT], 16075 Currently unreadable (pending) sectors smartd[1093]: Device: /dev/sda [SAT], 36223 Offline uncorrectable sectors (changed +1)
That's VERY bad. Replace ASAP. Like this minute.
on a ssd? you are lucky if you still can backup data :-(
jdd
It's a rusty spinning disk. I ran SpinRite surface checks on it and there were no problems. It's been like that for years. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Le 16/04/2020 à 15:51, Richmond a écrit :
It's a rusty spinning disk. I ran SpinRite surface checks on it and there were no problems. It's been like that for years.
oh, yes, smart for spinning disk is sometime curious. I was refering to the subject ("with ssd") jdd -- http://dodin.org -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 4/15/20 8:48 AM, David C. Rankin wrote:
I ran into this answering a question on StackOverflow. With 8G of RAM total allocating 4.5 while you have a full desktop with a dozen or so tabs in Firefox, etc. will cause your box to swap. Simple example code posted at:
https://paste.opensuse.org/39167343
Why would the system appear locked while valgrind was swapping?
You have very little RAM for spare 4 or 5G to run your program, never mind with valgrind that will allocate a lot more address space. This is not a feature of valgrind, more the way you use the system. I don't know what exactly the answer is, and I've run into your problem a few times as well. Also with FS cache pressure causes system to become unresponsive as FS cache is written to the actual disk. Imagine 25G of FS cache that starts to flush to disk. In the "very old days", the recommendation would be to check that your disk is using DMA transfers instead of interrupt, but today, the interactivity can at times be worse than old Solaris with 64MB and 2G swap. I've had everything freeze on my laptop because of memory constraints. No keyboard, or mouse, but the webcam and the audio device kept on humming along. So, what's going on? The answer could be to use cgroups to prioritize desktop components and keep free pages available while other (memory hungry) tasks get reduced priority and get impacted by swap. But someone has to test this. - Adam PS. I'm not an expert here, just my thoughts on the subject as an interested party. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 04/16/2020 05:02 AM, Adam Majer wrote:
I've had everything freeze on my laptop because of memory constraints. No keyboard, or mouse, but the webcam and the audio device kept on humming along. So, what's going on?
The answer could be to use cgroups to prioritize desktop components and keep free pages available while other (memory hungry) tasks get reduced priority and get impacted by swap. But someone has to test this.
- Adam
PS. I'm not an expert here, just my thoughts on the subject as an interested party.
Nor I, but thank you, that adds to the body of knowledge and areas to investigate. It's more of a curiosity of why. I think Andrei has put his finger on in when programs have to start scanning though the page lists while reclaiming memory. I guess this time since I was writing to more than 1/2 the available RAM and have a full desktop running -- when it hit the swap condition, it just lasted much, much longer than I have ever seen before (ever -- going back to 1999? 2000?) -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
David C. Rankin wrote:
All,
I think it was James that had a similar question, but at the time I had no context to compare it to. But I had an occurrence last night where running valgrind on a simple bit of code that allocated 4.5G causes the my box to swap.
The machine was in hardlock during swap -- no mouse pointer movement, not keyboard control -- no nothing. I was contemplating a hard reset while I sat waiting... and waiting... and waiting (probably no more than 3 minutes -- but that provides a lot of time to think about whether you need to nuke it or not)
I have had boxes swap hundreds and hundreds of times over the decades, but never had a box exhibit this "hard-lock" behavior before. I now valgrid sets up its own protected environment to do its work in, but I don't see why it swapping would have any worse of different impact on the system than anything else swapping (I just don't know -- but wouldn't expect it to)
Is this a bug, a feature, a valgrind special environment issue? If I recall the earlier thread (there have been a couple - Why swap?, etc.. recently), James or whoever it was was seeing similar unresponsiveness during the swap.
It's not valgrind-specific. I had similar issues doing image processing, when the application would request more memory than the computer had physically, and would work on all that data. That swapped out the whole GUI etc, making the computer 'dead' until the processing was finished (some 30-40 minutes!), then ran as before.... My solution (posted here, too) was to use cgroups to limit the max physical memory a single application can use, so enough is left for the GUI (I had used physical - 2GB) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
participants (7)
-
Adam Majer
-
Andrei Borzenkov
-
Carlos E. R.
-
David C. Rankin
-
jdd@dodin.org
-
Peter Suetterlin
-
Richmond