diagnosing a stack corruption

newer
What is the meaning of the option...

Per Jessen

24 Apr 2006 24 Apr '06

09:13

All, I've got a suspected stack corruption, and as it involves an external library, I'm trying to determine who's to blame - my code or the library code. Without exception, the resulting segfault always happens in the external library (libclamav). Equally without exception, if I omit the single call to the library (cl_scanfile()), I see no problems. I'm not particularly familiar with the libclamav code, nor so I have a desire to become so, so I was hoping to come up with something that says the problem is in libclamav - without diagnosing the problem all the way. It's little difficult to reproduce - I have had stresstests run for 24+ hours before it happened, but it has also happened within 30mins of starting a test. /Per Jessen, Zürich

Show replies by date

Stefan Hundhammer

24 Apr 24 Apr

12:49

New subject: [suse-programming-e] diagnosing a stack corruption

On Monday 24 April 2006 09:13, Per Jessen wrote:

...

I've got a suspected stack corruption, and as it involves an external library, I'm trying to determine who's to blame - my code or the library code. Without exception, the resulting segfault always happens in the external library (libclamav). Equally without exception, if I omit the single call to the library (cl_scanfile()), I see no problems.

That might easily be a usage problem (who owns what resource, who has to free allocated memory, ...). You might want to run your application with "valgrind" and double-check all problems it reports. CU -- Stefan Hundhammer Penguin by conviction. YaST2 Development SUSE Linux Products GmbH Nuernberg, Germany

Per Jessen

14:01

New subject: [suse-programming-e] diagnosing a stack corruption

Stefan Hundhammer wrote:

...

On Monday 24 April 2006 09:13, Per Jessen wrote:

...
I've got a suspected stack corruption, and as it involves an external library, I'm trying to determine who's to blame - my code or the library code. Without exception, the resulting segfault always happens in the external library (libclamav). Equally without exception, if I omit the single call to the library (cl_scanfile()), I see no problems.

That might easily be a usage problem (who owns what resource, who has to free allocated memory, ...).

Yeah, I think so too - but wouldn't that show on the heap, rather than in the stack? This has so far always hit the same function, called as part of cl_scanfile(). And always with a very clear stack corruption.

...

You might want to run your application with "valgrind" and double-check all problems it reports.

Yep, that's the next step - I've already valgrinded a couple of times and fixed a couple of problems, but another round won't hurt :-) Thanks. /Per Jessen, Zürich

Stefan Hundhammer

14:59

New subject: [suse-programming-e] diagnosing a stack corruption

On Monday 24 April 2006 14:01, Per Jessen wrote:

...

...
That might easily be a usage problem (who owns what resource, who has to free allocated memory, ...).

Yeah, I think so too - but wouldn't that show on the heap, rather than in the stack?

Not neccessarily. A dangling pointer can overwrite memory anywhere. A pointer to a local object that is returned from a function, then dereferenced and written to will corrupt your stack for sure. A common mistake is to release objects too soon. I don't know any more details about your project, but that one happened to me lately, too: for ( SomeList::iterator it = myList().begin(); it != myList().end(); ++it ) { ... } In this example, myList() would return a temporary list. Being sure that both calls to myList() would refer to the same object, I made that mistake to create two temporary lists, use each to return an interator and compare them. Only the iterators would refer to different objects and thus the comparison would only match by coincidence... and writing to the content of that iterator would wreck the stack for sure. Just to give you an example of how the application can easily mess up its own stack with that kind of thing. ;-)

...

This has so far always hit the same function, called as part of cl_scanfile(). And always with a very clear stack corruption.

I have no clue about that lib. Maybe there is a missing initialization or something like that. CU -- Stefan Hundhammer Penguin by conviction. YaST2 Development SUSE Linux Products GmbH Nuernberg, Germany

Per Jessen

21:59

New subject: [suse-programming-e] diagnosing a stack corruption

Stefan Hundhammer wrote:

...

A common mistake is to release objects too soon. I don't know any more details about your project, but that one happened to me lately, too:

My application is a multi-threaded SMTP-proxy written in C. One thing it does is call clamav to scan emails for virus.

...

Just to give you an example of how the application can easily mess up its own stack with that kind of thing. ;-)

Yep, got it. Does anyone have suggestions as to how to trap a stack corruption? I'm almost certainly asking a naive question - I'm not quite familiar with debugging C on Intel (my normal hunting grounds are assembler on S390). I'm thinking of maintaining a couple flags/checksum(ish) variable at each start/end of a stack, and then checking those at entry/exit? Does anything exist that will assist me doing this? (this method was been very successful in catching corruptions in both CICS and TPF). The error I'm chasing seems highly intermittent - I can process 100.000s of emails (sample of 1014 emails repeated in multiple threads) over 24+ hours with no problem, but also trigger it within 30mins of starting a test. /Per Jessen, Zürich

Jerry Feldman

22:22

New subject: [suse-programming-e] diagnosing a stack corruption

On Monday 24 April 2006 3:59 pm, Per Jessen wrote:

...

Yep, got it.

Does anyone have suggestions as to how to trap a stack corruption? I'm almost certainly asking a naive question - I'm not quite familiar with debugging C on Intel (my normal hunting grounds are assembler on S390).

I'm thinking of maintaining a couple flags/checksum(ish) variable at each start/end of a stack, and then checking those at entry/exit? Does anything exist that will assist me doing this? (this method was been very successful in catching corruptions in both CICS and TPF).

The error I'm chasing seems highly intermittent - I can process 100.000s of emails (sample of 1014 emails repeated in multiple threads) over 24+ hours with no problem, but also trigger it within 30mins of starting a test. For one, make sure that a core file is produced. Then run GDB on the core file. That will point you at the segfault, then try to do a backtrace (the bt command) from there. Stack corruptions do tend to be messy because you lose context.

Additionally, one of the most common causes of stack corruption is not so much malloc-free issues, but indexing beyond the bounds of an array. Lot's of strange things can happen when the stack is corrupted. Note that CICS and TPS are ancient and have some very good debuggers. Additionally, Intel has its own compilers and debuggers that you can use freely for non-commercial use. Their ICC debugger is much more in the line of the old Unix dbx (actually it is DEC's ladebug). I've found that in trying to debug C++ inline functions, Intel's IDB is better than gdb. -- Jerry Feldman Boston Linux and Unix user group http://www.blu.org PGP key id:C5061EA9 PGP Key fingerprint:053C 73EC 3AC1 5C44 3E14 9245 FB00 3ED5 C506 1EA9

Anders Johansson

22:28

New subject: [suse-programming-e] diagnosing a stack corruption

On Monday 24 April 2006 22:22, Jerry Feldman wrote:

...

For one, make sure that a core file is produced. Then run GDB on the core file. That will point you at the segfault, then try to do a backtrace (the bt command) from there. Stack corruptions do tend to be messy because you lose context.

You might get some idea just by displaying the contents of the stack. Seeing what it is that has been written there could be a big clue towards tracking down what is writing it

Per Jessen

22:37

New subject: [suse-programming-e] diagnosing a stack corruption

Anders Johansson wrote:

...

On Monday 24 April 2006 22:22, Jerry Feldman wrote:

...
For one, make sure that a core file is produced. Then run GDB on the core file. That will point you at the segfault, then try to do a backtrace (the bt command) from there. Stack corruptions do tend to be messy because you lose context.

You might get some idea just by displaying the contents of the stack. Seeing what it is that has been written there could be a big clue towards tracking down what is writing it

Yes, I understand that - my problem is that I suspect the corruption is being done by an external library (libclamav), and I'm not sufficiently familiar with that to immediately spot something odd in the stack. /Per Jessen, Zürich

Jerry Feldman

22:51

New subject: [suse-programming-e] diagnosing a stack corruption

On Monday 24 April 2006 4:37 pm, Per Jessen wrote:

...

Yes, I understand that - my problem is that I suspect the corruption is being done by an external library (libclamav), and I'm not sufficiently familiar with that to immediately spot something odd in the stack. That is a very difficult situation when you do not have source code.

BTW: WindRiver (http://www.windriver.com/portal/server.pt) has a set of Open Source tools that could also help. The problem is that there are many factors that can corrupt a stack, and they tend to be hard to find. You will also see different results with different compilers. -- Jerry Feldman Boston Linux and Unix user group http://www.blu.org PGP key id:C5061EA9 PGP Key fingerprint:053C 73EC 3AC1 5C44 3E14 9245 FB00 3ED5 C506 1EA9

Per Jessen

22:32

New subject: [suse-programming-e] diagnosing a stack corruption

Jerry Feldman wrote:

...

For one, make sure that a core file is produced.

Yep, got it.

...

Then run GDB on the core file. That will point you at the segfault, then try to do a backtrace (the bt command) from there. Stack corruptions do tend to be messy because you lose context.

Yep, got it - all the back trace shows me is pretty much that I have a stack corruption.

...

Additionally, one of the most common causes of stack corruption is not so much malloc-free issues, but indexing beyond the bounds of an array. Lot's of strange things can happen when the stack is corrupted. Note that CICS and TPS are ancient and have some very good debuggers.

Too true. Actually you just run your TPF machine in SST mode - I'm slowly beginning to dream of doing that in Xen or vmware or one of those. But of course you're right - they're both ancient, although also very much up-to-date. Rumour has it that TPF ran the Olympics webserver during the 2000 Olympics.

...

Additionally, Intel has its own compilers and debuggers that you can use freely for non-commercial use. Their ICC debugger is much more in the line of the old Unix dbx (actually it is DEC's ladebug). I've found that in trying to debug C++ inline functions, Intel's IDB is better than gdb.

I guess I'll have to get acquainted - I'm much more at home with IPCS, a system trace and such. And my age is showing. /Per Jessen, Zürich

Jerry Feldman

22:44

New subject: [suse-programming-e] diagnosing a stack corruption

On Monday 24 April 2006 4:32 pm, Per Jessen wrote:

...

Too true. Actually you just run your TPF machine in SST mode - I'm slowly beginning to dream of doing that in Xen or vmware or one of those. But of course you're right - they're both ancient, although also very much up-to-date. Rumour has it that TPF ran the Olympics webserver during the 2000 Olympics. Actually, rather than ancient, I should have used the word mature.

There are a lot of tools available on Linux. The debuggers have some options such as testing on an event. In general something like this: when at <somewhere> if foo != bar break or trace -- Jerry Feldman Boston Linux and Unix user group http://www.blu.org PGP key id:C5061EA9 PGP Key fingerprint:053C 73EC 3AC1 5C44 3E14 9245 FB00 3ED5 C506 1EA9

Anders Johansson

22:25

New subject: [suse-programming-e] diagnosing a stack corruption

On Monday 24 April 2006 21:59, Per Jessen wrote:

...

The error I'm chasing seems highly intermittent - I can process 100.000s of emails (sample of 1014 emails repeated in multiple threads) over 24+ hours with no problem, but also trigger it within 30mins of starting a test.

Intuitively, I'd say something isn't checking return values from memory allocation functions

6585

Age (days ago)

6585

Last active (days ago)

List overview

Download

11 comments

4 participants

participants (4)

Anders Johansson
Jerry Feldman
Per Jessen
Stefan Hundhammer