All, I've got a suspected stack corruption, and as it involves an external library, I'm trying to determine who's to blame - my code or the library code. Without exception, the resulting segfault always happens in the external library (libclamav). Equally without exception, if I omit the single call to the library (cl_scanfile()), I see no problems. I'm not particularly familiar with the libclamav code, nor so I have a desire to become so, so I was hoping to come up with something that says the problem is in libclamav - without diagnosing the problem all the way. It's little difficult to reproduce - I have had stresstests run for 24+ hours before it happened, but it has also happened within 30mins of starting a test. /Per Jessen, Zürich
On Monday 24 April 2006 09:13, Per Jessen wrote:
I've got a suspected stack corruption, and as it involves an external library, I'm trying to determine who's to blame - my code or the library code. Without exception, the resulting segfault always happens in the external library (libclamav). Equally without exception, if I omit the single call to the library (cl_scanfile()), I see no problems.
That might easily be a usage problem (who owns what resource, who has to free
allocated memory, ...).
You might want to run your application with "valgrind" and double-check all
problems it reports.
CU
--
Stefan Hundhammer
Stefan Hundhammer wrote:
On Monday 24 April 2006 09:13, Per Jessen wrote:
I've got a suspected stack corruption, and as it involves an external library, I'm trying to determine who's to blame - my code or the library code. Without exception, the resulting segfault always happens in the external library (libclamav). Equally without exception, if I omit the single call to the library (cl_scanfile()), I see no problems.
That might easily be a usage problem (who owns what resource, who has to free allocated memory, ...).
Yeah, I think so too - but wouldn't that show on the heap, rather than in the stack? This has so far always hit the same function, called as part of cl_scanfile(). And always with a very clear stack corruption.
You might want to run your application with "valgrind" and double-check all problems it reports.
Yep, that's the next step - I've already valgrinded a couple of times and fixed a couple of problems, but another round won't hurt :-) Thanks. /Per Jessen, Zürich
On Monday 24 April 2006 14:01, Per Jessen wrote:
That might easily be a usage problem (who owns what resource, who has to free allocated memory, ...).
Yeah, I think so too - but wouldn't that show on the heap, rather than in the stack?
Not neccessarily. A dangling pointer can overwrite memory anywhere. A pointer to a local object that is returned from a function, then dereferenced and written to will corrupt your stack for sure. A common mistake is to release objects too soon. I don't know any more details about your project, but that one happened to me lately, too: for ( SomeList::iterator it = myList().begin(); it != myList().end(); ++it ) { ... } In this example, myList() would return a temporary list. Being sure that both calls to myList() would refer to the same object, I made that mistake to create two temporary lists, use each to return an interator and compare them. Only the iterators would refer to different objects and thus the comparison would only match by coincidence... and writing to the content of that iterator would wreck the stack for sure. Just to give you an example of how the application can easily mess up its own stack with that kind of thing. ;-)
This has so far always hit the same function, called as part of cl_scanfile(). And always with a very clear stack corruption.
I have no clue about that lib. Maybe there is a missing initialization or
something like that.
CU
--
Stefan Hundhammer
Stefan Hundhammer wrote:
A common mistake is to release objects too soon. I don't know any more details about your project, but that one happened to me lately, too:
My application is a multi-threaded SMTP-proxy written in C. One thing it does is call clamav to scan emails for virus.
Just to give you an example of how the application can easily mess up its own stack with that kind of thing. ;-)
Yep, got it. Does anyone have suggestions as to how to trap a stack corruption? I'm almost certainly asking a naive question - I'm not quite familiar with debugging C on Intel (my normal hunting grounds are assembler on S390). I'm thinking of maintaining a couple flags/checksum(ish) variable at each start/end of a stack, and then checking those at entry/exit? Does anything exist that will assist me doing this? (this method was been very successful in catching corruptions in both CICS and TPF). The error I'm chasing seems highly intermittent - I can process 100.000s of emails (sample of 1014 emails repeated in multiple threads) over 24+ hours with no problem, but also trigger it within 30mins of starting a test. /Per Jessen, Zürich
On Monday 24 April 2006 3:59 pm, Per Jessen wrote:
Yep, got it.
Does anyone have suggestions as to how to trap a stack corruption? I'm almost certainly asking a naive question - I'm not quite familiar with debugging C on Intel (my normal hunting grounds are assembler on S390).
I'm thinking of maintaining a couple flags/checksum(ish) variable at each start/end of a stack, and then checking those at entry/exit? Does anything exist that will assist me doing this? (this method was been very successful in catching corruptions in both CICS and TPF).
The error I'm chasing seems highly intermittent - I can process 100.000s of emails (sample of 1014 emails repeated in multiple threads) over 24+ hours with no problem, but also trigger it within 30mins of starting a test. For one, make sure that a core file is produced. Then run GDB on the core file. That will point you at the segfault, then try to do a backtrace (the bt command) from there. Stack corruptions do tend to be messy because you lose context.
Additionally, one of the most common causes of stack corruption is not so
much malloc-free issues, but indexing beyond the bounds of an array. Lot's
of strange things can happen when the stack is corrupted. Note that CICS
and TPS are ancient and have some very good debuggers.
Additionally, Intel has its own compilers and debuggers that you can use
freely for non-commercial use. Their ICC debugger is much more in the line
of the old Unix dbx (actually it is DEC's ladebug). I've found that in
trying to debug C++ inline functions, Intel's IDB is better than gdb.
--
Jerry Feldman
On Monday 24 April 2006 22:22, Jerry Feldman wrote:
For one, make sure that a core file is produced. Then run GDB on the core file. That will point you at the segfault, then try to do a backtrace (the bt command) from there. Stack corruptions do tend to be messy because you lose context.
You might get some idea just by displaying the contents of the stack. Seeing what it is that has been written there could be a big clue towards tracking down what is writing it
Anders Johansson wrote:
On Monday 24 April 2006 22:22, Jerry Feldman wrote:
For one, make sure that a core file is produced. Then run GDB on the core file. That will point you at the segfault, then try to do a backtrace (the bt command) from there. Stack corruptions do tend to be messy because you lose context.
You might get some idea just by displaying the contents of the stack. Seeing what it is that has been written there could be a big clue towards tracking down what is writing it
Yes, I understand that - my problem is that I suspect the corruption is being done by an external library (libclamav), and I'm not sufficiently familiar with that to immediately spot something odd in the stack. /Per Jessen, Zürich
On Monday 24 April 2006 4:37 pm, Per Jessen wrote:
Yes, I understand that - my problem is that I suspect the corruption is being done by an external library (libclamav), and I'm not sufficiently familiar with that to immediately spot something odd in the stack. That is a very difficult situation when you do not have source code.
BTW: WindRiver (http://www.windriver.com/portal/server.pt) has a set of Open
Source tools that could also help.
The problem is that there are many factors that can corrupt a stack, and
they tend to be hard to find. You will also see different results with
different compilers.
--
Jerry Feldman
Jerry Feldman wrote:
For one, make sure that a core file is produced.
Yep, got it.
Then run GDB on the core file. That will point you at the segfault, then try to do a backtrace (the bt command) from there. Stack corruptions do tend to be messy because you lose context.
Yep, got it - all the back trace shows me is pretty much that I have a stack corruption.
Additionally, one of the most common causes of stack corruption is not so much malloc-free issues, but indexing beyond the bounds of an array. Lot's of strange things can happen when the stack is corrupted. Note that CICS and TPS are ancient and have some very good debuggers.
Too true. Actually you just run your TPF machine in SST mode - I'm slowly beginning to dream of doing that in Xen or vmware or one of those. But of course you're right - they're both ancient, although also very much up-to-date. Rumour has it that TPF ran the Olympics webserver during the 2000 Olympics.
Additionally, Intel has its own compilers and debuggers that you can use freely for non-commercial use. Their ICC debugger is much more in the line of the old Unix dbx (actually it is DEC's ladebug). I've found that in trying to debug C++ inline functions, Intel's IDB is better than gdb.
I guess I'll have to get acquainted - I'm much more at home with IPCS, a system trace and such. And my age is showing. /Per Jessen, Zürich
On Monday 24 April 2006 4:32 pm, Per Jessen wrote:
Too true. Actually you just run your TPF machine in SST mode - I'm slowly beginning to dream of doing that in Xen or vmware or one of those. But of course you're right - they're both ancient, although also very much up-to-date. Rumour has it that TPF ran the Olympics webserver during the 2000 Olympics. Actually, rather than ancient, I should have used the word mature.
There are a lot of tools available on Linux.
The debuggers have some options such as testing on an event.
In general something like this:
when at <somewhere> if foo != bar
break or trace
--
Jerry Feldman
On Monday 24 April 2006 21:59, Per Jessen wrote:
The error I'm chasing seems highly intermittent - I can process 100.000s of emails (sample of 1014 emails repeated in multiple threads) over 24+ hours with no problem, but also trigger it within 30mins of starting a test.
Intuitively, I'd say something isn't checking return values from memory allocation functions
participants (4)
-
Anders Johansson
-
Jerry Feldman
-
Per Jessen
-
Stefan Hundhammer