A C++ multi-threaded application crashed: how to find the problem by GDB and fix it.

It is not hard to understand mutex, condition variable, semophore, etc. in Linux with pthread library, but you might not remember it for ever. A failure in multi-threaded application can give you a deep impression of these concept and on how to apply these facilities.

We have an application:

1. It has 48 threads which established an SSL session to remote host by a SDK via OpenSSL, so every thread has SessionManager object to deal with the session. The session is an returned pointer after creation, then it can be to suspend, resume, stop, delete.

2. There are also other threads which share the session and have the same operations to the  session pointer variable.

The code for resumeSession():

void SessionManager::resume() {
if ( session != NULL ) {
         if ( !isEndSession() ) {
             log->notice(“session is not NULL and session status is not End”);
             lea_session_resume(session);
} else {
log->notice(“session is not NULL, but session status is End”);
}
} else {
log->notice(“session is NULL, not call lea_session_resume”);
}
}

After the program running 7 X 24 for one year on a customer side, at a given time, the program crashed. To investigate the root reason, we have turned to core dump to know where caused the crash from within the source code.

The following steps are use:

1. create a directory and copy the core file and executable file to the above directory: core.20554, multilea

2. start gdb by: gdb multilea core.20554

3. in gdb, get the stack trace by : (gdb) bt

4. from the stack trace, we can see that the last invocation of our code is in frame 4, so use command by : (gdb) f 4

5. now we can print the current object’s member variable by: (gdb) p *this

6. Inspect the member variable value, to our surprise, the: sessionEnd = true

but from the resumeSession() code, we can see that only when sessionEnd is false, then it has chance to call the next line to resume session. So the sessionEnd value is changed between the two lines of  if() and lea_session_resume(session). There must be a race condition that set the sessionEnd to true. So we have to check all the source code to find all place that sets the sessionEnd. Since the session is end, the invocation of SDK  lea_session_resume() causes the application crash.

So we have to change code to add a mutex to avoid race condition.

void SessionManager::endSession() {
int status = pthread_mutex_lock(&sessionEndLock);
if (status != 0) {
log->notice(“cannot get the sessionEndLock to call opsec_end_session()”);
} else {
if (dummySession) {
opsec_end_session(dummy);
dummySession = false;
dummy = NULL;
} else {
if (session) {
opsec_end_session(session);
session = NULL;
}
}
pthread_mutex_unlock(&sessionEndLock);
}
log->notice(“session_end_session() had been executed”);
}

void SessionManager::resume() {
if( session != NULL ) {
int status = pthread_mutex_lock(&sessionEndLock);
if (status != 0) {
log->notice(“cannot get the sessionEndLock to detect if session is End”);
} else {
if( !isEndSession() ) {
log->notice(“session is not NULL and session status is not End”);
session_resume(session);
} else {
log->notice(“session is not NULL, but session status is End, not to call    “);
}
pthread_mutex_unlock(&sessionEndLock);
}
} else {
log->notice(“session is NULL, not call lea_session_resume”);
}
}

Advertisements
This entry was posted in c/c++, Linux, Operating system, Programming. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s