A piece of supposedly well-written multi-threaded code was giving me a headache the other day:

struct GpsdWrapper
{
    GpsdWrapper(int gpsFd);
    ~GpsdWrapper();

    void poll_AS();
    void parse_AS();

private:

    static const size_t READ_BUFFER_SIZE = 100*1024;

    char   read_buffer[READ_BUFFER_SIZE];
    size_t buffer_pos;

    int gpsFd;

    boost::mutex  buffer_lock;
    boost::thread poll_thread;
    boost::thread parse_thread;
};

GpsdWrapper::GpsdWrapper(int gpsFd)
    : gpsFd(gpsFd)
    , poll_thread (boost::bind(&GpsdWrapper::poll_AS, this))
    , parse_thread(boost::bind(&GpsdWrapper::parse_AS, this))
{}

GpsdWrapper::~GpsdWrapper()
{
    poll_thread.interrupt();
    parse_thread.interrupt();

    poll_thread.join();
    parse_thread.join();
}

void GpsdWrapper::poll_AS()
{
    // {block signals}

    fd_set rfds;

    while (true) {
        boost::this_thread::interruption_point();

        timeval tv = get_timeout(); // {defined elsewhere}
        FD_ZERO(&rfds);
        FD_SET(gpsFd, &rfds);

        int result = select(gpsFd+1, &rfds, NULL, NULL, &tv);
        if (result > 0) {
            boost::mutex::scoped_lock(buffer_lock);

            if (read_buffer_pos == READ_BUFFER_SIZE) {
                syslog(LOG_WARNING,
                   "GPS read buffer reached maximum size; flushing it");
                read_buffer_pos = 0;
            }

            int r = read(gpsFd, read_buffer + read_buffer_pos,
                 READ_BUFFER_SIZE - read_buffer_pos);
            if (r > 0) {
                read_buffer_pos += r;
            }
            else if (r == -1) {
                syslog(LOG_NOTICE, "Read failed");
                // {disconnect}
                continue;
            }
            else if (r == 0) {
                syslog(LOG_NOTICE, "Remote client closed connection");
                // {disconnect}
                continue;
            }
        }
    }
}

void GpsdWrapper::parse_AS()
{
    // {block signals}

    while (true) {
        boost::this_thread::interruption_point();

        bool buffer_is_full;
        std::string buffer_copy;

        {
            boost::mutex::scoped_lock(buffer_lock);

            buffer_is_full = (read_buffer_pos == READ_BUFFER_SIZE);
            buffer_copy.assign(read_buffer, read_buffer_pos);
            read_buffer_pos = 0;
        }

        // {pass buffer_copy to a parser implementation}

        if (!buffer_is_full)
            usleep(300000);
    }
}

The idea here is that an instance of GpsdWrapper sits around in the calling scope, spawning two threads of its own to tick over eating/buffering data from a serial device and dispatching it to some parser, respectively. The parser has its own thread as we certainly don't want serial reads being delayed by any bottlenecks in the parsing process, and we can flush the shared buffer if such a bottleneck begins to cause a backlog of unparsed data, without causing problems on the source end.

Thread safety is allegedly guaranteed by constructing the scope-bound boost::mutex::scoped_lock objects, and all the buffer maths check out. But I was still seeing data dropping in the serial read at seemingly random intervals, implying a problem with the cursor read_buffer_pos.

It took a few hours of paid yet wasted office time to spot that my scoped_locks were in fact not scope-bound at all, because I forgot to give them names:

boost::mutex::scoped_lock(buffer_lock);
// ^ temporary object of type `scoped_lock`, lives only for the line

boost::mutex::scoped_lock l(buffer_lock);
// ^ named object with automatic storage and scope-bound lifetime

This silly little typo that I'd made identically in each case, went completely undetected and essentially meant that there would be no thread-safety in the entire mechanism.

Problem randomly found after a revitalising coffee break, I then set about trying to conjure up a way for the build process to catch mistakes like this in the future. But what would they be? GCC doesn't (and cannot be configured to) emit any warnings about this code because, in the general case, there's nothing wrong with a temporary object that is not assigned to anything and that thus dies straight away.

Indeed, even in the threading world you may use a lock as a waiting "gate" that can be safely released as soon as the gate has been passed. As it turns out, using block scope purely for the purpose of using block scope is more rare than common. Since we can't expect a toolchain to read our minds, it doesn't seem likely that we'll ever get an automated mechanism for identifying this possible error.

I guess the moral of the story is to remember this story when using scoped_locks.