Flink And Blink: The Case For C+-

No, that's not a typo. I really do mean C+-.

When using C++, there's a tendency for C programmers to think they have to use all the facilities of the language at once. Particularly user-defined classes, since C++ takes C into object-oriented programming. "If I'm using C++, I have to define some classes."

The danger in that is a risk of over-engineering, over-complicating things, by forcibly looking for ways to use classes when there may not be a need for that.

In a large, complex piece of software, there are many places that benefit from user-defined classes. In a bigger beast like that, the user-defined object-oriented approach helps abstract the problem.

But in small, quick tools, that's not necessarily the case, so plain C with a few simple structs is probably sufficient. However, there's still a lot of value to be found in the C++ standard library.

Specifically, the managed string and container classes. One of the big complaints about C is the need for explicit memory management. Because of that need, the C language and runtime library don't offer any native containers other than the statically-sized array, with the simple character array as an implementation of strings. If you need any dynamic structures, you have to implement your own, with explicit memory management on top of malloc() and free().

There are no native or standard library lists, hash tables, trees, or other dynamic structures. There are no dynamically-sized arrays or strings. There's no automatic deallocation of heap when you're finished using it.

As a result, C usage has been plagued by decades of buffer overflows and memory leaks. It also means a lot of time required to roll your own basic dynamic structures (and iron out the buffer overflows and memory leaks in their implementation).

But the C++ standard library provides all of those things. It also provides building blocks that can be used to layer more complex structure on them. That provides a lot of opportunities to build things without having to define any classes of your own.

I'm not saying there's anything wrong with classes. I'm just saying there's a whole class of programs that don't need any extra classes. The C++ standard library already provides a rich set of resources to choose from, often sufficient on their own to build useful programs that are faster to implement and debug than if you did everything in C.

You probably already treat templates that way. Even though C++ offers the ability to define templates, you may write tons of code without ever defining your own templates. Just because the language offers a feature doesn't mean you have to define any of your own things with it. Yet you still probably make extensive of templates through the library.

And so it is with classes. You can write tons of code without ever defining your own classes. Doing lots of string processing, common in software tools? The C++ standard library provides a whole host of classes that will help, starting with std::string.

Need to keep lists of those strings, in the order you got them? How about a std::list<std::string>? Need fast associative storage keyed off the string, or a portion of it? How about a std::unordered_map<std::string, yourThingHere>? Need to keep a set of sorted strings? How about a std::map<yourThingHere, std::string>? Or something sorted by the strings? How about a std::map<std::string, yourThingHere>?

Then if you need something a little more complex than simple strings and structs, you can use std::pair<thingA, thingB> or std::bind<callableThing, args>.

The other benefit to this is that at some point you may realize that perhaps there are some user-defined classes that would make sense in your progam after all, it's not just strings and structs and pairs and binds. The infrastructure you've already built into the program is OO-ready. And you have std::shared_ptr<yourClassHere> to automate memory management and support RAII, avoiding memory leaks.

So making the switch to a more heavily object-oriented program is a small step, a refinement, rather than throwing it all out and starting over again.

Meanwhile, you're already in the mindset of using just the minimum of appropriate user-defined classes, and not going overboard trying to beat everything into the shape of an OO nail just because you have an OO hammer.

That's adding just the amount of design and implementation complexity necessary to help abstract the appropriate parts of the problem, while maintaining a simple, pared-down elegance. Make things only as complex as you need to, and no more (as well as following Einstein's advice to make things as simple as possible, but no simpler).

Meanwhile, you're relying on a large body of fully-implemented and debugged composable, modular elements to speed the job to completion. In many ways, that right there is going a long way to meeting the promise of the "software IC".

So that's what I'm calling C+-. It's C++ minus the user-defined classes. Which is more than just writing plain C that you compile with the C++ compiler. It's simply object-oriented code that relies entirely on someone else's classes.

You can argue about whether that's a good thing or a bad thing in the grand scheme of things, but I see it as just another practical tool in your toolbox.

There are three situations where this approach is useful:

Quick tools where you need to get it done as fast as possible so you can use it to help you get on with your main work.
Competitive programming, where you're working under the gun.
Coding interviews, which are essentially competitive programming under a time limit, whether on a whiteboard, in a shared editing session, or in an automated coding assessment system.

As an example of this, here's a tool I've been wanting to have for a while. I work on IOT projects, distributed systems where small embedded system client devices communicate with large backend servers.

Debugging these can be challenging as you try to sift through the logs each side produces. Because many IOT systems lack real-time clocks, they may not know what actual time it is, so it's hard to match up activity in the client log with the activity in the server, especially when there are communication errors and voluminous logs.

The tool below, msgresolve, resolves the messages logged by a client IOT device and its server. The client tracks time since booted, in msec, and logs that timestamp on each line. The server tracks real time GMT to msec resolution, logging the data and time on each line.

The example logs here contain a very small amount of data, but it's not unreasonable for a log to have hundreds or thousands of messages.

In order for this to work, the message logging must have a way of identifying each message uniquely. This is known as the message signature, a short string that summarizes the message contents. The signature may be a cryptographic hash or message digest such as MD5, or a checksum or polynomial such as Fletcher or CRC.

The messages must have some degree of randomization in the the contents so that no two messages in the same direction every produce the same signature (at least for the duration of the logging). This randomization might be due to encryption, some incrementing field such as a timestamp or counter, or a randomized nonce.

Users of git will be familiar with this concept. The commit hash acts as the identifier for changes to file content, and is affected by only a single-byte change in the file contents.

Here the signature is formed from the message hash and the message length. Appending the length adds a little insurance in case messages of different lengths, with different contents, hash to the same value, known as a hash collision. Two messages of the same length should always hash to different values if at least one bit is different in them, so the hash conditioned by the length ensures a unique signature.

I had a couple thoughts on how to approach the algorithm. One was to treat it as a difference-matching problem, such as the Unix diff utility. The other was a kind of match-and-merge approach. However that seemed like it might head toward an O(N^2) algorithm (for each client message, run down the list of server messages to find a match), which would rapidly get too slow for large logs.

But that made me think about an indexed lookup method, where a faster lookup method would make that approach manageable.

Part of what made it tricky is the fact that even though the two logs have parallel, time-ordered sets of messages, there might be lost or corrupted messages, and the two logs might not cover the exact same range of time. So just because a message appeared in one log, there was no guarantee that it opposite appeared exactly as is in the other log.

The other thing that helped crystallize it was the realization that matching up a set of parallel ordered log entries could be viewed as three parts from the perspective of the client messages:

Handle any messages in the server log that preceded the messages matching the client messages.
Handle all the messages in the client log, which may or may not have matching server messages (along with intervening server messages that didn't have any matching client messages).
Handle any messages in the server log that followed the messages matching the client messages.

So this algorithm uses a hash table (std::unordered_map) to index a list (std::list) of log entries. The hash table (which I call a dict, as in a Python dict) is indexed by message signature. Ideally, for every transmitted message, there is a received message with matching signature. That's the basis of the lookup. Iterating linearly through the time-ordered list deals with the unmatched server messages. Reordered messages can produce some interesting results.

For every message, lookup the signature in the other side's dict to find its matching message. That makes it an O(N) algorithm (the hash lookup done for each message is O(1)).

I did have to separate transmit from receive messages for each side, since it's possible for a received message to have the same signature as a transmitted message if all the randomizing factors are the same in both directions. Thus the signature on a client TX message would be used to lookup the corresponding message in the server RX message dict.

The actual string storage for the log lines for each side is in the list, which is a time-ordered list. The dict entries contain references to those strings, so a dict is simply the index, by signature, of the list of strings.

All of this can be managed with standard library objects, using std::pairs to bind cross reference information with the log strings. For simple composition, this works well. As you need to compose more complex objects, navigating pairs of pairs rapidly gets out of hand, so that's when to define some structs, or maybe some simple data classes.

The other thing that was very useful was to define a split() function, equivalent to the split() function in Python. I use split() and join() quite a bit in Python for similar text processing tools. They really speed up string processing, allowing you to tear apart and reassemble strings easily. That also crosses the C/C++ string boundary: split() takes a C-style character array and splits it into a vector of strings (std::vector<std::string>).

I used a number of typedefs of the standard objects as syntactic sugar. That's a big help when declaring an iterator for an unordered map of composed pairs.

With the split() function and the typedefs of the standard objects acting as power tools, the code was straightforward.

The resulting output of the tool makes it easy to navigate the logs and correlate activity. One useful modification would be to have it group all the other non-message logs line with the nearest message (though that brings up the problem of deciding whether the lines should be grouped with the nearest subsequent message, or the nearest previous message). That would be especially useful behind a GUI like tkdiff (see, there's that diff thinking again...).

For another example of code like this, see More C+-.

The source, msgresolve.cpp (I had to do a little odd line-folding to make it fit in the width below):

// Usage: msgresolve <clientLog> <serverLog>
//
// Resolve client/server logs from the client perspective. That
// treats the total sequence of messages as 3 sections:
//   1) Initial unmatched server messages.
//   2) Client messages that may be matched or unmatched,
//      interspersed with unmatched server messages.
//   3) Remaining unmatched server messages.
//
// This is an example of a C++ program that is written mostly
// in plain C style, but that makes use of the container and
// composition classes in the C++ standard library. It is a
// lightweight use of C++ with no user-defined classes.
//
// 2018 Steve Branam <sdbranam@gmail.com> learntocode

#include <iostream>
#include <vector>
#include <list>
#include <unordered_map>

#define SERVER_PREFIX "    "

enum ARGS
{
    ARGS_PROGNAME,
    ARGS_CLIENT_LOG,
    ARGS_SERVER_LOG,
    ARGS_REQUIRED
};

enum CLIENT
{
    CLIENT_TIMESTAMP,
    CLIENT_FILE,
    CLIENT_LINE,
    CLIENT_SEVERITY,
    CLIENT_DIRECTION,
    CLIENT_HASH_KEYWORD,
    CLIENT_HASH,
    CLIENT_LEN,
    CLIENT_BYTES_KEYWORD,
    CLIENT_TIMESTAMP_LEN = 10
};

enum SERVER
{
    SERVER_DATE,
    SERVER_TIME,
    SERVER_THREAD,
    SERVER_SEVERITY,
    SERVER_FUNC,
    SERVER_CLIENT,
    SERVER_DIRECTION,
    SERVER_HASH_KEYWORD,
    SERVER_HASH,
    SERVER_LEN,
    SERVER_BYTES_KEYWORD,
    SERVER_TIME_LEN = 16
};

typedef std::string String;
typedef std::vector<String> StringVec;
typedef std::pair<String, String> StringPair;
typedef std::list<StringPair> MsgList;
typedef std::unordered_map<String, String&> MsgDict;
typedef std::pair<String, String&> MsgDictEntry;

MsgList clientTimestamps;
MsgDict clientReceives;
MsgDict clientTransmits;

MsgList serverTimestamps;
MsgDict serverReceives;
MsgDict serverTransmits;

StringVec split(char* str, const char* delim)
{
    StringVec strings;

    char *token = std::strtok(str, delim);
    while (token != NULL) {
        strings.push_back(token);
        token = std::strtok(NULL, delim);
    }
    
    return strings;
}

bool isClientTimestamp(const String& str)
{
    if (str.size() == CLIENT_TIMESTAMP_LEN) {
        for (int x = 0; x < str.size(); ++x)
        {
            if (!isdigit(str[x])) {
                return false;
            }
        }
        return true;
    }
    return false;
}

bool isServerTime(const String& str)
{
    if (str.size() == SERVER_TIME_LEN) {
        for (int x = 0; x < str.size(); ++x)
        {
            if (!isdigit(str[x]) &&
                (str[x] != ':') &&
                (str[x] != '.') &&
                (str[x] != ']')) {
                return false;
            }
        }
        return true;
    }
    return false;
}

bool isClientRxTx(const StringVec& fields)
{
    return ((fields.size() > CLIENT_BYTES_KEYWORD) &&
            isClientTimestamp(fields[CLIENT_TIMESTAMP]) &&
            (fields[CLIENT_DIRECTION] == "RX" ||
             fields[CLIENT_DIRECTION] == "TX") &&
            (fields[CLIENT_HASH_KEYWORD] == "hash") &&
            (fields[CLIENT_BYTES_KEYWORD] == "bytes\n" ||
             fields[CLIENT_BYTES_KEYWORD] == "bytes,"));
}

bool isServerRxTx(const StringVec& fields)
{
    return ((fields.size() > SERVER_BYTES_KEYWORD) &&
            isServerTime(fields[SERVER_TIME]) &&
            (fields[SERVER_DIRECTION] == "RX" ||
             fields[SERVER_DIRECTION] == "TX") &&
            (fields[SERVER_HASH_KEYWORD] == "hash") &&
            (fields[SERVER_BYTES_KEYWORD] == "bytes\n" ||
             fields[SERVER_BYTES_KEYWORD] == "bytes,"));
}

bool loadClient(const char* fileName)
{
    FILE* file = std::fopen(fileName, "r");
    
    if (file) {
        char buffer[1000];
        while (std::fgets(buffer, sizeof(buffer), file) != NULL) {
            String line(buffer);
            StringVec fields = split(buffer, " ");

            if (isClientRxTx(fields)) {
                // Remove trailing comma.
                fields[CLIENT_HASH].pop_back();

                String key(fields[CLIENT_HASH]);
                key.append(fields[CLIENT_LEN]);

                String xref(fields[CLIENT_DIRECTION]);
                xref.append(key);

                clientTimestamps.push_back(StringPair(xref, line));
                if (fields[CLIENT_DIRECTION] == "RX") {
                    clientReceives.insert(MsgDictEntry(key,
                                   clientTimestamps.back().second));
                } else {
                    clientTransmits.insert(MsgDictEntry(key,
                                   clientTimestamps.back().second));
                }
            }
        }
        std::fclose(file);
        return true;
    }
    std::cout << "Failed to open client file "
              << fileName << std::endl;
    return false;
}

bool loadServer(const char* fileName)
{
    FILE* file = std::fopen(fileName, "r");
    
    if (file) {
        char buffer[1000];
        while (std::fgets(buffer, sizeof(buffer), file) != NULL) {
            String line(buffer);
            StringVec fields = split(buffer, " ");

            if (isServerRxTx(fields)) {
                // Remove trailing comma.
                fields[SERVER_HASH].pop_back();

                String key(fields[SERVER_HASH]);
                key.append(fields[SERVER_LEN]);

                String xref(fields[SERVER_DIRECTION]);
                xref.append(key);

                serverTimestamps.push_back(StringPair(xref, line));
                if (fields[SERVER_DIRECTION] == "RX") {
                    serverReceives.insert(MsgDictEntry(key,
                                   serverTimestamps.back().second));
                } else {
                    serverTransmits.insert(MsgDictEntry(key,
                                   serverTimestamps.back().second));
                }
            }
        }
        std::fclose(file);
        return true;
    }
    std::cout << "Failed to open server file"
              << fileName << std::endl;
    return false;
}

void printRxSeparator()
{
    std::cout << "   /" << std::endl
              << "  <" << std::endl;
}

void printTxSeparator()
{
    std::cout << "  \\" << std::endl
              << "   >" << std::endl;
}

void printTransactionSeparator()
{
    std::cout << std::endl
              << "---------" << std::endl
              << std::endl;
}

// Find next server match for client processing, processing any
// unmatched server messages along the way.
void findNextServerMatch(MsgList::iterator& curServer)
{
    for (bool found = false;
         !found && curServer != serverTimestamps.end();) {
        std::string& xref(curServer->first);
        std::string key(xref.substr(2));
        std::string& server(curServer->second);
        
        if (xref[0] == 'R') {
            found = (clientTransmits.find(key) !=
                     clientTransmits.end());
            if (!found) {
                std::cout << "Client transmit not found"
                          << std::endl;
                printTxSeparator();
                std::cout << SERVER_PREFIX << server;
            }
        } else {
            found = (clientReceives.find(key) !=
                     clientReceives.end());
            if (!found) {
                std::cout << SERVER_PREFIX << server;
                printRxSeparator();
                std::cout << "Client receive not found"
                          << std::endl;
            }
        }
        
        if (!found) {
            printTransactionSeparator();
            curServer++;
        }
    }
}

// Process all client messages, checking for unmatched server
// messages along the way.
void processClient(MsgList::iterator& curServer)
{
    for (MsgList::iterator curClient = clientTimestamps.begin();
         curClient != clientTimestamps.end();
         curClient++) {
        std::string& xref(curClient->first);
        std::string key(xref.substr(2));
        std::string& client(curClient->second);
        MsgDict::iterator match;

        if (xref[0] == 'R') {
            match = serverTransmits.find(key);
            if (match == serverTransmits.end()) {
                std::cout << SERVER_PREFIX
                          << "Server transmit not found" << std::endl;
            } else {
                std::cout << SERVER_PREFIX << match->second;
            }

            printRxSeparator();
            std::cout << client;
        } else {
            std::cout << client;
            printTxSeparator();
            
            match = serverReceives.find(key);
            if (match == serverReceives.end()) {
                std::cout << SERVER_PREFIX
                          << "Server receive not found" << std::endl;
            } else {
                std::cout << SERVER_PREFIX << match->second;
            }       
        }
        printTransactionSeparator();
        
        if (match != serverReceives.end()) {
            // Matched, advance server iterator and find next
            // matching server msg.
            findNextServerMatch(++curServer);
        }
    }
}

void resolve()
{
    MsgList::iterator curServer = serverTimestamps.begin();

    // Handle any initial unmatched server messages.
    findNextServerMatch(curServer);
    
    // Handle client messages interspersed with any unmatched
    // server messages.
    processClient(curServer);

    // Handle any remaining unmatched server messages.
    findNextServerMatch(curServer);
}

int main(int argc, char* argv[])
{
    if (argc < ARGS_REQUIRED ||
        String(argv[1]) == "-h") {
        std::cout << "Usage: " << argv[ARGS_PROGNAME]
                  << " <clientLog> <serverLog>" << std::endl;
        return EXIT_FAILURE;
    }
    else {
        if (loadClient(argv[ARGS_CLIENT_LOG]) &&
            loadServer(argv[ARGS_SERVER_LOG])) {
            resolve();
        } else {
            return EXIT_FAILURE;
        }
    }
    return EXIT_SUCCESS;
}

Sample client log (the 0x11111111 hashes are ones I deliberately changed to break the match):

0345604820          comm.c, 1529, D: TX hash 0x47e21fdd, 185 bytes, msg type 3
0345605799          comm.c, 1426, D: RX hash 0xd331bb95, 35 bytes
0345605916          comm.c, 1529, D: TX hash 0x2f66bbd6, 180 bytes, msg type 15
0345606875          comm.c, 1426, D: RX hash 0x11111111, 28 bytes
0345607011          comm.c, 1529, D: TX hash 0x6924ebfd, 69 bytes, msg type 16
0345607146          comm.c, 1426, D: RX hash 0x183d710c, 33 bytes
0345607215          comm.c, 1529, D: TX hash 0x5c4b78f4, 504 bytes, msg type 18

Sample server log:

[2018-02-05 20:50:04.093798] [0x00007f3412dc8700] [debug]   send_msg()  00000062 TX hash 0xfcf3f009, 33 bytes, msg type 19
[2018-02-05 20:50:04.101101] [0x00007f3412dc8700] [debug]   send_msg()  00000062 TX hash 0xca5c8aea, 53 bytes, msg type 15
[2018-02-05 20:51:45.796547] [0x00007f34135c9700] [debug]   handle_msg()  :00000062 RX hash 0x47e21fdd, 185 bytes
[2018-02-05 20:51:45.812284] [0x00007f34135c9700] [debug]   send_msg()  :00000062 TX hash 0xd331bb95, 35 bytes, msg type 3
[2018-02-05 20:51:46.894310] [0x00007f34135c9700] [debug]   handle_msg()  :00000062 RX hash 0x2f66bbd6, 180 bytes
[2018-02-05 20:51:46.894661] [0x00007f34135c9700] [debug]   send_msg()  :00000062 TX hash 0x7495ff13, 29 bytes, msg type 17
[2018-02-05 20:51:46.894829] [0x00007f34135c9700] [debug]   send_msg()  :00000062 TX hash 0x183d710c, 33 bytes, msg type 19
[2018-02-05 20:51:46.903009] [0x00007f34135c9700] [debug]   send_msg()  :00000062 TX hash 0xc1575ef6, 53 bytes, msg type 15
[2018-02-05 20:51:47.894246] [0x00007f34135c9700] [debug]   handle_msg()  :00000062 RX hash 0x11111111, 68 bytes
[2018-02-05 20:51:48.732482] [0x00007f34135c9700] [debug]   handle_msg()  :00000062 RX hash 0x5c4b78f4, 504 bytes
[2018-02-05 20:52:39.990683] [0x00007f34125c7700] [debug]   handle_msg()  :00000062 RX hash 0x15667979, 185 bytes
[2018-02-05 20:52:39.999387] [0x00007f34125c7700] [debug]   send_msg()  :00000062 TX hash 0x3b1bf5ec, 35 bytes, msg type 3

Sample output:

$ ./msgresolve client.log server.log
    [2018-02-05 20:50:04.093798] [0x00007f3412dc8700] [debug]   send_msg()  00000062 TX hash 0xfcf3f009, 33 bytes, msg type 19
   /
  <
Client receive not found

---------

    [2018-02-05 20:50:04.101101] [0x00007f3412dc8700] [debug]   send_msg()  00000062 TX hash 0xca5c8aea, 53 bytes, msg type 15
   /
  <
Client receive not found

---------

0345604820          comm.c, 1529, D: TX hash 0x47e21fdd, 185 bytes, msg type 3
  \
   >
    [2018-02-05 20:51:45.796547] [0x00007f34135c9700] [debug]   handle_msg()  :00000062 RX hash 0x47e21fdd, 185 bytes

---------

    [2018-02-05 20:51:45.812284] [0x00007f34135c9700] [debug]   send_msg()  :00000062 TX hash 0xd331bb95, 35 bytes, msg type 3
   /
  <
0345605799          comm.c, 1426, D: RX hash 0xd331bb95, 35 bytes

---------

0345605916          comm.c, 1529, D: TX hash 0x2f66bbd6, 180 bytes, msg type 15
  \
   >
    [2018-02-05 20:51:46.894310] [0x00007f34135c9700] [debug]   handle_msg()  :00000062 RX hash 0x2f66bbd6, 180 bytes

---------

    [2018-02-05 20:51:46.894661] [0x00007f34135c9700] [debug]   send_msg()  :00000062 TX hash 0x7495ff13, 29 bytes, msg type 17
   /
  <
Client receive not found

---------

    Server transmit not found
   /
  <
0345606875          comm.c, 1426, D: RX hash 0x11111111, 28 bytes

---------

0345607011          comm.c, 1529, D: TX hash 0x6924ebfd, 69 bytes, msg type 16
  \
   >
    Server receive not found

---------

    [2018-02-05 20:51:46.894829] [0x00007f34135c9700] [debug]   send_msg()  :00000062 TX hash 0x183d710c, 33 bytes, msg type 19
   /
  <
0345607146          comm.c, 1426, D: RX hash 0x183d710c, 33 bytes

---------

    [2018-02-05 20:51:46.903009] [0x00007f34135c9700] [debug]   send_msg()  :00000062 TX hash 0xc1575ef6, 53 bytes, msg type 15
   /
  <
Client receive not found

---------

Client transmit not found
  \
   >
    [2018-02-05 20:51:47.894246] [0x00007f34135c9700] [debug]   handle_msg()  :00000062 RX hash 0x11111111, 68 bytes

---------

0345607215          comm.c, 1529, D: TX hash 0x5c4b78f4, 504 bytes, msg type 18
  \
   >
    [2018-02-05 20:51:48.732482] [0x00007f34135c9700] [debug]   handle_msg()  :00000062 RX hash 0x5c4b78f4, 504 bytes

---------

Client transmit not found
  \
   >
    [2018-02-05 20:52:39.990683] [0x00007f34125c7700] [debug]   handle_msg()  :00000062 RX hash 0x15667979, 185 bytes

---------

    [2018-02-05 20:52:39.999387] [0x00007f34125c7700] [debug]   send_msg()  :00000062 TX hash 0x3b1bf5ec, 35 bytes, msg type 3
   /
  <
Client receive not found

---------

Flink And Blink

Pages

Wednesday, March 28, 2018

The Case For C+-

No comments:

Post a Comment