Developer experiences from the trenches

A Convention For Fragment Parsers in C

Fri 09 August 2024 by Michael Labbe
tags code

Sometimes you want to parse a fragment from a string and all you have is C. Parsers for things like rfc3339 timestamps are handy, reusable pieces of code. This post suggests a convention for writing stack-based fragment parsers that can be easily reused or composed into a larger parser.

It’s opinionated, but tends to work for most things so adopt or adapt to your needs.

The Interface

The idea is pretty simple.

// can be any type
typedef struct {
  // fields go here
} type_t;

int parse_type(char **stream, size_t len, type_t *out);

Pass in a **stream pointer to a null-terminated string. On return, **stream points to the location of an error, or past the end of the parse on success. This means that it can point to the null terminator.

Pass in the length of the string to parse to avoid needing to call strlen, or to indicate if the end of a successful parse occurs before the null terminator.

Return can be an int as depicted, or an enum of parse failure reasons if not. The key thing is that zero is success. This allows multiple parses to OR the results and test for error once for trivial code.

That’s the whole interface. You can compose a larger parser out of smaller versions of these. So, if you want to parse a float (a deceptively hard thing to do) in a document, or key value pairs with quotes or something, you can build, test and reuse them by following this convention.

Helping with Implementation

When you implement a fragment parser you end up needing the same few support functions. This suggests a convention.

Testing for whether the stream was fully parsed works well works with a macro containing a single expression:

#define did_fully_parse_stream \
    (*stream - start == (ptrdiff_t)len)

int parse_type(char **stream, size_t len, type_t *out) {
    char *start = *stream;

    if (!did_fully_parse_stream)
        return 1;

}

Token Walking

Test the next token for a match:

static int is_token(const char **stream, char ch) {
    return **stream == ch;
}

Test the next token and bypass it if it matches. By convention, use this if a token failing to match is not an error.

static int was_token(const char **stream, char ch) {

    if (is_token(stream, ch)) {
        (*stream)++;
        return 1;
    }

    return 0;
}

Test the next token to be ‘ch’, returning true if it is. While this functionally does the same thing as was_token, it is semantically useful to use it to mean an error has occurred if it does not match.

static int expect_token(const char **stream, char ch) {
    return !was_token(stream, ch);
}

Token Classification

Token classification is very easy to implement using C99’s designated initializers. A zero-filled lookup table can be used to test token class and to convert tokens to values.

static char digits[256] = {
    ['0'] = 0,  ['1'] = 1,  ['2'] = 2,  ['3'] = 3,  ['4'] = 4,  ['5'] = 5,
    ['6'] = 6,  ['7'] = 7,  ['8'] = 8,  ['9'] = 9,
};

void func()
{
    // is it a digit?
    if (digits[**stream]) {
       // yes, convert token to stored integral value
       int value = digits[**stream];
    }

    // skip token stream ahead to first non-digit
    while (digits[**stream]) (*stream)++;
}