A better way to process binary data in Python

One of my side-projects is CHIRP, which is an application for programming the memory contents of various radios.  Like other projects I’ve done where I want to cater to (or at least, not exclude) people running Windows, it’s written in Python.  Not only is Python my newfangled language of choice for high-level geekiness, but it’s also a language that provides me a substantial amount of platform isolation and I’m comfortable deploying applications based on it to Windows.  I really have no desire to learn how to write “real” Windows applications, so if it weren’t as easy as it is, I wouldn’t do it.

Python is really great for a lot of things, but processing binary data is not what I would consider one of them.  CHIRP needs to download a binary image of a radio over a serial line and twiddle bits before uploading it again.  Since the radios are embedded microprocessors with limited storage, lots of bit packing is performed to be as efficient as possible, which results in a lot of bit whacking on my part to pack and unpack the information I need.  So far, this has been done by combining things like the struct module with the bitwise operators available in Python.  That means that the code looks like this:

tsval = (ord(mmap[POS_TSTEP]) >> 4) & 0x0F
fval = 'x00' + mmap[POS_FREQ_STARTSmilie: :POS_FREQ_END]
freq = ((struct.unpack(">i", fval)[0] * mult) / 1000.0)
val = struct.unpack("B", mmap[POS_DUPX])[0] & 0xC0
if val == 0xC0:
  duplex = "+"
elif val == 0x80:
  duplex = "-"
else:
  duplex = ""

It’s really quite ugly, and the bigger issue is that all the code above requires similar but opposite code to poke new values back into the locations.  This often resulted in code that would extract the values properly, but not insert them back correctly.  Further, common tasks were often done differently in different radio drivers that were written at different times.  An example of such a task is extracting an index into an array of values that was, say, three bits in the middle of a byte (or worse, six bits across two different bytes).  Another one is packing and unpacking BCD frequency values of various widths and endian orientations.  In summary, the major problems with this sort of approach are:

  1. Really ugly, hard-to-read, and hard-to-maintain code
  2. Different code paths for getting and setting a particular bitfield
  3. Multiple different styles and algorithms for getting at commonly-formatted, but differently-arranged bitfields
  4. Stability and correctness issues stemming from all of the above

Recently, I was reading some code by Dean AE7Q that was able to read and write ICOM ICF files.  His code is written in C++, so it has the benefit of using structs to map fields in a given buffer.  However, what really shook me, was the use of bitfield definitions.  Now, it’s not like I don’t write C all day long, and interact with bitfield definitions on a regular basis, I just handn’t even considered how much easier they would make the process.  I decided that there had to be a way I could do something similar in Python to make my life easier.

What I decided to do was write a parser for a simple meta language that looks like C.  It needed to support the following elements:

  1. Structures
  2. Arrays
  3. Bitfield definitions

Additionally, it would be very helpful if it had the following properties:

  1. A native data type for BCD-encoded bytes (many radios store integer values in BCD)
  2. 24-bit integers (Several radios use a three-byte integer to represent a frequency)
  3. A few compiler-directives to help seek within a large data stream to a position before mapping
  4. A pythonic mechanism to read and write the defined fields in the data stream after parsing

What I came up with was a module called bitwise.  It uses the PyPEG, which is somewhat like lex for C-based compiler writers.  The grammar is extremely simple and easy to understand, and the parser/compiler/whatever isn’t too bad either (although it could use quite a bit more cleanup).  The result is a simple, single-call interface that turns a binary data stream into a very usable object tree.  See the following example code:

# Defines a format for parsing some binary data
defn = """
  struct {
    u8 foo;
    u8 highbit:1,
       sixbits:6,
       lowbit:1;
       char string[3];
       bbcd fourdigits[2];
  } mystruct[1];"""
# Some binary data for us to parse
data = "x7Fx81abcx12x34"
tree = parse(defn, data)
print "Foo: %i" % tree.mystruct.foo
print "Highbit: %i" % tree.mystruct.highbit
print "Sixbits: %i" % tree.mystruct.sixbits
print "Lowbit: %i" % tree.mystruct.lowbit
print "String: %s" % tree.mystruct.string
print "Fourdigits: %i" % tree.mystruct.fourdigits

Which prints the following:

Foo: 127
Highbit: 1
Sixbits: 0
Lowbit: 1
String: abc
Fourdigits: 1234

If I wanted to change the sixbits field from all zeros to the value 13 and the string to “xyz”, all I have to do is:

tree.mystruct.sixbits = 13
tree.mystruct.string = "xyz"

I think that’s a clear win in terms of improved syntax and maintainability.  For images where the memory regions I care about are not contiguous from the beginning (which is all of them), the #seek and #seekto directives allow me to apply the definitions that follow to a specific location in the data

Aside from the obvious syntactic and mental health gains, here are a few “lines of code” metrics from before and after converting several drivers to use my “bitwise” module:

 Driver Lines before
Lines after
Change
 IC-2820 704 336 53% smaller
 ID-880 600 360 40% smaller
 VX-8R 375 139 63% smaller

To me, that’s fairly significant, especially given that the code to enable this change is only 721 lines on it’s own.  By the way, all of these numbers include comments and a block of (unchanged) license text at the top, so the actual change of functional lines is even more significant.  In all, before the change CHIRP had almost 6,000 lines of bit-twiddling driver code.  I’ve only converted a few of them so far, but if these initial gains continue to apply, that could cut the total driver codebase by 50%, which is a really good thing. The other aspect of it, which is harder to convey with numbers, is the fact that I know that the driver code is more robust now.  Since I don’t own all the radios that CHIRP supports, being able to make small maintenance tweaks to the code without having to brute-force test the driver against the real device is extremely helpful.  Now that the code is much more symmetric, I think such maintenance tasks will be much smoother in the future.

Category(s): Codemonkeying
Tags: ,