One of my side-projects is CHIRP, which is an application for programming the memory contents of various radios. Like other projects I’ve done where I want to cater to (or at least, not exclude) people running Windows, it’s written in Python. Not only is Python my newfangled language of choice for high-level geekiness, but it’s also a language that provides me a substantial amount of platform isolation and I’m comfortable deploying applications based on it to Windows. I really have no desire to learn how to write “real” Windows applications, so if it weren’t as easy as it is, I wouldn’t do it.
Python is really great for a lot of things, but processing binary data is not what I would consider one of them. CHIRP needs to download a binary image of a radio over a serial line and twiddle bits before uploading it again. Since the radios are embedded microprocessors with limited storage, lots of bit packing is performed to be as efficient as possible, which results in a lot of bit whacking on my part to pack and unpack the information I need. So far, this has been done by combining things like the struct module with the bitwise operators available in Python. That means that the code looks like this:
tsval = (ord(mmap[POS_TSTEP]) >> 4) & 0x0F fval = 'x00' + mmap[POS_FREQ_STARTOS_FREQ_END] freq = ((struct.unpack(">i", fval)[0] * mult) / 1000.0) val = struct.unpack("B", mmap[POS_DUPX])[0] & 0xC0 if val == 0xC0: duplex = "+" elif val == 0x80: duplex = "-" else: duplex = ""
It’s really quite ugly, and the bigger issue is that all the code above requires similar but opposite code to poke new values back into the locations. This often resulted in code that would extract the values properly, but not insert them back correctly. Further, common tasks were often done differently in different radio drivers that were written at different times. An example of such a task is extracting an index into an array of values that was, say, three bits in the middle of a byte (or worse, six bits across two different bytes). Another one is packing and unpacking BCD frequency values of various widths and endian orientations. In summary, the major problems with this sort of approach are:
- Really ugly, hard-to-read, and hard-to-maintain code
- Different code paths for getting and setting a particular bitfield
- Multiple different styles and algorithms for getting at commonly-formatted, but differently-arranged bitfields
- Stability and correctness issues stemming from all of the above
Recently, I was reading some code by Dean AE7Q that was able to read and write ICOM ICF files. His code is written in C++, so it has the benefit of using structs to map fields in a given buffer. However, what really shook me, was the use of bitfield definitions. Now, it’s not like I don’t write C all day long, and interact with bitfield definitions on a regular basis, I just handn’t even considered how much easier they would make the process. I decided that there had to be a way I could do something similar in Python to make my life easier.
What I decided to do was write a parser for a simple meta language that looks like C. It needed to support the following elements:
- Structures
- Arrays
- Bitfield definitions
Additionally, it would be very helpful if it had the following properties:
- A native data type for BCD-encoded bytes (many radios store integer values in BCD)
- 24-bit integers (Several radios use a three-byte integer to represent a frequency)
- A few compiler-directives to help seek within a large data stream to a position before mapping
- A pythonic mechanism to read and write the defined fields in the data stream after parsing
What I came up with was a module called bitwise. It uses the PyPEG, which is somewhat like lex for C-based compiler writers. The grammar is extremely simple and easy to understand, and the parser/compiler/whatever isn’t too bad either (although it could use quite a bit more cleanup). The result is a simple, single-call interface that turns a binary data stream into a very usable object tree. See the following example code:
# Defines a format for parsing some binary data defn = """ struct { u8 foo; u8 highbit:1, sixbits:6, lowbit:1; char string[3]; bbcd fourdigits[2]; } mystruct[1];""" # Some binary data for us to parse data = "x7Fx81abcx12x34" tree = parse(defn, data) print "Foo: %i" % tree.mystruct.foo print "Highbit: %i" % tree.mystruct.highbit print "Sixbits: %i" % tree.mystruct.sixbits print "Lowbit: %i" % tree.mystruct.lowbit print "String: %s" % tree.mystruct.string print "Fourdigits: %i" % tree.mystruct.fourdigits
Which prints the following:
Foo: 127 Highbit: 1 Sixbits: 0 Lowbit: 1 String: abc Fourdigits: 1234
If I wanted to change the sixbits field from all zeros to the value 13 and the string to “xyz”, all I have to do is:
tree.mystruct.sixbits = 13 tree.mystruct.string = "xyz"
I think that’s a clear win in terms of improved syntax and maintainability. For images where the memory regions I care about are not contiguous from the beginning (which is all of them), the #seek and #seekto directives allow me to apply the definitions that follow to a specific location in the data
Aside from the obvious syntactic and mental health gains, here are a few “lines of code” metrics from before and after converting several drivers to use my “bitwise” module:
Driver | Lines before |
Lines after |
Change |
IC-2820 | 704 | 336 | 53% smaller |
ID-880 | 600 | 360 | 40% smaller |
VX-8R | 375 | 139 | 63% smaller |
To me, that’s fairly significant, especially given that the code to enable this change is only 721 lines on it’s own. By the way, all of these numbers include comments and a block of (unchanged) license text at the top, so the actual change of functional lines is even more significant. In all, before the change CHIRP had almost 6,000 lines of bit-twiddling driver code. I’ve only converted a few of them so far, but if these initial gains continue to apply, that could cut the total driver codebase by 50%, which is a really good thing. The other aspect of it, which is harder to convey with numbers, is the fact that I know that the driver code is more robust now. Since I don’t own all the radios that CHIRP supports, being able to make small maintenance tweaks to the code without having to brute-force test the driver against the real device is extremely helpful. Now that the code is much more symmetric, I think such maintenance tasks will be much smoother in the future.
One Response in another blog/article
[…] http://www.danplanet.com/blog/2010/11/10/a-better-way-to-process-binary-data-in-python/ […]