Writing a Parser Is Boring?

When inter­act­ing with dif­fer­ent ser­vices or pro­gram there often comes the point where some sort of file has to be parsed into a data struc­ture, which can be processed lat­er. For stan­dard file for­mats like XML or JSON parsers are avail­able for almost any lan­guage out there. (BTW I love the graph­i­cal rep­re­sen­ta­tion of the JSON syn­tax on their website.)

There are how­ev­er sit­u­a­tions in which one has to deal with a pro­pri­etary file for­mat or with ven­dor-spe­cif­ic exten­sions to a stan­dard­ized for­mat. At this point a pars­er has to be writ­ten. It seems a bor­ing task, how­ev­er there are sev­er­al impor­tant aspects of pro­gram­ming which can be learned here.

Organizing Data in Data Structures

When pars­ing a file there is a data struc­ture giv­en in it, which has to be rep­re­sent­ed in the pro­gram­ming lan­guage one is pars­ing in. This can be as sim­ple as a file con­tain­ing email address­es for a mail­ing list, one per line. Like shown here:

peter@gorillapatch.com
paul@gorillapatch.com
mary@gorillapatch.com

This eas­i­ly trans­lates into an array or list of strings in which each ele­ment is one adress. This exam­ple can be made more com­plex if one intro­duces mail­ing lists. In this case there is still one email address a line, but there are sec­tions in the file which spec­i­fy a name of a mail­ing list:

[Class A]
peter@gorillapatch.com
paul@gorillapatch.com
[Class B]
mary@gorillapatch.com

Now the pars­ing prob­lem gets a lit­tle more com­pli­cat­ed. How­ev­er at first one has to think how the parsed data is used lat­er on. In this case a hash or direc­to­ry with the list names (Class A and Class B) could be the best solu­tion, because then lat­er on the lists could be accessed sep­a­rate­ly and all infor­ma­tion of the input file is con­served in the parsed data struc­ture. Anoth­er way to han­dle this infor­ma­tion would be to save the email address­es all in one array with each ele­ment of the array being again an array of two ele­ments: email adress and list name. This could be ben­e­fi­cial if all email address­es are accessed reg­u­lar­ly and only sel­dom sin­gle lists are need­ed. Anoth­er pos­si­bil­i­ty would be to store the email adress­es as objects, which would be espe­cial­ly use­ful if there are also oper­a­tions to be applied to the data which could be imple­ment­ed as mem­ber func­tions. Take home mes­sage of this is Per­l’s old say­ing: TIMTOWTDI (There Is More Than One Way To Do It). 

Test-Driven Development

Test-dri­ven devel­op­ment is relat­ed to a com­pi­la­tion of best prac­tices in pro­gram­ming called extreme pro­gram­ming which affect both the code and project man­age­ment. The idea is to have a suite of auto­mat­ed tests which make sure that the pro­gram is doing what it is expect­ed to do.

In the case of pars­ing a file one first must know what the mean­ing of the file con­tent is. So there is a defined input and out­put which is gen­er­at­ed by a unit of the pro­gram. Hav­ing a auto­mat­ed test is use­ful in two respects:

  1. it makes sure that the unit is doing what it is sup­posed to do
  2. it is a clear bench­mark when the work is finished

The first point seems strange because a pro­gram­mer obvi­ous­ly makes sure that the code works as expect­ed. Usu­al­ly one can make it work with­out unit tests, but it is eas­i­er to test every­thing auto­mat­i­cal­ly. This leads to the sec­ond point: it is obvi­ous when the work is done: the indi­ca­tor is sim­ply that all tests which have been writ­ten in advance pass. Now stop working.

The def­i­n­i­tion of a unit depends on the pro­gram. It can be as small as a pars­er method which con­verts one line in the input file into the cor­re­spond­ing data struc­ture in the out­put. It can be as com­plex as read­ing a whole file in and make sure the con­tent of the whole file is parsed cor­rect­ly. There is no harm if these two unit tests over­lap. If the unit under test­ing is too large it is hard to locate the bug caus­ing the test to fail. If the unit is too small it gets cum­ber­some writ­ing so many tests. This takes some expe­ri­ence to judge how large a unit should be. How­ev­er hav­ing some kind of test is much bet­ter than not hav­ing any test at all.

Iterating and Bug-Fixing

Now a solu­tion has been pro­grammed and it is work­ing, because the unit test pass. When pars­ing a spe­cial file some­thing does not work as expect­ed. Now this file with the expect­ed behav­ior how it should work is trans­lat­ed into a test. Of course this test will fail because the file was not parsed cor­rect­ed. Now go back to the pre­vi­ous sec­tion and imple­ment this fea­ture until the test pass­es. The tests makes sure, that not only the new file is processed cor­rect­ly and simul­ta­ne­ous­ly the old files are still work­ing. Hav­ing a unit test suite with a good cov­er­age deals with the prob­lems that chang­ing code has unrec­og­nized side effects which may break oth­er parts of the program. 

Using Source Code Management Systems

First of all a seri­ous pro­gram­mer must use a source code man­age­ment sys­tem! This can be any­thing from SVN, CVS, Per­force, Bazaar, Mer­cu­r­ial or Git. There are pro and cons for each if them, but a dis­trib­uted ver­sion con­trol sys­tem like Git gives a lot of flex­i­bil­i­ty but also much power. 

There is a branch­ing mod­el of Vin­cent Driessen called
Git Flow which is a good way of rep­re­sent­ing the his­to­ry of a pro­gram­ming project clean­ly in the repos­i­to­ry struc­ture. To help with the man­age­ment of these fea­ture branch­es there is also an exten­sion for Git avail­able, which can be found on GitHub. These high-lev­el func­tions will help struc­tur­ing the repos­i­to­ry. This effort pays back even more, when there are things like code review or IT audits and one can show why which code was written. 

Wrapping it all up

The four aspects here are gener­ic for a good pro­gram­ming work­flow. Writ­ing a pars­er is usu­al­ly such a small project that it is per­fect for estab­lish­ing best prac­tices. Any com­ments are appreciated.

Thomas

Chemist, Programmer, Mac and iPhone enthusiast. Likes coding in Python, Objective-C and other languages.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.