Writing a Parser Is Boring?

When interacting with different services or program there often comes the point where some sort of file has to be parsed into a data structure, which can be processed later. For standard file formats like XML or JSON parsers are available for almost any language out there. (BTW I love the graphical representation of the JSON syntax on their website.)

There are however situations in which one has to deal with a proprietary file format or with vendor-specific extensions to a standardized format. At this point a parser has to be written. It seems a boring task, however there are several important aspects of programming which can be learned here.

Organizing Data in Data Structures

When parsing a file there is a data structure given in it, which has to be represented in the programming language one is parsing in. This can be as simple as a file containing email addresses for a mailing list, one per line. Like shown here:

peter@gorillapatch.com
paul@gorillapatch.com
mary@gorillapatch.com

This easily translates into an array or list of strings in which each element is one adress. This example can be made more complex if one introduces mailing lists. In this case there is still one email address a line, but there are sections in the file which specify a name of a mailing list:

[Class A]
peter@gorillapatch.com
paul@gorillapatch.com
[Class B]
mary@gorillapatch.com

Now the parsing problem gets a little more complicated. However at first one has to think how the parsed data is used later on. In this case a hash or directory with the list names (Class A and Class B) could be the best solution, because then later on the lists could be accessed separately and all information of the input file is conserved in the parsed data structure. Another way to handle this information would be to save the email addresses all in one array with each element of the array being again an array of two elements: email adress and list name. This could be beneficial if all email addresses are accessed regularly and only seldom single lists are needed. Another possibility would be to store the email adresses as objects, which would be especially useful if there are also operations to be applied to the data which could be implemented as member functions. Take home message of this is Perl’s old saying: TIMTOWTDI (There Is More Than One Way To Do It).

Test-Driven Development

Test-driven development is related to a compilation of best practices in programming called extreme programming which affect both the code and project management. The idea is to have a suite of automated tests which make sure that the program is doing what it is expected to do.

In the case of parsing a file one first must know what the meaning of the file content is. So there is a defined input and output which is generated by a unit of the program. Having a automated test is useful in two respects:

it makes sure that the unit is doing what it is supposed to do
it is a clear benchmark when the work is finished

The first point seems strange because a programmer obviously makes sure that the code works as expected. Usually one can make it work without unit tests, but it is easier to test everything automatically. This leads to the second point: it is obvious when the work is done: the indicator is simply that all tests which have been written in advance pass. Now stop working.

The definition of a unit depends on the program. It can be as small as a parser method which converts one line in the input file into the corresponding data structure in the output. It can be as complex as reading a whole file in and make sure the content of the whole file is parsed correctly. There is no harm if these two unit tests overlap. If the unit under testing is too large it is hard to locate the bug causing the test to fail. If the unit is too small it gets cumbersome writing so many tests. This takes some experience to judge how large a unit should be. However having some kind of test is much better than not having any test at all.

Iterating and Bug-Fixing

Now a solution has been programmed and it is working, because the unit test pass. When parsing a special file something does not work as expected. Now this file with the expected behavior how it should work is translated into a test. Of course this test will fail because the file was not parsed corrected. Now go back to the previous section and implement this feature until the test passes. The tests makes sure, that not only the new file is processed correctly and simultaneously the old files are still working. Having a unit test suite with a good coverage deals with the problems that changing code has unrecognized side effects which may break other parts of the program.

Using Source Code Management Systems

First of all a serious programmer must use a source code management system! This can be anything from SVN, CVS, Perforce, Bazaar, Mercurial or Git. There are pro and cons for each if them, but a distributed version control system like Git gives a lot of flexibility but also much power.

There is a branching model of Vincent Driessen called
Git Flow which is a good way of representing the history of a programming project cleanly in the repository structure. To help with the management of these feature branches there is also an extension for Git available, which can be found on GitHub. These high-level functions will help structuring the repository. This effort pays back even more, when there are things like code review or IT audits and one can show why which code was written.

Wrapping it all up

The four aspects here are generic for a good programming workflow. Writing a parser is usually such a small project that it is perfect for establishing best practices. Any comments are appreciated.