File Format Design
The laws of bioinformatics file format design
I'm not bitter
1. Make up a new format. Clearly, you are so brilliant that no-one has dealt with this kind of data before. The only time you should stick to an existing format is when it obeys the guidelines below. Alternatively, take an existing format and bend it to your needs. The more subtle the difference is, the better. Some people won't have to change their software at all! And other people will just silently get erroneous results. Obviously, silent erroneous results are better than up-front failure.
2. Ignore escaping! You're trying to use an alphabet with N symbols to represent data potentially containing all N symbols, plus structure. But go ahead and steal some symbols to represent the structure, and don't bother trying to give them back through escaping. No one will ever want to put data containing a tab or newline character into your file format. Or a #. Or a semicolon or an equals sign or a comma. Take all the symbols you want! < looks totally crazy--no-one wants to type four characters just to get a less-than sign.
3. Use quotes. If you must represent data containing structural symbols like tabs or newlines or commas, just put it in quotes. Quotes are magic! No one ever wants to represent data containing quotes. And most software is very good about dealing with quoted data containing embedded newlines. Adding quotes is like pushing around a bubble under wallpaper--moving the problem from one place to another feels very productive.
4. Underspecify. No one wants to be nailed down to some set of instantly-oudated rules. Plus, anyone wanting to write software for your format will enjoy all the little surprises that people will cram into the gray areas.
5. Make it ad-hoc. People will want to extend your format in a hundred different ways, but don't worry about extensibility. That's someone else's problem, even if that someone else is you a year from now. A year is a long time, and you need to get stuff done now! In particular, avoid formats where the attributes are named. No one wants to deal with all that verbosity. Better to use some kind of delimited format; if someone wants a new attribute they can just add a new column! Figuring out what the column means is an entertaining puzzle for everyone else. This is especially fun when different groups each add a column that means something different. Alternatively, jam a bunch of stuff into one of the existing columns.
6. Namespaces are for wusses. The more people that have to coordinate to make a change, the better.
7. Hand editing is important. Very few people in bioinformatics have to deal with large volumes of data, so most people will be hand-creating these files rather than using software to read and write them. If you have to choose between making the format more convenient for a human and more convenient for software, always choose the human.
8. If you're going to be dealing with people that are not too computer-oriented, use Excel. This is completely sane. And if there's some grouchy programmer that doesn't like Excel, you can just export to CSV. Excel CSV files follow almost all of these guidelines.
-- Mitch Skinner - 01 Apr 2007