Monday, February 19, 2007

One File Is Not Enough

Every once in a while I come to a situation where I need to attach together different types of data. It could be a PDF to which I want to add some notes, or a picture that I want to describe, or a song to which I want to add lyrics. In certain cases I want to bundle together several files, some of which could be pretty complex themselves, such as add the Visio diagram of a database schema to an Access database file.

What I basically need, is to put together several different files while preserving the ability to access each and every one of them individually.

Sure I could zip them, but that would not really solve the problem since I would have to unzip them before I can use them. I could put them in a folder, that would help the situation, but it I would still be able to delete one and leave another and I would have to separately manage their names and so on, definitely not a solution I would prefer.

It seems that in the current state of affairs, there is no way to do this in neither of the existing file systems. Why? Well, I guess there are several reasons.

First let's look at the definition of the file. The way I would describe it, a "file" is a collection of data of the same type. Of course, this is no longer entirely true for modern files which may include different types of data (think about a Word document with embedded picture), but it still works most of the time (you could still call it a document, only with picture). Wikipedia, defines file a little differently (take a look). I liked the phrasing "available to a computer program". It means that a certain file is related to a certain computer program as in - can be open by a certain editor. Another editor, another file. This makes sense and correlates with the everyday experience, indeed, how often did you see files that could be open in two entirely different editors (I don't count similar editors from different vendors and text editors that can open everything).

Which brings us to the second point, the file extension. OK, the names are no longer eight characters short (though I still avoid putting spaces in file names) nor are the extensions limited to three characters but they are still very important and dominant in a way we manipulate files today. In most cases, extension determines the type of editor that will be open (at least by default) to edit the file. This can be changed, of course, but if a .pdf you download turns out to be an ArcView Package Definition File - you will be very mad. After all, extension defines format, and implies content.

And finally, let's take a look at our file systems, ladies and gentlemen. They are, well, old. I use Windows XP which came preinstalled (see: Windows Tax) and I think I have NTFS, but I don't really care, since I dont use it's "advanced" features anyway. For those who were sad when WinFS was dropped from Vista, relax - I am not sure it would improve the situation. It was more about being a database, and having meta data. We are very obsessed with search these days.

We have gone a long way since 1977 (that was a good year), but we still can't treat several different files as one. Sad, and frankly, a little bit strange. Haven't anyone encounter this problem before?

I think, in a way, media guys did. They had this problem of distinguishing between two different types of file content (audio and video) and the file container that will bundle them together. The way they solved it is, in my opinion, far from optimal but it is a good point for the case. They distinguish between a codec, which is used to read and understand specific video or audio data and the file format used to hold them. Each media player in the world supports this scheme, as well as every other tool which needs to access the media files. First it reads the file, just as regular editor, in a way determined by file extension, and then it uses a codec specified in the file to read the audio and video data. This way your media streams always stay together, you can't accidentally copy a video and forget a voice track or vice versa.

Does this solve a problem? Eh. Well, partially, for this specific case, you still can't add anything else to your file - those who ever used subtitles will see what I mean. This solution is not scalable enough, and besides do you know how many video file formats are there?

I believe this is a real problem, a solution of which will present a next big step in file systems, no less than the search and the database stuff. I think that this problem is encountered and solved each time with one compromise or another, mostly as a workaround. I don't know what you think (I will if you comment) but for me: One File Is Not Enough!