Wednesday, March 3 2010
In the process of importing a Microsoft-generated Excel spreadsheet into my homemade data-manipulating universe, I learned some things I hadn't known about the data format known as CSV (comma-separated values). In this common export format, database column values are separated by commas and rows are separated by carrigae returns. Unlike the native formats of database tables from nearly all database vendors, this format is human-readable when viewed in a text editor. My assumption was that the values in any particular CSV file are all either enclosed by quotes or they are not, depending on whether or not the column values themselves could potentially contain commas and carriage returns (which would otherwise be interpreted as delimiters). Of course this doesn't really solve anything, since it introduces another character (the double quote) that has to be escaped should it appear in columnar data. But it's a hacked-together data format from the 1960s, a time when tabular data wasn't often used for text information (and when text itself was often all-uppercase and bore only rudimentary punctuation).
It turns out that my assumption about CSV was over-simplistic, because (at least as exported by Microsoft Excel), any particular field in any particular row in any particular CSV file can either have quotes around it or it can not have quotes around it, depending on whether they are necessary. (In the CSV export code I've written, the fields are always quoted and any quotes in the tabular data are always escaped, and Excel never had any problem importing this more predictable format.)
It turns out that parsing a CSV file where the data can go back and forth between being quoted and not being quoted is a challenging programming problem. Today I decided to tackle it after discovering that the PHP function that is supposed to do this (str_getcsv) doesn't seem to work (and even if it did, it's only present in very recent versions of PHP). The function I use for parsing delimited strings is called BetterExplode, and it's been in my arsenal since November, 2007. It basically does what the built-in PHP function explode does, but it pays enough attention to the context of characters used as delimiters to avoid treating them as such when they are obviously part of data (quoted, escaped, or appearing in a PHP comment). Today I added a small amount of code to BetterExplode so that it could successfully parse lines of a CSV file exported by Microsoft Excel. The function works by scanning through a string character by character, going into and out of modes depending on what it finds along the way. My new code is built around a "mop" (a variable called $strMop) that catches cases that had earlier been ignored, outputting them in the result array. For those coming here from Google searches, I've provided the content you are seeking: the latest copy of BetterExplode, complete with all the special functions it relies upon:
For linking purposes this article's URL is:feedback
previous | next