studying a cache
Friday, January 28 2011
After much spelunking through object-oriented PHP code organized in disparate deeply-nested folders, and after paying close attention to the data I'd been logging, I figured out the problem that has been driving me crazy for the past couple weeks (in the project for which I have recently been projectmicromanaged). It turned out that a couple of obscure fields holding external userids (userids on a remote system) had not been changed from bigints to varchars. And they needed to be varchars on this remote system, because userids there are alphanumeric. Well, technically they're hexadecimal (and so big that they cannot be converted to bigints). Mind you, all of this was happening in a data caching table, meaning that the problem went away when I simply disabled the caching system. And while this whole problem is probably insufferably dull to read about, the result of diagnosing and fixing it was something close to elation. There's an endorphin rush associated with solving a puzzle, particularly if your professional reputation is on the line. I might not get paid anything near what I should have been paid for this project, but the rush of solving its biggest crisis would be expensive to duplicate.
While I was pumped up on that high, I turned my attention to another web challenge: database exhaustion on the content syndication site I built for David (of Penny and David). That site is based on MySQL and PHP but runs under shared hosting on a Godaddy Windows server (because it once required Cold Fusion). Now, though, without a Cold Fusion requirement, this platform makes little sense, but it's such a big unwieldy database that I've been slow to migrate it to a proper Linux hosting environment. Recently the site and David (and, by extension, me) have been paying the price for the inadequate platform. Godaddy shared hosting is cut-rate and kind of ghetto (the kind where Danica Patrick lives in a trailer and is actually named Amber Shifflett), and either David's article database has recently crossed a scary size threshold or Godaddy has ramped back on the computational firepower (Moore's Law in reverse). For whatever reason, the site has been crashing a lot lately: throwing memory exhaustion errors and unhelpfully undefined 500 errors. If one calls Godaddy, they can usually bring the site back up, but communicating with them is always a pain. Every interaction is with a tech support guy who knows none of the history of the site's problems, so there's always a prolonged initial period spent convincing them that the problem is their hosting environment, not our code.
Today I decided to solve the problem with a database recordset caching system (vaguely similar to the one mentioned in the first paragraph, but based entirely in the file system). David's site doesn't require fresh data from the database for every pageview; most of the time these queries produce the same results as they did the time before. So why not find a way to store the results of queries in the file system and, if these results aren't too old, use them as the source of data instead of a hit to the database? Implementing such a system was helped by the fact that all database queries on the site pass through a single wrapper function.
The first problem to overcome was relating query SQL to the contents of files holding their results. I should mention here that I always retrieve data from the database in complete recordsets (that is, as arrays of associative arrays). This means that I never process data in a record-by-record manner while the connection stays open. I get all the data, close the connection (or try to; PHP usually leaves it open), and then deal with it all as a recordset. Because I do this, the recordset result of a SQL query could simply be serialized and put in a text file. But what name should I give that file? I decided the name should be a MD5 of the SQL. MD5 produces a concise and sufficiently-unique file name that allows for easy storage and retrieval in the file system of a recordset for any query.
Other features of the system allowed me to specify different cache lives (based on cache file modified dates) for queries containing different strings (in other words, hitting different tables). I also added some code to prevent caching of recordsets from Updates and Inserts.
The system I just described might sound a bit complicated, but it only took me about a half hour to implement. And once I had it running, browsing the cache directory proved extremely helpful in further streamlining the code. I found evidence (in terms of data being exchanged) of very expensive queries being done with every page load. One of these produced a half-megabyte list of nearly 20,000 article IDs that my PHP then counted — all just to figure out how many pages were in a paginated list. This is the kind of sloppy programming that is easy to leave in place when you're working quickly. It "works," so you don't go back and fix it (in this case, that would have meant making the SQL do the count and return just that one number instead of a mess of IDs for the PHP count). So not only was my caching system good at eliminating database hits; it also drew my attention to the shoddiest of my database calls, all of which were easy to find by simply ordering the cache files by size. (I was creating files parallel with the cache files having the cache files' names but ending in .sql and containing both the query that produced the recordset and a debug_backtrace() to help me find where, in my PHP, the queries were coming from.)
Tomorrow Gretchen and I had plans to attend a Bard graduation at Eastern Correctional Facility, so I went down to the basement to take a bath. While I was still soaking in the delicious hot waters, Gretchen got a phone call from someone else in the program telling her that the graduation had been post-poned. Why? A semi-famous speaker had found time in his schedule to be there, family members had taken time off work to be there, and I'd canceled a visit to the city to visit my childhood friend Nathan V. (he was attending a conference). Well, it turned out that there was no prison riot and the place hadn't been subsumed in a mudslide. What had happened was that a guard who works at Eastern had died in an automobile accident and, well, prison guards are a brotherhood, and now was no time to be celebrating anything. Or something like that.
This evening Gretchen and I watched a movie put together by our friend Marissa called Vegucated. It chronicled the experimental transition of three New Yorkers from carnivo-omnivore to veganism, and along the way we're all given the case for not eating animals. I don't like watching clips of suffering animals, and there was some of that in Vegucated, so I mostly just averted my eyes during those sequences. It was an engaging film, which was especially surprising given that it had been made for something like $20,000. It had a slight low-budget "grain" to it, but that might be the sort of thing that can be cleaned up in post production.
For linking purposes this article's URL is:feedback
previous | next