Archive for the ‘bugs’ category

Do You Debug Your Website?

November 17, 2008

I can’t say that the bug this post is about is my worst bug ever.  Or even my most embarassing bug ever.  But it is certainly my most costly bug ever.  But first, a picture.  You can click it to see the full-sized version.

The above graph shows two years and change of monthly visitor counts to my business.  There are a bunch of milestones I could have put up there — the new versions of my software, the key events in my market’s annual cycle, the redesign of my website, the time I quadrupled my advertising budget.  And they all pale in comparison to squashing one stupid little bug.

What Could Possibly Have Done That?

Well, like many bugs, the impact of this one rose in direct proportion to how clever you are.  My business website runs on a custom-designed CSS which cranks out free printable bingo cards.  The home page, shopping cart, main level navigation, all of these things are valuable towers which sit on a gigantic foundation of free content.  That content attracts me links, traffic, and searchers.  The bright idea to make the production of this content scale is probably the single biggest thing I’ve ever done to grow my business.

And scale it does.  I pay a terrifically talented freelancer to write the actual word lists that become the bingo cards, and provide a little bit of color commentary.  I feed her input into the CMS, and *bam* web page, downloadable PDF, and GIF are created.  Now multiply this times a couple hundred, for very little additional work on my part.

The Best Laid Plans of Mice and Men

One design decision I made when building this automated marketing machine was to give it a time-based element: one bingo card got featured a day.  Originally this was largely because I had the site separate from my main site and wanted to call it Daily Bingo Cards.  I figured, hey, if a new card pops up to the “front” page every day then it will always be fresh, and that makes it stickier for folks.  (And, as it turns out, I do have a handful of folks who a year later are still loading the page several times a week to see what is new.)

However, I had (totally unfounded) performance worries about Rails when I wrote this application.  I thought hitting the database for every pageview was needlessly taxing on my server (bzz, I only get a few thousand a day, why the heck would the server complain?)  So I decided to turn on caching.

Caching is a wonderful tool.  You can use it to solve just about any performance problem.  It solves it by replacing the problem with a cache expiration problem.  (No performance problem?  Oh, you get to deal with cache expiration anyway, don’t worry.)

How Hard Can It Be?

Now, if you want to refresh your content on your website once a day, and you’re not really particular when you do it, caching should be dead easy: purge the cache once a day.  Which I did.  Sort of.

You see, Rails stores cached pages in the same directory your static HTML is being served out of, as static HTML.  Your web server treats the cached pages as it would any other web page if they exist, and just slurps them right out of the directory and spits them out at whoever is requesting that page.  This is blazingly, bugs-in-your-teeth fast.

And to clear the cache?  Simple — delete the HTML file and it is like it never existed.  Your web server, seeing no file matching the user’s request, will fallback to calling Rails to see what to do with the request, and after Rails does its magic there might be a new HTML file deposited into that directory.

And how to delete a file once a day?  Stick task in chrontab… done.

Your Testing Protocol Leaves Something To Be Desired

Now normally a line which is, roughly speaking, the complexity of rm -rf /some/file/name/goes/here is pretty hard to screw up.  But knowing well my pechant for screwups I tested those lines before I wrote them into the crontab.  As root (do you see where this is going?)  And they worked swimmingly.

Of course, when I wrote them into the crontab, I did not set them to execute as Root.  No, I set them to execute as the user that I thought would be dropping the files — clearly the web server, right?  (WRONG.  The web server proxies requests to Rails, but Rails operates under another user to write the cached html.  I knew that on an intellectual level, but it did not occur to me one fateful evening when writing that cron file.)

Compounding Errors

So, as a result, large portions of my website were not getting refreshed when I thought they were.  As a result, instead of a dynamic bingo card publishing empire, with constantly refreshed content that was added to, content on the website was fixed at first generation and then went stale.

Had it stayed stale, I would have eventually figured things out.  But no — I was developing the site, occasionally, and every time I deployed a new version of it, one of the deployment steps had the side effect of nuking the cache directory.  And since the only reason I would be trolling on my own bingo cards pages was to see that my changes were taking effect, I never saw the bug.  

Meanwhile, between bursts of development activity, the page would look like an unattended ghost town for weeks or months at a time.

Diagnosing The Error, or, Reality Bites Your Hindquarters

Periodically I check what bingo cards my users find most appealing, both for curiosity’s sake and to guide development of new content.  Typically this is largely seasonal in nature — in November, for example, I can tell you without looking that bingo cards for Thanksgiving are going to have a total lock on the most popular crown.  But sometimes I’ll notice other patterns — a church sends my site out in their weekly flier and Bible Bingo gets popular for a week, some school has a unit on Chaucer and suddenly I see a surge in searches about the Canterbury Tales, and the like.

And then sometimes you see things that can’t be explained.  Insects being on the top 10 list for two months.  But you ignore it — maybe some people like bugs.  Cat Breeds for two weeks?  Maybe some people like cats.  But the one that finally clued me in was The Moon.  Because either NASA camp was playing Moon bingo every single day in July or something was up.

The Game Is Afoot

So I checked the database to see what card was scheduled for that day, because the card that gets top billing should typically be at or near the top of the most popular list.  And the database dutifully reported: none.  Which is impossible — had I run out of cards?  I thought I would still have about 100 in reserver.  I checked the database and saw, hmm, closer to 200.  Meaning either I was off on my mental count or about 100 had not been assigned.

One quick Ruby script later, iterating through all the days in summer, and I realized that only 6 days in summer had seen a new card assigned.  The error log showed nothing out of the ordinary for the other days.  As a matter of fact, it showed… nothing for them.  It was like the website wasn’t getting generated at all… or if it was, it was being served 100% out of cache.

Shoot.

Five minutes later I found and fixed the bug.  I was so mad I nearly posted a note about it here, but figured no one would care.  Had I known at the time how bad the impact was, I would have had apoplexy.

Fast Forward Three Months

If you know what happened, this seems like a pretty teeny bug conceptually.  Google sure didn’t see it that way — this bug made a live, growing website look largely dead to the world, and Google accordingly sent most of their searchers to get their bingo cards elsewhere.  (Ever wondered if Google values “fresh” web content?  This is my most convincing experience that says the answer is YES, although there are a couple of reasons why the pages are better with the daily updates happening which go a bit deeper than “they smell fresh that way”, which are out of the scope of this post.)  After the website was restored to vibrant life, Google opened the long-tail floodgates.

And the difference?  I’ve more than doubled most types of traffic, both search related and otherwise.  (Funny, while no user thought this was important enough to email about — and would you email if a favorite website went dead for a while during the summer?  — many sure thought it important enough when it was fixed to take notice.)  There are a few confounding factors in there (increased advertising budget, seasonal fluctuations, etc) but nothing accomodates for anything like the huge run-up you see in the graph.  My sales graph also shows a bit of a bump, to put it mildly.

Word to the wise

1)  Test your website with the same diligence you test your application.  Or, in my case, improved diligence.

2)  Pay attention to your website.  Even the boring bits.  Especially the boring bits that make you money.

3)  Never send a commited Windows programmer to do a Unix sysadmin’s job.

Advertisements

Quick Request Part #2

February 23, 2008

If someone with Java 1.3 or 1.4 installed on their computer could please download the Windows free trial from the Bingo Card Creator site, install it, and tell me whether you get to the main screen or not, I would be obliged. 

I believe the problem was that earlier I had assumed, having set the compiler to use version 1.3 in Eclipse, that I was getting Java 1.3 class files.  That worked for a long, long time.  Then I made an ant script and, boom, it stopped working the first time I ran the clear target and Ant used the (defaults to 1.6) javac target to make the new build.  Of course, having 1.6 on this machine, I never noticed the change.

All I can say is doh!

“Thats Funny, No One Has Bought a CD In Weeks”

March 27, 2007

I’ve had my best month of sales ever, but only 1 CD in that time.  Typically about half of my customers get the CD.  I had a vague feeling that there were fewer CD orders than usual this month but it didn’t raise any flags with me.  Then I got a fairly typical email saying “How do I purchase your software?”  Thanks for your interest, click the big red button which says Purchase Now.  “That doesn’t say it comes with a CD.”  Thanks for your continued interest, you need to click the one next to the text Purchase a CD.  “That doesn’t work.”  Thanks for your continued interest, you need to… oh, wait.  It actually doesn’t work. 

It seems that when I switched the item numbers in e-junkie (to accomodate SwiftCD integration) I forgot to also switch the item numbers in the e-junkie links on my page.  For some reason this actually didn’t cause a problem for at least the first two weeks.  It had to be working for one of my customers to get an order for a CD through on March 3rd.  At some point after that the e-junkie system began saying “Oh, wait, the item number that link references doesn’t exist anymore” and bailing when you tried to use it to add things to the cart.  I assume that most of my customers who saw this error either shrugged and said “OK, I’ll take the download!” and some, more worrisomely, probably left.

This is one of those bugs that just makes me want to die inside as a programmer.  The systems involved have well-understood interfaces but the inner workings are complex and totally opaque to me.  As a result, bugs are hard to predict except by seeing them, and if their visibility is obscured by whatever system interaction happens its likely that I’ll be the last to know.  I guess the only solution to that is regular monitoring and applying enough concentration to know when the process is out of control. 

As long as I’m on the subject of CDs: if you’ll excuse my own HTML coding errors, the integration of SwiftCD and e-junkie has been flawless in every respect.  Its also cut the amount of customer support I had to do literally in half — back when half of my orders were CDs I spent as much time retyping addresses and invoice numbers into cd-fulfillment.com as I did answering customer emails.  Now delivering a CD takes as much marginal work as delivering a registration key: nothing.  Granted, at my level of sales thats probably 5 minutes saved 3-4 times a week, but for some reason minor repetitive nuisances like that grate on me far more than their absolute time required would suggest.

Note to those using Inno Setup…

March 6, 2007

… don’t forget to set the working directory for the shortcuts you create.  I had assumed Windows would automatically default to the program directory.  This is apparently not the case on some systems, and its been causing some extraordinarily quirky behavior for some of my users.  (v1.05 and 1.051 use Java’s facility to locate the working directory rather than using the directory the .exe is in, because that logic was causing problems for the Mac port.  Unfortunately, on at least some systems they’ll default to somewhere else instead.  I hadn’t noticed this problem because Inno Setup actually does set a sensible default working directory when you execute the program directly from the installer.)

*sigh* Time for a 1.052.

Laugh To Keep From Crying…

February 23, 2007

I just found this morning, through ironically the same customer that was having difficulties yesterday (this must be karma), that a key feature of my software has been disabled in the build on my website for the last 2 weeks.  I had been seeing traffic higher than ever, double the number of confirmed downloads I had in January, and was wondering “Why am I seeing 20-30% of my usual sales?  Well, it turns out the reason was that I was performing an involuntary A-B test: A with the feature,  and B without, based on what day you had first downloaded the software.  A wins by a longshot.

Here’s what happened: I ship Bingo Card Creator with about a dozen preprogrammed word lists.  Since I’m skeptical of my customers’ ability to do the Open File -> Navigate To Folder -> Select File routine, I make this very easy through a Wizards menu.  Click on the Wizards Menu, click your subject, click the menu item which describes what grade level/skill you are working on.  I had known this was a crucial feature when I included it in 1.02 because it alleviates the “empty screen problem” (I essentially can’t sell to someone who hasn’t seen a card printed out, and if you have to type in 25 words before you can print a card then you’re much less likely to invest the time).  I hadn’t known it was quite so crucial though.

Here’s what happened.  The Wizard menu is autogenerated, and unlike the vast majority of my code the logic is pretty smart.  It spiders a particular subdirectory in the installation, doing a breadth-first search of the file system tree, and making every directory a submenu and every properly formatted file an entry in the submenu it appears in.  This lets me add new items to the Wizard menu without tweaking anything in the Java code.  If a directory is empty, it doesn’t get shown.  If no files are found at all, the Wizard menu doesn’t show, because nothing ticks people off like non-functional programs, right?  (Grumble grumble.)

Anyhow, this feature has worked since 1.02 and I test to make sure it works every time I make a build, because its just so critical and because without the feature I’d actually have to type in word lists to type printing and that is slow.

I test in Eclipse and it works fine.

I test after exporting my JAR and it works fine.

I test after wrapping the JAR in the EXE and it works fine.

I test after building my installer and installing and it works fine.

My customer downloads the installer and sees no Wizards menu.

Can you spot what edge case my testing missed?  Oh, its fun — I accidentally introduced a single extra character into my Inno Setup script.  Since my setup script is not in my Subversion I never even noticed that I had made the change.  This resulted in the setup program copying the SampleLists (where the wizards reside) folder into the application directory, not into the application/SampleLists directory.  On my machine, since I was installing without uninstalling (clobbers identically named files but DOES NOT DELETE files not present in the new installer), I still had the old C:\Program Files\Bingo Card Creator\SampleLists directory with the proper structure, and things worked peachy.  Then my old, loyal customers who were doing updates installed and everything worked peachy.  Then new people downloaded the installer and, boom, no Wizards for two weeks.

After work today I’ll release a .01 “upgrade” to bugfix Download.com and all the other places that cache installers.

Lessons learned:

  • Setup script goes into the Subversion repository so that at any time it is either known-good or marked as changed since the last known-good.
  • Uninstall before installing to do final testing.  If I had another machine lying around I’d even do a virtual machine or something so that I was guaranteed of a fresh test environment.
  • Keep the lines of communication open with customers.  They can save you from yourself.