On more than one occasion I've had to deal with "where does the data live" in an n-tiered web application. Text Files? Databases? Relational Databases? Database-like things? Our collective imagination?
The non-relational database route usually starts out as something quite benign - possibly by design, but often times by circumstance or even accident.
- Design: Everything needs to go super fast and databases can't keep up what's happening at the front end. Or we don't want to spend that much in hardware and/or licensing costs to make that happen. Mr. Ellison wants his new yacht and you don't want to fund it.
- Circumstance: We bought a really big commercial database as the master repository, but don't want to spring for the money to forward deploy runtime copies of this database. And mixing something with a price tag of $0.00 into the equation such as PostgreSQL or MySQL just doesn't make sense at this point (lack of expertise, additional distraction/resource requirements, etc.)
- Accident: Oops. We didn't think we'd last six months, let alone six years. But lucky us, we have this Perl script that punches data out of a database, scp's it to some server where it gets munged by a big Tcl script that... and there you go.
Before I forget, there is one more variation on the theme:
- Reinventing the Square Wheel: I don't need a stinkin' database and I alone can improve on the pathetic shortcomings of twenty plus year old, mature, version 11 commercial applications. I'm a genius, even on my most modest of days.
Now, the primordial step - 100 records in a text file that look something like this:
123\tThe Big Lebowski
456\tOffice Space
789\tFight Club
...
999\tDriving Miss Daisy
Your mod_perl module opens this text file every time a web page is rendered. Heck, we're not even using CGI so this isn't too bad. But we need to do it this way since someone is hand editing the file and who knows when it will change.
Some months later we graduate to 500 or 1000 records. Now no one wants to manage this by hand anymore, even if it were humanly feasible. So someone stands up a database and back ports this data into it. Of course we now have working, useful code up front that needs a text file and we don't have time to change. Some magic incantations ensue and next thing you know a table is being exported in the special tab delimited format and being scp'd to a half dozen web servers where your mod_perl module is loading it on every page view.
Then one day something bad happens. The database export breaks and a zero-length file is pushed to all your web servers. Crap. The website it down. Our little scp script now checks for that and when no one is looking egg is quietly wiped from someone's face.
Then one day something good happens. Your website just hit the jackpot and the teeming masses can't get enough. Now opening this text file on every page view doesn't seem like such a good idea. In fact, this was all built so long ago it takes a frantic all-nighter to dig around and try to figure out what is causing the website to tip over.
The good news is that at 6:00 AM the next morning you rediscover dbm and tie. Some quick hacks to your export script and mod_perl code. We're back in business.
Disaster averted, but only momentarily. Not only does this data file normally update on a daily basis, but sometimes it needs to be updated _now_ (perhaps someone goofed something badly or the CEO is having a bad hair day). Some quick mod_perl code to check for a new data file arriving - the usual suspect with a ".new" file suffix. Oh yeah, now we're cooking with fire.
Just when things are looking stable, some intrepid person in marketing convinces senior management that if we were to have the movie rating right up there with its title, we'd increase revenue by 10x. No problem, we'll tack that onto the end of the dbm file records:
key=123 value=The Big Lebowski,4 stars
key=456 value=Office Space,5 stars
key=789 value=Fight Club,3.5 stars
...
key=999 value=Driving Miss Daisy,1.5 stars
Not too shabby - tweak the export, parse the values on commas and done. Then a minimum wage data entry clerk enters the title "10,000 B.C." into the system. Ugh.
Ever resilient, never giving up, there is a way out:
key=123 value=The Big Lebowski^A4 stars
key=456 value=Office Space^A5 stars
key=789 value=Fight Club^A3.5 stars
...
key=999 value=Driving Miss Daisy^A1.5 stars
key=1000 value=10,000 B.C.^A0.5 stars
How clever, use a delimiter character that never appears in any legitimate text. Take that.
The whole 10x revenue thing didn't really pan out, but never mind, since if these 17 other metadata fields are added, 20x revenue is ripe for the taking. Oh and they're nested attributes, with all different cardinalities. Some movies have two supervising key grips you know.
It is at this point, Grasshopper, that things get really interesting. A couple of choice solutions out of this
jam:
- Nest the additional attributes in the dbm file values, but even more cleverly use something like ^B as the delimiter.
- Take a small step back, slightly regroup. Ahh yes, XML can fit the bill here. Stuff XML blobs into the dbm file values. Whip out a trusty XML parser - life's never been easier.
- Someone suggests Storable.pm and is promptly ignored.
- Wait a second. What is with all the hassle of building this dbm file? Why not just export a single big XML file and be done with it? (I sense an advanced technique coming on here).
Lots of grumbling ensues, but finally the export script pushes out a big XML file. And your mod_perl code now reads the whole thing into a nice hash and refreshes it whenever a new file shows up.
The XML starts to get pretty big now, and at over 100MB folks start to gzip it to save on disk space (you keep a few older snapshots of this file around for safety purposes - brownie points there) and to more quickly move it across the network. No one suspects that when gzip is added, the zero-length file checking code is accidentally undone and a month or two later someone has a really bad day trying to explain why the web site is down again.
And then one day it really hits you. The default parser is a DOM parser and the XML file is finally big enough that it can't be loaded without clobbering the web server or simply failing to load. Unfortunately this fact is discovered via pager, not preemptively.
Frantically someone whips up a SAX implementation and saves the day.
Finally someone suggests this whole thing needs to be re-architected. Thank goodness. The big XML is obviously the problem, so let's split it into thousands of smaller XML files - to be precise, one for each movie. How brilliant! There will never be a big XML file problem again! Better yet, a big NetApp is ordered and all the XML gets sent to this one magic centralized location. Take that, no more scp. Pure genius.
Over time, these "smaller" XML files become bigger and bigger and performance begins to once again be a significant problem. Not really a problem though - simply chop these files up into even smaller pieces - wash, rinse, repeat. XLink and XPointer confuse the hell out of everyone, so referential integrity between all these tiny pieces of XML is assumed implicitly. After all, exceptions are well, just that - exceptions!
It's at this point that Susan Powter parachutes into a conference room and screams "Stop the insanity!".
There is a point to all of this.
It's not that exporting and pushing forward text or other types of data files is pure evil. Specialized applications and cross-company/organizational unit boundaries often require text-based feeds. But these are the exceptions, not the rule. By default (and when in doubt), build from the ground up using tools that enable more efficient management of your data. It turns out 999/1000 times a database is this more efficient place. Pay attention to and get familiar the free database solutions out there. They're actually quite good.
Some folks like to play the performance trump card and quickly end up down the "reinvent the square wheel" path. First show me all your of caching strategy front to back with things like squid, memcached and Hibernate - then let's talk. At this point if it's still not fast enough and you still want to build some custom data file export compressed shared memory gizmo, I may listen. Perhaps for entertainment value - but that you'll never know.
Pardon me while I step out and don my asbestos long johns.
Comments