The Bug That Got Away
One thing that I've always loved hearing about from fellow engineers or reading about on technical blogs are bugs. Nasty ones. Ones that keep you up at night and those that will wake you from a dead sleep. These are the ones that great stories are built upon, because like many great stories, they have all of the pieces:
- Exposition - Ah crap! There's a bug in here somewhere.
- Rising Action - Let's dig into this and see how widespread it is and how we'll mitigate it.
- Climax - The "Eureka!" moment when you've narrowed down the exact cause of the bug.
- Falling Action - Implementing a fix, verifying it fixes the issue.
- Resolution - Merging the fix into source control, knowing the bug will be gone (forever)!
There's an extreme satisfaction to be found in a good bug. The exploration, the thrill of the chase, and finally catching that bug red-handed and putting an end to it with extreme prejudice.
Unfortunately, not all tales have happy endings; Sometimes the bug gets away.
The Exposition
This particular tale begins as most bug stories do - with a legacy software system. There isn't really anything special here, an older, cobbled together front-end, an enterprise-grade database, etc. If you've seen one, you've seen them all.
At any rate, just prior to an upcoming major release - I get a ping from a colleague to look at something. One of the records in the database is corrupted with some really bizarre encoding patterns. There doesn't appear to be any rhyme or reason behind them, it's just screwy and inconsistent with just about every other area of this application:
Record A: Look everything is nice & shiny!.
Record B: Look everything is nice & shiny!
Record C: Look everything is nice & shiny!
So, upon seeing this - I did said what any good developer would: "Oh, this should be a pretty simple fix.".
The Rising Action
Software engineering is full of bugs.
There are countless systems, big and small, that are just riddled with the things. As an engineer I know this very well, as I've contributed to my fair share of them. I've been a software engineer over ten years or so and I've always considered myself to be thorough, especially when it comes to tracking down a bug: the research, the deep diving, and finally: the fix.
As with any bug - one of the first steps to fix it, is being able to reproduce it. I spoke with our QA team and they weren't immediately able to reproduce it, but mentioned they would look into it further. Hours pass and I receive another message something to the effect of:
QA Person: Rion, I just spun up a fresh new environment and I can reproduce the issue!
At this point, I'm excited. I had been fighting with this for over a day and I'm about to dive down the bug fixing rabbit hole on the way to take care of this guy. I log into the new environment, and sure enough, QA was right! I can reproduce it! I should have this thing knocked out in a matter of minutes and my day is saved!
Or so I thought. Roughly two hours to the minute of being able to reproduce the issue, it stops occurring. I was literally in the middle of demonstrating the issue to a colleague and minutes later, it's completely vanished. How could this be? Nothing in the environment changed, no machine or web server restarts, no configuration changes, nothing. The bug, just after a matter of hours, seems to have resolved itself.
Skipping to the Last Page
Normally as part of a rising action in a story, things built and build until they reach a point. At this point in my story, I should have figured out the root cause by now. The bug apparently was reproducible for a short while, but not long enough to determine the exact cause (lots of moving parts in this machine). So, I start adventuring to try to find a path to climb up that much higher on debugging mountain. I was pulling everything out of my bag of tricks including:
- Examining IIS Logs - In multiple environments, I checked through IIS logs in production environments where the issue had occurred, in the short-term reproducible QA environment, my local environment.
- Examining Event Viewer Logs - Maybe there was some type of exception that was causing the web server to restart and that magically fixed the issue. Surely, there would be something there.
- Profiling Environments - In times when the issue was reproducible, I took advantage of the SQL Server Profiler and had logs of the exact calls that were being executed against the database.
- Decompiling Production Code - With a Hail Mary I attempted to decompile code from the production environment to ensure that no code changes were different and that no calls outside expectation were being made.
Nothing helped. Every single new avenue I'd venture down would only further my confusion and leave me wondering what the heck could be causing the issue. After putting all of the pieces together, you could basically describe the issue as follows:
How could making two sets of calls, all traveling through the same endpoints, passing along the same data, executing the same queries against the same exact stored procedures result in different data (one being corrupted and the other not).
For the first time in years, I felt defeated by a bug. I started grasping at straws, looking for race-conditions, outside forces that might affecting the code, network throttling issues, nothing.
The Bug Won
Many days and nights had passed. This bug was waking me up at night, I was dreaming about potential causes only to run to my computer and try them out and eventually realize they didn't work. Like every good engineer, I had a workaround in mind for this issue just minutes after encountering it, but I was determined to not have to end up there.
I had seen the issue locally, even for a fleeting moment, in several QA environments (again fleeting), and within several production environments. I had tried everything that I could think of, consulting countless peers to brainstorm the cause, but all that resulted in was spreading the bewilderment throughout the team.
This seemingly trivial bug had eluded every form of capture/resolution that I could think of. It left in its wake nothing but bewilderment, not only to myself, but seemingly everyone that I tried demonstrating the issue to. Eventually, much like a doctor, I had to call it.
After over a week of my life, days and nights, being spent pursing this bug: it won. There wouldn't be a climax, there wouldn't be a happy ending, there wouldn't be a nice warm, fuzzy feeling of accomplishment; there'd be a few lines of hacky code to fix it.
I felt just like our friend Charlie Brown, and this bug had ripped the football away just before I'd ever get a chance to kick it.
It Happens
The reason that I wrote this, or that it's worth writing about really has nothing to do with the bug itself. It has to do with me, and maybe even you. I've always considered myself great at solving problems, and thorough. I'll dig deep, keep digging, exploring, and won't stop until I can crack the problem, until in this case: I couldn't.
Being an engineer is typically about solving problems, but more importantly, it's about being practical. I could have easily spent several more days (and nights) trying to solve this problem and figure out exactly why it was happening, but honestly, the fix for it took no longer than 5 minutes to implement. This was about being able to admit defeat. Much like there's nothing wrong with admitting "I don't know", there's nothing wrong with knowing when to suck up your pride and move on.
If you ask me today, I still don't know what caused this issue. I'll probably never know, and that's alright. I'll let this one get away and tell its friends about me. I know I'll certainly make sure to tell mine about it.