2011-10-16 - The Art of Debugging



This is an experiment. I will write this blog entry in a completely non-technical style, just for fun. Hope you enjoy it, if not: stop reading.

The little sound flashes of the heater which it always makes when it automatically turns off because the desired temperature has been reached woke me up. Moving pictures play on the wall, painted by the blinds as a painter, painting with the bright rays of the early Sunday morning sun as paint.

My eyes slowly adjust to the beginnings of my day as the thoughts wander in from the far land of dreams.

If this were a movie you would now hear a soft and soothing music, that soft and soothing music the trained movie watcher always takes for a sign that there comes a build-up to a shock moment, a moment when all of a sudden somebody or something evil enters the scene in the blink of an eye and the music turns loud in an instant so that untrained movie watchers will grip their armrests, or their partner, whichever is more stable.

And just like in the movies, but without music nor watchers, it comes back to me. The past week which started out nicely, only to culminate in a desastrous Friday. One of those dreaded Fridays when everything fails, when you can not stop working because you do not want to leave too much on your desk which would greet you on Monday morning like your worst enemy would, with a nasty smirk on the face that needs no verbal accompaniment. One of those dreaded Fridays, when I believe I can just put in a few more hours and at least one problem is solved, only to give up after a long, fruitless stretch, long past midnight, going home, defeated.

So there I am, staring at my wall, and all the joy of watching the little lights dance is gone. It is gone since I know that I will need to debug this program, to find the root of the problem, to squash it once and for all, or I will not enjoy my Sunday.

Flashback. I am nine years old, my brother is so happy he finally found something we can share. With extraordinary patience, he explains to me how to write a program. It's a list of instructions, he says, and the computer will follow them to the letter. Like directions, when you explain how to get somewhere. He gets excited and wants me to share the excitement. Can you explain to me how to get to your school, he asks me. That's easy, I say, just go to the big church, and you're right there. No, no, he says, I am not local, in fact, I am someone who only understands the words left, right and forward. Halfway through my explanations which he comments with praise and corrections, I lose my patience. Now I switch on the computer. It greets me with the same error message as I left it. The error message that does not make sense at all.

It is a good thing that I was forced to leave the computer off yesterday, since I would have been distressed again. Long ago I gave up cursing at the machine. After all, it does exactly - not faithfully, it cannot have faith or any other feeling, which is good, otherwise I would believe that it tries to make me angry - what it has been told. Therefore I am actually cursing at myself when I curse at the computer. So I stopped doing it.

In those fruitless hours Friday night, I made many mistakes, and one mistake was to stay so late. Another was that I switched on a compatibility mode inadvertenly, supporting an older version of the programming language. And all of a sudden the error message was gone. But I was too tired to realize that it was that mode which made the error go away. Well, in the end I did realize it, so that time was spent not so fruitlessly after all.

Flashback. For half a year, I have been programming on and off now. It is interesting, but also sometimes boring, because things go too slow for my liking. Some of it is me learning slower than I would like, but most of the time it is the program executing so slowly. Of course, my brother says, that's because you don't use machine language. Machine language, I ask, what's that.

The first thing I do is to verify that the mode change was responsible for the lack of the error message. Yes, if I switch it back on and off, the message is toggled off and on, too. A large part of debugging is made up of pure luck stumbling across something odd that turns out to be the reason for the bug.

Okay, first step done, but now what? And besides, that test whether the error appears or not takes way too long. Since I can build and start the program from the command line, and since the command line allows me to write a simple program "do A and if that works, also do B" I do that. Build it. If that worked, start it. Once started without an error, use the part of the program that I think triggers the error message. Yes, that does the job, I still get the error message and the test whether the problem occurs is much faster.

Flashback. My family including me is traveling. We are at some friends' house and I just found out that their son is a professional programmer. I ask him whether he knows machine language. He laughs and says, assembler, yes I know that. I write assembler code for a living. Oh, I almost shout out, can you teach me, my brother wants me to become more proficient in BASIC first, but I'd like to know this assembler language. The sooner the better. My father is a little embarrassed at my pestering and apologizes. The guy tries to be nice and says, well, maybe tomorrow, I cannot teach you over the dinner table anyway.

Next step. I need to reduce the problematic code to investigate. Since the error goes away when that compatibility mode flag is set, it must be something with the code generation. That is my hunch, and a large part of debugging is following hunches. It seems a bit counterintuitive to ignore logical reasoning when coping with something as rigidly logical as a computer program. But programs are written by humans.

One theory about the history of the word "debugging" goes like this: Once upon a time, when electronics were made up of tubes, not transistors or chips, when microcomputers still occupied whole rooms filled with electronics, heat and noise, there was a problem with a program. The programs back then were wiring schemes, the distinction between hardware and software not yet invented. And even after double, triple and quadruple checking, the wiring was sound, but the simple test still produced the wrong result. Then an engineer followed his hunch and went back into the computer. After a while he came back saying, I found the bug, presenting the remnants of a big insect which sacrificed its life in the mission to short-circuit the computer.

Usually I would use a program called "debugger" that allows me to execute my program one instruction after another, and inspect the intermediate results. But it does not allow me to step back, and something tells me that I cannot tackle this particular problem that way, something tells me that something is going wrong long before the error message is displayed, it is just a consequence. Besides, the code responsible for the error message is not mine, I just use it, but I do not yet understand what it does exactly, in depth. And I need to understand the code better before I have a chance to understand the problem. A large part of debugging is to learn enough to be able to understand what is happening.

But first I need to reduce the amount of code that could be responsible for the bug. Reducing a problem means stripping it down to essentials. A large part of debugging is just reducing the problem to less possible reasons. For my particular problem, I think that two files are involved, both contain source code, interacting with one another through the code that is not mine. I copy both files into a new location and remove the references to my other source code files which I did not copy, one by one, until the result builds. Anxiously, I test whether the error message, so far my only indicator of a problem, still is displayed. Otherwise I also got rid of the code containing the bug. Yes, the error message is still displayed. Good, otherwise that hour spent with extracting the two files and making them build would have been in vain. Well, almost in vain. It would have been elimination in the finest imitation of Sherlock Holmes: once you have eliminated every other option, that one which remains, however improbable, must be the correct one.

Flashback. He probably thought I would forget. But I really want to learn machine language. His defenses crumble after lunch and he says, but it will be on paper, with a pencil. Yes, I say and smile. And when it becomes too complicated, we stop, he says, still wanting to discourage me and wanting to spend his time with something more exciting than introducing a not even ten year old pest to assembler. It won't, I say. That following evening, I know that everything in the computer is bits and bytes and that it is even stupider than I thought: the instructions it really follows are working on such a low level, not even being able to multiply two numbers greater than 15, that I am really grateful to my brother for teaching me a higher-level programming language first. Now I start to delete more and more lines of code from the two lines. Instead of doing that carefully, one by one, I throw out half of them in one go, since I now have this very quick test whether I got rid of the offending code, too. Whenever the error message goes away, I click "undo" in my text editor and delete the other half. More and more lines go, and the shorter the code gets, the faster my test runs. It starts to get fun.

Flashback. I am fourteen years now and have written a lot of programs already, and more importantly, learnt from reading other programmers' code. One such lesson was when I read code to draw circles and was puzzled about the formula it used. Thus I found out about Pythagoras. But more interesting was that concept of functions. To wrap code doing a specific thing in a "function", with input parameters that I can change whenever calling the function, without needing to change that function's code. For example, I could draw a circle in red, and then in green, calling the same function by simply providing a different input parameter for the color. I am down to less than 20 lines for each of the two files, and whatever I tried in the past ten minutes, I cannot reduce the files any further without "fixing" the error. Now it is time to look which function displays the error, and how we arrived there, since the real error likely occurred before that point. The programming language I use here - Java - makes that relatively easy. For each function call, it records which line in the source code corresponds to that call. And since functions frequently call functions which call yet other functions, which in turn call yet other functions, and so on, we often end up with lists of nested calls. I insert the code to display that list where I suspect the error to happen and run my test again. So now I know which functions are involved and inspect the corresponding source code to get a bigger picture.

Flashback. At the same time I learn about functions, my brother introduces me to the programming language C. It has the same high-level conveniences I am used to from my initiatiation to programming all those years back, but the clue is: there is a program called "compiler" which translates the source code into machine language. That is different from my first programming language, which reads the instructions one after another and tries to interpret them. The output of the compiler is raw machine language that runs at full speed. Now I know that the error happens in a place where some sanity check is performed on so-called "bytecode", a sanity check that only makes sense with newer versions of the programming language. And all of a sudden it also makes sense that the error message went away when I built the code in compatibility mode. Unfortunately, that means that the mistake could be in my code, since the sanity check is just not performed in the compatibility mode. And if the sanity check flags an issue with my code correctly, that issue is also present when not being checked. Time to read some documentation about that new sanity check.

Flashback. My brother asks me to have a look at some weird issue. The sample source code he tested with his new C compiler (a spanking new 1.0 version, students' edition) results in a wrong result, he believes. It is simply converting Fahrenheit to Celsius, but it looks like it multiplies with the factor 9/5 instead of 5/9 after subtracting 32. I have a look and cannot find a mistake in the source code, so I suggest to have a look at the assembler code. We fire up a disassembler, a program that takes the raw bytes of a machine language program and displays it in a human readable format. For some definition of "human readable": it substitutes the command bytes by some three-letter "mnemonic" which is kind of a common nickname for the operation used by all assembler programmers. The result is hard to read, but it shows clearly that the compiled code disagrees with the source code. We found a bug. While Java does not have a machine language, it has a so-called "bytecode" which is similar. It is just another layer between the source code and the machine language because Java is portable to many machine languages. For example, my computers speak the Intel machine language while the phone I am typing this story on speaks the ARM language. Having studied the documentation of the bytecode - especially the part about the new version - for more than half an hour I feel the urge to have a look at the code which upsets the sanity check so much that it displays an error message.

There I stare into the computer, looking at disassembled bytecode, comparing with notes I wrote into a text editor window, going back and forth and slowly starting to understand how the sanity check works. I try to apply the check manually to parts of the bytecode and compare to the intermediate results of the code, which I told the program to display for my pleasure, for this debugging session's sake. More and more I get convinced that my code is correct and some odd issue is going on with the sanity check. For a few hours I have been debugging now, but now I am on the bug's trail, I can sense its scent, I start to run after it now that I feel I am close to squashing the issue once and for all. And then, in a blink of a moment, I know it. I know what the issue is. I almost solved the problem. A large part of debugging, the largest indeed, is the search for the cause of the problem. In most cases, the solution is straight-forward once the reason is identified.

Turns out that the check does not expect input like the one I provide. That check wants to verify that the type of a certain function's input parameter is correct. And it just does not include a check for an "unset value" which would normally be an error of its own, but in my particular test case is not. And all of a sudden that error message actually makes sense: it is just a delayed check, testing lazily for a valid value of the input parameter. Only that in my particular scenario, the code is not reached before the formerly unset value is set to something valid.

A little fiddling here and there and I end up adding that check for an unset value in the correct place. The modified code now says that an unset value is not of an incorrect type. My code runs, it was correct after all.

Three lines of code added, problem solved.

The heater makes that noise revealing that it just turned itself off because it reached the desired temperature. The sun paints itself on the wall, through rustling leaves and almost closed blinds. It is half past two in the afternoon and my Sunday just begins. It is a beautiful late autumn day without a trace of clouds in the sky. Time to leave bed and go outside and enjoy the colors.

Disclaimer: in the interest of the narrative, a few facts have been bent or shortened. Similarities with persons, both living or dead, would not be purely coincidental.