The Story of Bug Hunting in Production

BUGS COME AND GO. DEVELOPERS STAY!

Production bugs mean you cannot debug line by line; even worse that cannot reproduce them. The other days, I got one of them. We know when it happened by a screenshot reported by the client.

After a day hunting the bug, I finally found out the root cause. Once found out, the solution is a piece of cake. I thought the process I had been through is interesting and might be useful for others.

Here we go – the process of hunting a production bug.

Step 1 – Treasure in the Log

The first thing is to get the log. Find the entries at that specific time. Most of the system will display a user-friendly error message to end users. The real error, usually in the form of Exception, is buried in the log.

Our system leverages NHibernate for Data Access layer. And the exception tells us that there is a piece of string that does not fit into its column max length. It happens when the system commits a transaction.

The question is which table/column? There are many tables involved in a single transaction. This is a complex system.

In the bug’s context, there is very few information involved. I looked at the code that is responsible for the function. Nothing’s special.

Step 2 – Look for Variants

From the customer, I know that some users had the problem. Some have not. Therefore, logged in user plays an important role in the bug.

When looking at any problem, there are always a number of participants, there are input and output. I took a pencil and paper. I jotted down all the participants (actually the C# object/class, such as User, Workflow, …). Just jot them down. Then I drew relationships between them.

Thinking about both cases success and fail, what are the variant factors? Luckily in my case, I can quickly figure out that was the User. And in our system, we care most about 2 pieces of information: Username and Full Name.

This step is important because it will give us a good picture of what might cause the problem.

Step 3 – Look at Data

Luckily that we can access to the production database. Ran some queries. The data emerged an interesting observation:

If the username is 41 characters, it causes the issue. If the username is shorter (at that time I did not notice how many characters the shorter means). I knew that our system keeps the created/edited information. But those columns have more than 50 characters max. It should not be an issue anyway.

Let’s find all the columns having the limit of 41 characters

Hurray, the bad boy could not hide anymore. There I had a very small set of columns which might cause the issue. It did not take much time to check each of them. All I had to do is to check in the code which column accepts Username value.

The fix? Well easy. Just need to extend the column max length.

 

Dealing with production bugs is hard. There is very limited information. And for most developers, if they cannot reproduce, debug the bug, they give up. The first and most important thing is to have a right mindset:

There must be something wrong somewhere in the system.

Fix a production bug? Not a problem!

  1. Have the right mindset. Do not assume.
  2. Gather whatever we can. Use whatever we got. Do not complain.
  3. Take a deep breath.
  4. Apply basic logical thinking. The process I share here is one of them.
  5. Hey, need a bit of luck.

Either you like it or not, bugs happen in production. I suggest you embrace them, have fun with them.

BUGS COME AND GO. DEVELOPERS STAY!

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: