The Story of Bug Hunting in Production

BUGS COME AND GO. DEVELOPERS STAY!

Production bugs mean you cannot debug line by line; even worse that cannot reproduce them. The other days, I got one of them. We know when it happened by a screenshot reported by the client.

After a day hunting the bug, I finally found out the root cause. Once found out, the solution is a piece of cake. I thought the process I had been through is interesting and might be useful for others.

Here we go – the process of hunting a production bug.

Step 1 – Treasure in the Log

The first thing is to get the log. Find the entries at that specific time. Most of the system will display a user-friendly error message to end users. The real error, usually in the form of Exception, is buried in the log.

System.Transactions.TransactionAbortedException: The transaction has aborted. ---> NHibernate.Exceptions.GenericADOException: could not execute batch command.[SQL: SQL not available] ---> System.Data.SqlClient.SqlException: String or binary data would be truncated.
The statement has been terminated.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds, Boolean describeParameterEncryptionRequest)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean asyncWrite)
   at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(TaskCompletionSource`1 completion, String methodName, Boolean sendToPipe, Int32 timeout, Boolean asyncWrite)
   at System.Data.SqlClient.SqlCommand.ExecuteNonQuery()
   at System.Data.SqlClient.SqlCommandSet.ExecuteNonQuery()
   at NHibernate.AdoNet.SqlClientBatchingBatcher.DoExecuteBatch(IDbCommand ps)
   --- End of inner exception stack trace ---
   at NHibernate.AdoNet.SqlClientBatchingBatcher.DoExecuteBatch(IDbCommand ps)
   at NHibernate.AdoNet.AbstractBatcher.ExecuteBatchWithTiming(IDbCommand ps)
   at NHibernate.AdoNet.AbstractBatcher.ExecuteBatch()
   at NHibernate.AdoNet.AbstractBatcher.PrepareBatchCommand(CommandType type, SqlString sql, SqlType[] parameterTypes)
   at NHibernate.Persister.Entity.AbstractEntityPersister.Update(Object id, Object[] fields, Object[] oldFields, Object rowId, Boolean[] includeProperty, Int32 j, Object oldVersion, Object obj, SqlCommandInfo sql, ISessionImplementor session)
   at NHibernate.Persister.Entity.AbstractEntityPersister.UpdateOrInsert(Object id, Object[] fields, Object[] oldFields, Object rowId, Boolean[] includeProperty, Int32 j, Object oldVersion, Object obj, SqlCommandInfo sql, ISessionImplementor session)
   at NHibernate.Persister.Entity.AbstractEntityPersister.Update(Object id, Object[] fields, Int32[] dirtyFields, Boolean hasDirtyCollection, Object[] oldFields, Object oldVersion, Object obj, Object rowId, ISessionImplementor session)
   at NHibernate.Action.EntityUpdateAction.Execute()
   at NHibernate.Engine.ActionQueue.Execute(IExecutable executable)
   at NHibernate.Engine.ActionQueue.ExecuteActions(IList list)
   at NHibernate.Engine.ActionQueue.ExecuteActions()
   at NHibernate.Event.Default.DefaultFlushEventListener.OnFlush(FlushEvent event)
   at NHibernate.Impl.SessionImpl.Flush()
   at NHibernate.Transaction.AdoNetWithDistributedTransactionFactory.DistributedTransactionContext.System.Transactions.IEnlistmentNotification.Prepare(PreparingEnlistment preparingEnlistment)

Our system leverages NHibernate for Data Access layer. And the exception tells us that there is a piece of string that does not fit into its column max length. It happens when the system commits a transaction.

The question is which table/column? There are many tables involved in a single transaction. This is a complex system.

In the bug’s context, there is very few information involved. I looked at the code that is responsible for the function. Nothing’s special.

Step 2 – Look for Variants

From the customer, I know that some users had the problem. Some have not. Therefore, logged in user plays an important role in the bug.

When looking at any problem, there are always a number of participants, there are input and output. I took a pencil and paper. I jotted down all the participants (actually the C# object/class, such as User, Workflow, …). Just jot them down. Then I drew relationships between them.

Thinking about both cases success and fail, what are the variant factors? Luckily in my case, I can quickly figure out that was the User. And in our system, we care most about 2 pieces of information: Username and Full Name.

This step is important because it will give us a good picture of what might cause the problem.

Step 3 – Look at Data

Luckily that we can access to the production database. Ran some queries. The data emerged an interesting observation:

If the username is 41 characters, it causes the issue. If the username is shorter (at that time I did not notice how many characters the shorter means). I knew that our system keeps the created/edited information. But those columns have more than 50 characters max. It should not be an issue anyway.

Let’s find all the columns having the limit of 41 characters

SELECT TOP 1000 [TABLE_CATALOG]
      ,[TABLE_NAME]
      ,[COLUMN_NAME]
      ,[DATA_TYPE]
      ,[CHARACTER_MAXIMUM_LENGTH]
  FROM [INFORMATION_SCHEMA].[COLUMNS]
  WHERE CHARACTER_MAXIMUM_LENGTH < 41 AND CHARACTER_MAXIMUM_LENGTH > 0 AND DATA_TYPE='nvarchar'

Hurray, the bad boy could not hide anymore. There I had a very small set of columns which might cause the issue. It did not take much time to check each of them. All I had to do is to check in the code which column accepts Username value.

The fix? Well easy. Just need to extend the column max length.

 

Dealing with production bugs is hard. There is very limited information. And for most developers, if they cannot reproduce, debug the bug, they give up. The first and most important thing is to have a right mindset:

There must be something wrong somewhere in the system.

Fix a production bug? Not a problem!

  1. Have the right mindset. Do not assume.
  2. Gather whatever we can. Use whatever we got. Do not complain.
  3. Take a deep breath.
  4. Apply basic logical thinking. The process I share here is one of them.
  5. Hey, need a bit of luck.

Either you like it or not, bugs happen in production. I suggest you embrace them, have fun with them.

BUGS COME AND GO. DEVELOPERS STAY!

WPF: Mystery bug: Hunt, fix, but still not understand why

I got a bug report from client for a WPF application. Please look at the picture below:

mysterybug

The expected dropdown values are: I , unique II, III, IV, V

But it displayed “II” instead of “III”. I then check the value from data source. The data was correct. I then brought up my Visual Studio and ran the app. No, issue found.

After some discussion, I then tried with Windows 7, and Windows 8. And after some tries with changing the value for “III” item. The conclusions are:

  1. It works fine with Windows Server 2008 R2.
  2. It does not work on Windows 7 and Windows 8
  3. Only happen if “III” stands alone. If you add space (“I II”), it works

After hunting down the code, I ended up at the template to display the combo box item ( I need to use custom template for my own purpose)

<ComboBox.ItemTemplate>
<DataTemplate>
<TextBlock Text="{Binding Path=Name}" Height="35" Padding="5"></TextBlock>
</DataTemplate>
</ComboBox.ItemTemplate>

Accidentally, I changed Padding from “5” to “3” ( I love number “3”). To my all surprise, it worked Smile. I then tried with number “4”. It did not.

It will work with value less than or equal 3.

But I still do not understand the reason Sad smile Just post it here if someone can explain to me or if someone has the same problem might find the answer.