Production Bug – Unnoticeable Async Void

The operation team reported a bug in production. The server crashed a few times, which seems daily. Here is the detail. I modified some texts (replaced with XXX) to remove the confidential data.

Your app crashed because of System.Net.Http.HttpRequestException
Your app, XXX, crashed because of System.Net.Http.HttpRequestException and aborted the requests it was processing when the overflow occurred. As a result, your app’s users may have experienced HTTP 502 errors.

This call stack caused the exception:
System.Net.Http.HttpContent+d__54.MoveNext
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw
System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification
System.Net.Http.HttpContent+d__70`2[[System.__Canon System.Private.CoreLib][System.__Canon System.Private.CoreLib]].MoveNext
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw
System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification
XXX.DeleteSampleCommandHandler+<>c__DisplayClass6_0+<b__0>d.MoveNext
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw
System.Threading.Tasks.Task+<>c.b__139_1
System.Threading.QueueUserWorkItemCallback+<>c.<.cctor>b__6_0
System.Threading.ExecutionContext.RunForThreadPoolUnsafe[[System.__Canon System.Private.CoreLib]]
System.Threading.QueueUserWorkItemCallback.Execute
System.Threading.ThreadPoolWorkQueue.Dispatch
System.Threading._ThreadPoolWaitCallback.PerformWaitCallback
---> (Inner Exception #0) System.IO.IOException
System.Net.Security.SslStream+d__214`1[[System.Net.Security.SslStream+SslReadAsync System.Net.Security]].MoveNext
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw
System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification
System.Net.Http.HttpConnection+d__87.MoveNext
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw
System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess
System.Net.Http.HttpConnection+ChunkedEncodingReadStream+d__10.MoveNext
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw
System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification
System.Net.Http.HttpConnectionResponseContent+d__5.MoveNext
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw
System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification
System.Net.Http.HttpContent+d__54.MoveNext
--- End of inner exception stack trace ---
---> (Inner Exception #1) System.ObjectDisposedException
System.Net.Security.SslStream.ThrowIfExceptional
System.Net.Security.SslStream.CheckThrow
System.Net.Security.SslStream+d__214`1[[System.Net.Security.SslStream+SslReadAsync System.Net.Security]].MoveNext
--- End of inner exception stack trace ---

To learn more about this exception, use the Kudu console to view the log file in d:\home\logfiles\crashdumps, which contains up to 10 log files for your app if space is available.

A long stack trace without only one line related to our application:

XXX.DeleteSampleCommandHandler+<>c__DisplayClass6_0+<b__0>d.MoveNext 

Let’s check out that command handler class.

public async Task Handle(DeleteSampleCommand command, CancellationToken cancellationToken)
{
    var client = await _clientFactory.GetSampleApiClient(command.Endpoint);

    client.RequestProcessed += async (sender, response) =>
    {
        var request = await response.RequestMessage.Content.ReadAsStringAsync();
        var content = await response.Content.ReadAsStringAsync();

        await _auditRepository.Add(command.Endpoint, command.SampleId, _context.User, "DeleteSample", request, (int)response.StatusCode, content);
    };

    await client.DeleteSampleAsync(command.SampleId, cancellationToken);

    // Other logic after deleting the sample from a remote endpoint

}

A client is auto-generated by Swagger/NSwag. This is the DeleteSampleAsync method.

public async System.Threading.Tasks.Task DeleteSampleAsync(System.Guid sampleId, System.Threading.CancellationToken cancellationToken)
{
    if (sampleId == null)
        throw new System.ArgumentNullException("sampleId");

    var urlBuilder_ = new System.Text.StringBuilder();
    urlBuilder_.Append(BaseUrl != null ? BaseUrl.TrimEnd('/') : "").Append("/samples/{sampleId}");
    urlBuilder_.Replace("{sampleId}", System.Uri.EscapeDataString(ConvertToString(sampleId, System.Globalization.CultureInfo.InvariantCulture)));

    var client_ = _httpClient;
    var disposeClient_ = false;
    try
    {
        using (var request_ = new System.Net.Http.HttpRequestMessage())
        {
            request_.Method = new System.Net.Http.HttpMethod("DELETE");

            PrepareRequest(client_, request_, urlBuilder_);
            var url_ = urlBuilder_.ToString();
            request_.RequestUri = new System.Uri(url_, System.UriKind.RelativeOrAbsolute);
            PrepareRequest(client_, request_, url_);

            var response_ = await client_.SendAsync(request_, System.Net.Http.HttpCompletionOption.ResponseHeadersRead, cancellationToken).ConfigureAwait(false);
            var disposeResponse_ = true;
            try
            {
                var headers_ = System.Linq.Enumerable.ToDictionary(response_.Headers, h_ => h_.Key, h_ => h_.Value);
                if (response_.Content != null && response_.Content.Headers != null)
                {
                    foreach (var item_ in response_.Content.Headers)
                        headers_[item_.Key] = item_.Value;
                }

                ProcessResponse(client_, response_);

                var status_ = (int)response_.StatusCode;
                if (status_ == 200)
                {
                    return;
                }
                else
                if (status_ == 400)
                {
                    string responseText_ = ( response_.Content == null ) ? string.Empty : await response_.Content.ReadAsStringAsync().ConfigureAwait(false);
                    throw new ApiException("Bad Request", status_, responseText_, headers_, null);
                }
                else
                if (status_ == 401)
                {
                    string responseText_ = ( response_.Content == null ) ? string.Empty : await response_.Content.ReadAsStringAsync().ConfigureAwait(false);
                    throw new ApiException("Unauthorized", status_, responseText_, headers_, null);
                }
                else
                if (status_ == 403)
                {
                    string responseText_ = ( response_.Content == null ) ? string.Empty : await response_.Content.ReadAsStringAsync().ConfigureAwait(false);
                    throw new ApiException("Forbidden", status_, responseText_, headers_, null);
                }
                else
                if (status_ == 500)
                {
                    string responseText_ = ( response_.Content == null ) ? string.Empty : await response_.Content.ReadAsStringAsync().ConfigureAwait(false);
                    throw new ApiException("Server Error", status_, responseText_, headers_, null);
                }
                else
                {
                    var responseData_ = response_.Content == null ? null : await response_.Content.ReadAsStringAsync().ConfigureAwait(false); 
                    throw new ApiException("The HTTP status code of the response was not expected (" + status_ + ").", status_, responseData_, headers_, null);
                }
            }
            finally
            {
                if (disposeResponse_)
                    response_.Dispose();
            }
        }
    }
    finally
    {
        if (disposeClient_)
            client_.Dispose();
    }
}

The client.RequestProcessed is our custom event. The implementation looks like

// Extend the generated SampleClient, thank to the partial class
public partial class SampleClient
{
    // It is designed in the SampleClient to allow consumers hook up additional functionalities
    partial void ProcessResponse(HttpClient client, HttpResponseMessage response)
    {
        RequestProcessed?.Invoke(client, response);
    }

    // Our custom event which is invoked whenever the ProcessResponse is executed
    public event EventHandler<HttpResponseMessage> RequestProcessed;
}

Investigation

The application is a Web API service. Command handlers are invoked from Controllers. Usually, an exception cannot crash the process.

Unless?

I recall I read the async await practices from David Fowler.

BAD Async void methods can’t be tracked and therefore unhandled exceptions can result in application crashes.

Async void usually appears in event handlers. Which is this code

client.RequestProcessed += async (sender, response) =>
{
    var request = await response.RequestMessage.Content.ReadAsStringAsync();
    var content = await response.Content.ReadAsStringAsync();

    await _auditRepository.Add(command.Endpoint, command.SampleId, _context.User, "DeleteSample", request, (int)response.StatusCode, content);
};

The compiler will generate an async void delegate for that event handler. Remember that the ProcessResponse is synchronous. Therefore, when there is an exception in the event handler block, it crashes the application.

But why? how to explain the exception in the stack trace? The response instance was disposed of. The event handler read request and response contents on a different thread for auditing purposes.

Solution

A quick and dirty solution introduces another method that returns a task.


private async Task AuditRequest(DeleteSampleCommand command, HttpResponseMessage response)
{
    var request = await response.RequestMessage.Content.ReadAsStringAsync();
    var content = await response.Content.ReadAsStringAsync();

    await _auditRepository_.Add(command.Endpoint, command.SampleId, _context.User, "DeleteSample", request, (int)response.StatusCode, content);
}

client.RequestProcessed += (sender, response) =>
{
    _ = AuditRequest(command, response);
};

However, it will break the audit requirements. The _auditRepository_.Add does not run because of an exception when reading the response.

To audit, the code needs to know the request and response. The execution flow can get the request and response without reading HttpResponseMessage directly again. The generated Swagger client has already read and wrapped the content in the ApiException.Response property.

And the final solution is as below.

public async Task Handle(DeleteSampleCommand command, CancellationToken cancellationToken)
{
    var payload = JsonConvert.SerializeObject(command);
    var response = "OK";
    var statusCode = 200;
    var client = await _clientFactory.GetSampleApiClient(command.Endpoint);

    try
    {
        await client.DeleteSampleAsync(command.SampleId, cancellationToken);
    }
    catch (ApiException apiException)
    {
        response = apiException.Response ?? apiException.Message;
        statusCode = apiException.StatusCode;
    }

    await _auditRepository.Add(
        command.Endpoint,
        command.SampleId,
        _context.User,
        "DeleteSample",
        payload, 
        statusCode,
        response);
    // Other logic after deleting the sample from a remote endpoint

}

Conclusion

Here are a few things I learned from this production issue.

  1. Be careful with async await. It looks cool but be sure you know what you are doing.
  2. Avoid accessing the Request/Response directly. Look for another solution. There is always another solution.
  3. When introducing an event, ask this question: "Do I need it? Can I write code without it?"

Production bugs are always interesting.

Production Bug – SQL Identity Jump

I got a bug report that some records were not synced between the two systems. There is a system to input data, and there is another system to expose data to external systems. In between, there is a system to convert data. Combining them is a data pipeline. It is typical architecture these days.

After checking data in related databases, I found that the error occurred in a single database where the data moved from one table to another. There are a source table, a destination table, and an intermediate table. We store the progress in a state which allows the system to continue from the last run.

Whenever there are changes in the source table, a SQL trigger executes and moves the "EntityId" column into the intermediate table.
Whenever the function (a timer Azure Function) starts, it reads data from the intermediate table and performs the required logic to convert data.

Everything worked as expected until the production bug was reported.

Schema and Code

The intermediate table schema Entity_Changes

CREATE TABLE [dbo].[Entity_Changes]
(
    [Ordering] [bigint] IDENTITY(1,1) NOT NULL PRIMARY KEY, 
    [EntityId] UNIQUEIDENTIFIER NOT NULL, 
)

The execution flow is

  1. Read records from Entity_Changes table
  2. Proceed them
  3. Delete them from Entity_Changes table

The simplified version of the code

private async Task<long> ReplaceRows(IDataSource dataSource, ConversionState resultState, CancellationToken cancellationToken)
{
    var pagingFrom = resultState.NumberOfRowsConverted;

    long recordsCount = 0;
    const int batchSize = 1000;
    do
    {
        cancellationToken.ThrowIfCancellationRequested();

        pagingFrom = resultState.NumberOfRowsConverted;
        var pagingTo = pagingFrom + batchSize;

        var pagedResult = await dataSource.ReadData(pagingFrom, pagingTo);
        if (pagedResult.MaxOrdering == 0)
        {
            _logger.LogInformation("No records to proceed");
            break;
        }

        // Do the conversion logic here

        recordsCount += distinctResults.Count;
        _logger.LogInformation($"{recordsCount} records");

        // NumberOfRowsConverted is used to set the PagingFrom value. 
        // It is the max ordering from the last batch
        resultState.NumberOfRowsConverted = pagedResult.MaxOrdering;
        resultState.State = (int)ConversionStateStatus.Running;
        await _conversionStateService.SaveLastSyncedState(resultState);

        var deletedRows = await dataSource.DeleteProceededRecords(pagingFrom, pagingTo);
        _logger.LogInformation($"Delete {deletedRows} rows from temp table");

    } while (true);

    return recordsCount;
}

In one of the data sources, I found these interesting numbers. The resultState.NumberOfRowsConverted was 5.885.592. The lowest Ordering in the source table was 5.895.593. The difference was 10.001. It did not make sense. How could that happen?

With those numbers, the ReadData returned empty result because the pagingTo was 5.886.592 (=5.885.592 + 1.000). It is smaller than 5.895.593

From experience, I know that the SQL Identity column does not guarantee continuous values; for example, when a transaction fails, Identity values are lost. But 10K gap is too big.

Looking at the code for hours could not help me explain (btw, I found some other issues with the code); I asked Google this SQL Identity Jump. Bingo! I found the answer SQLHack Avoid Identity Jump.


SQL Server is using a different cache size for the various data type of identity columns. The cache size of the INT data type column is 1000, and the Cache size of the BIGINT or Numeric data typed column is 10000.

Bingo! In some circumstances, there are jumps. The Ordering column is a BIGINT, so the jump size is 10.000.

Solution

Once you know the root cause, the solution is quite simple. First off, the paging implementation was not accurate. The number of rows might not be the same between the reading and deleting data. How could it happen?

Let’s say, when reading data for paging from [0-1.000], there are 10 rows in the database. The system proceeds them. When it comes to deleting those 10 rows, there are 900 rows. The solution is simple.

// Delete the actual ordering range that were proceeded
var deletedRows = await dataSource.DeleteProceededRecords(pagingFrom, pagedResult.MaxOrdering);

To solve the first problem, query using SELECT TOP WHERE ORDERING > @pagingFrom instead of querying by range.

var pagedResult = await dataSource.ReadData(pagingFrom, batchSize);

Production bugs are always interesting.

Pull Requests

A pull request (PR) is a social contract between developers, architects, and technical leads. So what are the major parts of a pull request?

Code

In software development, most of the time, PR is about code. Keep the changes as small as possible. Do not try to include too many changes in a pull request.

Description

I have seen many pull requests with poor descriptions. When I read them, they do not give me any helpful information. What should be included in a PR’s descriptions?

Bugs: Root Cause – Solution – Impact

Root Cause: It is a critical part. Developers must explain why and how it occurred. A common understanding between developers and reviewers is established once they read and agree upon the root cause. When the root cause is wrong, the rest does not matter.

Solution: Explain how you have fixed the problem. Did you make any tradeoffs? And why did you choose that approach? Depending on the problem, the detail differs.

Impact: Some changes might have significant consequences, such as infrastructure changes, changes in frameworks, changes in common flows. Some changes are local. Developers must be aware of them, and they must write them down.

Features: Problem/Requirement – Solution – Impact

The steps are the same as bugs, except the first one is "Problem/Requirement," a sentence explaining the outcome of the PR.

Commit Logs

Many pull requests have only one poor description commit. And it includes every change in the PR. IMO, it is a waste of commit feature. Commit logs should tell the process of solving the problem, a series of small changes. It reflects the implementation flow.

  1. A commit should capture a single "logical" change if possible. It is the same as the Single Responsibility Principle in software development.
  2. A commit should have a "good" commit message. At least it should help you, the developer, understand what has been implemented in that commit.
  3. The commit logs (all commits in a PR) should form an implementation flow, which is very important. It is like a checklist for both developers and reviewers. One can figure out what was missing by reading the commit logs.

There is pushback. What is a single "logical" change? Well, there is no definition. Developers have to practice and improve their judgment over time. It is also the magic of this practice. It makes developers better.

Resolve Code Reviews

If possible, resolve a comment in a commit. It is much easier to track what has been resolved and how – by looking at the commit detail.

Small practices can make a big difference.