Production Bug – SQL Identity Jump

I got a bug report that some records were not synced between the two systems. There is a system to input data, and there is another system to expose data to external systems. In between, there is a system to convert data. Combining them is a data pipeline. It is typical architecture these days.

After checking data in related databases, I found that the error occurred in a single database where the data moved from one table to another. There are a source table, a destination table, and an intermediate table. We store the progress in a state which allows the system to continue from the last run.

Whenever there are changes in the source table, a SQL trigger executes and moves the "EntityId" column into the intermediate table.
Whenever the function (a timer Azure Function) starts, it reads data from the intermediate table and performs the required logic to convert data.

Everything worked as expected until the production bug was reported.

Schema and Code

The intermediate table schema Entity_Changes

CREATE TABLE [dbo].[Entity_Changes]
(
    [Ordering] [bigint] IDENTITY(1,1) NOT NULL PRIMARY KEY, 
    [EntityId] UNIQUEIDENTIFIER NOT NULL, 
)

The execution flow is

  1. Read records from Entity_Changes table
  2. Proceed them
  3. Delete them from Entity_Changes table

The simplified version of the code

private async Task<long> ReplaceRows(IDataSource dataSource, ConversionState resultState, CancellationToken cancellationToken)
{
    var pagingFrom = resultState.NumberOfRowsConverted;

    long recordsCount = 0;
    const int batchSize = 1000;
    do
    {
        cancellationToken.ThrowIfCancellationRequested();

        pagingFrom = resultState.NumberOfRowsConverted;
        var pagingTo = pagingFrom + batchSize;

        var pagedResult = await dataSource.ReadData(pagingFrom, pagingTo);
        if (pagedResult.MaxOrdering == 0)
        {
            _logger.LogInformation("No records to proceed");
            break;
        }

        // Do the conversion logic here

        recordsCount += distinctResults.Count;
        _logger.LogInformation($"{recordsCount} records");

        // NumberOfRowsConverted is used to set the PagingFrom value. 
        // It is the max ordering from the last batch
        resultState.NumberOfRowsConverted = pagedResult.MaxOrdering;
        resultState.State = (int)ConversionStateStatus.Running;
        await _conversionStateService.SaveLastSyncedState(resultState);

        var deletedRows = await dataSource.DeleteProceededRecords(pagingFrom, pagingTo);
        _logger.LogInformation($"Delete {deletedRows} rows from temp table");

    } while (true);

    return recordsCount;
}

In one of the data sources, I found these interesting numbers. The resultState.NumberOfRowsConverted was 5.885.592. The lowest Ordering in the source table was 5.895.593. The difference was 10.001. It did not make sense. How could that happen?

With those numbers, the ReadData returned empty result because the pagingTo was 5.886.592 (=5.885.592 + 1.000). It is smaller than 5.895.593

From experience, I know that the SQL Identity column does not guarantee continuous values; for example, when a transaction fails, Identity values are lost. But 10K gap is too big.

Looking at the code for hours could not help me explain (btw, I found some other issues with the code); I asked Google this SQL Identity Jump. Bingo! I found the answer SQLHack Avoid Identity Jump.


SQL Server is using a different cache size for the various data type of identity columns. The cache size of the INT data type column is 1000, and the Cache size of the BIGINT or Numeric data typed column is 10000.

Bingo! In some circumstances, there are jumps. The Ordering column is a BIGINT, so the jump size is 10.000.

Solution

Once you know the root cause, the solution is quite simple. First off, the paging implementation was not accurate. The number of rows might not be the same between the reading and deleting data. How could it happen?

Let’s say, when reading data for paging from [0-1.000], there are 10 rows in the database. The system proceeds them. When it comes to deleting those 10 rows, there are 900 rows. The solution is simple.

// Delete the actual ordering range that were proceeded
var deletedRows = await dataSource.DeleteProceededRecords(pagingFrom, pagedResult.MaxOrdering);

To solve the first problem, query using SELECT TOP WHERE ORDERING > @pagingFrom instead of querying by range.

var pagedResult = await dataSource.ReadData(pagingFrom, batchSize);

Production bugs are always interesting.

Write a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.