February
21st 2008
How to Generate Test Data in Bulk

Posted under Code & SQL

Performance tuning in SQL Server is unendingly complex. At the basic end of the scale are indexes, query plans, disc-based I/O distribution, and the like; more complex issue demand a lot more investigation and care. Because the query optimizer’s job is most heavily centered in disc access - unlike the CLR optimizer, which is best at putting functions inline, and unwinding loops - performance and scalability code can only be realistically tested against data. Lots and lots of data.

Remote joins are a situation SQL Server could handle better. Generally, an entire table is copied from one server to the other, which could potentially flood tempdb, then the join is performed locally. You can use a query hint to force which table is sent and which server the join is delegated to. A better solution will come in a later post, although this is just one situation that requires gigabytes of data to properly test.

By custom, this test data often comes from Cartesian joins. An example is:

Select * From sysObjects, sysComments, sysColumns

My test server doesn’t have enough space to execute this code. If sysComments is removed and the query is executed in a new, empty database, 20,000 rows will be returned ( using SQL Server 2005, Enterprise Edition ). The trouble is, the rows look very much alike, can easily be mistaken for one another, and make poor test data. Let’s try a better approach.

What do we Want from Test Data?

For many optimization techniques, the data itself doesn’t matter. If you’d like to test the efficiency of a compound versus simple [single column] index, you only need a large table and data with cardinality similar to your production data.

Without knowing how unique or duplicated the data in any given system will be, the best approach here is to generate completely random data. While the values will be completely meaningless, they’ll be easily distinguishable. Given enough rows, with the correct data types for your application, it should be easy to simulate most real world scenarios.

Randomness in a Database

At first blush, asking a system designed to manage structured data for random behavior might seem odd. Digging further, ‘might’ turns to ’surely.’ The concept of determinism is a crucial part of the data engine’s job. If you’ve ever worked with indexed views, you may have encountered this: you can’t use the GetDate() function in an indexed view, because it gives a different return value for each call, and persisting the results of the view would be impossible.

Unfortunately, this is a problem: SQL doesn’t like user defined functions ( UDFs ) that aren’t deterministic. SQL Server 2000, when UDFs were introduced, wouldn’t allow any function, like GetDate() or @@cpu_busy, inside a function. Calculating the difference between today and an arbitrary date using a function meant passing both values - including today’s date - in as parameters. This has been relaxed somewhat in 2005. However, this code:

Create Function dbo.NonDeterministicTest() Returns Int As Begin Return Rand() * 100 End

Gives this error:

Msg 443, Level 16, State 1, Procedure NonDeterministicTest, Line 1
Invalid use of side-effecting or time-dependent operator in ‘rand’ within a function.

This means we can’t use the convenient semantics of a function to retrieve our random values. Some values can be column defaults, while text data will have to be generated by a stored procedure using an output parameter. Note: Supplying a seed to the Rand() function doesn’t change SQL Server 2005’s evaluation.

Why does SQL Server Care About Random Function Returns?

If a function returns the same value for every call, the data engine will call the function once, cache its value, and use this on all rows in the result set. If the function gives a different value based on its arguments, and these arguments are supplied by columns or expressions from the query, the engine will cache the return value until the data in the columns changes. On the other hand, if a function isn’t deterministic, it must be executed for every row that could potentially find its way into the result set.

How to Generate Random Values in SQL

We have a number of options for how to code our procedure:

  • The Rand() function is straightforward and well documented.
  • The NewID() function creates a globally unique identifier ( GUID ), which is stored internally as 16 bytes, but expressed in hex with four dashes.
  • Type conversion; because a GUID has far more data than most primitives, we can get a fairly random value by making a call like Cast(Cast(NewID() as binary(8)) as bigInt). This doesn’t work to create DateTime values, however, and is probably overkill in other situations.

The Code

For reasons we’ve gone over, the solution to our test data problem comes in sections. First we need a stored procedure to actually generate the random text data for us, we need a table to store them, and, finally, we’ll use a T-SQL loop to execute our sproc and insert the results.

Random Generator Stored Procedure

Alter Procedure RandomText
    @length int,
    @rval    varChar(max)    Output
As
    Declare @r varChar(max)
    Declare @i int, @v int
    Set @r = ''
    Set @i = 0
    While @i < @length Begin
		Set @v = Cast(Rand() * 20 As Int)
        If @v < 6
            Set @r = @r + Char(65 + Cast(Rand() * 26 As Int))
        Else If @v < 12
            Set @r = @r + Char(97 + Cast(Rand() * 26 As Int))
		Else If @v < 19
            Set @r = @r + Char(48 + Cast(Rand() * 10 As Int))
		Else Begin
			Set @v = Cast(Rand() * 5 As Int)
			Set @r = @r + Case When @v < 2 Then ' ' When @v = 3 Then '+' When @v = 4 Then '-' When @v = 5 Then '.' Else '!' End
		End
        Set @i = @i + 1
    End
    Select @rval = @r

Pretty straightforward; the procedure takes a string variable to populate, and the number of characters it should generate. This will look much like base 64 encoding. This is overkill unless you’ve set your database to perform case sensitive comparisons, provides more variety to the data.

Table to Hold Test Data

In my case, I needed two tables for a remote query. I called the table that would live on the foreign server BigTable, and the local one SmallTable, created from a subset of BigTable. Feel free to change the names to suit your taste.

Create Table BigTable (
    RecordKey      uniqueIdentifier  default (newID()),
    RecordID        bigInt               identity,
    F1                 float                default (rand() * 10000000),
    F2                 float                default (rand() * 10000000),
    F3                 float                default (rand() * 10000000),
    F4                 float                default (rand() * 10000000),
    F5                 float                default (rand() * 10000000),
    F6                 float                default (rand() * 10000000),
    F7                 float                default (rand() * 10000000),
    F8                 float                default (rand() * 10000000),
    F9                 float                default (rand() * 10000000),
    F10               float                default (rand() * 10000000),
    F11               float                default (rand() * 10000000),
    F12               float                default (rand() * 10000000),
    F13               float                default (rand() * 10000000),
    F14               float                default (rand() * 10000000),
    F15               float                default (rand() * 10000000),
    SmallText        varChar(900)        primary key,
    BigText           varChar(4000),
    CreatedOn       dateTime            default (getUtcDate())
)

For extra credit, this schema presents the opportunity to test compound indexes:

Create Index idxCoveringFloats On BigTable ( F1, F2, F3, F4, F5, F6, F7, F8, F9, F10, F11, F12, F13, F14, F15 )

Rules for Tables and Indexes

The columns SmallText and BigText might seem odd. SQL Server allows up to 900 bytes per index value, so defining a larger column prevents it from being indexed. Likewise, a row must fit inside a data page, which is 8,192 bytes, or 8 KB. BigText could use the text data type instead of variable character, but for my purposes, this makes more sense.

If you create a primary key on a table that doesn’t already use a clustered index, the PK will be clustered. In our case, that means the SmallText column. This is a nightmare design; the data in the table is sorted in the same order as the index itself, so as rows are added with random values, data in the table will be constantly moving. For the sake of a speedy test, you may want to remove the primary key before populating the table.

Populate the Table

Set NoCount On
Declare @i int
Set @i = 0
Print GetDate()
While @i < 500000 Begin
    Declare @r1 varChar(900), @r2 varChar(4000)
    Exec RandomText 900, @r1 output
    Exec RandomText 4000, @r2 output
    Insert Into BigTable ( SmallText, BigText ) Values ( @r1, @r2 )
    Set @i = @i + 1
End
Print GetDate()
exec sp_spaceused BigTable

Finally, we use a while loop to stuff the results of our stored procedure into the table. All other columns populate themselves with random values from defaults. No count is used to suppress half a million “(1 row(s) affected)” messages.

Warning

The population script is very resource intensive: nothing is set based, so normal optimizations the query engine provides don’t help. We’re using nested loops to generate data and record it, slow-by-slow. If you leave the primary key while adding data, the server will split pages like it’s going out of style. Plan to leave this running for a while, and for resources ( especially RAM ) on the server to be used greedily.

Trackback URI | Comments RSS

Leave a Reply