Performance tuning in SQL Server is unendingly complex. At the basic end of the scale are indexes, query plans, disc-based I/O distribution, and the like; more complex issue demand a lot more investigation and care. Because the query optimizer’s job is most heavily centered in disc access - unlike the CLR optimizer, which is best at putting functions inline, and unwinding loops - performance and scalability code can only be realistically tested against data. Lots and lots of data.
Remote joins are a situation SQL Server could handle better. Generally, an entire table is copied from one server to the other, which could potentially flood tempdb, then the join is performed locally. You can use a query hint to force which table is sent and which server the join is delegated to. A better solution will come in a later post, although this is just one situation that requires gigabytes of data to properly test.
By custom, this test data often comes from Cartesian joins. An example is:
Select * From sysObjects, sysComments, sysColumns
My test server doesn’t have enough space to execute this code. If sysComments is removed and the query is executed in a new, empty database, 20,000 rows will be returned ( using SQL Server 2005, Enterprise Edition ). The trouble is, the rows look very much alike, can easily be mistaken for one another, and make poor test data. Let’s try a better approach.
What do we Want from Test Data?
For many optimization techniques, the data itself doesn’t matter. If you’d like to test the efficiency of a compound versus simple [single column] index, you only need a large table and data with cardinality similar to your production data.
Without knowing how unique or duplicated the data in any given system will be, the best approach here is to generate completely random data. While the values will be completely meaningless, they’ll be easily distinguishable. Given enough rows, with the correct data types for your application, it should be easy to simulate most real world scenarios.
Randomness in a Database
At first blush, asking a system designed to manage structured data for random behavior might seem odd. Digging further, ‘might’ turns to ’surely.’ The concept of determinism is a crucial part of the data engine’s job. If you’ve ever worked with indexed views, you may have encountered this: you can’t use the GetDate() function in an indexed view, because it gives a different return value for each call, and persisting the results of the view would be impossible.
Unfortunately, this is a problem: SQL doesn’t like user defined functions ( UDFs ) that aren’t deterministic. SQL Server 2000, when UDFs were introduced, wouldn’t allow any function, like GetDate() or @@cpu_busy, inside a function. Calculating the difference between today and an arbitrary date using a function meant passing both values - including today’s date - in as parameters. This has been relaxed somewhat in 2005. However, this code:
Create Function dbo.NonDeterministicTest() Returns Int As Begin Return Rand() * 100 End
Gives this error:
Msg 443, Level 16, State 1, Procedure NonDeterministicTest, Line 1
Invalid use of side-effecting or time-dependent operator in ‘rand’ within a function.
This means we can’t use the convenient semantics of a function to retrieve our random values. Some values can be column defaults, while text data will have to be generated by a stored procedure using an output parameter. Note: Supplying a seed to the Rand() function doesn’t change SQL Server 2005’s evaluation.
Why does SQL Server Care About Random Function Returns?
If a function returns the same value for every call, the data engine will call the function once, cache its value, and use this on all rows in the result set. If the function gives a different value based on its arguments, and these arguments are supplied by columns or expressions from the query, the engine will cache the return value until the data in the columns changes. On the other hand, if a function isn’t deterministic, it must be executed for every row that could potentially find its way into the result set.
How to Generate Random Values in SQL
We have a number of options for how to code our procedure:
- The Rand() function is straightforward and well documented.
- The NewID() function creates a globally unique identifier ( GUID ), which is stored internally as 16 bytes, but expressed in hex with four dashes.
- Type conversion; because a GUID has far more data than most primitives, we can get a fairly random value by making a call like Cast(Cast(NewID() as binary(8)) as bigInt). This doesn’t work to create DateTime values, however, and is probably overkill in other situations.
The Code
For reasons we’ve gone over, the solution to our test data problem comes in sections. First we need a stored procedure to actually generate the random text data for us, we need a table to store them, and, finally, we’ll use a T-SQL loop to execute our sproc and insert the results.
Random Generator Stored Procedure
Alter Procedure RandomText
@length int,
@rval varChar(max) Output
As
Declare @r varChar(max)
Declare @i int, @v int
Set @r = ''
Set @i = 0
While @i < @length Begin
Set @v = Cast(Rand() * 20 As Int)
If @v < 6
Set @r = @r + Char(65 + Cast(Rand() * 26 As Int))
Else If @v < 12
Set @r = @r + Char(97 + Cast(Rand() * 26 As Int))
Else If @v < 19
Set @r = @r + Char(48 + Cast(Rand() * 10 As Int))
Else Begin
Set @v = Cast(Rand() * 5 As Int)
Set @r = @r + Case When @v < 2 Then ' ' When @v = 3 Then '+' When @v = 4 Then '-' When @v = 5 Then '.' Else '!' End
End
Set @i = @i + 1
End
Select @rval = @r
Pretty straightforward; the procedure takes a string variable to populate, and the number of characters it should generate. This will look much like base 64 encoding. This is overkill unless you’ve set your database to perform case sensitive comparisons, provides more variety to the data.
Table to Hold Test Data
In my case, I needed two tables for a remote query. I called the table that would live on the foreign server BigTable, and the local one SmallTable, created from a subset of BigTable. Feel free to change the names to suit your taste.
Create Table BigTable (
RecordKey uniqueIdentifier default (newID()),
RecordID bigInt identity,
F1 float default (rand() * 10000000),
F2 float default (rand() * 10000000),
F3 float default (rand() * 10000000),
F4 float default (rand() * 10000000),
F5 float default (rand() * 10000000),
F6 float default (rand() * 10000000),
F7 float default (rand() * 10000000),
F8 float default (rand() * 10000000),
F9 float default (rand() * 10000000),
F10 float default (rand() * 10000000),
F11 float default (rand() * 10000000),
F12 float default (rand() * 10000000),
F13 float default (rand() * 10000000),
F14 float default (rand() * 10000000),
F15 float default (rand() * 10000000),
SmallText varChar(900) primary key,
BigText varChar(4000),
CreatedOn dateTime default (getUtcDate())
)
For extra credit, this schema presents the opportunity to test compound indexes:
Create Index idxCoveringFloats On BigTable ( F1, F2, F3, F4, F5, F6, F7, F8, F9, F10, F11, F12, F13, F14, F15 )
Rules for Tables and Indexes
The columns SmallText and BigText might seem odd. SQL Server allows up to 900 bytes per index value, so defining a larger column prevents it from being indexed. Likewise, a row must fit inside a data page, which is 8,192 bytes, or 8 KB. BigText could use the text data type instead of variable character, but for my purposes, this makes more sense.
If you create a primary key on a table that doesn’t already use a clustered index, the PK will be clustered. In our case, that means the SmallText column. This is a nightmare design; the data in the table is sorted in the same order as the index itself, so as rows are added with random values, data in the table will be constantly moving. For the sake of a speedy test, you may want to remove the primary key before populating the table.
Populate the Table
Set NoCount On
Declare @i int
Set @i = 0
Print GetDate()
While @i < 500000 Begin
Declare @r1 varChar(900), @r2 varChar(4000)
Exec RandomText 900, @r1 output
Exec RandomText 4000, @r2 output
Insert Into BigTable ( SmallText, BigText ) Values ( @r1, @r2 )
Set @i = @i + 1
End
Print GetDate()
exec sp_spaceused BigTable
Finally, we use a while loop to stuff the results of our stored procedure into the table. All other columns populate themselves with random values from defaults. No count is used to suppress half a million “(1 row(s) affected)” messages.
Warning
The population script is very resource intensive: nothing is set based, so normal optimizations the query engine provides don’t help. We’re using nested loops to generate data and record it, slow-by-slow. If you leave the primary key while adding data, the server will split pages like it’s going out of style. Plan to leave this running for a while, and for resources ( especially RAM ) on the server to be used greedily.