Indexing in SQL Server
In this article we are going to dive deep into clustering index and non-clustered index. But before that we need to have an informal introduction of indexing in database and it’s usage for non-database guys.
So the question is What is index in database? Why do we need indexing in database? Let us learn with some scenario along with an analogy. The analogy is the index of a book.
Just think we have a database book (not necessarily a database book can be any book) having 2000 pages.
Scenario 1: We want to read a complete chapter that describes “indexing” in detail?
Action: We will search for content table at the beginning of the book. The content table has the list of chapters along with starting page of the chapter. From that table we come to know pages 1000 to 1150 describe indexing in details. We started reading from page 1000 and read next 150 pages in next 7 days.
In this example above, the content table and pages are together analogous to clustered index in the database.
Now imagine that We have given same 2000 isolated pages not even arranged in any order and We want to read all the pages that describe the indexing in details.We imagine the situation, We might need 10 days to find all 150 pages that describes indexing in details and another 10 days to orient the pages in proper order. Then We read it for next 7 days. Exactly the same thing happens when we try to find something in a table that does not have any kind index.
Scenario 2: We just want to read the section that describes the topic “covering index” in details.
Action: We will directly move to index pages given at the back side of the book. Then we search for sections starting with "c", then find "covering index" in that section, get the associated page number and move to that page, read the section. Done.
In the example above, the index pages of the back of the book is analogous to the non-clustered index in database.
Now let’s wrap it up . In general, indexing helps us in searching data in a Database Table. And this “help” comes with a some trade offs in term of “overhead time” while inserting, updating or deleting data in table.
So indexing in Database can be a “Boon or Bane” depending on the context and table size. “Boon” as far a good amount of data searching is required and less number of insertion or updation or deletion are required and “Bane” as far a little data searching is required but huge insertion or updation or deletion are required.
In database world we can use two type of indexing techniques.
-
Clustered index.
-
Non Clustered index.
In this article we will look at the clustered index only. Before diving into indexing in SQL Server, let us examine how data is stored in SQL server. SQL Server internally organizes the data in a data file in terms of pages. A page is an 8 KB (8192 Bytes) unit and can belongs to a single object like a table or index. So a page can be a data page or index page. Data pages holds the actual data and index pages hold the indexing data (like content table of the book or indexing of the book ). Pages are again organized in extents. One extent consists of 8 consecutive pages. An extent is called as mixed extent if it contains pages from multiple objects and called uniform extent if all pages belongs to a single object. SQL Server stores first 8 pages of a single object in a mixed extent and as soon as it crosses 8 pages, a uniform extent is created and all pages of that object is moved to the uniform extent.
Visual 1: Extent contains 8 consecutive pages.
Visual 2: Page 8KB = 8192 bytes
Pages are the physical structure in SQL Server but as far as indexing is concerned, we are more interested in logical structure of the data. Data organized in pages are logical structure.
When a table is created in SQL Server database, it (table) is organized as heap. Data in heap does not have any logical structure. Heap is just a bunch of pages and extents.
As soon as we create a primary key on that table, SQL Server automatically organizes that table as a clustered table with primary key as the clustering key though it is not mandatory for a primary key to be a clustering key. We can choose any column as clustering key, even we can choose more than one column as a clustering key. We can have max 16 column in a clustering key but the size of all column together should not exceed 900 bytes. SQL Server organizes a clustered table as balanced tree. Every balanced tree has a single root page and at least one or more leaf level pages.
Visual 3: Clustered Index (Balanced Tree)
The clustering key might be unique or non unique. If we choose a primary key to be a clustering key, then it is unique. If it is other than primary key, the clustering key might not be unique. However SQL Server always maintains uniqueness of clustering key by adding a uniquifier (sequential integer) to the repeating values.
SQL Server always returns data from the leaf level pages. The intermediate levels and root level is only for navigation to the leaf level pages. That means all data resides in the leaf level in case of clustered index. Thus all the leaf level pages of a clustered table are data pages. Intermediate and root level pages are index pages.
Now the big questions that comes in mind, for a given table, are
-
How many intermediate levels?
-
How many pages?
-
How many rows in a page?
The answer to those questions completely depends on the numbers of rows we have in a table and size of each row, in other words the size of the table.
Let us try to understand the number of intermediate levels, number of pages in each intermediate level, the number of pages in leaf level with an example.
Suppose we have a table as defined below and it contains 100,000 rows. This table is a clustered table with a clustering key StudentID.
CREATE TABLE STUDENT
(
StudentID int IDENTITY(1,1) NOT NULL,
FirstName char(50) NOT NULL,
LastName char(50) NOT NULL,
SSN char(11) NOT NULL
)
The row size will be (4+50+50+11) = 115 bytes. Recall, size of the page in SQL Server is 8KB and 96 bytes is reserved for page header. So ( 8 x 1024 – 96 ) = 8096 bytes of a page is usable.
So one page can accommodate 8096 / 115 = 70.4 = 70 rows assuming 100% allocation per page ( the fill factor of the clustered index is set to default 0 and it means 100% fill. Fill factor is a bit advance concept and is used to fight against the fragmentation issue).
Calculation says, we need (100,000 / 70 ) = 1428.57 = 1429 leaf level pages to accommodate all the rows.
Here is the structure of leaf level pages according to the clustering key. The important part is the physical and logical structure of leaf level pages are same for clustered index.
Visual 4: Leaf level pages.
Now let’s talk about intermediate level pages and root page. The intermediate level ( and root level ) pages does not contains actual data, the rows in intermediate level ( and root ) pages are called index row and each one row points to the one pages in next level. For example if root level page contains 10 rows and each row points to one page in next level, that means we have 10 pages in next level. If an intermediate level page contains 50 rows, then it points to the 50 pages in next level.
According to our previous calculation we have 1429 pages that needs to be pointed from intermediate level pages. In other words we need to have 1429 number of row spreads across the pages in one level up. To calculate the number of intermediate pages we need to calculate the row size in intermediate level pages and remember the rows in intermediate level pages are not data row, they are index rows. Let’s examine the index rows. If we look into Level 1 pages of Visual 5, we can see the index row like “71, 1, 2001, …..”. 71 represents index key value, 1 represent fileID ( In which file data is ?) and 2001 represent pageID ( In which page data is ?). Keeping all this in mind we can easily calculate index row size.
Index row size = index key value + pointer + row overhead
= index key value + (fileID + pageID) + row overhead
= 4 bytes + (2 bytes + 4 bytes) + 1 byte
= 11 bytes
The minimum row overhead is 1 byte. I will discuss this in another article related to data pages.
Number of rows in an intermediate level page is
= 8096 bytes per page / (11 bytes index row + 2 bytes slot array)
= 8096/13
= 622
Any page having more than one row will have slot array of 2 bytes for each row, so we have added 2 bytes per row in above calculation.
We need to store 1429 rows in intermediate level. So the number of pages required to store 1429 rows = 1429 / 622 = 2.3 = 3.
We need only 3 pages in intermediate level and first two pages will have 622 rows each and points to 622 pages in next level. The 3rd page will have 185 rows pointing to 185 pages in next level.
Visual 5: Cluster index structure
The rows in intermediate level pages shows index key value, file ID and page ID.
Every balanced tree must have a single root page. According to our calculation a page can have 622 index row. So to point to 3 intermediate level pages, we just need a root page having 3 rows as shown in figure below.
For 100,000 rows of STUDENT table we need only three level in balance three. With only 3 level we can accommodate a huge number of rows. See the root level has only 3 rows where it can accommodate 622 index rows and each row is actually pointing to (43541 - 1) = 43540 or
(87081 - 43541) = 43540 records. So this root page can hold (622 * 43540) = 27,081,880 records before adding another level in the tree.
One important observation is the key choice for clustering key. It’s an integer and takes only 4 bytes thus accommodating more rows in pages (any level). Had it been a 16 byte GUID, the number of pages would have been increased substantially in each level thus we could have one more level in the above balanced tree.
Suppose we want to find the name of the student having StudentID 73. We don’t need to scan all the data pages (leaf level pages) of the balanced tree to find StudentID 73. We just need to scan 3 pages. First look into root pages, it tells us StudentID 73 related information is available in page 4111 of file 1. Second scan page 4111, it tells us StudentID 73 related information is available in page 2001 of file 1. Third scan page 2001 to find the name of student having StudentID 73 and the name is “Kim”. This is how the SQL Server works internally while using clustered index.
- Summation IT
- Wednesday 24 January 2018 |
- Database Development |
Categories
Recent Posts
Common Vulnerabilities in Web Applications:
Security in a website is the most important factor needs to be taken care of if considered. Are your website and user data safe and secure?Android Marshmallow runtime permissions
Security in a website is the most important factor needs to be taken care of if considered. Are your website and user data safe and secure?SignalR – Why, What and How?
An increasing number of software out there namely websites and web applications today offer or need to offer real-time dataIndexing in SQL Server
An increasing number of software out there namely websites and web applications today offer or need to offer real-time data
