Understanding Database Indexes

What is a database Index?

A database index is a special data structure that helps to speed up the process of searching for specific rows in a database table. It does this by organizing the data in the table according to the values in one or more columns. An index can be thought of as a table of contents in a book, which helps the reader to quickly find the desired chapter or page.

How Does an Index Work?

If you have a database with more than a million entries and you need to find a specific record, the database engine will scan each row one by one until it finds a match for the record you are searching for, and then return the result. However, with so many records, this scanning process could take several minutes, causing the application to slow down and become almost unusable.

Now this is where the Database Index shines, with an index things work a little bit differently, when you create an index on a column the database sorts the values in that column and creates a data structure that associates each value in the indexed column with a pointer to the corresponding row in the actual table.

How To Create a Database Index

We will be using a running example to buttress further the concept of Database Indexing. It is assumed that you have MYSQL installed on your machine.

First, we create a new Database with the following command.

CREATE DATABASE index_demo;

Then we create a table inside our brand-new database with the following command.

CREATE TABLE customer (
    cust_name VARCHAR(20) NOT NULL, 
    age INT, 
    email VARCHAR(20), 
    phone_no VARCHAR(20) );

Now we will use this code to populate the table we just created inside our database, you can add multiple records to the table.

INSERT INTO customer (name, age, email, phone_no)
VALUES ('Samuel', 30, 'samuel@gmail.com', '2347067322873');

Cool stuff!

Now we have some dummy data inside our database. Since our database is not indexed yet if we run a SELECT query using the EXPLAIN command, we get this feedback:

EXPLAIN SELECT * FROM customer where cust_name = 'Samuel';

EXPLAIN shows how the query engine plans to execute the query. In the above screenshot, you can see that the rows column returns 5 & possible_keys returns null. possible_keys represents what all available indices are there which can be used in this query. The key column represents which index is actually going to be used out of all possible indices in this query.

PRIMARY KEYS

If you noticed when creating the customer table columns, we did not explicitly declare a Primary Key for our table, we will now make the EMAIL field our Primary Key, assuming that no 2 users in our database should have the same email address. Just before we do that, please bear in mind the following criteria when creating a Primary Key:

  • A primary key should be part of many vital queries in your application.

  • A primary key is a constraint that uniquely identifies each row in a table. If multiple columns are part of the primary key, that combination should be unique for each row.

  • The primary key should be Non-null. Never make null-able fields your primary key. By ANSI SQL standards, primary keys should be comparable to each other, and you should definitely be able to tell whether the primary key column value for a particular row is greater, smaller or equal to the same from other row. Since NULL means an undefined value in SQL standards, you can’t deterministically compare NULL with any other value, so logically NULL is not allowed.

  • The ideal primary key type should be a number like INT or BIGINT because integer comparisons are faster, so traversing through the index will be very fast.

Often we define an id field as AUTO INCREMENT in tables & use that as a primary key, but the choice of a primary key depends on developers.

What if you don’t create any primary key yourself?

For this tutorial, we used MYSQL with the default InnoDB database engine, if you did not explicitly declare a Primary Key, innoDB by default implicitly creates one for you because InnoDB by design must have a primary key in every table. So once you create a primary key later on for that table, InnoDB deletes the previously auto defined primary key.

Since we don’t have any primary key defined as of now, let’s see what InnoDB by default created for us, run the command:

SHOW EXTENDED INDEX FROM customer;

EXTENDED shows all the indices that are not usable by the user but managed completely by MySQL.

Here we see that MySQL has defined a composite index on DB_ROW_ID , DB_TRX_ID, DB_ROLL_PTR, and all columns defined in the table. In the absence of a user defined primary key, this index is used to find records uniquely.

Let’s now create the primary index on email and examine the created index:

ALTER TABLE customer ADD PRIMARY KEY (email)
SHOW INDEXES FROM customer;

In the above screenshot, we see that one primary index is created on the column email. The columns of the image are described as follows:

Table : The table on which the index is created.

Non_unique: If the value is 1, the index is not unique, if the value is 0, the index is unique. From our example, the email column is unique hence we are seeing a 1 there.

Key_name: The name of the index created. The name of the primary index is always PRIMARY in MySQL, irrespective of if you have provided any index name or not while creating the index.

Seq_in_index: The sequence number of the column in the index. If multiple columns are part of the index, the sequence number will be assigned based on how the columns were ordered during the index creation time. The sequence number starts from 1.

Collation: how the column is sorted in the index. A means ascending, D means descending, NULL means not sorted.

Cardinality: The estimated number of unique values in the index. More cardinality means higher chance that the query optimizer will pick the index for queries.

Sub_part : The index prefix. It is NULL if the entire column is indexed. Otherwise, it shows the number of indexed bytes in case the column is partially indexed.

Packed: Indicates how the key is packed; NULL if it is not.

Null: YES if the column may contain NULL values and blank if it does not.

Index_type: Indicates which indexing data structure is used for this index. Some possible candidates are — BTREE, HASH, RTREE, or FULLTEXT.

Comment: The information about the index not described in its own column.

Index_comment: The comment for the index specified when you created the index with the COMMENT attribute.

Now let’s see if this index reduces the number of rows which will be searched for a given email in the WHERE clause of a query. Remember before creating an index, the database had to query all the rows in the table to find a match for the resource we were looking for, let us see if there is any improvement on that.

EXPLAIN SELECT * FROM customer WHERE email = 'samuel@gmail.com';

In this snapshot, notice that the rows column has returned 1 only, the possible_keys & key both returns PRIMARY . So it essentially means that using the primary index named as PRIMARY (the name is auto assigned when you create the primary key), the query optimizer just goes directly to the record & fetches it. It’s very efficient. This is exactly what an index is for — to minimize the search scope at the cost of extra space.

Using the Create Index Command to Create Database Index

You will notice that we did not explicitly create an index on the email column, we just made the email column a Primary Key and mysql automatically created an index on that column. Often times when building applications you might need to explicitly create an index, this can be done using the create index command.

CREATE INDEX index_demo
ON customer (email);

This creates an index on the email column on the customer table.

When Should You Use a Database Index?

Indexes are most useful when a database is performing more read operations than write operations. Let’s pretend we are building a site like IMDB. Our database has many tables, including a movies table. Our movies table has the columns: title, genre, rating, and release_date. While IMDB adds (or writes) movies to their movies table quite frequently, I’m willing to bet the number of times a user performs a search on the website to query movies is far greater.

Since data retrieval operations on our movies table are the most common action, we will be well served to add indexes to the most commonly queried columns. Let’s add an index to the title column since this is the most common lookup in the movies table. We will sort the newly created index alphabetically. So our new index is essentially a copy of the title column in alphabetical order with pointers to the original rows of each unordered title in the table. Now, when a user searches for a movie by its title, it can be found much faster since the database has a better idea of where to search for it.

Downsides of database indexes

Let's go over some of the possible downsides of using too many database secondary indexes.

Additional storage

The first and perhaps most obvious drawback of adding indexes is that they take up additional storage space. The exact amount of space depends on the size of the table and the number of columns in the index, but it's usually a small percentage of the total size of the table. A basic index only needs to store the values of the indexed columns as well as a pointer to the row in the table. So for a column that contains integers, the index will only need to store the integer values. This space will increase if the column contains strings because the index will need to store the string values as well as the length of each string.

This is important to consider if you have large datasets, as adding multiple indexes to a table can quickly use a significant amount of additional storage space.

Slower writes

When you add an index, it has to be updated whenever a row is inserted, updated, or deleted. This means that writes will be slower. Before you add an index, you should consider whether you will be doing a lot of writes to the table and whether or not you can afford to slow down the writes.