DATABASE DESIGN CONCEPTS AND SQL STANDARDS: February 2010

MySQL 5 Storage Engines

By Ian Gilfillan

New Storage Engines in MySQL 5

MySQL 5 offers a number of new storage engines (previously called table types). In addition to the default MyISAM storage engine, and the InnoDB, BDB, HEAP and MERGE storage engines, there are four new types: CSV, ARCHIVE, FEDERATED and EXAMPLE, as well as a new name for the HEAP storage engine. It is now called the MEMORY storage engine. None of the new types are available by default - you can check for sure with the SHOW ENGINES statement. Here is what is on my default version of MySQL Max:

mysql> SHOW ENGINES;

+------------+---------+-----------------
| Engine     | Support | Comment                                                    |
+------------+---------+-----------------
| MyISAM     | DEFAULT | Default engine as of MySQL 3.23 with great performance     |
| HEAP       | YES     | Alias for MEMORY                                           |
| MEMORY     | YES     | Hash based, stored in memory, useful for temporary tables  |
| MERGE      | YES     | Collection of identical MyISAM tables                      |
| MRG_MYISAM | YES     | Alias for MERGE                                            |
| ISAM       | NO      | Obsolete storage engine, now replaced by MyISAM            |
| MRG_ISAM   | NO      | Obsolete storage engine, now replaced by MERGE             |
| InnoDB     | YES     | Supports transactions, row-level locking, and foreign keys |
| INNOBASE   | YES     | Alias for INNODB                                           |
| BDB        | YES     | Supports transactions and page-level locking               |
| BERKELEYDB | YES     | Alias for BDB                                              |
| NDBCLUSTER | NO      | Clustered, fault-tolerant, memory-based tables             |
| NDB        | NO      | Alias for NDBCLUSTER                                       |
| EXAMPLE    | NO      | Example storage engine                                     |
| ARCHIVE    | NO      | Archive storage engine                                     |
| CSV        | NO      | CSV storage engine                                         |
+------------+---------+------------------------------

To add support for the missing storage engines, you currently need to build MySQL with certain options. It is likely though that there will be binary versions that include these storage engines by default at some point. Until then, there is no other way to enable them.
Changes in the MEMORY storage engine
You can read my 2003 article on the HEAP table type as a start, as most of the detail has not changed, and refer below for modifications in MySQL 5.
Previously, the HEAP storage engine only made use of hash indexes. These allow finding specific matches extremely quickly, but do not return any kind of range data. An index matches a record, but there is no ordering to allow it to return subsequent records. Only the complete index can be used, the concept of leftmost prefixing (using the left part of an index) does not apply. The MEMORY storage engine now permits BTREE indexes as well (the kind used by MyISAM tables by default).
To specify an index type, use the USING clause, as in the following examples:

CREATE TABLE memory_table (f1 INT, INDEX USING BTREE (f1)) ENGINE = MEMORY;
or
CREATE TABLE memory_table (f1 INT, INDEX USING HASH (f1)) ENGINE = MEMORY;

The HASH index is still the default, and will be the type of index created if you do not specify a particular kind.

Now support AUTO_INCREMENT
Now support INSERT DELAYED
Support indexes on columns that can contain NULL values
Never get converted to disk tables. (Temporary internal tables are automatically converted to disk table if they get too big, MEMORY tables never are. The max_heap_table_sizevariable (it hasn't yet changed its name to reflect the new storage engine name) places a limit on the memory utilization of MEMORY tables, and you can always place a MAX_ROWS limit as well, when creating the table.

The EXAMPLE storage engine

Added in MySQL 4.1.3 and only of interest to developers, the EXAMPLE storage engine does nothing, but is there to provide simple source code for developers to base new storage engines on. For those interested, the source code can be found in the sql/examples directory.
The FEDERATED storage engine
Added in MySQL 5.0.3, to make use of it you need to use the --with-federated-storage-engine option to configure when building MySQL. The FEDERATED storage engine allows you to access data from a table on another database server. That table can make use of any storage engine. Let's see it in action. First, CREATE a table on a remote server (you can do this on the same server for testing purposes, but doing so is fairly pointless otherwise).

CREATE TABLE myisam_table (f1 INT, PRIMARY KEY(f1))ENGINE=MYISAM;
Assuming that the default is set to create MyISAM tables (FEDERATED tables can access tables of any type), the above statement creates a definition file (.frm), an index file (.MYI) and a data file (.MYD). If you had created an InnoDB file, MySQL would create a definition (.frm) and index and data file (.idb). Now create the FEDERATED table on another server. The original table must always exist first:

CREATE TABLE federated_table (f1 INT, PRIMARY KEY(f1))ENGINE=FEDERATED

COMMENT='mysql://username:password@hostname.co.za:3306/dbname/myisam_table';

This creates a definition file (.frm), but the data and index files are that of the table on the remote server. The only unusual syntax is the COMMENT, which supplies the username, password (optional), port (optional), database and table name. This method is not particularly elegant, or secure, as the password is stored in clear text available to anyone who has access to the table data. However, in most cases it is likely that whoever has access to the FEDERATED table can also have access to the remote table, so this should not be too much of an issue. Bear in mind that this method of connecting will likely change in a future version.
There are some limitations on the use of FEDERATED tables. They are useful for easily accessing data on a different server, but fall short in many areas:

No Data Definition Language statements are permitted (such as DROP TABLE, ALTER TABLE)
They do not support transactions (since the remote server is contacted once, and the results returned to the local server)
Similarly, there is no way of being sure that the integrity of the local data is intact.
No prepared statements.

The CSV storage engine

Added in MySQL 4.1.4, tables of type CSV are actually just comma-delimited text files. This can be quite useful, and they exist to allow MySQL to interact easily with other applications that make use of CSV files, such as spreadsheets. They make no use of any sort of indexing. To enable this storage engine, use the --with-csv-storage-engine configure option when building MySQL.
Let's see how this works. You can import an existing CSV file. Assume you have a file containing firstname, surname and age, as follows:
"Jacob","Mbezela","42"

"Schabir","Field","29"

First, create the .frm definition file, as follows:

mysql>CREATE TABLE csv_names(firstname CHAR(30), surname CHAR(40), age INT) ENGINE = CSV;
The blank data file is also created. Since the CSV file is just a plain text file, you can copy an existing CSV file into the same location, and it will be viewable from the MySQL client, as follows:

mysql> SELECT * FROM csv_names;
+-----------+------------+-----+

| firstname | surname    | age |
+-----------+------------+-----+
| Jacob     | Mbezela    |  42 |
| Schabir   | Field      |  29 |
+-----------+------------+-----+

With no indexes, the SELECT is not at all efficient, and performs a complete table scan. Conversely, you can INSERT a record from the MySQL client:

mysql> INSERT INTO csv_names VALUES('Quinton','Baxter','75');

and view the change in the CSV file:

"Jacob","Mbezela","42"
"Schabir","Field","29"

"Quinton","Baxter","75"

The ARCHIVE storage engine

Added in MySQL 4.1.3, the archive storage engine lives up to its name by storing large amounts of data without taking up too much space. It too makes no use of any sort of indexing, and there are no means to repair the table should it become corrupted during a crash. To enable this storage engine, use the -with-archive-storage-engine configure option when building MySQL.

mysql> CREATE TABLE archive_names(firstname CHAR(30), surname CHAR(40), age INT) ENGINE = ARCHIVE;
This, as always, creates a .frm definition file, as well as .ARZ and .ARM data and metadata files.

Being an archive, you cannot DELETE, UPDATE or REPLACE records - you can only INSERT and SELECT. Again, with no indexes, the SELECT needs to perform a complete table scan. Although the records are compressed upon insertion, OPTIMIZE TABLE can compress the entire dataset even further. A .ARN file will temporarily appear when this occurs.

mysql> INSERT INTO archive_name VALUES('Quinton','Baxter','75');
mysql> SELECT * FROM archive_names;

+-----------+------------+-----+
| firstname | surname    | age |
+-----------+------------+-----+
| Quinton   | Baxter     |  75 |
+-----------+------------+-----+

Conclusion
The new storage engines, whilst tricky for most people to use since they require rebuilding MySQL, can be useful for those with special needs. After all, people have taken the time to write them! They are bound to be included in the binaries at some point, so if you are dying to try them, but rebuilding MySQL is not for you, you hopefully do not have to be patient for long. Even if you are able to use them now, keep an eye on the official documentation, as things are likely to change as they approach maturity. Good luck!

Optimizing MySQL: Queries and Indexes

By Ian Gilfillan

You know the scene. The database is just too slow. Queries are queuing up, backlogs growing, users being refused connection. Management is ready to spend millions on "upgrading" to some other system, when the problem is really that MySQL is simply not being used properly. Badly defined or non-existent indexes are one of the primary reasons for poor performance, and fixing these can often lead to phenomenal improvements. Consider an extreme example:


CREATE TABLE employee (
   employee_number char(10) NOT NULL,
   firstname varchar(40),
   surname varchar(40),
   address text,
   tel_no varchar(25),
   salary int(11),
   overtime_rate int(10) NOT NULL
);

To find employee Fred Jone's salary(employee number 101832), you run: SELECT salary FROM employee WHERE employee_number = '101832';MySQL has no clue where to find this record. It doesn't even know that if it does find one matching, that there will not be another matching one, so it has to look through the entire table, potentially thousands of records, to find Fred's details.
An index is a separate file that is sorted, and contains only the field/s you're interested in sorting on. If you create an index on employee_number, MySQL can find the corresponding record very quickly (Indexes work in very similar ways to an index in a book. Imagine paging through a technical book (or more often, an scrambled pile of notes!) looking for the topic "Optimizing MySQL". An index saves you an immense amount of time!
Before we repair the table structure above, let me tell you about a most important little secret for anyone serious about optimizing their queries: EXPLAIN. EXPLAIN shows (explains!) how your queries are being used. By putting it before a SELECT, you can see whether indexes are being used properly, and what kind of join is being performed...
For example:


EXPLAIN SELECT employee_number,firstname,surname FROM employee WHERE employee_number= '10875';

+----------+------+---------------+------+---------+------+------+------------+
| table    | type | possible_keys | key  | key_len | ref  | rows | Extra      |
+----------+------+---------------+------+---------+------+------+------------+
| employee | ALL  | NULL          | NULL |    NULL | NULL |    2 | where used |
+----------+------+---------------+------+---------+------+------+------------+

So what are all these things?

table shows us which table the output is about (for when you join many tables in the query)
type is an important one - it tells us which type of join is being used. From best to worst the types are: system, const, eq_ref, ref, range, index, all
possible_keys Shows which possible indexes apply to this table
key And which one is actually used
key_len give us the length of the key used. The shorter that better.
ref Tells us which column, or a constant, is used
rows Number of rows mysql believes it must examine to get the data
extra Extra info - the bad ones to see here are "using temporary" and "using filesort"

Looks like our query is a shocker, the worst of the worst! There are no possible keys to use, so MySQL has to go through all the records (only 2 in this example, but imagine a really large table).
Now lets add the index we talked about earlier.
If we re-run the EXPLAIN, we get:

+----------+-------+---------------+---------+---------+-------+------+-------+
| table    | type  | possible_keys | key     | key_len | ref   | rows | Extra |
+----------+-------+---------------+---------+---------+-------+------+-------+
| employee | const | PRIMARY       | PRIMARY |      10 | const |    1 |       |
+----------+-------+---------------+---------+---------+-------+------+-------+

The query above is a good one (it almost falls into the category of "couldn't be better"). The type of "join" (not really a join in the case of this simple query) is "const", which means that the table has only one matching row. The primary key is being used to find this particular record, and the number of rows MySQL thinks it needs to examine to find this record is 1. All of which means MySQL could have run this query thousands of times in the time it took you to read this little explanation.
============>>>>>>>>>>>>>>>>>>>

Some knowledge of how indexes work allows you to use them more efficiently. Firstly, note that when you update a table with an index, you have to update the index as well, so there is a performance price to pay. But unless your system runs many more inserts than selects and the inserts need to be quick, and not the selects, this is a price worth paying.
What about if you want to select on more than one criteria? (As you can see, it only makes sense to index those fields you use in the WHERE clause.) The query:
SELECT firstname FROM employee;
makes no use of an index at all. An index on firstname is useless. But,
SELECT firstname FROM employee WHERE surname="Madida";
would benefit from an index on surname.
Let's look at some more complex examples where EXPLAIN can help us improve the query. We want to find all the employees where half their overtime rate is less than $20. Knowing what you do, you correctly decide to add an index on overtime_rate, seeing as that's the column in the where clause.
ALTER TABLE employee ADD INDEX(overtime_rate);
Now let's run the query.
EXPLAIN SELECT firstname FROM employee WHERE overtime_rate/2<20;

+----------+------+---------------+------+---------+------+------+------------+
| table    | type | possible_keys | key  | key_len | ref  | rows | Extra      |
+----------+------+---------------+------+---------+------+------+------------+
| employee | ALL  | NULL          | NULL |    NULL | NULL |    2 | where used |
+----------+------+---------------+------+---------+------+------+------------+

Not good at all! Every single employee record is being read. Why is this? The answer lies in the "overtime_rate/2" part of the query. Every overtime_rate (and hence every record) has to be read in order to divide it by 2. So we should try and leave the indexed field alone, and not perform any calculations on it. How is this possible? This is where your school algebra comes to the rescue! You know that 'x/2 = y' is the same as 'x = y*2'.We can rewrite this query, by seeing if the overtime_rate is less than 20*2. Let's see what happens.
EXPLAIN SELECT firstname FROM employee WHERE overtime_rate<20*2;

+--------+-------+---------------+---------------+---------+------+------+----------+
|table   | type  | possible_keys | key           | key_len | ref  | rows |Extra     |
+--------+-------+---------------+---------------+---------+------+------+----------+
|employee| range | overtime_rate | overtime_rate |       4 | NULL |    1 |where used|
+--------+-------+---------------+---------------+---------+------+------+----------+

Much better! MySQL can perform the 20*2 calculation once, and then search the index for this constant. The principle here is to keep your indexed field standing alone in the comparison, so that MySQL can use it to search, and not have to perform calculations on it.
You may say that I was being unfair, and should have phrased the request as "where the overtime rate is less than 40", but users seem to have a knack of making a request in the worst way possible!
==========>>>>>>>>>>>>>>>>>>>>

Ordering by surname is a common requirement, so it would make sense to create an index on surname. But in this example our employee table consists of thousands of people from Swaziland, and with the surname "Dlamini". So we need to index on firstname as well. The good news is that MySQL uses leftmost prefixing, which means that a multi-field index A,B,C will also be used to search not only for a,b,c combinations, but also A,B as well as just A.
In our example, this means that an index of the type
ALTER TABLE employee ADD INDEX(surname,firstname);
is used for a queries such as
EXPLAIN SELECT overtime_rate FROM employee WHERE surname='Madida';
as well as
EXPLAIN SELECT overtime_rate FROM employee WHERE surname='Madida' and firstname="Mpho";
which both result in

+--------+------+-------------------+---------+---------+-------+------+-----------+
|table   | type | possible_keys     | key     | key_len | ref   | rows |Extra      |
+--------+------+-------------------+---------+---------+-------+------+-----------+
|employee| ref  | surname,surname_2 | surname |      41 | const |    1 |where used |
+--------+------+-------------------+---------+---------+-------+------+-----------+

However, the query
EXPLAIN SELECT overtime_rate FROM employee WHERE firstname='Mpho';
does not use an index, as firstname is not available from the left of the index, as shown below.

+----------+------+---------------+------+---------+------+------+------------+
| table    | type | possible_keys | key  | key_len | ref  | rows | Extra      |
+----------+------+---------------+------+---------+------+------+------------+
| employee | ALL  | NULL          | NULL |    NULL | NULL |    3 | where used |
+----------+------+---------------+------+---------+------+------+------------+

If you needed this kind of query, you would have to add a separate index on firstname.

The Query Optimizer, OPTIMIZE and ANALYZE

The magic inside MySQL that decides which keys, if any, to use to in the query, is called the query optimizer. It takes a quick glance at the index to see which indexes are the best to use. Compare it to searching for a CD by the artist "Savuka", called "Third World Child", where there are 2 indexes, one alphabetical by artist name, and the other by album name. At a glance, you see that there are 20000 unique artists, and 400000 unique albums, so you decide to search by artist. But if you knew that there were 50 Savuka albums, and that Third World child is the only album starting with "T", your search criteria would change. You can provide similar information for the Optimizer by running
ANALYZE TABLE tablename;
This stores the key distribution for the table (running ANALYZE is equivalent to running myisamchk -a or myismachk --analyze).
Many deletes and updates leave gaps in the table (especially when you're using varchar, or in particular text/blob fields). This means there are more unnecessary disk I/O's, as the head needs to skip over these gaps when reading. Running
OPTIMIZE TABLE tablename
solves this problem. Both of these statements should be run fairly frequently in any well looked after system.
Another factor that most people don't use when indexing is to take advantage of short indexes. You don't have to index on the entire field. Our surname and firstname fields are 40 characters each. That means the index we created above is 80 characters. Inserts to this table then also have to write an additional 80 characters, and selects have 80 character blocks to maneuvre around (disk I/O is the primary hardware bottleneck, but that's for another day!). Try reducing the size of your index - in the example above, rather use.
ALTER TABLE employee ADD INDEX(surname(20),firstname(20));
Now our updates write to an index half the size, and selects have a smaller index to search. Both will be faster (unless you make the indexes too short - imagine a book index, instead of giving the full word, only contained the first letter of the word!. You'd spend a lot of time looking up "semaphore" and "saxophone" when you actually wanted "SQL". Don't do the same to MySQL!
The same applies to the original field definitions. In these days of ample disk space, we don't often worry about space. But smaller usually means faster, so defining our surname and firstname fields as CHAR (255) would be a mistake if the biggest firstname is never more than 20 characters! You don't want to cut names off, but remember that you can ALTER the field later if conditions change, and you need to allow for more characters. I also suggest usingVARCHAR rather than CHAR (variable length characters rather than fixed length characters), even though many don't recommend this as they are more subject to fragmentation. I overcome this by using OPTIMIZE often.
============>>>>>>>>>>>>>>>>

Most systems need to be highly optimized for selects - take a news site which performs millions of queries per day, but where the data arrives in large batches of text files. So for parts of the day, inserts need to be optimal, without noticeably affecting the millions trying to access the data...
Assuming a nicely formatted '|' delimited text file that we want to insert into the table above, take this piece of PHP code:

if (!($fp = fopen("datafile.txt","r"))) {               // open the file for reading
  print "\nUnable to open datafile.txt for writing";    // display error
  exit();                                       // end the running of the program
}


while (!feof ($fp)) {                   // loop through the file line by line
  $sline = fgets($fp, 4096);            // put the data into the variable $sline
  $sline = chop($sline);                        // remove the newline
  list($eno,$fname,$sname,$telno,$salary) = split("|",$code);
                        // split the line on "|", populating the ind. variables
  $db->query("insert into employee(employee_number,firstname,surname,
tel_no, salary
   values($eno,'$fname','$sname','$tel_no', $salary)");
}                                               // end while loop

This would work, but would be very slow. The index buffer would be flushed after every insert. Until recently, MyISAM tables (The MySQL default) did not allow data to be inserted at the same time as being read. The new format does, but only if there are no deleted records present (highly unlikely in a heavily used system). So the entire table is locked for the duration of each insert. Fine on a low volume site, but when you're getting hundreds or thousands of queries per second, you'll soon notice the backlog!
There's a solution however - the best way to insert the data is to use MySQL's "LOAD DATA INFILE". This is much faster (20 times according to MySQL), and the only way some systems I've seen are still hanging in there!
The syntax is simple, and the code becomes a lot simpler too:

$db->query("LOAD DATA INFILE 'datafile.txt' INTO TABLE employee (employee_number,firstname,surname,tel_no,salary) FIELDS TERMINATED BY '|'");

LOAD DATA INFILE has defaults of:
FIELDS TERMINATED BY '\t' ENCLOSED BY '' ESCAPED BY '\\'
if you don't specify any of these clauses. And, just as with an ordinary insert, you need to specify a field list if the order of the fields is different, or, as in the example above, you're not inserting data for every field. Always specifying a field list is good practice for all queries anyway - if someone adds a field to the table at a later stage, you don't want to go back and have to fix all your previous INSERT and SELECT * statements.
If you can't get this to work properly, have a look at the format of your text file - every problem I've seen with LOAD DATA has been because of a corrupted text file. Every field in every row must be delimited correctly!
You may not always be inserting from a text file - perhaps your application needs to do many unrelated inserts continually. There are ways to make sure the mass of users selecting are not badly affected... The first is to use INSERT LOW PRIORITY. This waits until there are no more reads waiting to happen, waiting for the gap, and not pushing in as it were. Of course, if your database is a rush hour special, there may never be a gap, and the client performing the INSERT LOW PRIORITY may start to grow cobwebs! An alternative here is INSERT DELAYED. The client is immediately freed, and the insert put into a queue (with all the other INSERT DELAYED's still waiting for the queue to end). This means that there can be no meaningful information passed back to the client, (such as the auto_increment value), as the INSERT has not been processed when the client is freed. Also, be aware that a catastrophe such as an unexpected power failure here will result in the queued INSERT's being lost. For neither of these methods do you have any idea when the data will be inserted, if at all, so I suggest you use with caution.

Conclusion

It's not only getting the data in that needs to be quick - sometimes you need to get it out quickly too. (Say you've accidentally loaded yesterday's classified ads, for example). Don't do a:
DELETE FROM classifieds;
Rather, use:
TRUNCATE TABLE classifieds;
The difference here is that DELETE drops records one by one, and that can be 1 million one by one's too slow! Note that this does not apply before version 4.0 of MySQL. At time of writing, most of you will still be using 3.x versions (if you do a DELETE FROM tablename on a non-empty table, and get 0 records back as a result, you're running an earlier version. To fix this problem, MySQL made DELETE remove records one by one so as to return the number of records deleted, but TRUNCATE still did the quick delete. Also, earlier versions than 3.23.33 used TRUNCATE tablename, not TRUNCATE TABLE tablename)
This has only been a brief introduction to optimizing queries and indexes. Once you've mastered these tips, you'll want to look at your more complex joins that just never seem to perform quickly. The MySQL site has lots of useful information, and there are a few good books out there as well. But don't forget EXPLAIN! Often the best way is to try and rewrite the query in as many different ways as possible, and see which one runs more efficiently. You can learn a lot by trying to figure out why one alternative ran faster than the other. Good luck!
resource list:
www.mysql.com - complete documentation for MySQL

DATABASE DESIGN CONCEPTS AND SQL STANDARDS