Is there a measurable performance difference between using INT vs. VARCHAR as a primary key in MySQL? I'd like to use VARCHAR as the primary key for reference lists (think US States, Country Codes) and a coworker won't budge on the INT AUTO_INCREMENT as a primary key for all tables.
My argument, as detailed here, is that the performance difference between INT and VARCHAR is negligible, since every INT foreign key reference will require a JOIN to make sense of the reference, a VARCHAR key will directly present the information.
So, does anyone have experience with this particular use-case and the performance concerns associated with it?
You make a good point that you can avoid some number of joined queries by using what's called a natural key instead of a surrogate key. Only you can assess if the benefit of this is significant in your application.
That is, you can measure the queries in your application that are the most important to be speedy, because they work with large volumes of data or they are executed very frequently. If these queries benefit from eliminating a join, and do not suffer by using a varchar primary key, then do it.
Don't use either strategy for all tables in your database. It's likely that in some cases, a natural key is better, but in other cases a surrogate key is better.
Other folks make a good point that it's rare in practice for a natural key to never change or have duplicates, so surrogate keys are usually worthwhile.
I was a bit annoyed by the lack of benchmarks for this online, so I ran a test myself.
Note though that I don't do it on a regular basic, so please check my setup and steps for any factors that could have influenced the results unintentionally, and post your concerns in comments.
The setup was as follows:
Intel® Core™ i7-7500U CPU @ 2.70GHz × 4
15.6 GiB RAM, of which I ensured around 8 GB was free during the test.
148.6 GB SSD drive, with plenty of free space.
Ubuntu 16.04 64-bit
MySQL Ver 14.14 Distrib 5.7.20, for Linux (x86_64)
The tables:
create table jan_int (data1 varchar(255), data2 int(10), myindex tinyint(4)) ENGINE=InnoDB;
create table jan_int_index (data1 varchar(255), data2 int(10), myindex tinyint(4), INDEX (myindex)) ENGINE=InnoDB;
create table jan_char (data1 varchar(255), data2 int(10), myindex char(6)) ENGINE=InnoDB;
create table jan_char_index (data1 varchar(255), data2 int(10), myindex char(6), INDEX (myindex)) ENGINE=InnoDB;
create table jan_varchar (data1 varchar(255), data2 int(10), myindex varchar(63)) ENGINE=InnoDB;
create table jan_varchar_index (data1 varchar(255), data2 int(10), myindex varchar(63), INDEX (myindex)) ENGINE=InnoDB;
Then, I filled 10 million rows in each table with a PHP script whose essence is like this:
$pdo = get_pdo();
$keys = [ 'alabam', 'massac', 'newyor', 'newham', 'delawa', 'califo', 'nevada', 'texas_', 'florid', 'ohio__' ];
for ($k = 0; $k < 10; $k++) {
for ($j = 0; $j < 1000; $j++) {
$val = '';
for ($i = 0; $i < 1000; $i++) {
$val .= '("' . generate_random_string() . '", ' . rand (0, 10000) . ', "' . ($keys[rand(0, 9)]) . '"),';
}
$val = rtrim($val, ',');
$pdo->query('INSERT INTO jan_char VALUES ' . $val);
}
echo "\n" . ($k + 1) . ' millon(s) rows inserted.';
}
For int
tables, the bit ($keys[rand(0, 9)])
was replaced with just rand(0, 9)
, and for varchar
tables, I used full US state names, without cutting or extending them to 6 characters. generate_random_string()
generates a 10-character random string.
Then I ran in MySQL:
SET SESSION query_cache_type=0;
For jan_int table: SELECT count(*) FROM jan_int WHERE myindex = 5; SELECT BENCHMARK(1000000000, (SELECT count(*) FROM jan_int WHERE myindex = 5));
SELECT count(*) FROM jan_int WHERE myindex = 5;
SELECT BENCHMARK(1000000000, (SELECT count(*) FROM jan_int WHERE myindex = 5));
For other tables, same as above, with myindex = 'califo' for char tables and myindex = 'california' for varchar tables.
Times of the BENCHMARK
query on each table:
jan_int: 21.30 sec
jan_int_index: 18.79 sec
jan_char: 21.70 sec
jan_char_index: 18.85 sec
jan_varchar: 21.76 sec
jan_varchar_index: 18.86 sec
Regarding table & index sizes, here's the output of show table status from janperformancetest;
(w/ a few columns not shown):
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Collation |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| jan_int | InnoDB | 10 | Dynamic | 9739094 | 43 | 422510592 | 0 | 0 | 4194304 | NULL | utf8mb4_unicode_520_ci |
| jan_int_index | InnoDB | 10 | Dynamic | 9740329 | 43 | 420413440 | 0 | 132857856 | 7340032 | NULL | utf8mb4_unicode_520_ci |
| jan_char | InnoDB | 10 | Dynamic | 9726613 | 51 | 500170752 | 0 | 0 | 5242880 | NULL | utf8mb4_unicode_520_ci |
| jan_char_index | InnoDB | 10 | Dynamic | 9719059 | 52 | 513802240 | 0 | 202342400 | 5242880 | NULL | utf8mb4_unicode_520_ci |
| jan_varchar | InnoDB | 10 | Dynamic | 9722049 | 53 | 521142272 | 0 | 0 | 7340032 | NULL | utf8mb4_unicode_520_ci |
| jan_varchar_index | InnoDB | 10 | Dynamic | 9738381 | 49 | 486539264 | 0 | 202375168 | 7340032 | NULL | utf8mb4_unicode_520_ci |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
My conclusion is that there's no performance difference for this particular use case.
INDEX
instead of PRIMARY KEY
. I don't remember my reasoning - I probably assumed PRIMARY KEY
is just an INDEX
with uniqueness constraint. However, reading the section on how things are stored in InnoDB in federico-razzoli.com/primary-key-in-innodb, I think that my results still apply to primary keys, and answer the question on value lookup performance difference. Also, your comment suggests looking at performance of sorting algorithms, which do not apply to the use case I investigate, which is looking up values in a set.
It's not about performance. It's about what makes a good primary key. Unique and unchanging over time. You may think an entity such as a country code never changes over time and would be a good candidate for a primary key. But bitter experience is that is seldom so.
INT AUTO_INCREMENT meets the "unique and unchanging over time" condition. Hence the preference.
Depends on the length.. If the varchar will be 20 characters, and the int is 4, then if you use an int, your index will have FIVE times as many nodes per page of index space on disk... That means that traversing the index will require one fifth as many physical and/or logical reads..
So, if performance is an issue, given the opportunity, always use an integral non-meaningful key (called a surrogate) for your tables, and for Foreign Keys that reference the rows in these tables...
At the same time, to guarantee data consistency, every table where it matters should also have a meaningful non-numeric alternate key, (or unique Index) to ensure that duplicate rows cannot be inserted (duplicate based on meaningful table attributes) .
For the specific use you are talking about (like state lookups ) it really doesn't matter because the size of the table is so small.. In general there is no impact on performance from indices on tables with less than a few thousand rows...
Absolutely not.
I have done several... several... performance checks between INT, VARCHAR, and CHAR.
10 million record table with a PRIMARY KEY (unique and clustered) had the exact same speed and performance (and subtree cost) no matter which of the three I used.
That being said... use whatever is best for your application. Don't worry about the performance.
For short codes, there's probably no difference. This is especially true as the table holding these codes are likely to be very small (a couple thousand rows at most) and not change often (when is the last time we added a new US State).
For larger tables with a wider variation among the key, this can be dangerous. Think about using e-mail address/user name from a User table, for example. What happens when you have a few million users and some of those users have long names or e-mail addresses. Now any time you need to join this table using that key it becomes much more expensive.
As for Primary Key, whatever physically makes a row unique should be determined as the primary key.
For a reference as a foreign key, using an auto incrementing integer as a surrogate is a nice idea for two main reasons. - First, there's less overhead incurred in the join usually. - Second, if you need to update the table that contains the unique varchar then the update has to cascade down to all the child tables and update all of them as well as the indexes, whereas with the int surrogate, it only has to update the master table and it's indexes.
The drawaback to using the surrogate is that you could possibly allow changing of the meaning of the surrogate:
ex.
id value
1 A
2 B
3 C
Update 3 to D
id value
1 A
2 B
3 D
Update 2 to C
id value
1 A
2 C
3 D
Update 3 to B
id value
1 A
2 C
3 B
It all depends on what you really need to worry about in your structure and what means most.
Common cases where a surrogate AUTO_INCREMENT
hurts:
A common schema pattern is a many-to-many mapping:
CREATE TABLE map (
id ... AUTO_INCREMENT,
foo_id ...,
bar_id ...,
PRIMARY KEY(id),
UNIQUE(foo_id, bar_id),
INDEX(bar_id) );
Performance of this pattern is much better, especially when using InnoDB:
CREATE TABLE map (
# No surrogate
foo_id ...,
bar_id ...,
PRIMARY KEY(foo_id, bar_id),
INDEX (bar_id, foo_id) );
Why?
InnoDB secondary keys need an extra lookup; by moving the pair into the PK, that is avoided for one direction.
The secondary index is "covering", so it does not need the extra lookup.
This table is smaller because of getting rid of id and one index.
Another case (country):
country_id INT ...
-- versus
country_code CHAR(2) CHARACTER SET ascii
All too often the novice normalizes country_code into a 4-byte INT
instead of using a 'natural' 2-byte, nearly-unchanging 2-byte string. Faster, smaller, fewer JOINs, more readable.
At HauteLook, we changed many of our tables to use natural keys. We did experience a real-world increase in performance. As you mention, many of our queries now use less joins which makes the queries more performant. We will even use a composite primary key if it makes sense. That being said, some tables are just easier to work with if they have a surrogate key.
Also, if you are letting people write interfaces to your database, a surrogate key can be helpful. The 3rd party can rely on the fact that the surrogate key will change only in very rare circumstances.
I faced the same dilemma. I made a DW (Constellation schema) with 3 fact tables, Road Accidents, Vehicles in Accidents and Casualties in Accidents. Data includes all accidents recorded in UK from 1979 to 2012, and 60 dimension tables. All together, about 20 million records.
Fact tables relationships:
+----------+ +---------+
| Accident |>--------<| Vehicle |
+-----v----+ 1 * +----v----+
1| |1
| +----------+ |
+---<| Casualty |>---+
* +----------+ *
RDMS: MySQL 5.6
Natively the Accident index is a varchar(numbers and letters), with 15 digits. I tried not to have surrogate keys, once the accident indexes would never change. In a i7(8 cores) computer, the DW became too slow to query after 12 million records of load depending of the dimensions. After a lot of re-work and adding bigint surrogate keys I got a average 20% speed performance boost. Yet to low performance gain, but valid try. Im working in MySQL tuning and clustering.
The question is about MySQL so I say there is a significant difference. If it was about Oracle (which stores numbers as string - yes, I couldn't believe it at first) then not much difference.
Storage in the table is not the issue but updating and referring to the index is. Queries involving looking up a record based on its primary key are frequent - you want them to occur as fast as possible because they happen so often.
The thing is a CPU deals with 4 byte and 8 byte integers naturally, in silicon. It's REALLY fast for it to compare two integers - it happens in one or two clock cycles.
Now look at a string - it's made up of lots of characters (more than one byte per character these days). Comparing two strings for precedence can't be done in one or two cycles. Instead the strings' characters must be iterated until a difference is found. I'm sure there are tricks to make it faster in some databases but that's irrelevant here because an int comparison is done naturally and lightning fast in silicon by the CPU.
My general rule - every primary key should be an autoincrementing INT especially in OO apps using an ORM (Hibernate, Datanucleus, whatever) where there's lots of relationships between objects - they'll usually always be implemented as a simple FK and the ability for the DB to resolve those fast is important to your app' s responsiveness.
Allow me to say yes there is definitely a difference, taking into consideration the scope of performance (Out of the box definition):
1- Using surrogate int is faster in application because you do not need to use ToUpper(), ToLower(), ToUpperInvarient(), or ToLowerInvarient() in your code or in your query and these 4 functions have different performance benchmarks. See Microsoft performance rules on this. (performance of application)
2- Using surrogate int guarantees not changing the key over time. Even country codes may change, see Wikipedia how ISO codes changed over time. That would take lots of time to change the primary key for subtrees. (performance of data maintenance)
3- It seems there are issues with ORM solutions, such as NHibernate when PK/FK is not int. (developer performance)
Not sure about the performance implications, but it seems a possible compromise, at least during development, would be to include both the auto-incremented, integer "surrogate" key, as well as your intended, unique, "natural" key. This would give you the opportunity to evaluate performance, as well as other possible issues, including the changeability of natural keys.
As usual, there are no blanket answers. 'It depends!' and I am not being facetious. My understanding of the original question was for keys on small tables - like Country (integer id or char/varchar code) being a foreign key to a potentially huge table like address/contact table.
There are two scenarios here when you want data back from the DB. First is a list/search kind of query where you want to list all the contacts with state and country codes or names (ids will not help and hence will need a lookup). The other is a get scenario on primary key which shows a single contact record where the name of the state, country needs to be shown.
For the latter get, it probably does not matter what the FK is based on since we are bringing together tables for a single record or a few records and on key reads. The former (search or list) scenario may be impacted by our choice. Since it is required to show country (at least a recognizable code and perhaps even the search itself includes a country code), not having to join another table through a surrogate key can potentially (I am just being cautious here because I have not actually tested this, but seems highly probable) improve performance; notwithstanding the fact that it certainly helps with the search.
As codes are small in size - not more than 3 chars usually for country and state, it may be okay to use the natural keys as foreign keys in this scenario.
The other scenario where keys are dependent on longer varchar values and perhaps on larger tables; the surrogate key probably has the advantage.
Success story sharing