Best Practices for Datatypes in a MySQL Schema
Brought to you by Rick James
This blog mostly assumes you are using InnoDB, which is the preferred Engine.
When initially creating a schema, especially one for a large table, be sure to minimize the size of each column. Changing a BIGINT to INT will save 4 gigabytes in a billion-row table. (Or more, if it is used in an index.)
The goal of this blog is to help you pick the best datatypes, and to give you tips related to the choices.
Use the smallest type practical. Use UNSIGNED where appropriate, which is most of the time.
Datatype bytes SIGNED UNSIGNED
-------- ----- ----------- ----------
BIGINT 8 ~+/-9e18 0..~18e18
INT 4 ~+/-2 billion 0..~4 billion
MEDIUMINT 3 ~+/-8 million 0..~16 million
SMALLINT 2 -32768..+32767 0..65536
TINYINT 1 -128..+127 0..255
⚈ MEDIUMINT(5) is 3 bytes; the '5' means nothing except...
⚈ zipcode MEDIUMINT(5) ZEROFILL would give you leading zeros for US zipcodes. (Ditto for FIPS)
⚈ Note: MySQL 8.0.19 has removed the spurious length field except when using ZEROFILL.
FLOAT and DOUBLE Datatypes
⚈ FLOAT has about 7 significant digits and a range of about 10^-38 to 10^+38.
⚈ DOUBLE has about 16 significant digits and a range of about 10^-308 to 10^+308.
⚈ Do not use (m,n) on FLOAT or DOUBLE; it only leads to an extra rounding between binary and decimal. If you are tempted to use such, then switch to the DECIMAL datatype.
⚈ When displaying, consider using ROUND() or FORMAT() for limiting the number of decimal places. FORMAT also adds "thousands separators" and optionally understands 'locale' differences such as . versus ,.
⚈ Use DECIMAL for monitary values.
⚈ DECIMAL(m,n) occupies approximately m/2 bytes.
⚈ DECIMAL is always SIGNED.
⚈ There is a limit of (m-n) digits to the left of the decimal point. Hence DECIMAL(8,2) allows up to a million dollars or Euros, to the cent.
⚈ Use CHAR only when the string really is a fixed length.
⚈ VARCHAR is preferred over TINYTEXT/TEXT. There is no practical use of TINYTEXT instead of an equivalent VARCHAR.
⚈ LONGTEXT is likely to run into various non-string limits (packet_size, etc) long before you can get 4 billion bytes.
⚈ The number in CHAR and VARCHAR is 'characters', not 'bytes'. The number in variations of TEXT is 'bytes'.
⚈ CHARACTER SET utf8 takes up to 3 bytes per character for VARCHAR and TEXT.
⚈ For some ROW_FORMATs in InnoDB, CHAR(10) CHARACTER SET utf8 will occupy only 10 bytes. But don't depend on such.
⚈ Don't blindly say VARCHAR(255)`; use a reasonable limit. (Else tmp tables, index limits, etc may have issues.)
⚈ See "Other Datatypes" below for country_code, IP addresses, etc.
⚈ Do not put strictly numeric values into VARCHAR; WHERE clauses may suffer. In particular WHERE varchar_col = 1234 will convert each row's varchar_col to numeric to perform the comparison.
⚈ Use CHARACTER SET ascii (or latin1) for hex, various codes, etc, where no accents are ever possible. (Collate _general_ci in most cases)
⚈ Use Use CHARACTER SET ascii COLLATE ascii_bin for Base64 since different cases mean different things. (BINARY` would also work.)
⚈ Use CHARACTER SET utf8mb4 for general text; this provides full international support. In particular, utf8 is incomplete for Emoji and Chinese.
Suggested character set for non utf8 columns in mysql
VARCHAR vs CHAR vs TEXT
⚈ In almost all cases, VARCHAR is preferrable to CHAR.
⚈ TEXT (of various sizes) is necessary when you get beyond VARCHAR's limit.
⚈ Use CHAR for truly fixed-length columns.
⚈ Almost always fixed-length columns are ascii, so use CHARACTER SET ascii (or latin1).
⚈ If you will be storing text in multiple languages, be sure to have CHARACTER SET utf8mb4. (utf8 is a compromise in older versions.)
VARCHAR vs TEXT
⚈ country_code CHAR(2)
⚈ postal_code CHAR(6) -- this works for Britain, but probably not for the world in general
⚈ uuid, guid, md5, sha256, etc -- because they are hex, plus -
⚈ ipv4, ipv6 -- decimal or hex, plus limited puncutuation.
⚈ phone numbers -- In US, you can use CHAR, but for international, use VARCHAR.
DATE and TIME
⚈ Date_format(Now(), '%Y-%m-%d') is the same as CURRENT_DATE().
⚈ Don't split a DATETIME into two columns (a DATE and a TIME) -- it is usually easier to pick apart a DATETIME than to manipulate two columns
⚈ DATETIME is like a picture of the clock on your wall (in your timezone); TIMESTAMP is an instant in time.
⚈ TIMESTAMP is stored in UTC, and converted from local time when storing, and converted to localtime when fetching.
⚈ If you need more control over timezones, you are on your own.
⚈ (n) can be added to DATETIME/TIME/TIMESTAMP to get "fractional seconds" as low as microseconds (6).
⚈ Do not use any time-oriented column as a UNIQUE (or PRIMARY) key unless you have business logic to confirm that two rows with identical values cannot coexist.
⚈ DATE and DATETIME literals should be strings. You do not need to cast a string to a DATE (etc). That is, do not do CAST('2016-07-14' AS DATE) or DATE('2016-07-14'). More discussion:
⚈ When expressing a date range, don't worry about leap years, number of days in a month, etc; instead express it thus:
AND `Date` >= '2016-02-24'
AND `Date` < '2016-02-24' + INTERVAL 1 WEEK
There are religious battles waging over whether to ever use ENUM. Here are some tips if you choose to use EHUM:
⚈ ENUMs are stored internally as a 1- or 2-byte number.
⚈ Advantage: You use strings instead of numbers; this makes code more readable.
⚈ Up to 255 options: an ENUM is only one byte.
⚈ Typically it is unwise to have an ENUM "nullable". Instead, make the first option something like 'unknown' or 'N/A'.
⚈ Don't have more than a very few options in an ENUM.
⚈ If you do need to add an option to an ENUM, the ALTER is essentially free if you add the new option on the end, and don't exceed 255 options.
⚈ Due to low cardinality, it is rarely useful to index an ENUM by itself. However, it may be quite useful to use it in a 'composite' (multi-column) index.
⚈ If I insert noon in a DATETIME, it will say noon for you wherever you are. But inserting noon into a TIMESTAMP will show up differently on the other side of the world. Which effect do you want?
⚈ Example of an ENUM with a lot of optiona: rating ENUM('unrated', 'G', 'PG', 'PG-13', 'R', 'NC-17') DEFAULT 'unrated'
ENUM vs TINYINT vs VARCHAR
Suppose you have
CREATE TABLE ( ...
col_enum ENUM('yes', 'no', 'maybe'),
col_tiny TINYINT, -- 0='yes', 1='no', 2='maybe'
col_char VARCHAR(5), -- 'yes', 'no', 'maybe'
What are the pros/cons of those 2 choices?
⚈ col_char is the bulkiest;
⚈ col_tiny gives no clue of "meaning";
⚈ col_enum cannot be extended to now values without doing an ALTER
You pick what you like. There is no solid winner.
⚈ Rarely used.
⚈ Think of it as a multi-valued ENUM.
⚈ A SET cannot have more than 64 options.
⚈ It occupies CEIL(n/8) bytes, where n is the number of options.
⚈ It may be easier to use some size of INT instead of a SET, especially if you do boolean arithmetic.
⚈ It is not useful to index a SET column.
⚈ Example (refering to movies): special_features SET('Trailers', 'Commentaries', 'Deleted Scenes', 'Behind the Scenes')
⚈ Latitude/Longitude: DOUBLE is grossly excessive. See this for other options:
⚈ IPv4 is no bigger than 22.214.171.124 -- VARCHAR(15) CHARACTER SET ascii, which can be converted to BINARY(4) via inet_aton(). The former can be compared only for [in]equality; the latter can be used for ranges.
⚈ IPv6 is no bigger than ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff -- VARCHAR(39) CHARACTER SET ascii, which can be converted to BINARY(16) via inet6_aton(). The former can be compared only for [in]equality; the latter can be used for ranges.
⚈ Money: See DECIMAL, but be aware that different countries need different number of decimal places; 4 might be the current max. As for leading digits -- that is rather large, and growing daily.
⚈ Zipcode -- US zip-5 could be stored in CHAR(5) CHARACTER SET ascii or MEDIUMINT(5) UNSIGNED ZEROFILL. (Note the very rare usage of ZEROFILL.) Zip-9, Postal Codes, etc probably deserve VARCHAR(n) CHARACTER SET ascii with a suitable n.
⚈ GUID/UUID -- These need CHAR(36) CHARACTER SET ascii. Better yet, convert to BINARY(16) with functions available in MySQL 8.0 or found in
⚈ MD5/SHA1/etc -- These are like GUIDs/UUIDs, but need different sizes and can be converted via HEX() and UNHEX().
⚈ JSON -- MySQL and MariaDB vary on this.
⚈ JSON -- This is probably desirable: json_encode($str, JSON_UNESCAPED_UNICODE); it avoids \U+1234 codes instead of UTF-8 characters.
NULL vs. NOT NULL
⚈ Unfortunately, NULL is the default.
⚈ If NULLable takes extra space, it is probably minuscule compared to the rest of the space taken. So, I don't make the argument on space.
⚈ A PRIMARY KEY cannot contain any nullable columns.
⚈ A UNIQUE key can contain a nullable column.
⚈ Use NOT NULL unless there you need an 'out of band' NULL value.
Some uses for NULL:
⚈ Don't have the value yet.
⚈ "Not applicable"
⚈ Value (or reference to other table) has been removed
Empty vs. NULL
(except AUTO_INCREMENT, TIMESTAMP...)
If PERSIST, then they take space.
⚈ Always explicitly provide one. While not an absolute requirement, it signals that you have thought about what the PK should be.
⚈ Each secondary key (in InnoDB) contains a copy of the PK; hence,...
⚈ A bulky PK leads to bulky secondary indexes, thereby leading to more disk space.
⚈ The implied rule above is "keep the PK small". In reality, the that rule only applies if you have two or more secondary keys. At one secondary key, the space is pretty much a wash.
⚈ A PK can either be "natural" or "surrogate"...
Natural versus Surrogate PRIMARY KEY
⚈ A "natural" PK is a column (or set of columns) that are 'naturally' unique.
⚈ A "surrogate" PK is (usually) INT UNSIGNED AUTO_INCREMENT. (The INT can be any of its sizes.) (UNSIGNED gives you double the range for free.)
⚈ I find that more than half of tables do have column(s) that could be the "natural" PK.
⚈ Some platforms automatically provide a surrogate PK for all tables. This can hurt performance in some cases. And it can waste disk space.
⚈ An important pattern where "natural" is better than "surrogate" is in a many-to-many mapping table, which has 2 main columns, an those columns together are unique. Hence, that makes a good PK. The opposite ordering makes a good composite index for going the other way. See
⚈ Each secondary key is a separate B+Tree containing the column(s) of the secondary key, plus the column(s) of the PK.
⚈ I believe dup columns are removed.
⚈ Fetching a record using a secondary key involves first drilling down its BTree, then drilling down the BTree that contains the data (and is ordered by the PK).
⚈ If all the columns in a SELECT are found in the secondary key (as augmented by the PK), then the key is said to be "covering". EXLAIN show it by saying Using index (not 'Using index condition'). This gives a slight (sometimes 2x) performance boost.
⚈ When taking advantage of 'covering', it is best to spell out the columns you need. This lets the reader know that you planned for such.
⚈ When you have INDEX(a), INDEX(a,b), toss the former as redundant. (And, in at least one case, it trips up the Optimizer.)
⚈ FKs are good for maintaining the integrity of links between tables.
⚈ FKs cost a small extra effort to do the integrity check.
⚈ FKs need an index.
If you have trouble establishing FKs (error 150, 23000, 23001, etc), do one of these:
⚈ Disable FKs, CREATE TABLEs, re-enable FKs
⚈ Apply the FKs via ALTER after doing the relevant CREATE TABLEs
⚈ (I forget the 3rd.)
One-to-one Mapping / Hierarchy
⚈ A "1:many" mapping is easily implemented by having the id for the "one" in the row of the "many".
⚈ A table representing a hierarchical tree has columns id and parent_id, plus PRIMARY KEY(id), INDEX(parent_id). The "root" of the tree usually has parent_id=0. See Recursive CTEs for traversing such trees.
⚈ Partitioning is rarely useful.
⚈ When switching between partitioned and not partitioned, all the indexes of the table should be reviewed are likely to be changed
⚈ More discussion:
SHOW CREATE TABLE is more descriptive than DESCRIBE
⚈ "MUL" does not tell you whether indexes are single-column or composite.
⚈ FOREIGN KEYs are ignored.
⚈ ENGINE is unavailable
⚈ etc, etc.
⚈ So, never use DESCRIBE TABLE, only use SHOW CREATE TABLE !
What to INDEX
⚈ "But I indexed every column" == Novice who does not understand indexing.
⚈ With rare exceptions, you need to develop the SELECTs (and UPDATEs and DELETEs) to see what indexes are needed for a table.
⚈ In many cases, "composite" indexes are the key to performance.
⚈ For a lot more discussion see:
⚈ Because the textbook says to -- Novice.
⚈ To centralize a value that could change -- The goal of the textbook.
⚈ To save space, hence improve performance -- The goal of a DBA.
⚈ But, do not normalize 'continuous' values (numbers, floats, dates, etc)
Computing row size
This is tricky. There are many hidden fields in a record in an InnoDB table and block and record. This will give you an approximation.
1. For each numeric column, add the number of bytes given above.
2. For each string column, add an estimate of the average length of the string. (This will go awry for big text/blob columns that end up "off-record")
3. Multiple by 3. This very crudely accounts for the overhead, padding, waste, indexes, etc.
Another way is to
1. CREATE TABLE ...
2. Populate it will many rows. How many rows? This gets tricky because the table will grow is spurts.
3. SHOW TABLE STATUS and look at Avg_row_length.
Single/double quotes, backtics, etc
⚈ Backtics _must_ be used when the table/column name is also a reserved word. They may be left off otherwise. However, you run the real risk that in the next version of MySQL one of your variable names will have become a reserved word.
⚈ Strings must have quotes (single or double) around them, quotes may be used around numbers. string_col = number should have quotes around the number to allow for use of an index. However, if you use "prepared" statements, the quotes are supplied for you, so you must not quote the placeholders.
⚈ In PHP, "interpolation" is more concise, but equivalent to concatenation. (OK, I am giving an opinion here.)
⚈ You may use backtics around any/all "names". Note that that is what mysqldump does. Exception: There is an SQL mode that allows double quotes to be used instead of backtics. (Very few users set that mode.)
⚈ In PHP, interpolation (using $var inside quoted string) works only with double-quotes. So this won't work right: '... WHERE col = "$var"'
⚈ In many situations, the use of "prepared statements" or "binding" should definitely be used for security reasons. (cf SQL Injection and other hacks). (But the OP did not list an example of such.) Also, it provides both the escaping and quoting that is required. Using prepared statements allows you to insert your strings exactly as they are.
⚈ PHP is a nightmare. htmlspecialcharacters for <input ...> things, urlencode for pieces of urls, real_escape_string for MySQL, addslashes and htmlentities for ??
These are guidelines, not hard-and-fast rules.
⚈ By convention use the name id for a surrogate PK: id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY
⚈ Alternatively (to previous bullet), prefix the id with the table name: userid or user_id, company_id, etc
⚈ But don't prefix all the columns of a table with the table name. It clutters the queries and provides no useful information.
⚈ When JOINing, use an alias to qualify which table a column comes from. (Prefixing the column name cannot be trusted.)
⚈ When JOINing, it is usually best to give each table a short alias. This is a compromise between the clutter and the specification, above.
⚈ An alias might be the initial letter or the first letter of each 'word' in the name.
⚈ Don't blindly use LEFT JOIN everywhere. If the WHERE clause requires a specific value from the table, then it is really a JOIN.
man page on data size
Original posting: Aug, 2019 (after many years of assimilating ideas)
MySQL Documents by Rick James
HowTo Techniques for Optimizing Tough Tasks:
Partition Maintenance (DROP+REORG) for time series (includes list of PARTITION uses)
Big DELETEs - how to optimize -- and other chunking advice, plus a use for PARTITIONing
Chunking lengthy DELETE/UPDATE/etc.
Data Warehouse techniques:
High speed ingestion
Entity-Attribute-Value -- a common, poorly performing, design pattern (EAV); plus an alternative
5 methods for 'Find Nearest'
Find the nearest 10 pizza parlors -- efficient searching on Latitude + Longitude (another PARITION use)
Lat/Long representation choices
Z-Order 'find nearest'(under construction)
Pagination, not with OFFSET, LIMIT
Techniques on efficiently finding a random row (On beyond ORDER BY RAND())
GUID/UUID Performance (type 1 only)
IP Range Table Performance -- or other disjoint ranges
Rollup Unique User Counts
Alter of a Huge table -- Mostly obviated by 5.6
Latest 10 news articles -- how to optimize the schema and code for such
Build and execute a "Pivot" SELECT (showing rows as columns)
Find largest row for each group ("groupwise max")
Other Tips, Tuning, Debugging, Optimizations, etc...
Rick's RoTs (Rules of Thumb -- lots of tips)
Datatypes and building a good schema
Memory Allocation (caching, etc)
Character Set and Collation problem solver
Trouble with UTF-8
If you want case folding, but accent sensitivity, please file a request at http://bugs.mysql.com .
other language tips
utf8mb4 Collations on 8.0
Converting from MyISAM to InnoDB -- includes differences between them
Compound INDEXes plus other insights into the mysteries of INDEXing
Cookbook for Creating Indexes
Many-to-many mapping table
MySQL Limits -- built-in hard limits
767-byte INDEX limit
Galera, tips on converting to (Percona XtraDB Cluster, MariaDB 10, or manually installed)
5.7's Query Rewrite -- perhaps 5.7's best perf gain, at least for this forum's users
Analyze MySQL Performance
Analyze VARIABLEs and GLOBAL STATUS
My slides from conferences
Percona Live 4/2017 - Rick's RoTs (Rules of Thumb) - MySQL/MariaDB
Percona Live 4/2017 - Index Cookbook - MySQL/MariaDB
Percona Live 9/2015 - PARTITIONing - MySQL/MariaDB
(older ones upon request)
Contact me via LinkedIn; be sure to include a brief teaser in the Invite request:
Did my articles help you out? Like what you see? Consider donating: