Character Set and Collation

Character sets are collection of letters and symbols. Encodings are the internal representations of the members of a character set. Collations are the instructions that dictates how characters are to be compared.

To show a list of character sets supported by your server:

SHOW CHARACTER SET

To show the list of collation supported by your server:

SHOW COLLATION;

To determine the character sets and collations in use:

SHOW VARIABLES LIKE 'character%';
SHOW VARIABLES LIKE 'collation%';

Different tables and even different columns may require different character sets, and so both may be specified when a table is created:

CREATE TABLE mytable 
(
    col1 INT, 
    col2 VARCHAR(10),
    col3 VARCHAR(10) CHARACTER SET latin1 COLLATE latin1_general_ci
) DEFAULT CHARACTER SET hebrew COLLATE hebrew_general_ci;
  1. If both CHARACTER SET and COLLATE are specified, those values are used.
  2. If only CHARACTER SET is specified, it is used along with the default collation for that character set.
  3. If neither CHARACTER SET nor COLLATE are specified, the database default is used.

A useful technique for performing case-sensitive search on a table that is usually not case-sensitive:

SELECT * FROM customer ORDER BY lastname, firstname COLLATE latin1_general_cs

COLLATE can also be used in ORDER BY, GROUP BY, HAVING, aggregate functions, aliases, and more.

Strings may be converted between character sets if absolutely needed. To do this, use Cast() or Convert() functions.

To reduce the amount of disk space required by character sets and the amount of memory used by the server, don't select unneeded character sets when you configure MySQL. This requires that you compile MySQL from source rather than using precompiled binary distribution. To see which character sets are available:

shell> ./configure --help

To configure MySQL to support a given set of character sets, invoke configure with a —with-charset option that specifies the default character set, and a —with-extra-charsets option that names any other character sets to include. The value of the —with-extra-charsets should be a comma-separated list of character set names:

shell> ./configure --with-charset=latin1 --with-extra-charsets=utf8,ucs2

For multi-byte character sets that have variable-length encoding, a variable-length data type may be more appropriate even if stored values have a fixed-length. The utf8 character set uses one to three bytes per character. For fixed-length data types, three bytes per character must always be allocated to allow for the possibility that every character will require the "widest" encoding. Thus, CHAR(32) requires 96 bytes. For variable-length data types, only as much storage is allocated as required. In a VARCHAR(32) column, a 32-character string that consists entirely of three-byte characters requires 96 bytes plus a length byte, whereas it requires only 32 bytes plus a length byte if the string consists entirely of single-byte characters.

If you have a choice between multi-byte character sets, choose the one for which the most commonly used characters take less space. For example, the utf8 and utc2 character sets can both be used for storing Unicode data. In utf8, characters take from one to three bytes, but most non-accented Latin characters take one byte. In ucs2, every character takes two bytes. Therefore, if the majority of your characters are non-accented characters, you'll likely achieve a space saving by using utf8 rather than ucs2. This assumes the use of variable-length data type such as VARCHAR(n). If you use a fixed-length type such as CHAR(n), stored values requires n * 3 bytes for utf8, and n * 2 bytes for ucs2, regardless of the particular characters in stored values.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License