Full text indexing is an amazing feature. Instead of incredibly costly LIKE comparisons, an index can be created by the RDBMS which contains the full text of the column, or table, in question. For things like e-mail message bodies, biographies or other lengthy descriptions, or any other unstructured text, full text indexing is the only solution that makes sense. Using the LIKE method (for example : SELECT * FROM MyTable WHERE col1 LIKE ‘%MySQL’;) is so amazingly slow that it will bring a server to its knees if used often.
You enable full text indexing on MS SQL using the the SSMS gui(right click on a table and use the Full Text Index menu if full text indexing has been enabled on that database), or through T-SQL. For more detailed instructions, check out this great article on developer.com. Once the full text index is created, MS SQL just seems a whole lot faster when searching the indexed fields provided you’re using the right T-SQL. I’ll assume that my MS SQL guru readers are already pretty familiar with that!
Creating a FTI in MySQL requires a little DDL, seen below :
mysql> CREATE TABLE articles (
-> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
-> title VARCHAR(200),
-> body TEXT,
-> FULLTEXT (title,body)
-> );
Once you’ve created and populated your MyISAM table (try using INSERT INTO ISAM_table_name (x) SELECT x from innoDB_table_name for a quick migration from another table with text values.) you can do some really cool things with the index, just like in MS SQL.
On the documentation page from MySQL.com you can find instructions for ranking result sets by relevance using the MATCH() function, which compares nicely with the CONTAINSTABLE keyword in MSSQL which returns both the key and the rank of each key in result or table form. You can also do simple boolean searches, just like the CONTAINS and FREETEXT keywords using IN BOOLEAN MODE. Weighted results are supported using the syntax below :
(MATCH (title) AGAINST (’keywords’ IN BOOLEAN MODE)*10)+
(MATCH (title) AGAINST (’keywords’ IN BOOLEAN MODE)*50);
Also, much like MS SQL, MySQL has some words that it doesn’t index…
- Any word that is too short is ignored. The default minimum length of words that are found by full-text searches is four characters.
- Words in the stopword list are ignored. A stopword is a word such as “the” or “some” that is so common that it is considered to have zero semantic value. There is a built-in stopword list, but it can be overwritten by a user-defined list.
Some methods I haven’t found in MySQL are :
- Inflectional search - For example, searching for swim but finding swimming, swam, etc.
- Proximity based search - Tennis within a few words of raquet gaining a much higher ranking than either of the keywords by themselves. Apparently, MySQL doesn’t store the position of the words in the index, so the proximity searching may not be in the next few versions, if it ever comes at all!
If any MySQL readers out there know how to do these in MySQL (obviously almost anything can be done in PHP, etc.)please drop me a line. You’ll receive full credit here and maybe even a nice paypal thank you!
So as we can see, MySQL does full text indexing just like MS SQL for the most part, but there is a catch. A BIG catch. You can only full text index a MyISAM table. In MySQL, there are multiple storage engines!! This means that many of the databases out there are going to have a tough time with full text indexing if they plan to scale out and stay with innoDB, unless we can find a solution. Quoted from Wikipedia, the following are some of the major differences between MyISAM and innoDB :
- InnoDB recovers from a crash or other unexpected shutdown by replaying its logs. MyISAM must fully scan and repair or rebuild any indexes or possibly tables which had been updated but not fully flushed to disk. Since the InnoDB approach is approximately fixed time while the MyISAM time grows with the size of the data files, InnoDB offers greater perceived availability and reliability as database sizes grow.
- MyISAM relies on the operating system for caching reads and writes to the data rows while InnoDB does this within the engine itself, combining the row caches with the index caches. Dirty (changed) database pages are not immediately sent to the operating system to be written by InnoDB, which can make it substantially faster than MyISAM in some situations.
- InnoDB stores data rows physically in primary key order while MyISAM typically stores them mostly in the order in which they are added. This corresponds to the MS SQL Server feature of “Clustered Indexes” and the Oracle feature known as “index organized tables.” When the primary key is selected to match the needs of common queries this can give a substantial performance benefit. For example, customer bank records might be grouped by customer in InnoDB but by transaction date with MyISAM, so InnoDB would likely require fewer disk seeks and less RAM to retrieve and cache a customer account history. On the other hand, inserting data in orders that differ substantially from primary key (PK) order will presumably require that InnoDB do a lot of reordering of data in order to get it into PK order. This places InnoDB at a slight disadvantage in that it does not permit insertion order based table structuring.
- InnoDB currently does not provide the compression and terse row formats provided by MyISAM, so both the disk and cache RAM required may be larger. A lower overhead format is available for MySQL 5.0, reducing overhead by about 20% and use of page compression is planned for a future version.
- When operating in fully ACID-compliant modes, InnoDB must do a flush to disk at least once per transaction, though it will combine flushes for inserts from multiple connections. For typical hard drives or arrays, this will impose a limit of about 200 update transactions per second. If you require higher transaction rates, disk controllers with write caching and battery backup will be required in order to maintain transactional integrity. InnoDB also offers several modes which reduce this effect, naturally leading to a loss of transactional integrity. MyISAM has none of this overhead, but only because it does not support transactions.
Now that we’ve read through the details, lets get to the point! InnoDB, MyISAM, and other engines allow a flexibility in decision making for administrators and developers. MS SQL has only one storage engine. This brings us back to the legos vs. furniture philosophy seen in the MySQL backups simplified post from last week. Unix is flexible, and Microsoft has a vision. You’ve got to work a little harder to make decisions in the Unix world, but it pays off when your tools match your problems. I encourage everyone interested in the MySQL storage engines and other database internals to check out Understanding MySQL Internals (Understanding)
For those of you not yet familiar with the SQL Server storage engine, check out Delaney’s amazing work on Inside Microsoft (r) SQL Server ™ 2005: The Storage Engine
My suggestion for this particular case, which requires searching based on a column of RSS text, is that we have both MyISAM and innoDB tables in the same database. This won’t affect our backups, since we use mysqldump and, for the time being, won’t affect our performance too negatively. Since we’ll only be storing the text data in the RSS text table, we don’t need to worry too much about foreign key issues and other things that apparently affect MyISAM. A simple search and JOIN will solve all of our full text needs.
You can create a table using any MySQL compatible engine using the following syntax :
CREATE TABLE t (i INT) ENGINE = MYISAM;
or
CREATE TABLE t (i INT) ENGINE = INNODB;
This is a really cool feature, and allows you to mix and match technologies for storage inside a single database, which is something quite different from MS SQL. Also, you could get really crazy and create an entirely new database for your ISAM needs and use fully qualified data element names like SELECT * FROM databasename.tablename. Most people don’t recommend this.
More details to come with the final implementation, and look for an update to the backup post soon, as we document the FTP process for offsite backups!



