N-Gram Generation for Autocomplete

Jim Quinn
2 min readApr 30, 2021

Every web site that lets a user search needs to be able to do autocomplete. It’s the standard. If a site doesn’t predict what a user is typing, users won’t use the site.

To be useful autocomplete has to be fast. So even if a site is a front for a relational database the users won’t care. They want their typing to offer suggestions quickly. That means using an in-memory database like Redis. There’s no way around it, in-memory databases beat disk based databases every time.

So how do you get data from a relational database to Redis? First you have to generate N-grams. An N-gram is a subset of characters that match what the user types. For example, if you have a dataset of fruits, you would generate:

“a”

“ap”

“app”

“appl”

“apple”

Which all match to “apple”. Notice that “a” and “ap” also could match to “apricot”. Those are the choices that would be offered to a user.

You have to generate these in a format that is suitable for bulk loading to Redis which looks like this:

*4

$4

ZADD

$2

ap

$1

0

$5

apple

The “*4” indicates there are 4 commands. The “$4” indicates there are 4 characters in “ZADD”, the “$2” indicates there are 2 characters in “ap” etc. “ap” is the N-gram, and “apple” is the target.

Of course, you want to generate this syntax from a relational database for every N-gram for every fruit in your database. To do this you need a cursor:

The cursor inserts the N-grams into a table. Export the table into a file and run “unix2dos” on the file. Now you have a file that can be bulk loaded into Redis. The hard part is done. To load the file use the redis-cli and the endpoint of your Redis cluster:

cat fruitngrams.txt | src/redis-cli -c -h dev-redis.xxxx.xxx.com -p 6379 –pipe

To verify that the N-grams have been properly loaded connect to the cluster using the redis-cli and use the command:

ZREVRANGE “ap” 0 20

And you should see all the words that start with “ap”.

When a user types and then selects one of the choices presented that choice needs to be tied to an ID.

The last thing is to make sure that when a user selects a word from the autocomplete choices a code can be returned. Use the HSET command to assign a code to each of the choices that could be selected from an autocomplete. This means that if someone selects “macintosh” it can generate the same FruitCode as if someone selected “apple”.

*4

$4

HSET

$5

apple

$9

FruitCode

$6

123456

This allows for a fast and robust autocomplete on any web site.

--

--

Jim Quinn

M.S. Computer Science. VP Engineering at a startup. Fitness enthusiast. Pizza lover.