Fuzzy Matching in PostgreSQL: A Guide to pg_trgm extension

2 Aug 2024 21:12

—

PostgreSQL, known for its powerful and flexible full-text search capabilities (to_tsvector, to_tsquery etc), includes the pg_trgm extension for improved similarity searches and text indexing.

Here’s a brief look at what pg_trgm offers.

What is pg_trgm?

pg_trgm (trigram) is an extension in PostgreSQL that provides support for similarity searches using trigram matching. A trigram is a group of three consecutive characters in a string. By breaking down text into trigrams, pg_trgm enables efficient pattern matching and similarity searches.

test=# SELECT show_trgm('Corum');
            show_trgm
---------------------------------
 {"  c"," co",cor,oru,rum,"um "}
(1 row)

Even if your search on Google is full of mistakes or doesn’t make much sense, you’ll usually still get an answer. pg_trgm does exactly same by using a distance operator < – > that computes the distance between two strings.

You can do similarity searches using distance operator. It is useful for applications requiring fuzzy matching. This can be helpful in scenarios like searching for misspelled words or variations in names.

Using pg_trgm, you can perform similarity searches with the % operator or use functions like similarity() to calculate how closely two strings match.

I will show a quick demo of how you can use it. First i will create a simple table storing provinces of Türkiye.

test=# CREATE TABLE provinces (
    province_id SERIAL PRIMARY KEY,province_name text );
CREATE TABLE

-- Insert statement for provinces in alphabetical order
test=# INSERT INTO provinces (province_name) VALUES ('Adana'), ('Adıyaman'), ('Afyonkarahisar'), ('Ağrı'), ('Aksaray'), ('Amasya'), ('Ankara'), ('Antalya'), ('Ardahan'), ('Artvin'), ('Aydın'), ('Balıkesir'), ('Bartın'), ('Batman'), ('Bayburt'), ('Bilecik'), ('Bingöl'), ('Bitlis'), ('Bolu'), ('Burdur'), ('Bursa'), ('Çanakkale'), ('Çankırı'), ('Çorum'), ('Denizli'), ('Diyarbakır'), ('Düzce'), ('Edirne'), ('Elazığ'), ('Erzincan'), ('Erzurum'), ('Eskişehir'), ('Gaziantep'), ('Giresun'), ('Gümüşhane'), ('Hakkâri'), ('Hatay'), ('Iğdır'), ('Isparta'), ('İstanbul'), ('İzmir'), ('Kahramanmaraş'), ('Karabük'), ('Karaman'), ('Kars'), ('Kastamonu'), ('Kayseri'), ('Kilis'), ('Kırıkkale'), ('Kırklareli'), ('Kırşehir'), ('Kocaeli'), ('Konya'), ('Kütahya'), ('Malatya'), ('Manisa'), ('Mardin'), ('Mersin'), ('Muğla'), ('Muş'), ('Nevşehir'), ('Niğde'), ('Ordu'), ('Osmaniye'), ('Rize'), ('Sakarya'), ('Samsun'), ('Şanlıurfa'), ('Siirt'), ('Sinop'), ('Sivas'), ('Şırnak'), ('Tekirdağ'), ('Tokat'), ('Trabzon'), ('Tunceli'), ('Uşak'), ('Van'), ('Yalova'), ('Yozgat'), ('Zonguldak');
INSERT 0 81

Now lets see it in action.

Usage of pg_trgm extension

test=# SELECT a.*, a.province_name <-> 'Şorum' distance FROM provinces a order by (province_name <-> 'Şorum') limit 3;
 province_id | province_name | distance
-------------+---------------+-----------
          24 | Çorum         |         0
          63 | Ordu          |      0.75
          31 | Erzurum       | 0.8181818
(3 rows)

Last but not least, With the help of this extension you can create GIST or GIN indexes based on trigrams. These indexes speed up searches for similar strings, making them more efficient compared to full table scans. Hans Jürgen Schönig from Cybertec has a great blog post (More performance for LIKE and ILIKE statements) about how to speed up LIKE and ILIKE statements.

For those looking to implement fuzzy matching in Oracle databases, the UTL_MATCH package offers functions for comparing two strings, with support beginning from version 11g. The 23c release introduces several new algorithms in addition to the existing ones (Jaro-Winkler Distance and Levenshtein Distance), including bigram, trigram, whole_word_match, and longest_common_substring.

Hope it helps.

Fuzzy Matching in PostgreSQL: A Guide to pg_trgm extension

What is pg_trgm?

Discover More from Osman DİNÇ

Comments

Leave your comment Cancel reply

Fuzzy Matching in PostgreSQL: A Guide to pg_trgm extension

What is pg_trgm?

Share this:

Discover More from Osman DİNÇ

Comments

Leave your comment Cancel reply