Search like Google in Oracle
On my blog post (Understanding pg_trgm in PostgreSQL) , I showed how you can make similarity searches in PostgreSQL using pg_trgm extension.
Oracle’s UTL_MATCH package includes functions for performing fuzzy matching between two strings, with support starting from version 11g. The 23ai release introduces several additional algorithms to complement the existing ones (Jaro-Winkler Distance and Levenshtein Distance). New algorithms include bigram, trigram, whole_word_match, and longest_common_substring.
First i will create the sample data for my tests.
SQL> CREATE TABLE provinces (
province_id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,province_name varchar2(100) );
INSERT INTO provinces (province_name) VALUES ('Adana');
INSERT INTO provinces (province_name) VALUES ('Adıyaman');
INSERT INTO provinces (province_name) VALUES ('Afyonkarahisar');
INSERT INTO provinces (province_name) VALUES ('Ağrı');
INSERT INTO provinces (province_name) VALUES ('Aksaray');
INSERT INTO provinces (province_name) VALUES ('Amasya');
INSERT INTO provinces (province_name) VALUES ('Ankara');
INSERT INTO provinces (province_name) VALUES ('Antalya');
INSERT INTO provinces (province_name) VALUES ('Ardahan');
INSERT INTO provinces (province_name) VALUES ('Artvin');
INSERT INTO provinces (province_name) VALUES ('Aydın');
INSERT INTO provinces (province_name) VALUES ('Balıkesir');
INSERT INTO provinces (province_name) VALUES ('Bartın');
INSERT INTO provinces (province_name) VALUES ('Batman');
INSERT INTO provinces (province_name) VALUES ('Bayburt');
INSERT INTO provinces (province_name) VALUES ('Bilecik');
INSERT INTO provinces (province_name) VALUES ('Bingöl');
INSERT INTO provinces (province_name) VALUES ('Bitlis');
INSERT INTO provinces (province_name) VALUES ('Bolu');
INSERT INTO provinces (province_name) VALUES ('Burdur');
INSERT INTO provinces (province_name) VALUES ('Bursa');
INSERT INTO provinces (province_name) VALUES ('Çanakkale');
INSERT INTO provinces (province_name) VALUES ('Çankırı');
INSERT INTO provinces (province_name) VALUES ('Çorum');
INSERT INTO provinces (province_name) VALUES ('Denizli');
INSERT INTO provinces (province_name) VALUES ('Diyarbakır');
INSERT INTO provinces (province_name) VALUES ('Düzce');
INSERT INTO provinces (province_name) VALUES ('Edirne');
INSERT INTO provinces (province_name) VALUES ('Elazığ');
INSERT INTO provinces (province_name) VALUES ('Erzincan');
INSERT INTO provinces (province_name) VALUES ('Erzurum');
INSERT INTO provinces (province_name) VALUES ('Eskişehir');
INSERT INTO provinces (province_name) VALUES ('Gaziantep');
INSERT INTO provinces (province_name) VALUES ('Giresun');
INSERT INTO provinces (province_name) VALUES ('Gümüşhane');
INSERT INTO provinces (province_name) VALUES ('Hakkâri');
INSERT INTO provinces (province_name) VALUES ('Hatay');
INSERT INTO provinces (province_name) VALUES ('Iğdır');
INSERT INTO provinces (province_name) VALUES ('Isparta');
INSERT INTO provinces (province_name) VALUES ('İstanbul');
INSERT INTO provinces (province_name) VALUES ('İzmir');
INSERT INTO provinces (province_name) VALUES ('Kahramanmaraş');
INSERT INTO provinces (province_name) VALUES ('Karabük');
INSERT INTO provinces (province_name) VALUES ('Karaman');
INSERT INTO provinces (province_name) VALUES ('Kars');
INSERT INTO provinces (province_name) VALUES ('Kastamonu');
INSERT INTO provinces (province_name) VALUES ('Kayseri');
INSERT INTO provinces (province_name) VALUES ('Kilis');
INSERT INTO provinces (province_name) VALUES ('Kırıkkale');
INSERT INTO provinces (province_name) VALUES ('Kırklareli');
INSERT INTO provinces (province_name) VALUES ('Kırşehir');
INSERT INTO provinces (province_name) VALUES ('Kocaeli');
INSERT INTO provinces (province_name) VALUES ('Konya');
INSERT INTO provinces (province_name) VALUES ('Kütahya');
INSERT INTO provinces (province_name) VALUES ('Malatya');
INSERT INTO provinces (province_name) VALUES ('Manisa');
INSERT INTO provinces (province_name) VALUES ('Mardin');
INSERT INTO provinces (province_name) VALUES ('Mersin');
INSERT INTO provinces (province_name) VALUES ('Muğla');
INSERT INTO provinces (province_name) VALUES ('Muş');
INSERT INTO provinces (province_name) VALUES ('Nevşehir');
INSERT INTO provinces (province_name) VALUES ('Niğde');
INSERT INTO provinces (province_name) VALUES ('Ordu');
INSERT INTO provinces (province_name) VALUES ('Osmaniye');
INSERT INTO provinces (province_name) VALUES ('Rize');
INSERT INTO provinces (province_name) VALUES ('Sakarya');
INSERT INTO provinces (province_name) VALUES ('Samsun');
INSERT INTO provinces (province_name) VALUES ('Şanlıurfa');
INSERT INTO provinces (province_name) VALUES ('Siirt');
INSERT INTO provinces (province_name) VALUES ('Sinop');
INSERT INTO provinces (province_name) VALUES ('Sivas');
INSERT INTO provinces (province_name) VALUES ('Şırnak');
INSERT INTO provinces (province_name) VALUES ('Tekirdağ');
INSERT INTO provinces (province_name) VALUES ('Tokat');
INSERT INTO provinces (province_name) VALUES ('Trabzon');
INSERT INTO provinces (province_name) VALUES ('Tunceli');
INSERT INTO provinces (province_name) VALUES ('Uşak');
INSERT INTO provinces (province_name) VALUES ('Van');
INSERT INTO provinces (province_name) VALUES ('Yalova');
INSERT INTO provinces (province_name) VALUES ('Yozgat');
INSERT INTO provinces (province_name) VALUES ('Zonguldak');
commit;
Now, I will show you how to perform similarity searches in Oracle versions 11g and later. The results vary depending on the chosen algorithm and are case-sensitive. You can use the UPPER and LOWER functions to handle case sensitivity.
SQL> SELECT * FROM provinces ORDER BY UTL_MATCH.jaro_winkler_similarity(upper(province_name), upper('Şorum')) desc fetch first 1 rows only;
SQL> SELECT * FROM provinces ORDER BY UTL_MATCH.edit_distance_similarity(upper(province_name), upper('Şorum')) desc fetch first 1 rows only;

The newly added algorithms on Oracle Database 23ai are:
LONGEST_COMMON_SUBSTRING finds the longest common substring between the two strings.
LEVENSHTEIN corresponds to UTL_MATCH.EDIT_DISTANCE or UTL_MATCH.EDIT_SIMILARITY and gives a measure of character edit distance or similarity.
DAMERAU_LEVENSHTEIN distance differs from the classical LEVENSHTEIN distance by including transpositions among its allowable operations in addition to the three classical single-character edit operations (insertions, deletions and substitutions).
JARO_WINKLER corresponds to UTL_MATCH.JARO_WINKLER (a percentage between 0-1) or UTL_MATCH.JARO_WINKLER_SIMILARITY (the same but scaled from 0-100).
BIGRAM and TRIGRAM are instances of the N-gram matching technique, which counts the number of common contiguous sub-strings (grams) between the two strings.
WHOLE_WORD_MATCH corresponds to Word Match Percentage or Count comparison in Oracle Enterprise Data Quality. It calculates the LEVENSHTEIN or edit distance of two phrases with words (instead of letters) as matching units.
With the Oracle 23ai release, you can take advantage of the new algorithms by using the newly added FUZZY_MATCH function. While the UTL_MATCH package is still available, the FUZZY_MATCH function offers a broader range of algorithms, providing greater flexibility and better alignment with developer needs.
SELECT * FROM provinces ORDER BY fuzzy_match(levenshtein, upper(province_name), upper('Şorum')) desc fetch first 1 rows only;
SELECT * FROM provinces ORDER BY fuzzy_match(jaro_winkler, upper(province_name), upper('Şorum')) desc fetch first 1 rows only;
SELECT * FROM provinces ORDER BY fuzzy_match(bigram, upper(province_name), upper('Şorum')) desc fetch first 1 rows only;
SELECT * FROM provinces ORDER BY fuzzy_match(trigram, upper(province_name), upper('Şorum')) desc fetch first 1 rows only;
SELECT * FROM provinces ORDER BY fuzzy_match(whole_word_match, upper(province_name), upper('Şorum')) desc fetch first 1 rows only;
SELECT * FROM provinces ORDER BY fuzzy_match(longest_common_substring, upper(province_name), upper('Şorum')) desc fetch first 1 rows only;

SQL> SELECT a.*, fuzzy_match(bigram, upper(province_name), upper('Şorum')) as similarity from provinces a ORDER BY fuzzy_match(bigram, upper(province_name), upper('Şorum')) desc fetch first 3 rows only;
PROVINCE_ID PROVINCE_NAME SIMILARITY
----------- -------------------- ----------
186 Çorum 60
193 Erzurum 33
225 Ordu 20
Much more information available Data Quality Operators section of Oracle 23ai SQL Language Reference.
Hope it helps.


Leave your comment