> soundex | phonetic | fuzzy <

// Soundex - Phonetic algorithm for indexing names by sound

INPUT:

OUTPUT:

American Refined

[PHONETIC]

Sound-Based

Encodes names based on pronunciation, not spelling.

[FUZZY]

Fuzzy Matching

Finds similar-sounding names despite spelling variations.

[GENEALOGY]

Family Research

Essential tool for genealogy and historical records.

>> technical info

How Soundex Works:

Soundex keeps the first letter and replaces consonants with digits based on phonetic groups. Similar-sounding consonants get the same digit. Vowels are ignored, and the result is padded to 4 characters (American) or variable length (Refined).

Encoding Rules:

1 = B,F,P,V 2 = C,G,J,K,Q,S,X,Z 3 = D,T 4 = L 5 = M,N 6 = R Robert → R163 Rupert → R163 Rubin → R150

Why Use Soundex:

>Database deduplication
>Genealogy research
>Census analysis
>Customer matching
>Spell correction

>> frequently asked questions

What is Soundex?

Soundex is a phonetic algorithm patented in 1918 for indexing names by sound. It was designed for the US Census to help find surnames with similar pronunciations despite different spellings.

American vs Refined Soundex?

American Soundex produces 4-character codes (letter + 3 digits). Refined Soundex (used in SQL Server) uses more digit mappings and produces variable-length codes for better accuracy.

Why do different spellings get the same code?

That's the purpose! Soundex groups similar-sounding names together. Smith and Schmidt sound similar, so they get similar codes, helping find name variations in databases.

What are Soundex limitations?

Soundex works best with English names. It may not handle names from other languages well, and very different spellings of the same name might get different codes.

Soundex in SQL — how do I use SOUNDEX() and DIFFERENCE() in databases?

Most major databases ship with native Soundex functions, so you don't need to re-implement the algorithm in application code:

SQL Server / Azure SQL:
SELECT SOUNDEX('Robert'), SOUNDEX('Rupert'); -- Both return 'R163' SELECT DIFFERENCE('Smith', 'Smythe'); -- Returns 4 (max similarity, scale 0–4)

Oracle: SELECT SOUNDEX(last_name) FROM employees;
MySQL: SELECT SOUNDEX('Johnson'), 'Jonson' SOUNDS LIKE 'Johnson'; (the SOUNDS LIKE operator is a MySQL shortcut for SOUNDEX(a) = SOUNDEX(b)).
PostgreSQL: Not built-in. Install the fuzzystrmatch extension: CREATE EXTENSION fuzzystrmatch; SELECT soundex('Robert'), difference('Robert', 'Rupert');
SQLite: Compile with -DSQLITE_SOUNDEX flag, then use soundex().

Typical use — deduplicate customers:
SELECT a.id, b.id FROM customers a JOIN customers b ON a.id < b.id WHERE SOUNDEX(a.last_name) = SOUNDEX(b.last_name) AND DIFFERENCE(a.first_name, b.first_name) >= 3;

Index tip: create a computed column storing SOUNDEX(name) and index it — otherwise the function runs per-row and disables index lookups on large tables.

Soundex code for specific names — what does S530, J525, R163 mean?

A 4-character Soundex code is: first letter of the name + three digits derived from the next consonants. Common examples Bing searchers look up:

• Smith / Smyth / Smythe / Schmidt → S530 (S, then M=5, then T=3, padded with 0). This is the classic demonstration case — all four spellings collapse to the same code.
• Johnson / Johnsen / Jonson / Jansen → J525.
• Robert / Rupert / Rubert → R163 (R, B/P=1, R=6, T=3).
• Williams / Williamson → W452 / W452 (truncation after 3 digits).
• Washington → W252.
• Lee / Lea / Leigh → L000 (only one coded consonant; padded with zeros).
• O'Brien / Obryan / Obrien → O165.

Digit mapping (memorize 6 groups):
• 1: B, F, P, V
• 2: C, G, J, K, Q, S, X, Z
• 3: D, T
• 4: L
• 5: M, N
• 6: R
• Vowels (A, E, I, O, U, Y, H, W): ignored except when separating consonants.

Double letters / adjacent same-digit rule: consecutive consonants that map to the same digit are collapsed (e.g., TT → one 3; CK → one 2). This is the step beginners most often get wrong when computing Soundex by hand.

Implementing Soundex in Python, JavaScript, Java, Ruby — code snippets

Most languages have battle-tested Soundex libraries — don't roll your own unless studying the algorithm:

Python: pip install jellyfish
import jellyfish jellyfish.soundex('Robert') # 'R163' jellyfish.metaphone('Robert') # 'RBRT'

JavaScript / Node.js: npm install natural
const natural = require('natural'); const sx = natural.SoundEx; sx.process('Robert'); // 'R163'

Java: Apache Commons Codec
import org.apache.commons.codec.language.Soundex; new Soundex().encode("Robert"); // "R163"

Ruby: require 'text'; Text::Soundex.soundex('Robert') (via the text gem).

Go: go get github.com/mpvl/subst or roll your own in ~30 lines — the algorithm is simple:
func Soundex(s string) string { if s == "" { return "" } first := unicode.ToUpper(rune(s[0])) code := string(first) prev := digit(first) for _, r := range s[1:] { d := digit(unicode.ToUpper(r)) if d != 0 && d != prev { code += string(d) } if d != 0 || !isHW(r) { prev = d } if len(code) == 4 { break } } return (code + "000")[:4] }

The subtle bit: H and W don't break consonant adjacency (Ashcraft → A261, not A226).

Soundex 语音算法在线 — Soundex 对中文名字有用吗？

Soundex 仅为英语姓氏设计，对中文/日文/韩文名字几乎没有用，原因：

• 字符集限制: Soundex 只映射 A-Z 的 26 个英文字母，汉字无法输入。
• 拼音转写差异: 同一个汉字在不同拼音系统下差异大。例如 "张"（Zhang / Chang / Cheung / Teoh）按 Soundex 会编码为 Z520、C520、C520、T000——Z/C/T 头字母不同导致无法匹配。
• 声调信息丢失: Soundex 无法处理 "Ma 马" vs "Ma 麻" vs "Ma 骂" 的声调差异。

中文姓名模糊匹配的更好选择：
• Pinyin-based matching: 统一转为无声调拼音后做编辑距离。
• Cangjie / Wubi 编码: 基于字形结构匹配异体字（张/張）。
• Double Metaphone: 比 Soundex 对欧洲名字（俄语、德语、法语移民姓氏）更宽容，但对亚洲名字仍然有限。
• Levenshtein 编辑距离: 在拼音字符串上使用，更灵活。
• Jaro-Winkler 相似度: 对短字符串（姓氏）表现优于 Levenshtein。

Soundex 的经典应用场景（英语背景）：
• 美国人口普查（1890 年起使用，现仍是家谱研究的标准）
• Ellis Island 移民名册（欧洲移民姓氏被书记官按发音误拼，Soundex 能追溯同一家族）
• 警方数据库（目击者报告的姓氏拼写不确定时）
• FamilySearch、Ancestry.com 族谱搜索默认支持 Soundex 模糊匹配。

如果做的是 中文用户数据去重，建议管道：拼音化 → 规范化（zh/c, sh/s, n/l 合并）→ Levenshtein 距离阈值过滤。

Refined Soundex vs American (NARA) Soundex — which should I use?

Two major Soundex variants exist, and the choice affects match quality:

American Soundex (NARA 1880):
• Fixed 4-character output: [A-Z][0-9][0-9][0-9].
• Used by US National Archives for the 1880, 1900, 1910, 1920, 1930 censuses.
• Groups B/F/P/V into one code — too coarse for modern fuzzy-search.
• Output space: 26 × 10³ = 26,000 unique codes.

Refined Soundex (SQL Server default, T-SQL SOUNDEX()):
• Variable-length output (no truncation at 4).
• Finer-grained groupings: separates B/P from F/V, separates C/K/Q from S/X/Z, etc. Result: fewer false-positive collisions.
• Does not collapse the first letter into a digit (like American Soundex), preserving more info.
• Used by Microsoft SQL Server, some genealogy tools.

When to pick which:
• Genealogy research on US census records → American Soundex (the indexes in NARA/Ancestry were built with it — your queries must match).
• Modern customer data deduplication → Refined Soundex or (better) Double Metaphone which outperforms both on non-Anglo names.
• Cross-tool compatibility → American Soundex — it's the one implemented everywhere (Python jellyfish, Java Commons Codec, MySQL, Oracle, PostgreSQL fuzzystrmatch). Refined is niche outside SQL Server.

Other phonetic algorithms worth knowing:
• NYSIIS (1970): better for Slavic/Hispanic surnames than Soundex.
• Metaphone / Double Metaphone (1990/2000): handles more European language origins.
• Caverphone: optimized for New Zealand English.
• Cologne Phonetic (Kölner Phonetik): German equivalent of Soundex.

Other Languages

🇺🇸 English 🇫🇷 Français 🇩🇪 Deutsch 🇪🇸 Español 🇵🇹 Português 🇮🇹 Italiano 🇳🇱 Nederlands 🇷🇺 Русский 🇯🇵 日本語 🇰🇷 한국어 🇨🇳 简体中文 🇹🇼 繁體中文 🇸🇦 العربية 🇮🇳 हिन्दी 🇵🇱 Polski 🇹🇷 Türkçe 🇸🇪 Svenska 🇩🇰 Dansk 🇳🇴 Norsk