Soundex is a system for indexing names by sound. It was designed so that homophones, words that sound the same but which are spelt differently, resolve to the same encoding. For example, the names Reid and Reed would both be encoded as R300, McDonald and Macdonald are both M235, etc.
To create a Soundex:
- The first letter of the Soundex is the first letter of the name.
- Then remove all vowels, and all occurrences of y, h and w.
- The remaining letters are encoded one-by-one according to their place of articulation, i.e. where in the mouth or throat the sound is formed.
- The labial consonants b, f, p and v, which are formed by the lips, are coded as a one.
- The guttural and sibilant consonants, c, g, j, k, q, s, x and z, which are formed at the back of the throat and with the tongue close to the roof of the mouth respectively, are coded as a two.
- The dental consonants, d and t, which are formed by the tongue against the teeth are coded as a three.
- The long liquid consonant l is encoded as a four.
- The nasal consonants, m and n, in which air escapes through the nose, are encoded as a five.
- The short liquid consonant r is encoded as a six.
- If two letters that are encoded as the same number are next to each other (e.g. the d and t in Schmidt) then the encoding is used only once.
- If two letters that are encoded as the same number are separated by a y, h or w then the encoding is used only once.
- If two letters that are encoded as the same number are separated by a vowel then the encoding is used twice.
- The letters are encoded one-by-one until three numbers are produced. If the name is too short, the remainder of the Soundex is encoded using zeroes.
If we use the example of Macdonald from above:
- First letter is M.
- Removing the vowels leaves us with Mcdnld.
- c is encoded as two, giving us M2.
- d is encoded as three, giving us M23.
- n is encoded as five, giving us M235.