HomePage | Optical Illusions | War Stories | QBasic | Dads Navy Days | Bristol | Bristol, USA | Bristol, Canada | Terre Haute | Miscellany | Web Stuff | About Ray | Site Map | Site Search | Messages | Credits | Links | Web Rings

QBasic | Errors | 40lb Weight | Bits | Chance | Colours | Dates | Delays | File Dialog | Files | Input | Matching | Menus | Mouse | Numbers | SeqNo | SIRDS | Sorts | Text | Timer | DLoads

Text matching techniques

When using databases one of the most common proceedures is comparing records, either for deduplication or suppression. Deduplication is the process of removing identical records from a database. Suppression uses two or more databases, records found in one or more databases are removed from the main database, this is also known as purging. Another, similar, process is known as merging. In this process records in the suppression database(s) are compared against the main data records. Those records that are not already contained within the main database are added to it.

Searching for identical text is relatively easy, but problems arise when matching on names and addresses. For example identical text matching would not match the following records :-

Name               Address Line 1       Address Line 2  Address Line 3
Mr. A B Sample     123 Sample Road      Sample Town     Sample County
Mr A Sample        123, Sample Road     SampleTown      SampleCounty
MR A SAMPLE        123 SAMPLE ROAD      SAMPLE TOWN     SAMPLE COUNTY
Mr Alan Sample     123 Sample Rd        Sample Town     Sample County
Alan Sample        123 Sample Road      Sample Town     Sample County
A Sample           123, Sample Rd       SAmple Town     Sample CouNty

MATCHES
     0                  2                    3               3

Because of these discrepancies professional database matching software is very sophisticated and can be very expensive. Most of this software award scores for how close records get to being a perfect match to eachother. Working for a direct mailing house, where we do a lot of record comparisons, we can choose at which score level we decide to either delete or keep almost matching records. At low scores we delete less records but may not catch mail to the same address, hence people sometimes get two or more letters addressed to very slightly addresses but arriving at the same house. Incidentally, this is usually the fault of the suppliers of the database and not us. When we do something like a Mail Preference Service (MPS) suppression where people have specifically asked not to receive direct mail we keep the matching score fairly high, this means that people in the MPS suppression database will not receive the mailing at the risk of leaving people who may like to get it out.

Returning to our three example records, one way to improve the matching process is to strip out any characters that is not A-Z or 0-9. Matching will also improve if the letters are all cased the same. Applying this technique to the above records they now become :-

Name               Address Line 1       Address Line 2  Address Line 3
MRABSAMPLE         123SAMPLEROAD        SAMPLETOWN      SAMPLECOUNTY
MRASAMPLE          123SAMPLEROAD        SAMPLETOWN      SAMPLECOUNTY
MRASAMPLE          123SAMPLEROAD        SAMPLETOWN      SAMPLECOUNTY
MRALANSAMPLE       123SAMPLERD          SAMPLETOWN      SAMPLECOUNTY
ALANSAMPLE         123SAMPLEROAD        SAMPLETOWN      SAMPLECOUNTY
ASAMPLE            123SAMPLERD          SAMPLETOWN      SAMPLECOUNTY

MATCHES
     2                  4                    6               6

This is a little better, but doing a straightforward text comparison will not exactly match all of the fields in all of these records. Now you can see the value of giving scores to the matches. Some matching softwware will also remove the vowels, some will also split the NAME field into fields called TITLE, FORNAMES, INITIALS and SURNAMES. Some can make initials from the forenames as well. Some of this software contains a database of fornames and can make up the title from them.

Title Initial Forename Surname Address Line 1 Address Line 2 Address Line 3
MR    A                SAMPLE  123SMPLRD      SMPLTWN        SMPLCNTY
MR    A                SAMPLE  123SMPLRD      SMPLTWN        SMPLCNTY
MR    A                SAMPLE  123SMPLRD      SMPLTWN        SMPLCNTY
MR    A       ALAN     SAMPLE  123SMPLRD      SMPLTWN        SMPLCNTY
      A       ALAN     SAMPLE  123SMPLRD      SMPLTWN        SMPLCNTY
      A                SAMPLE  123SMPLRD      SMPLTWN        SMPLCNTY

MATCHES
4     6        2         6        6             6                6

To get this far may seem to be a relatively simple process but the work is only just beginning. To cut down on the work that has to be done relatively few matches will be done. We usually use surname, address line 1 and postcode or town. Some clients may want only members of one sex mailed, or only one letter to each household, so further comparisons have to be done. When deduping every record in the database has to be compared against every other record, when doing suppressions every record in every database used has to be compared. To compare any one field in the four records of our pretend database above SIX seperate comparisons have to be done, imagine how many comparisons have to be done on a database of several thousand records. For this reason most databases, Access, FoxPro etc allow the use of key indexes. Indexes are a type of sort on the records so a large number of comparisons can be bypassed. Indexing or sorting a database makes comparisons very much quicker; there is no need for addresses in Edinburgh to be compared to those in, say, Sheffield. A discussion on sorting techniques can be found on the SORT pages.

The following program, PreMatch.bas, shows how text can be prepared for the text comparison process. A word of caution, this code should NOT be used for ANY commercial application. The reason for the caution is that the seperation of names into their componant parts is fraught with difficulties, getting it wrong will certainly annoy the people in your database and may even move them to sue you. You must also remember that by keeping lists of peoples names and addresses, let alone any other details about them, makes you subject to the Data Protection Act.

DECLARE SUB PrintFlds ()
DECLARE SUB TitleCase (Text$)
DECLARE SUB Strip (Text$)
DECLARE SUB MoreStrip (Text$)

'PreMatch.bas   Ray Thomas      August 2000

'*** A program to get text into a form suitable for matching
'*** To simplfy the program I will assume no spaces between hyphenated surnames
'*** It will also contain a "contact" field and a salutation

TYPE InRec
   Title AS STRING * 5
   Initial AS STRING * 5
   Surname AS STRING * 20
   Fullname AS STRING * 20
   Add1 AS STRING * 20
   Add2 AS STRING * 20
   Add3 AS STRING * 20
   Salut AS STRING * 30
   Contact AS STRING * 30
   MName AS STRING * 20
   MAdd1 AS STRING * 20
   MAdd2 AS STRING * 20
   MAdd3 AS STRING * 20
END TYPE

DIM SHARED Record(9) AS InRec   'An array to hold the records
DIM TitleComp(6) AS STRING      'Array to compare titles
DIM Counter AS INTEGER          'A counter for loop counts
DIM Count AS INTEGER            'A counter for loop counts
DIM Number AS INTEGER           'General numeric variable

'*** When defining type arrays they are initialised to NULLS
'*** to make text comparisons easier we are going to fill them with spaces

FOR Counter = 1 TO UBOUND(Record) - 1
   Record(Counter).Title = SPACE$(LEN(Record(Counter).Title))
   Record(Counter).Initial = SPACE$(LEN(Record(Counter).Initial))
   Record(Counter).Surname = SPACE$(LEN(Record(Counter).Surname))
   Record(Counter).Fullname = SPACE$(LEN(Record(Counter).Fullname))
   Record(Counter).Add1 = SPACE$(LEN(Record(Counter).Add1))
   Record(Counter).Add2 = SPACE$(LEN(Record(Counter).Add2))
   Record(Counter).Add3 = SPACE$(LEN(Record(Counter).Add3))
   Record(Counter).Salut = SPACE$(LEN(Record(Counter).Salut))
   Record(Counter).Contact = SPACE$(LEN(Record(Counter).Contact))
   Record(Counter).MName = SPACE$(LEN(Record(Counter).MName))
   Record(Counter).MAdd1 = SPACE$(LEN(Record(Counter).MAdd1))
   Record(Counter).MAdd2 = SPACE$(LEN(Record(Counter).MAdd2))
   Record(Counter).MAdd3 = SPACE$(LEN(Record(Counter).MAdd3))
NEXT Counter

'***  Define some records  ***
DATA Mr A B Sample,123 Sample Road,Sample Town,Sample County
DATA Mr A Sample,"123, Sample Road",SampleTown,SampleCounty
DATA MR A SAMPLE,123 SAMPLE ROAD,SAMPLE TOWN,SAMPLE COUNTY
DATA Mr Alan Sample,123 Sample Rd,Sample Town,Sample County
DATA Alan Sample,"123, Sample Rd",SAmple Town,Sample CouNty
DATA A Sample,123 Sample Road,Sample Town,Sample County
DATA Mr Sample,123 Sample Road,Sample Town,Sample County
DATA Mr Alpha-Sample,123 Sample Road, Sample Town, Sample County
DATA Mr A    B    Sample,123  Sample  Road,Sample  Town,Sample  County

'*** Fill the record array and titlecase the fields ***

FOR Counter = 0 TO UBOUND(Record) - 1
   READ Record(Counter).Fullname
   TitleCase Record(Counter).Fullname
   READ Record(Counter).Add1
   TitleCase Record(Counter).Add1
   READ Record(Counter).Add2
   TitleCase Record(Counter).Add2
   READ Record(Counter).Add3
   TitleCase Record(Counter).Add3
NEXT Counter

'***  Define a simple titles array  ***
'***  Imagine how many other variations may be found  ***

DATA Mr,Miss,Mrs,Ms,Doctor,Dr
FOR Counter = 0 TO UBOUND(TitleComp$) - 1
   READ TitleComp$(Counter)
NEXT Counter

'*** Strip out any "odd" characters

FOR Counter = 0 TO UBOUND(Record) - 1
   Strip Record(Counter).Fullname
   Strip Record(Counter).Add1
   Strip Record(Counter).Add2
   Strip Record(Counter).Add3
NEXT Counter

'*** Split the fullname field into title, first inital and surname ***
'*** title first ***

FOR Counter = 0 TO UBOUND(Record) - 1
   Number = INSTR(Record(Counter).Fullname, " ")
   Text$ = LEFT$(Record(Counter).Fullname, Number - 1)
   FOR Count = 0 TO UBOUND(TitleComp$) - 1
      IF UCASE$(Text$) = UCASE$(TitleComp$(Count)) THEN
         Record(Counter).Title = Text$
         EndTitle = LEN(TitleComp$(Count)) + 2
         EXIT FOR
      END IF
      EndTitle = 1
   NEXT Count
'*** To work out where the surname is we work backwards through the field
'*** The first space after the end spaces must be the start of the surname
       
   LastChar$ = " "
   FOR Count = LEN(Record(Counter).Fullname) TO 1 STEP -1
      LastChar$ = MID$(Record(Counter).Fullname, Count, 1)
      IF LastChar$ <> " " THEN
          Number = Count
          EXIT FOR
      END IF
   NEXT Count
   FOR Count = 1 TO Number
      CharPosn = INSTR(Count, Record(Counter).Fullname, " ")
      IF CharPosn >= Number THEN
	 Num = LEN(Record(Counter).Fullname) - Count
         Record(Counter).Surname = MID$(Record(Counter).Fullname, Count, Num)
         StartSur = Count
         EXIT FOR
      END If
   NEXT Count

'*** Get the initial
       
   IF EndTitle <> StartSur THEN
      Record(Counter).Initial = MID$(Record(Counter).Fullname, EndTitle, 1)
   END IF
NEXT Counter

'*** Create the "Contact" field
Text$ = ""
FOR Counter = 0 TO UBOUND(Record) - 1
   Text$ = RTRIM$(Record(Counter).Title) + " "
   IF Text$ = " " THEN
      Text$ = RTRIM$(Record(Counter).Initial) + " "
   ELSE
      Text$ = Text$ + RTRIM$(Record(Counter).Initial) + " "
   END IF
   Text$ = Text$ + RTRIM$(Record(Counter).Surname)
   Record(Counter).Contact = Text$
NEXT Counter

'*** Create the Salutation field
FOR Counter = 0 TO UBOUND(Record) - 1
   Text$ = "Dear " + Record(Counter).Title
   Text$ = RTRIM$(Text$) + " " + Record(Counter).Surname
   IF LEFT$(Record(Counter).Title, 1) = " " THEN
      Text$ = "Dear Sir/Madam"
   END IF
   Record(Counter).Salut = Text$
NEXT Counter

'***  Now the text is in sort of natural casing, create the matching fields

FOR Counter = 0 TO UBOUND(Record) - 1
   Record(Counter).MName = UCASE$(Record(Counter).Surname)
   Record(Counter).MAdd1 = UCASE$(Record(Counter).Add1)
   Record(Counter).MAdd2 = UCASE$(Record(Counter).Add2)
   Record(Counter).MAdd3 = UCASE$(Record(Counter).Add3)
NEXT Counter

'*** Process the matching fields
'*** This involves stripping spaces
FOR Counter = 0 TO UBOUND(Record) - 1
   MoreStrip Record(Counter).MName
   MoreStrip Record(Counter).MAdd1
   MoreStrip Record(Counter).MAdd2
   MoreStrip Record(Counter).MAdd3
NEXT Counter

PrintFlds

END

SUB MoreStrip (Text$)
NewText$ = ""
Count = 0
   FOR Counter = 1 TO LEN(Text$)
      Char$ = MID$(Text$, Counter, 1)
      IF INSTR("AEIOU ", UCASE$(Char$)) > 0 THEN
         Char$ = ""
         Count = Count + 1
      END IF
'*** For some reason commas need to be taken out specificaly
      IF Char$ = "," THEN
         Char$ = ""
         Count = Count + 1
      END IF
      NewText$ = NewText$ + Char$
   NEXT Counter
   Text$ = NewText$ + SPACE$(Count)
END SUB

SUB PrintFlds
CLS
FOR Count = 0 TO UBOUND(Record) - 1
   PRINT Record(Count).Title;
   PRINT Record(Count).Initial;
   PRINT Record(Count).Surname
   PRINT Record(Count).Fullname;
   PRINT Record(Count).Salut;
   PRINT Record(Count).Contact
   PRINT Record(Count).Add1;
   PRINT Record(Count).Add2;
   PRINT Record(Count).Add3
   PRINT Record(Count).MName;
   PRINT Record(Count).MAdd1;
   PRINT Record(Count).MAdd2;
   PRINT Record(Count).MAdd3
NEXT Count
END SUB

SUB Strip (Text$)
NewText$ = ""
Count = 0
FOR Counter = 1 TO LEN(Text$)
   Char$ = MID$(Text$, Counter, 1)
   IF INSTR("ABCDEFGHIJKLMNOPQRSTUVWXYX 0123456789-", UCASE$(Char$)) < 1 THEN
      Char$ = ""
      Count = Count + 1
   END IF
   NewText$ = NewText$ + Char$
NEXT Counter
Text$ = NewText$ + SPACE$(Count)
END SUB

SUB TitleCase (Text$)

'*** This subroutine title cases text
'*** Split the text into words
'*** The initial letter of each word is uppercased
'*** The rest is lower cased
'*** While we're looking for spaces strip out extra ones

FinalText$ = ""
OldNum = 1
Number = 0
DO
   Number = INSTR(OldNum, Text$, " ")
   IF Number < 1 THEN EXIT DO
   NewText$ = MID$(Text$, OldNum, Number - OldNum)
   IF NewText$ > " " THEN
      InitText$ = UCASE$(LEFT$(NewText$, 1))
      RestText$ = LCASE$(MID$(NewText$, 2, Number - OldNum))
      FinalText$ = FinalText$ + InitText$) + RestText$) + SPACE$(1)
   END IF
   OldNum = Number + 1
LOOP UNTIL Number < 1

'*** Special case for hyphenated names
Number = INSTR(FinalText$, "-")
IF Number > 0 THEN
   MID$(FinalText$, Number + 1, 1) = UCASE$(MID$(FinalText$, Number + 1, 1))
END IF

Text$ = FinalText$
END SUB

If you look at the program you'll see it is just a text manipulation program that produces fields that are more suitable for text matching than simply casing the text. Once the matching fields have been produced they can be matched against eachother. The program also produces Salutation and Contact fields. When testing these sort of programs it is important that all possible forms of the data the program is required to deal with is tested, hence the 9 different forms of the A B Sample data.

Soundex

There is another way that text can be matched. This uses an algorithim that, strange as it seems, matches text by the way it is pronounced. The method is called SOUNDEX.

There are several variations of the Soundex algorithim and they give slightly different results. Here is a list of the common rules of creating a soundex algorithim :-

Double consonants are reduced to one letter
Adjacent letters that belong to the same group in the table below are reduced to one letter
Letters separated by H or W are treated as though the H or W are not there
The letters are substituted for the following numbers

1 for B, F, P and Y
2 for C, G, J, K, Q, S, X and Z
3 for D and T
4 for L
5 for M and N
6 for R

Vowels are removed, except if it is the first letter
The letter Y is removed, except if is the first letter
Shorten the coded word to 4 characters
If the code produced is less than 4 characters long then make it 4 by right padding with 0's

Odd though it seems this system really does work, the US Government uses it to index its census information and several genealogy pages index their databases with it. Several of these Soundex calculators can be found on the net.

Take the words BROADCAST, HELLRAISING, BARE and BEAR
Removing H and W gives BROADCAST, HELLRAISING, BARE and BEAR
Removing double consonants gives BROADCAST, HELRAISING, BARE and BEAR
Removing the vowels gives BRDCST, HLRSNG, BR and BR
Removing Y gives BRDCST, HLRSNG, BR and BR
Removing letters of the same group gives BRDCT, HLRSNG, BR and BR
Shortening the word to four characters gives BRDC, HLRS, BR and BR
Substitution gives B632, H462, B6 and B6
Padding gives B632, H462, B600 and B600

Using this method the surnames PAINE, PAIN, PANE, PAYN, PAYNE etc all give P500

When calculating Soundex's for surnames some programs will ignore prefixes such as Von, de, D', le, du, van etc. Most, however keep Mc, Mac etc.

'Soundex.bas    Ray Thomas      August 2000

'A simple Soundex calculator

'*** Get the word to calculate the soundex code for

CLS
LOCATE 5, 10
PRINT "Please type the word to create the soundex code for"
LOCATE 7, 20
LINE INPUT ; UserWord$

'*** To make processing simpler uppercase the input word
NewText$ = UCASE$(UserWord$)

'*** Keep the first letter for later
InitLet$ = LEFT$(NewText$, 1)

'*** Substitute the letters for numbers

TestWord$ = ""
FOR Counter = 1 TO LEN(NewText$)
        Char$ = MID$(NewText$, Counter, 1)
        IF INSTR("BFPV", Char$) THEN Char$ = "1"
        IF INSTR("CGJKQSXZ", Char$) THEN Char$ = "2"
        IF INSTR("DT", Char$) THEN Char$ = "3"
        IF INSTR("L", Char$) THEN Char$ = "4"
        IF INSTR("MN", Char$) THEN Char$ = "5"
        IF INSTR("R", Char$) THEN Char$ = "6"
        TestWord$ = TestWord$ + Char$
NEXT Counter

' *** Strip out the letters H and W
NewText$ = ""
FOR Counter = 1 TO LEN(TestWord$)
        Char$ = MID$(TestWord$, Counter, 1)
        Number = INSTR("HW", Char$)
        IF Number < 1 THEN NewText$ = NewText$ + Char$
NEXT Counter

'*** Check that no two adjacent codes are the same

TestWord$ = ""
FOR Counter = 1 TO LEN(NewText$)
        Char$ = MID$(NewText$, Counter, 1)
        IF Char$ <> RIGHT$(TestWord$, 1) THEN
                TestWord$ = TestWord$ + Char$
        END IF
NEXT Counter

'*** Strip out any non-alphabetic characters, vowels and Y

NewText$ = ""
FOR Counter = 1 TO LEN(TestWord$)
        Char$ = MID$(TestWord$, Counter, 1)
        Number = INSTR("123456", Char$)
        IF Number > 0 THEN NewText$ = NewText$ + Char$
NEXT Counter

'*** Create the final code

Number = INSTR("BCDFGJKLMNPQRSTVXZ", InitLet$)
IF Number > 0 THEN
        Soundex$ = InitLet$ + MID$(NewText$, 2, 3)
ELSE
        Soundex$ = InitLet$ + MID$(NewText$, 1, 3)
END IF

'*** If less than four characters pad right with "0"s
IF LEN(Soundex$) < 4 THEN
        Soundex$ = Soundex$ + STRING$(4 - LEN(Soundex$), "0")
END IF

LOCATE 10, 10
PRINT "The Soundex code for this word is :-"
LOCATE 12, 30
PRINT Soundex$
END

QBasic | Errors | 40lb Weight | Bits | Chance | Colours | Dates | Delays | File Dialog | Files | Input | Matching | Menus | Mouse | Numbers | SeqNo | SIRDS | Sorts | Text | Timer | DLoads

HomePage | Optical Illusions | War Stories | QBasic | Dads Navy Days | Bristol | Bristol, USA | Bristol, Canada | Terre Haute | Miscellany | Web Stuff | About Ray | Site Map | Site Search | Messages | Credits | Links | Web Rings