HomePage | Optical Illusions | War Stories | QBasic | Dads Navy Days | Bristol | Bristol, USA | Bristol, Canada | Terre Haute | Miscellany | Web Stuff | About Ray | Site Map | Site Search | Messages | Credits | Links | Web Rings
QBasic | Errors | 40lb Weight | Bits | Chance | Colours | Dates | Delays | File Dialog | Files | Input | Matching | Menus | Mouse | Numbers | SeqNo | SIRDS | Sorts | Text | Timer | DLoads
Text matching techniques
When using databases one of the most common proceedures is comparing records, either for deduplication or suppression. Deduplication is the process of removing identical records from a database. Suppression uses two or more databases, records found in one or more databases are removed from the main database, this is also known as purging. Another, similar, process is known as merging. In this process records in the suppression database(s) are compared against the main data records. Those records that are not already contained within the main database are added to it.
Searching for identical text is relatively easy, but problems arise when matching on names and addresses. For example identical text matching would not match the following records :-
Name Address Line 1 Address Line 2 Address Line 3 Mr. A B Sample 123 Sample Road Sample Town Sample County Mr A Sample 123, Sample Road SampleTown SampleCounty MR A SAMPLE 123 SAMPLE ROAD SAMPLE TOWN SAMPLE COUNTY Mr Alan Sample 123 Sample Rd Sample Town Sample County Alan Sample 123 Sample Road Sample Town Sample County A Sample 123, Sample Rd SAmple Town Sample CouNty MATCHES 0 2 3 3
Because of these discrepancies professional database matching software is very sophisticated and can be very expensive. Most of this software award scores for how close records get to being a perfect match to eachother. Working for a direct mailing house, where we do a lot of record comparisons, we can choose at which score level we decide to either delete or keep almost matching records. At low scores we delete less records but may not catch mail to the same address, hence people sometimes get two or more letters addressed to very slightly addresses but arriving at the same house. Incidentally, this is usually the fault of the suppliers of the database and not us. When we do something like a Mail Preference Service (MPS) suppression where people have specifically asked not to receive direct mail we keep the matching score fairly high, this means that people in the MPS suppression database will not receive the mailing at the risk of leaving people who may like to get it out.
Returning to our three example records, one way to improve the matching process is to strip out any characters that is not A-Z or 0-9. Matching will also improve if the letters are all cased the same. Applying this technique to the above records they now become :-
Name Address Line 1 Address Line 2 Address Line 3 MRABSAMPLE 123SAMPLEROAD SAMPLETOWN SAMPLECOUNTY MRASAMPLE 123SAMPLEROAD SAMPLETOWN SAMPLECOUNTY MRASAMPLE 123SAMPLEROAD SAMPLETOWN SAMPLECOUNTY MRALANSAMPLE 123SAMPLERD SAMPLETOWN SAMPLECOUNTY ALANSAMPLE 123SAMPLEROAD SAMPLETOWN SAMPLECOUNTY ASAMPLE 123SAMPLERD SAMPLETOWN SAMPLECOUNTY MATCHES 2 4 6 6
This is a little better, but doing a straightforward text comparison will not exactly match all of the fields in all of these records. Now you can see the value of giving scores to the matches. Some matching softwware will also remove the vowels, some will also split the NAME field into fields called TITLE, FORNAMES, INITIALS and SURNAMES. Some can make initials from the forenames as well. Some of this software contains a database of fornames and can make up the title from them.
Title Initial Forename Surname Address Line 1 Address Line 2 Address Line 3 MR A SAMPLE 123SMPLRD SMPLTWN SMPLCNTY MR A SAMPLE 123SMPLRD SMPLTWN SMPLCNTY MR A SAMPLE 123SMPLRD SMPLTWN SMPLCNTY MR A ALAN SAMPLE 123SMPLRD SMPLTWN SMPLCNTY A ALAN SAMPLE 123SMPLRD SMPLTWN SMPLCNTY A SAMPLE 123SMPLRD SMPLTWN SMPLCNTY MATCHES 4 6 2 6 6 6 6
To get this far may seem to be a relatively simple process but the work is only just beginning. To cut down on the work that has to be done relatively few matches will be done. We usually use surname, address line 1 and postcode or town. Some clients may want only members of one sex mailed, or only one letter to each household, so further comparisons have to be done. When deduping every record in the database has to be compared against every other record, when doing suppressions every record in every database used has to be compared. To compare any one field in the four records of our pretend database above SIX seperate comparisons have to be done, imagine how many comparisons have to be done on a database of several thousand records. For this reason most databases, Access, FoxPro etc allow the use of key indexes. Indexes are a type of sort on the records so a large number of comparisons can be bypassed. Indexing or sorting a database makes comparisons very much quicker; there is no need for addresses in Edinburgh to be compared to those in, say, Sheffield. A discussion on sorting techniques can be found on the SORT pages.
The following program, PreMatch.bas, shows how text can be prepared for the text comparison process. A word of caution, this code should NOT be used for ANY commercial application. The reason for the caution is that the seperation of names into their componant parts is fraught with difficulties, getting it wrong will certainly annoy the people in your database and may even move them to sue you. You must also remember that by keeping lists of peoples names and addresses, let alone any other details about them, makes you subject to the Data Protection Act.
DECLARE SUB PrintFlds () DECLARE SUB TitleCase (Text$) DECLARE SUB Strip (Text$) DECLARE SUB MoreStrip (Text$) 'PreMatch.bas Ray Thomas August 2000 '*** A program to get text into a form suitable for matching '*** To simplfy the program I will assume no spaces between hyphenated surnames '*** It will also contain a "contact" field and a salutation TYPE InRec Title AS STRING * 5 Initial AS STRING * 5 Surname AS STRING * 20 Fullname AS STRING * 20 Add1 AS STRING * 20 Add2 AS STRING * 20 Add3 AS STRING * 20 Salut AS STRING * 30 Contact AS STRING * 30 MName AS STRING * 20 MAdd1 AS STRING * 20 MAdd2 AS STRING * 20 MAdd3 AS STRING * 20 END TYPE DIM SHARED Record(9) AS InRec 'An array to hold the records DIM TitleComp(6) AS STRING 'Array to compare titles DIM Counter AS INTEGER 'A counter for loop counts DIM Count AS INTEGER 'A counter for loop counts DIM Number AS INTEGER 'General numeric variable '*** When defining type arrays they are initialised to NULLS '*** to make text comparisons easier we are going to fill them with spaces FOR Counter = 1 TO UBOUND(Record) - 1 Record(Counter).Title = SPACE$(LEN(Record(Counter).Title)) Record(Counter).Initial = SPACE$(LEN(Record(Counter).Initial)) Record(Counter).Surname = SPACE$(LEN(Record(Counter).Surname)) Record(Counter).Fullname = SPACE$(LEN(Record(Counter).Fullname)) Record(Counter).Add1 = SPACE$(LEN(Record(Counter).Add1)) Record(Counter).Add2 = SPACE$(LEN(Record(Counter).Add2)) Record(Counter).Add3 = SPACE$(LEN(Record(Counter).Add3)) Record(Counter).Salut = SPACE$(LEN(Record(Counter).Salut)) Record(Counter).Contact = SPACE$(LEN(Record(Counter).Contact)) Record(Counter).MName = SPACE$(LEN(Record(Counter).MName)) Record(Counter).MAdd1 = SPACE$(LEN(Record(Counter).MAdd1)) Record(Counter).MAdd2 = SPACE$(LEN(Record(Counter).MAdd2)) Record(Counter).MAdd3 = SPACE$(LEN(Record(Counter).MAdd3)) NEXT Counter '*** Define some records *** DATA Mr A B Sample,123 Sample Road,Sample Town,Sample County DATA Mr A Sample,"123, Sample Road",SampleTown,SampleCounty DATA MR A SAMPLE,123 SAMPLE ROAD,SAMPLE TOWN,SAMPLE COUNTY DATA Mr Alan Sample,123 Sample Rd,Sample Town,Sample County DATA Alan Sample,"123, Sample Rd",SAmple Town,Sample CouNty DATA A Sample,123 Sample Road,Sample Town,Sample County DATA Mr Sample,123 Sample Road,Sample Town,Sample County DATA Mr Alpha-Sample,123 Sample Road, Sample Town, Sample County DATA Mr A B Sample,123 Sample Road,Sample Town,Sample County '*** Fill the record array and titlecase the fields *** FOR Counter = 0 TO UBOUND(Record) - 1 READ Record(Counter).Fullname TitleCase Record(Counter).Fullname READ Record(Counter).Add1 TitleCase Record(Counter).Add1 READ Record(Counter).Add2 TitleCase Record(Counter).Add2 READ Record(Counter).Add3 TitleCase Record(Counter).Add3 NEXT Counter '*** Define a simple titles array *** '*** Imagine how many other variations may be found *** DATA Mr,Miss,Mrs,Ms,Doctor,Dr FOR Counter = 0 TO UBOUND(TitleComp$) - 1 READ TitleComp$(Counter) NEXT Counter '*** Strip out any "odd" characters FOR Counter = 0 TO UBOUND(Record) - 1 Strip Record(Counter).Fullname Strip Record(Counter).Add1 Strip Record(Counter).Add2 Strip Record(Counter).Add3 NEXT Counter '*** Split the fullname field into title, first inital and surname *** '*** title first *** FOR Counter = 0 TO UBOUND(Record) - 1 Number = INSTR(Record(Counter).Fullname, " ") Text$ = LEFT$(Record(Counter).Fullname, Number - 1) FOR Count = 0 TO UBOUND(TitleComp$) - 1 IF UCASE$(Text$) = UCASE$(TitleComp$(Count)) THEN Record(Counter).Title = Text$ EndTitle = LEN(TitleComp$(Count)) + 2 EXIT FOR END IF EndTitle = 1 NEXT Count '*** To work out where the surname is we work backwards through the field '*** The first space after the end spaces must be the start of the surname LastChar$ = " " FOR Count = LEN(Record(Counter).Fullname) TO 1 STEP -1 LastChar$ = MID$(Record(Counter).Fullname, Count, 1) IF LastChar$ <> " " THEN Number = Count EXIT FOR END IF NEXT Count FOR Count = 1 TO Number CharPosn = INSTR(Count, Record(Counter).Fullname, " ") IF CharPosn >= Number THEN Num = LEN(Record(Counter).Fullname) - Count Record(Counter).Surname = MID$(Record(Counter).Fullname, Count, Num) StartSur = Count EXIT FOR END If NEXT Count '*** Get the initial IF EndTitle <> StartSur THEN Record(Counter).Initial = MID$(Record(Counter).Fullname, EndTitle, 1) END IF NEXT Counter '*** Create the "Contact" field Text$ = "" FOR Counter = 0 TO UBOUND(Record) - 1 Text$ = RTRIM$(Record(Counter).Title) + " " IF Text$ = " " THEN Text$ = RTRIM$(Record(Counter).Initial) + " " ELSE Text$ = Text$ + RTRIM$(Record(Counter).Initial) + " " END IF Text$ = Text$ + RTRIM$(Record(Counter).Surname) Record(Counter).Contact = Text$ NEXT Counter '*** Create the Salutation field FOR Counter = 0 TO UBOUND(Record) - 1 Text$ = "Dear " + Record(Counter).Title Text$ = RTRIM$(Text$) + " " + Record(Counter).Surname IF LEFT$(Record(Counter).Title, 1) = " " THEN Text$ = "Dear Sir/Madam" END IF Record(Counter).Salut = Text$ NEXT Counter '*** Now the text is in sort of natural casing, create the matching fields FOR Counter = 0 TO UBOUND(Record) - 1 Record(Counter).MName = UCASE$(Record(Counter).Surname) Record(Counter).MAdd1 = UCASE$(Record(Counter).Add1) Record(Counter).MAdd2 = UCASE$(Record(Counter).Add2) Record(Counter).MAdd3 = UCASE$(Record(Counter).Add3) NEXT Counter '*** Process the matching fields '*** This involves stripping spaces FOR Counter = 0 TO UBOUND(Record) - 1 MoreStrip Record(Counter).MName MoreStrip Record(Counter).MAdd1 MoreStrip Record(Counter).MAdd2 MoreStrip Record(Counter).MAdd3 NEXT Counter PrintFlds END SUB MoreStrip (Text$) NewText$ = "" Count = 0 FOR Counter = 1 TO LEN(Text$) Char$ = MID$(Text$, Counter, 1) IF INSTR("AEIOU ", UCASE$(Char$)) > 0 THEN Char$ = "" Count = Count + 1 END IF '*** For some reason commas need to be taken out specificaly IF Char$ = "," THEN Char$ = "" Count = Count + 1 END IF NewText$ = NewText$ + Char$ NEXT Counter Text$ = NewText$ + SPACE$(Count) END SUB SUB PrintFlds CLS FOR Count = 0 TO UBOUND(Record) - 1 PRINT Record(Count).Title; PRINT Record(Count).Initial; PRINT Record(Count).Surname PRINT Record(Count).Fullname; PRINT Record(Count).Salut; PRINT Record(Count).Contact PRINT Record(Count).Add1; PRINT Record(Count).Add2; PRINT Record(Count).Add3 PRINT Record(Count).MName; PRINT Record(Count).MAdd1; PRINT Record(Count).MAdd2; PRINT Record(Count).MAdd3 NEXT Count END SUB SUB Strip (Text$) NewText$ = "" Count = 0 FOR Counter = 1 TO LEN(Text$) Char$ = MID$(Text$, Counter, 1) IF INSTR("ABCDEFGHIJKLMNOPQRSTUVWXYX 0123456789-", UCASE$(Char$)) < 1 THEN Char$ = "" Count = Count + 1 END IF NewText$ = NewText$ + Char$ NEXT Counter Text$ = NewText$ + SPACE$(Count) END SUB SUB TitleCase (Text$) '*** This subroutine title cases text '*** Split the text into words '*** The initial letter of each word is uppercased '*** The rest is lower cased '*** While we're looking for spaces strip out extra ones FinalText$ = "" OldNum = 1 Number = 0 DO Number = INSTR(OldNum, Text$, " ") IF Number < 1 THEN EXIT DO NewText$ = MID$(Text$, OldNum, Number - OldNum) IF NewText$ > " " THEN InitText$ = UCASE$(LEFT$(NewText$, 1)) RestText$ = LCASE$(MID$(NewText$, 2, Number - OldNum)) FinalText$ = FinalText$ + InitText$) + RestText$) + SPACE$(1) END IF OldNum = Number + 1 LOOP UNTIL Number < 1 '*** Special case for hyphenated names Number = INSTR(FinalText$, "-") IF Number > 0 THEN MID$(FinalText$, Number + 1, 1) = UCASE$(MID$(FinalText$, Number + 1, 1)) END IF Text$ = FinalText$ END SUB
If you look at the program you'll see it is just a text manipulation program that produces fields that are more suitable for text matching than simply casing the text. Once the matching fields have been produced they can be matched against eachother. The program also produces Salutation and Contact fields. When testing these sort of programs it is important that all possible forms of the data the program is required to deal with is tested, hence the 9 different forms of the A B Sample data.
Soundex
There is another way that text can be matched. This uses an algorithim that, strange as it seems, matches text by the way it is pronounced. The method is called SOUNDEX.
There are several variations of the Soundex algorithim and they give slightly different results. Here is a list of the common rules of creating a soundex algorithim :-
Double consonants are reduced to one letter
Adjacent letters that belong to the same group in the table below are reduced to one letter
Letters separated by H or W are treated as though the H or W are not there
The letters are substituted for the following numbers1 for B, F, P and Y
2 for C, G, J, K, Q, S, X and Z
3 for D and T
4 for L
5 for M and N
6 for RVowels are removed, except if it is the first letter
The letter Y is removed, except if is the first letter
Shorten the coded word to 4 characters
If the code produced is less than 4 characters long then make it 4 by right padding with 0's
Odd though it seems this system really does work, the US Government uses it to index its census information and several genealogy pages index their databases with it. Several of these Soundex calculators can be found on the net.
Take the words BROADCAST, HELLRAISING, BARE and BEAR
Removing H and W gives BROADCAST, HELLRAISING, BARE and BEAR
Removing double consonants gives BROADCAST, HELRAISING, BARE and BEAR
Removing the vowels gives BRDCST, HLRSNG, BR and BR
Removing Y gives BRDCST, HLRSNG, BR and BR
Removing letters of the same group gives BRDCT, HLRSNG, BR and BR
Shortening the word to four characters gives BRDC, HLRS, BR and BR
Substitution gives B632, H462, B6 and B6
Padding gives B632, H462, B600 and B600
Using this method the surnames PAINE, PAIN, PANE, PAYN, PAYNE etc all give P500
When calculating Soundex's for surnames some programs will ignore prefixes such as Von, de, D', le, du, van etc. Most, however keep Mc, Mac etc.
'Soundex.bas Ray Thomas August 2000 'A simple Soundex calculator '*** Get the word to calculate the soundex code for CLS LOCATE 5, 10 PRINT "Please type the word to create the soundex code for" LOCATE 7, 20 LINE INPUT ; UserWord$ '*** To make processing simpler uppercase the input word NewText$ = UCASE$(UserWord$) '*** Keep the first letter for later InitLet$ = LEFT$(NewText$, 1) '*** Substitute the letters for numbers TestWord$ = "" FOR Counter = 1 TO LEN(NewText$) Char$ = MID$(NewText$, Counter, 1) IF INSTR("BFPV", Char$) THEN Char$ = "1" IF INSTR("CGJKQSXZ", Char$) THEN Char$ = "2" IF INSTR("DT", Char$) THEN Char$ = "3" IF INSTR("L", Char$) THEN Char$ = "4" IF INSTR("MN", Char$) THEN Char$ = "5" IF INSTR("R", Char$) THEN Char$ = "6" TestWord$ = TestWord$ + Char$ NEXT Counter ' *** Strip out the letters H and W NewText$ = "" FOR Counter = 1 TO LEN(TestWord$) Char$ = MID$(TestWord$, Counter, 1) Number = INSTR("HW", Char$) IF Number < 1 THEN NewText$ = NewText$ + Char$ NEXT Counter '*** Check that no two adjacent codes are the same TestWord$ = "" FOR Counter = 1 TO LEN(NewText$) Char$ = MID$(NewText$, Counter, 1) IF Char$ <> RIGHT$(TestWord$, 1) THEN TestWord$ = TestWord$ + Char$ END IF NEXT Counter '*** Strip out any non-alphabetic characters, vowels and Y NewText$ = "" FOR Counter = 1 TO LEN(TestWord$) Char$ = MID$(TestWord$, Counter, 1) Number = INSTR("123456", Char$) IF Number > 0 THEN NewText$ = NewText$ + Char$ NEXT Counter '*** Create the final code Number = INSTR("BCDFGJKLMNPQRSTVXZ", InitLet$) IF Number > 0 THEN Soundex$ = InitLet$ + MID$(NewText$, 2, 3) ELSE Soundex$ = InitLet$ + MID$(NewText$, 1, 3) END IF '*** If less than four characters pad right with "0"s IF LEN(Soundex$) < 4 THEN Soundex$ = Soundex$ + STRING$(4 - LEN(Soundex$), "0") END IF LOCATE 10, 10 PRINT "The Soundex code for this word is :-" LOCATE 12, 30 PRINT Soundex$ END
QBasic | Errors | 40lb Weight | Bits | Chance | Colours | Dates | Delays | File Dialog | Files | Input | Matching | Menus | Mouse | Numbers | SeqNo | SIRDS | Sorts | Text | Timer | DLoads
HomePage | Optical Illusions | War Stories | QBasic | Dads Navy Days | Bristol | Bristol, USA | Bristol, Canada | Terre Haute | Miscellany | Web Stuff | About Ray | Site Map | Site Search | Messages | Credits | Links | Web Rings