[AccessD] A2003: Checking for Similar Names and or Addresses

Wed May 13 23:26:37 CDT 2015

I use SHA1 in SQL Server to generate a hash.  I append all of the 
address pieces together, then feed the resulting string into the 
hashbytes function.  Obviously the SP updates a HashFamily and 
HashPerson field VarBinary(200) in the same table as the name / address.

ALTER PROCEDURE [dbo].[zzsp_HashFieldsAllAdults_Update]
     -- Add the parameters for the stored procedure here
     @DBName varchar(50), @TblName varchar(50)
AS
BEGIN
     -- SET NOCOUNT ON added to prevent extra result sets from
     -- interfering with SELECT statements.
     SET NOCOUNT ON;

     -- Insert statements for procedure here
declare @SQL varchar(1000)

     BEGIN TRY
         SELECT @SQL = 'UPDATE [' + @DBName + '].[dbo].[' + @TblName + 
'] ' +
                         'SET [HashFamily]= hashbytes(''' + 'sha1' + 
''', Addr + Zip5 + Zip4 + LName), ' +
                         '[HashPerson]= hashbytes(''' + 'sha1' + ''', 
Addr + Zip5 + Zip4 + LName + FName)'
                         --WHERE ADDRValid not in(''ANK'',''INV'')'
         print @SQL
         exec (@SQL)
     END TRY
     BEGIN CATCH
         print 'There was an error UPDATING the hash fields! '
         print ERROR_MESSAGE()
     END CATCH

END

The resulting hash can be used to dedupe the data since an identical 
name/address will create an identical hash.  AZData is my table name.  
Or just select the dupes using a groupby / count kind of construct.

ALTER PROCEDURE [dbo].[usp_TblAZData_DedupePK]
     -- Add the parameters for the stored procedure here
AS
BEGIN
     -- SET NOCOUNT ON added to prevent extra result sets from
     -- interfering with SELECT statements.
     SET NOCOUNT ON;
declare    @RecsAffected int

     -- Insert statements for procedure here

     --
     --Clean out all records with a bad hash
     --
         DELETE FROM f
         FROM [dbo].[AZData] AS f
         WHERE HashFamily is null

     select @RecsAffected = 1

     while (@RecsAffected > 0)
     begin
         --
         --Use the hash number to select identical records, then the PK 
(autonumber) to select between identical records.
         --And delete the dupes
         --
             DELETE FROM f
             FROM [dbo].[AZData] AS f
             INNER JOIN [dbo].[AZData] AS g
             ON g.HashPerson = f.HashPerson AND f.pk < g.pk
             select @RecsAffected = @@RowCount

             print @RecsAffected
             print 'success deduping AZData'
     end

END

I can actually do this for you if you send me a file with the data.   It 
needs to have FName, LName, Addr, City, St and Postal Code, or whatever 
makes it distinguishable from the next record.

John W. Colby

On 5/13/2015 11:47 PM, Darren wrote:
> Hi Team
> I appreciate this may trigger a right vs. wrong discussion. All comments
> appreciated but this is for a Pro-Bono Project so there are no funds and my
> time is not infinite.
> So a perfect solution is not required. Just a working one :-)
> So...
> I have inherited a dB with approx 12K names and addresses for a community
> project I am working on.
> Data entry has been on home built Access dBs and Excel spreadsheets and data
> veracity is poor.
> I can easily identify nearly 1000 records with same first and last names so
> there appear to be a lot of duplicate entries.
> We will handle that and will sanitise the old data. However does anyone have
> a quick and easy "test" where I can quickly check completed names (First and
> Last) once entered, to see if we have exact or similar matches already in
> the dB?
> I really would like the same thing with addresses?
> I have Googled and I have seen some very complex weighting routines etc -
> Too tricky for me and beyond the scope of this project (and my skill set).
> I just need to present the users with a list of exact matches (I can do that
> now) and some potentials, in a list that they can then decide if it's ok to
> proceed or not.
> Many thanks in advance.
> Darren.