[AccessD] A2003: Checking for Similar Names and or Addresses
John W. Colby
jwcolby at gmail.com
Wed May 13 23:26:37 CDT 2015
I use SHA1 in SQL Server to generate a hash. I append all of the
address pieces together, then feed the resulting string into the
hashbytes function. Obviously the SP updates a HashFamily and
HashPerson field VarBinary(200) in the same table as the name / address.
ALTER PROCEDURE [dbo].[zzsp_HashFieldsAllAdults_Update]
-- Add the parameters for the stored procedure here
@DBName varchar(50), @TblName varchar(50)
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- Insert statements for procedure here
declare @SQL varchar(1000)
BEGIN TRY
SELECT @SQL = 'UPDATE [' + @DBName + '].[dbo].[' + @TblName +
'] ' +
'SET [HashFamily]= hashbytes(''' + 'sha1' +
''', Addr + Zip5 + Zip4 + LName), ' +
'[HashPerson]= hashbytes(''' + 'sha1' + ''',
Addr + Zip5 + Zip4 + LName + FName)'
--WHERE ADDRValid not in(''ANK'',''INV'')'
print @SQL
exec (@SQL)
END TRY
BEGIN CATCH
print 'There was an error UPDATING the hash fields! '
print ERROR_MESSAGE()
END CATCH
END
The resulting hash can be used to dedupe the data since an identical
name/address will create an identical hash. AZData is my table name.
Or just select the dupes using a groupby / count kind of construct.
ALTER PROCEDURE [dbo].[usp_TblAZData_DedupePK]
-- Add the parameters for the stored procedure here
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
declare @RecsAffected int
-- Insert statements for procedure here
--
--Clean out all records with a bad hash
--
DELETE FROM f
FROM [dbo].[AZData] AS f
WHERE HashFamily is null
select @RecsAffected = 1
while (@RecsAffected > 0)
begin
--
--Use the hash number to select identical records, then the PK
(autonumber) to select between identical records.
--And delete the dupes
--
DELETE FROM f
FROM [dbo].[AZData] AS f
INNER JOIN [dbo].[AZData] AS g
ON g.HashPerson = f.HashPerson AND f.pk < g.pk
select @RecsAffected = @@RowCount
print @RecsAffected
print 'success deduping AZData'
end
END
I can actually do this for you if you send me a file with the data. It
needs to have FName, LName, Addr, City, St and Postal Code, or whatever
makes it distinguishable from the next record.
John W. Colby
On 5/13/2015 11:47 PM, Darren wrote:
> Hi Team
> I appreciate this may trigger a right vs. wrong discussion. All comments
> appreciated but this is for a Pro-Bono Project so there are no funds and my
> time is not infinite.
> So a perfect solution is not required. Just a working one :-)
> So...
> I have inherited a dB with approx 12K names and addresses for a community
> project I am working on.
> Data entry has been on home built Access dBs and Excel spreadsheets and data
> veracity is poor.
> I can easily identify nearly 1000 records with same first and last names so
> there appear to be a lot of duplicate entries.
> We will handle that and will sanitise the old data. However does anyone have
> a quick and easy "test" where I can quickly check completed names (First and
> Last) once entered, to see if we have exact or similar matches already in
> the dB?
> I really would like the same thing with addresses?
> I have Googled and I have seen some very complex weighting routines etc -
> Too tricky for me and beyond the scope of this project (and my skill set).
> I just need to present the users with a list of exact matches (I can do that
> now) and some potentials, in a list that they can then decide if it's ok to
> proceed or not.
> Many thanks in advance.
> Darren.
More information about the AccessD
mailing list