Duplicate Detection with GenAI. How using LLMs and GenAI techniques can… | by Ian Ormesher

[ad_1]

How using LLMs and GenAI techniques can improve de-duplication

2D UMAP Musicbrainz 200K nearest neighbour plot

Customer data is often stored as records in Customer Relations Management systems (CRMs). Data which is manually entered into such systems by one of more users over time leads to data replication, partial duplication or fuzzy duplication. This in turn means that there no longer a single source of truth for customers, contacts, accounts, etc. Downstream business processes become increasing complex and contrived without a unique mapping between a record in a CRM and the target customer. Current methods to detect and de-duplicate records use traditional Natural Language Processing techniques known as Entity Matching. But it is possible to use the latest advancements in Large Language Models and Generative AI to vastly improve the identification and repair of duplicated records. On common benchmark datasets I found an improvement in the accuracy of data de-duplication rates from 30 percent using NLP techniques to almost 60 percent using my proposed method.

I want to explain the technique here in the hope that others will find it helpful and use it for their own de-duplication needs. It’s useful for other scenarios where you wish to identify duplicate records, not just for Customer data. I also wrote and published a research paper about this which you can view on Arxiv, if you want to know more in depth:

The task of identifying duplicate records is often done by pairwise record comparisons and is referred to as “Entity Matching” (EM). Typical steps of this process would be:

Data Preparation
Candidate Generation
Blocking
Matching
Clustering

Data Preparation

Data preparation is the cleaning of the data and involves such things as removing non-ASCII characters, capitalisation and tokenising the text. This is an important and necessary step for the NLP matching algorithms later in the process which don’t work well with different cases or non-ASCII characters.

Candidate Generation

In the usual EM method, we would produce candidate records by combining all the records in the table with themselves to produce a cartesian product. You would remove all combinations which are of a row with itself. For a lot of the NLP matching algorithms comparing row A with row B is equivalent to comparing row B with row A. For those cases you can get away with keeping just one of those pairs. But even after this, you’re still left with a lot of candidate records. In order to reduce this number a technique called “blocking” is often used.

Blocking

The idea of blocking is to eliminate those records that we know could not be duplicates of each other because they have different values for the “blocked” column. As an example, If we were considering customer records, a potential column to block on could be something like “City”. This is because we know that even if all the other details of the record are similar enough, they cannot be the same customer if they’re located in different cities. Once we have generated our candidate records, we then use blocking to eliminate those records that have different values for the blocked column.

Matching

Following on from blocking we now examine all the candidate records and calculate traditional NLP similarity-based attribute value metrics with the fields from the two rows. Using these metrics, we can determine if we have a potential match or un-match.

Clustering

Now that we have a list of candidate records that match, we can then group them into clusters.

There are several steps to the proposed method, but the most important thing to note is that we no longer need to perform the “Data Preparation” or “Candidate Generation” step of the traditional methods. The new steps become:

Create Match Sentences
Create Embedding Vectors of those Match Sentences
Clustering

Create Match Sentences

First a “Match Sentence” is created by concatenating the attributes we are interested in and separating them with spaces. As an example, let’s say we have a customer record which looks like this:

[ad_2]

Duplicate Detection with GenAI. How using LLMs and GenAI techniques can… | by Ian Ormesher | Jul, 2024

How using LLMs and GenAI techniques can improve de-duplication

Data Preparation

Candidate Generation

Blocking

Matching

Clustering

Create Match Sentences

Create Embedding Vectors

Clustering

Visualising Clustering

Resources

Efficiently build and tune custom log anomaly detection models with Amazon SageMaker

The State of Quantum Computing: Where Are We Today? | by Sara A. Metwalli | Jan, 2025

Why Variable Scoping Can Make or Break Your Data Science Workflow | by Clara Chong | Jan, 2025

Leave a Reply Cancel reply

Best Roulette Sites & Bonuses January 2025

Finest Roulette Web sites & Incentives January 2025

The Comprehensive Overview to Homework Encyclopedias

Finest Electronic poker Web sites 2025 Analysis Incentives Online game

Покердом

How using LLMs and GenAI techniques can improve de-duplication

Data Preparation

Candidate Generation

Blocking

Matching

Clustering

Create Match Sentences

Create Embedding Vectors

Clustering

Visualising Clustering

Resources

More Stories

Leave a Reply Cancel reply

You may have missed