How does entity resolution work?
Since the invention of databases, companies have faced the problem of eliminating duplicate records and combining multiple records for the same person or entity. Today, technologies that handle simple datasets, like a list of emails or phone contacts, use a straightforward approach to deduplication and combining records.
For example, “duplicate detection” or data deduplication is a technique that compares two data records to decide whether they are the same or different. The more attributes each record contains, the more accurate the tactic becomes. This system relies on data attributes matching exactly and works best with structured data.
For example, faced with the following two records, data deduplication would be able to determine that they refer to the same person
Pair matching vs. iterative combination
As datasets have grown larger and more complex, and aggregated data is imported from multiple sources, the possibility of data errors and inconsistencies has increased. Simple duplicate detection can’t tackle these large volumes of data, such as the datasets stored in an enterprise data warehouse.
Rather than matching pairs of records, the entity resolution process uses an iterative approach that compares and combines attributes from multiple records to determine if they represent the same entity. As attributes are compared with and added to each record, the accuracy of the records increases. By employing self-correcting decision-making in real time, entity resolution techniques can convert vast quantities of low-quality data with multiple duplications or ambiguities into meaningful, accurate descriptions of each entity.
Consolidating and separating records
Some entity resolution software uses fuzzy matching to improve its accuracy. Fuzzy matching draws a connection between attributes that are very similar, but not exactly the same, as they are in pair matching. For example, if one source of data has a typo in the customer’s name, fuzzy matching would be able to match these two names.
In this way, fuzzy matching is similar to probabilistic identity matching, which makes guesses about whether different descriptions in database fields refer to the same person. Traditional data deduplication is similar to deterministic identity matching, which relies on identifying identical descriptions or exact matches across multiple records.
For example, faced with the following three records, an entity resolution platform using fuzzy matching could determine if they all referred to the same person and, if so, standardize the description in each field and combine them into a single record.