Deduplication

RecordManager uses a custom algorithm for deduplication. The algorithm is based on decisions. Deduplication will process, one by one, all records that are marked as needing update. The deduplication process will find a single duplicate for each processed record. If a duplicate is found, it is also marked as processed. When all records are processed, their duplicates will be found incrementally.

Deduplicated records are marked with the same dedup id in the database. During the Solr query the individual records are filtered out from the results, but the merged records received from Solr are replaced in the result list with individual records according to a defined priority. There is basic built-in support for this in VuFind (see https://vufind.org/wiki/deduplication).

Managing Deduplication

In most cases deduplication just works by setting dedup = true for two or more data sources and running ./console records:deduplicate regularly e.g. via cron. There are, however, situations where manual intervention is required. One such case is when wrong records have been associated with the same dedup group due to e.g. an invalid ISBN in some of them. In that case the invalid ISBN and the associated title can be added to the configuration. It may not be enough to re-deduplicate a single record if there are multiple bad matches. In that case the command ./console records:check-dedup --single=record_id --strict may be used to check all members of dedup group.

Turning Deduplication On Or Off

If deduplication is enabled or disabled with the dedup setting above for an existing data source, it needs to be renormalized using the ./console records:renormalize --source=source_id command to add or remove deduplication keys that are used to find candidate records for deduplication.

Deduplication Algorithm

This is the flow for finding a duplicate for a record (17 Feb 2014, subject to change):

Find candidates for deduplication using ISBNs first, other ID fields (MARC 015, 016 and 024) second and title keys (see below) then as a last resort
Discard if the candidate is from the same source or deleted
Discard if formats don't match
Discard if the candidate has a matching ISBN and a previous ISBN match failed
Discard if the candidate has a matching ID field and a previous ID match failed
Discard if the candidate has already been deduplicated with a record from the same source as the one being processed
Positive match if a common ISBN is found, no further checks performed
Positive match if a common ID is found, no further checks performed
Discard if both records have ISSN's but none match
Discard if publication years don't match
Discard if page counts are not within 10 pages
Discard if any series ISSN's don't match
Discard if any series numberings don't match
Discard if the normalized title for either record is empty
Discard if levenshtein distance between the titles is equal or more than 10%
Discard if authors don't match (see below) or levenshtein distance between them is equal or more than 20%
Positive match

String Normalization

The built-in normalization algorithm works best for Finnish material. The following modifications are made to the string to be normalized:

Convert accented, diacritical etc. characters to the base character, except ö, ä and å
Remove whitespace and all punctuation
Convert string to lower case

Title Keys

Keys for title-based search for deduplication candidates are created as follows:

Split title to words
Take a word starting from the beginning
If the word is at least 4 characters long, count it as a word
Repeat until there are three words or the resulting string exceeds 25 characters
Normalize the string
Append normalized first word of author's name

A title key won't be created if the record doesn't have an author.

Author matching

Positive match if authors match exactly
Discard if either author length is less than 6 characters
Positive match if authors match for the length of shorter one
Positive match if first words match and at least an initial letter matches

Component Parts

Any separately catalogued component parts (e.g. songs belonging to a CD or articles belonging to a journal) are only marked duplicates if their host records are duplicates and ALL component parts in both host records match.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly