Deduplication

RecordManager uses a custom algorithm for deduplication. The algorithm is based on decisions. Deduplication will process, one by one, all records that are marked as needing update. The deduplication process will find a single duplicate for each processed record. If a duplicate is found, it is also marked as processed. When all records are processed, their duplicates will be found incrementally.

Deduplicated records are marked with the same dedup id in the Mongo database. The old method of indexing these records in Solr was to index the dedup id and use Solr's field collapsing to deduplicate the results on the fly. This had some scalability issues, so the new method is to index all the deduplicated records while also creating a merged record for searching. During the Solr query the individual records are filtered out from the results, but the merged records received from Solr are replaced in the result list with individual records according to a defined priority. There is basic built-in support for this in VuFind (see https://vufind.org/wiki/deduplication).

Deduplication Algorithm

This is the flow for finding a duplicate for a record (17 Feb 2014, subject to change):

Find candidates for deduplication using ISBNs first, other ID fields (MARC 015, 016 and 024) second and title keys (see below) then as a last resort
Discard if the candidate is from the same source or deleted
Discard if formats don't match
Discard if the candidate has a matching ISBN and a previous ISBN match failed
Discard if the candidate has a matching ID field and a previous ID match failed
Discard if the candidate has already been deduplicated with a record from the same source as the one being processed
Positive match if a common ISBN is found, no further checks performed
Positive match if a common ID is found, no further checks performed
Discard if both records have ISSN's but none match
Discard if publication years don't match
Discard if page counts are not within 10 pages
Discard if any series ISSN's don't match
Discard if any series numberings don't match
Discard if the normalized title for either record is empty
Discard if levenshtein distance between the titles is equal or more than 10%
Discard if authors don't match (see below) or levenshtein distance between them is equal or more than 20%
Positive match

String Normalization

The built-in normalization algorithm works best for Finnish material. The following modifications are made to the string to be normalized:

Convert accented, diacritical etc. characters to the base character, except ö, ä and å
Remove whitespace and all punctuation
Convert string to lower case

Title Keys

Keys for title-based search for deduplication candidates are created as follows:

Split title to words
Take a word starting from the beginning
If the word is at least 4 characters long, count it as a word
Repeat until there are three words or the resulting string exceeds 25 characters
Normalize the string
Append normalized first word of author's name

A title key won't be created if the record doesn't have an author.

Author matching

Positive match if authors match exactly
Discard if either author length is less than 6 characters
Positive match if authors match for the length of shorter one
Positive match if first words match and at least an initial letter matches

Component Parts

Any separately catalogued component parts (e.g. songs belonging to a CD or articles belonging to a journal) are only marked duplicates if their host records are duplicates and ALL component parts in both host records match.

Provide feedback

Saved searches