Skip to content
Ere Maijala edited this page Mar 27, 2024 · 15 revisions

RecordManager uses a custom algorithm for deduplication. The algorithm is based on decisions. Deduplication will process, one by one, all records that are marked as needing update. The deduplication process will find a single duplicate for each processed record. If a duplicate is found, it is also marked as processed. When all records are processed, their duplicates will be found incrementally.

Deduplicated records are marked with the same dedup id in the database. During the Solr query the individual records are filtered out from the results, but the merged records received from Solr are replaced in the result list with individual records according to a defined priority. There is basic built-in support for this in VuFind (see https://vufind.org/wiki/deduplication).

Managing Deduplication

In most cases deduplication just works by setting dedup = true for two or more data sources and running ./console records:deduplicate regularly e.g. via cron. There are, however, situations where manual intervention is required. One such case is when wrong records have been associated with the same dedup group due to e.g. an invalid ISBN in some of them. In that case the invalid ISBN and the associated title can be added to the configuration. It may not be enough to re-deduplicate a single record if there are multiple bad matches. In that case the command ./console records:check-dedup --single=record_id --strict may be used to check all members of dedup group.

Turning Deduplication On Or Off

If deduplication is enabled or disabled with the dedup setting above for an existing data source, it needs to be renormalized using the ./console records:renormalize --source=source_id command to add or remove deduplication keys that are used to find candidate records for deduplication.

Deduplication Algorithm

This is the flow for finding a duplicate for a record (17 Feb 2014, subject to change):

  1. Find candidates for deduplication using ISBNs first, other ID fields (MARC 015, 016 and 024) second and title keys (see below) then as a last resort
  2. Discard if the candidate is from the same source or deleted
  3. Discard if formats don't match
  4. Discard if the candidate has a matching ISBN and a previous ISBN match failed
  5. Discard if the candidate has a matching ID field and a previous ID match failed
  6. Discard if the candidate has already been deduplicated with a record from the same source as the one being processed
  7. Positive match if a common ISBN is found, no further checks performed
  8. Positive match if a common ID is found, no further checks performed
  9. Discard if both records have ISSN's but none match
  10. Discard if publication years don't match
  11. Discard if page counts are not within 10 pages
  12. Discard if any series ISSN's don't match
  13. Discard if any series numberings don't match
  14. Discard if the normalized title for either record is empty
  15. Discard if levenshtein distance between the titles is equal or more than 10%
  16. Discard if authors don't match (see below) or levenshtein distance between them is equal or more than 20%
  17. Positive match

String Normalization

The built-in normalization algorithm works best for Finnish material. The following modifications are made to the string to be normalized:

  1. Convert accented, diacritical etc. characters to the base character, except ö, ä and å
  2. Remove whitespace and all punctuation
  3. Convert string to lower case

Title Keys

Keys for title-based search for deduplication candidates are created as follows:

  1. Split title to words
  2. Take a word starting from the beginning
  3. If the word is at least 4 characters long, count it as a word
  4. Repeat until there are three words or the resulting string exceeds 25 characters
  5. Normalize the string
  6. Append normalized first word of author's name

A title key won't be created if the record doesn't have an author.

Author matching

  1. Positive match if authors match exactly
  2. Discard if either author length is less than 6 characters
  3. Positive match if authors match for the length of shorter one
  4. Positive match if first words match and at least an initial letter matches

Component Parts

Any separately catalogued component parts (e.g. songs belonging to a CD or articles belonging to a journal) are only marked duplicates if their host records are duplicates and ALL component parts in both host records match.