-
Notifications
You must be signed in to change notification settings - Fork 31
Deduplication
RecordManager uses a custom algorithm for deduplication. The algorithm is based on decisions. Deduplication will process, one by one, all records that are marked as needing update. The deduplication process will find a single duplicate for each processed record. If a duplicate is found, it is also marked as processed. When all records are processed, their duplicates will be found incrementally.
Deduplicated records are marked with the same dedup id in the Mongo database. The old method of indexing these records in Solr was to index the dedup id and use Solr's field collapsing to deduplicate the results on the fly. This had some scalability issues, so the new method is to index all the deduplicated records while also creating a merged record for searching. During the Solr query the individual records are filtered out from the results, but the merged records received from Solr are replaced in the result list with individual records according to a defined priority. There is basic built-in support for this in VuFind (see https://vufind.org/wiki/deduplication).
This is the flow for finding a duplicate for a record (17 Feb 2014, subject to change):
- Find candidates for deduplication using ISBNs first, other ID fields (MARC 015, 016 and 024) second and title keys (see below) then as a last resort
- Discard if the candidate is from the same source or deleted
- Discard if formats don't match
- Discard if the candidate has a matching ISBN and a previous ISBN match failed
- Discard if the candidate has a matching ID field and a previous ID match failed
- Discard if the candidate has already been deduplicated with a record from the same source as the one being processed
- Positive match if a common ISBN is found, no further checks performed
- Positive match if a common ID is found, no further checks performed
- Discard if both records have ISSN's but none match
- Discard if publication years don't match
- Discard if page counts are not within 10 pages
- Discard if any series ISSN's don't match
- Discard if any series numberings don't match
- Discard if the normalized title for either record is empty
- Discard if levenshtein distance between the titles is equal or more than 10%
- Discard if authors don't match (see below) or levenshtein distance between them is equal or more than 20%
- Positive match
The built-in normalization algorithm works best for Finnish material. The following modifications are made to the string to be normalized:
- Convert accented, diacritical etc. characters to the base character, except ö, ä and å
- Remove whitespace and all punctuation
- Convert string to lower case
Keys for title-based search for deduplication candidates are created as follows:
- Split title to words
- Take a word starting from the beginning
- If the word is at least 4 characters long, count it as a word
- Repeat until there are three words or the resulting string exceeds 25 characters
- Normalize the string
- Append normalized first word of author's name
A title key won't be created if the record doesn't have an author.
- Positive match if authors match exactly
- Discard if either author length is less than 6 characters
- Positive match if authors match for the length of shorter one
- Positive match if first words match and at least an initial letter matches
Any separately catalogued component parts (e.g. songs belonging to a CD or articles belonging to a journal) are only marked duplicates if their host records are duplicates and ALL component parts in both host records match.