Skip to content

Commit

Permalink
clean_doi() should lower-case returned DOI
Browse files Browse the repository at this point in the history
Code in a number of places (including Pubmed importer) assumed that this
was already lower-casing DOIs, resulting in some broken metadata getting
created.

See also: #83

This is just the first step of mitigation.
  • Loading branch information
bnewbold committed Jun 7, 2021
1 parent 9779781 commit e649003
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion python/fatcat_tools/normal.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,15 @@ def clean_doi(raw):
- 'doi:' prefix
- URL prefix
Lower-cases the DOI.
Does not try to un-URL-encode
Returns None if not a valid DOI
"""
if not raw:
return None
raw = raw.strip()
raw = raw.strip().lower()
if '\u2013' in raw:
# Do not attempt to normalize "en dash" and since FC does not allow
# unicode in DOI, treat this as invalid.
Expand Down Expand Up @@ -84,6 +86,7 @@ def test_clean_doi():
assert clean_doi("10.4025/diálogos.v17i2.36030") == None
assert clean_doi("10.19027/jai.10.106‒115") == None
assert clean_doi("10.15673/атбп2312-3125.17/2014.26332") == None
assert clean_doi("10.7326/M20-6817") == "10.7326/m20-6817"


ARXIV_ID_REGEX = re.compile(r"^(\d{4}.\d{4,5}|[a-z\-]+(\.[A-Z]{2})?/\d{7})(v\d+)?$")
Expand Down

0 comments on commit e649003

Please sign in to comment.