Correction: Diacritics missing from author name though present in PDF #333

nschneid · 2019-05-13T23:23:37Z

Ironically, Diacritics Restoration Using Neural Networks lists "Jan Hajic" on the page and in the BibTeX whereas it's spelled "Jan Hajič" in the PDF.

I see he is listed with the diacritic in some venues but not others, though going by the PDFs, Jan Hajič seems to be the preferred spelling.

Should the policy be that if an author is listed with multiple spellings differing only in diacritics, the one with the most diacritics should be applied?

mjpost · 2019-05-14T01:02:08Z

The goal is generally for (a) BibTeX to reflect PDF and (b) author pages to collect all observed name variants. So this is a mistake that should be corrected in the XML.

Want to open a PR (and be added to our list of volunteers)?

nschneid · 2019-05-14T01:51:23Z

Before making a one-time change I'd like to understand the underlying problem. Could it be that his name is listed without the diacritic in START, so it is showing up that way in the metadata for many of the venues?

davidweichiang · 2019-05-14T01:56:52Z

I just checked -- indeed, his name in START is just Jan Hajic.

davidweichiang · 2019-05-14T02:01:38Z

Missing diacritics is a widespread problem that we hoped to sidestep by allowing name variants. I suppose one could try to write a scraper to try to detect them.

If we knew when the switch was made to using START names, then looking for frequent mismatches after that year would help to identify people to contact and ask them to consider updating their profile.

davidweichiang · 2019-05-14T13:12:12Z

I adapted the auto_first_names.py script and am running it on L18 now. It's catching quite a few errors; not just the one @nschneid pointed out, but removing extra accents, decapitalizing an all-caps name, and flagging (but not autocorrecting, alas) a couple of misspelled names.

davidweichiang · 2019-05-14T14:37:57Z

In L18 (528 papers, wow), the script made 150 changes (also wow) and printed another 100+ warnings that usually indicate a typo or missing word.

The automatic changes are easy to check, and they all look good except for a few:

INFO:L18-1066 author Tomasz Pędzimąż: changing: Pędzimąż -> Pȩdzima̧ż
INFO:L18-1495 author Anna Björk Nikulásdóttir: changing: Nikulásdóttir -> Nikulasdóttir
INFO:L18-1632 author Huda Almuzaini: changing: Almuzaini -> almuzaini

The first changes ogoneks to cedillas, I believe incorrectly.
The second one looks incorrect to me based on a Google search.
The third one lowercases the last name of someone who doesn't appear to do that regularly.

@mjpost, do you think the PDF should be followed in such cases?

davidweichiang · 2019-05-15T13:00:21Z

What system does LREC use to fill metadata? Do they use START also? I'm running the script on L16 now (for #341) and seeing some PDF/XML mismatches that are the same as in L18.

For example (not an exhaustive list):

Phillippe Langlais -> Philippe Langlais
Ina Roesiger -> Ina Rösiger

So it would be nice to fix these at the source instead of on our end.

kilian-gebhardt · 2019-05-15T14:02:02Z

Seems to be START for both years:
http://lrec2016.lrec-conf.org/en/submission/
http://lrec2018.lrec-conf.org/en/submission/

davidweichiang · 2019-05-15T14:50:14Z

@mjpost what are your thoughts about editing XML to match PDF in these cases where the PDF has less information than the current XML:

XML currently has Matt Post, PDF has Matt POST
XML Matt Post, PDF M. Post
XML Matt Post, PDF Mat Post
XML Matt J. Post, PDF Matt Post
XML Matt Póst, PDF Matt Post (supposing that the accent is correct)
XML Matt Post, PDF matt post
[Edit: numbered list]
[Edit: 6]

mjpost · 2019-05-15T15:02:00Z

I approve on the grounds of superseding another conference's convention
I like only when it is clear that initials were used because of a conference-level editorial decision (in which case we are overriding their convention with our superior one). If this were a one-off, we don't have the evidence that this wasn't the author's choice.
I approve as a typo correction
I dislike, because there is no evidence that this is a correction. (And in particular, I strongly dislike my name being written as Matthew, Matthew J, Matt J, etc)
Is murky but I think wrong. For example, the same corrective principle might change Koehn → Köhn, which would be wrong. We could set a general rule that acknowledges typing Latin-1 characters was harder say, 20 years ago, but I think it's more straightforward to list this as an ASCII variant.

Just to be clear, since my tone may indicate otherwise, we can discuss any of these.

danielgildea · 2019-05-15T15:16:11Z

As a general rule, I would say the xml should reflect how you would want to cite the paper, and not necessarily have to match the PDF 100% of the time. On that basis, I would say that the xml should have:

No full caps.
Full first name if we know that the author usually uses it, and this conference/paper just didn't allow it.
Typos fixed if we are absolutely sure it's a typo.
Middle initial and form "Matt" vs "Matthew" etc as they appear in the pdf.
Any diacritics if we are sure they are correct and are generally used by that person.

Unfortunately these rules require some research/judgment, but I think it is better to leave things the way they are in case of doubt than it is to exactly mirror the PDFs.

akoehn · 2019-05-15T15:33:46Z

@mjpost : Approve means that you would like to keep the XML data and not the PDF one, correct?

For example, the same corrective principle might change Koehn → Köhn, which would be wrong

As an expert in this field [ :-) ]: it depends. You cannot change an oe to ö without any evidence. However, if either the PDF or the XML actually has Köhn in it, it is probably safe to say that the umlaut is the preferred version. Case in point: Philipp Köhn spells his name Koehn in all publications and has this probably also in softconf. No algorithm would try to change it to Köhn. I try to use the umlaut, but in some cases have to enter an ascii-only name, so Koehn will be in some database as well.

Ina Roesiger -> Ina Rösiger

In this case, one should go with Rösiger.

mjpost · 2019-05-15T17:01:47Z

@akoehn—yes, approve means I was in favor of the XML diverging from the PDF in the cases mentioned above.

I like @danielgildea's concise summary above. We should start throwing conclusions from these discussions into a wiki page that describes our approach.

Just seeing (6) above: I think capitalization falls under the Gildea Principles: we correct it to English conventions unless we have evidence the author prefers it that way (e.g., danah boyd, e e cummings).

davidweichiang · 2019-05-15T17:42:27Z

I think I hear a consensus about

(3) Don't copy errors from the PDF into the XML. Note that this can occasionally be a tough call: for example, I had difficulty figuring out Elahe Khorasani vs. Elahe Khorashani.

(4-5) Assuming that neither the PDF or XML has an error, go with the PDF.

(1-2) Override styles (like first initials or all caps) imposed by a conference (which is rare).

But:

(1-2) There's less clarity about individual papers that use first initials or all caps -- @mjpost says go with the PDF.

So for the examples mentioned in this thread, and a couple more:

Consensus cases:

PDF Hajič, XML Hajic: change to Hajič
PDF Philippe, XML Phillippe -> change to Philippe
PDF Rösiger, XML Roesiger -> change to Rösiger
PDF almuzaini, XML Almuzaini: keep Almuzaini

Not sure about whether these are considered typos or not:

PDF Pȩdzima̧ż, XML Pędzimąż: change to Pȩdzima̧ż
PDF Nikulasdóttir, XML Nikulásdóttir: change to Nikulasdóttir
PDF Khorashani, XML Khorasani -> change to Khorashani

And the cases where there is difference of opinion:

PDF WANG, XML Wang -> change to WANG
PDF P., XML Philipp -> change to P.

nschneid · 2019-05-15T18:41:25Z

(1-2) There's less clarity about individual papers that use first initials or all caps -- @mjpost says go with the PDF.

I'm not sure about this one. Why would an author choose to abbreviate their name in some publications but not others? It could be that the names in the PDF all follow one convention which is inconsistent with what some of the authors normally do. I would generally prefer more information over less information, so if the name was spelled out in START but abbreviated in the PDF, I'd go with the non-abbreviated version.

mjpost · 2019-05-15T22:15:41Z

I agree about having more information whenever possible; I just want us to have some degree of certainty about it, so that we don't "Gilbert Keith" someone's "GK". If we have information from another source (start ID, inference about a conference convention, etc) that suggests the author is fine with an evidenced fuller version, I'm fine with it. But part of the reason to have a strong preference for the PDF is that without that convention, one can spend endless time trying to figure out what's right in all these situations.

davidweichiang · 2019-05-16T00:56:21Z

@nschneid we previously discussed first initials at length in #245; @mjpost sorry to bring it up again. In the current situation (LREC and other conferences that use START), the full names are known to be correct because they are provided by the authors, so it seems especially sad to delete information (and indeed, I didn't do it in PR #340).

I will try to summarize the above discussion in the wiki, and I will back out the changes in L18 that made some last names all-caps.

davidweichiang · 2019-05-16T01:57:04Z

Do you want to further discuss how to get people to change their names in START? If not, we can close this issue.

davidweichiang · 2019-05-16T23:52:01Z

I think we can pretty reliably restore accents now by scraping them from PDFs. What's the best way to use this -- to identify people to ask to update their START accounts, or just run the scraper as part of ingestion?

The scraper also changes casing and inserts/deletes spaces and hyphens. But it can only flag, not autocorrect, changes in spelling or insertion/deletion of names or initials.

danielgildea · 2019-05-17T13:43:51Z

As far as getting people to update their names in START, it seems like there are a
few things we might try:

Try to get everyone's emails from START, and send emails to people with a mismatch.
Ask ACL organizers to include a note in their email to authors about checking that the names in START match what people want, possibly including the authors' names from START in the email so that people can see easily how their names appear now.
Provide pub chairs with a script to check names against the PDFs, so that they can edit the metadata and possibly bug the authors themselves.
Any thoughts on which of these to pursue?

mjpost · 2019-05-21T15:38:48Z

I think we should focus our efforts on implementing this ourself when we generate the XML (say in anthology_xml.py. Reasoning:

Authors sometimes don't add names via START userids (the interface for this is actually pretty confusing)
Some folks are not using START so there may be other errors there

(1) and (2) are still good ideas to reduce the amount that has to be fixed, though.

davidweichiang · 2019-05-21T15:51:38Z

I agree, it would be annoying for everyone involved to email individual people. So, we have an author-name scraper (https://github.com/acl-org/acl-anthology/blob/auto_accents/bin/auto_authors.py) that could be incorporated into normalize_anth.py and run as part of ingestion.

Currently, it downloads the PDF by HTTP (as you may have noticed if you look at the server log for the last few days), but if part of ingestion, it should be an option to read from a local directory.
Improve heuristics to use some kind of minimum-edit distance like auto_name_variants.py does; it has to be really high precision, though.
Improve heuristics to know about some letter relationships like oe and ö.
Add special cases, especially for nicknames where the edit distance may be high, like Kathleen-Kathy.

mjpost · 2019-06-12T22:07:06Z

@davidweichiang, do you want a local copy of the Anthology PDFs? It's 35 GB. If you have a CLSP account we could set this up, or find another way.

davidweichiang · 2019-06-12T23:25:58Z

I don’t have a CLSP account (I don’t think). But a local copy might be a good idea if we can figure out a way.

akoehn · 2019-06-17T08:31:06Z

Short cross link: #295 (comment) for a discussion of how to mirror PDFs in bulk. Should be ~5mins to implement.

mjpost · 2019-06-17T12:59:53Z

I've posted a file with checksums here [14 MB].

davidweichiang · 2019-06-17T14:37:50Z

Can this file (as well as the mirroring script) become part of the repo?

akoehn · 2019-06-17T14:50:09Z

Can we discuss that further in #295 (the mirroring issue)? I can write the script & create a pull request later today; I am currently on a train with limited bandwidth. Adding the checksums file to the repository seems like a good idea to me.

danielgildea · 2020-11-10T12:43:17Z

Hi,

I just ran find_name_variants.py, which finds names that slugify to the same thing. It found over 300 cases of people with essentially the same name that are currently considered to be different people in the database because they are not entered into name_variants.yaml. Most are missing accents, and some are different first/last split for multiword names. It looks like in all cases it is the same person.

I wonder if we should change the anthology code to consider any two names that slugify to the same thing to be the same person. That way these people could have one author page without us having to track down every name variant during ingestion, which we don't have any consistent process for currently.

mjpost · 2020-11-10T14:53:07Z

I like that idea. We should wrap up discussion in #623 and come up with a solution.

nschneid added the correction for corrections submitted to the anthology label May 13, 2019

davidweichiang mentioned this issue May 15, 2019

Automatically restore accents in L18 to match PDF #340

Merged

mjpost closed this as completed in #340 May 15, 2019

davidweichiang reopened this May 15, 2019

davidweichiang mentioned this issue May 15, 2019

Correcting the corrections to L18 #344

Merged

This was referenced Jun 2, 2019

Auto accents for 2015-2019 #376

Merged

Auto accents for years 2005-2014 #401

Merged

akoehn mentioned this issue Jun 17, 2019

Anthology mirrors #295

Closed

3 tasks

mjpost mentioned this issue Oct 21, 2019

Fix some overcapitalization #590

Closed

danielgildea mentioned this issue Nov 12, 2020

Slugify name #1064

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correction: Diacritics missing from author name though present in PDF #333

Correction: Diacritics missing from author name though present in PDF #333

nschneid commented May 13, 2019

mjpost commented May 14, 2019

nschneid commented May 14, 2019

davidweichiang commented May 14, 2019

davidweichiang commented May 14, 2019

davidweichiang commented May 14, 2019

davidweichiang commented May 14, 2019

davidweichiang commented May 15, 2019

kilian-gebhardt commented May 15, 2019

davidweichiang commented May 15, 2019 •

edited

Loading

mjpost commented May 15, 2019

danielgildea commented May 15, 2019

akoehn commented May 15, 2019

mjpost commented May 15, 2019 •

edited

Loading

davidweichiang commented May 15, 2019

nschneid commented May 15, 2019

mjpost commented May 15, 2019

davidweichiang commented May 16, 2019

davidweichiang commented May 16, 2019

davidweichiang commented May 16, 2019

danielgildea commented May 17, 2019

mjpost commented May 21, 2019

davidweichiang commented May 21, 2019

mjpost commented Jun 12, 2019

davidweichiang commented Jun 12, 2019

akoehn commented Jun 17, 2019

mjpost commented Jun 17, 2019

davidweichiang commented Jun 17, 2019

akoehn commented Jun 17, 2019

danielgildea commented Nov 10, 2020

mjpost commented Nov 10, 2020

Correction: Diacritics missing from author name though present in PDF #333

Correction: Diacritics missing from author name though present in PDF #333

Comments

nschneid commented May 13, 2019

mjpost commented May 14, 2019

nschneid commented May 14, 2019

davidweichiang commented May 14, 2019

davidweichiang commented May 14, 2019

davidweichiang commented May 14, 2019

davidweichiang commented May 14, 2019

davidweichiang commented May 15, 2019

kilian-gebhardt commented May 15, 2019

davidweichiang commented May 15, 2019 • edited Loading

mjpost commented May 15, 2019

danielgildea commented May 15, 2019

akoehn commented May 15, 2019

mjpost commented May 15, 2019 • edited Loading

davidweichiang commented May 15, 2019

nschneid commented May 15, 2019

mjpost commented May 15, 2019

davidweichiang commented May 16, 2019

davidweichiang commented May 16, 2019

davidweichiang commented May 16, 2019

danielgildea commented May 17, 2019

mjpost commented May 21, 2019

davidweichiang commented May 21, 2019

mjpost commented Jun 12, 2019

davidweichiang commented Jun 12, 2019

akoehn commented Jun 17, 2019

mjpost commented Jun 17, 2019

davidweichiang commented Jun 17, 2019

akoehn commented Jun 17, 2019

danielgildea commented Nov 10, 2020

mjpost commented Nov 10, 2020

davidweichiang commented May 15, 2019 •

edited

Loading

mjpost commented May 15, 2019 •

edited

Loading