Skip to content
This repository has been archived by the owner on Feb 18, 2021. It is now read-only.

Lost data from grobid TEI XML references transformation #89

Open
Vitaliy-1 opened this issue Jan 8, 2017 · 3 comments
Open

Lost data from grobid TEI XML references transformation #89

Vitaliy-1 opened this issue Jan 8, 2017 · 3 comments

Comments

@Vitaliy-1
Copy link

Vitaliy-1 commented Jan 8, 2017

typical parsed bibliography list item from Open Typesetting Stack at http://pkp-udev.lib.sfu.ca/ in JATS format (without authors):

<ref id="R24">
<element-citation>
<article-title>
Randomized Controlled Trial of Family Therapy in Advanced Cancer Continued Into Bereavement
</article-title>
<source>Journal of Clinical Oncology</source>
<year>2016-apr</year>
<fpage>1921</fpage>
<lpage>1927</lpage>
</element-citation>
</ref>

And here is this item from grobid transformation only:

<biblStruct coords="7,103.10,142.06,449.34,10.80;7,103.10,155.86,449.44,10.80;7,103.10,167.17,449.72,13.30"  xml:id="b12">
                    <analytic>
                        <title level="a" type="main">Randomized Controlled Trial of Family Therapy in Advanced Cancer Continued Into Bereavement</title>
                    </analytic>
                    <monogr>
                        <title level="j">J Clin Oncol</title>
                        <imprint>
                            <biblScope unit="volume">1</biblScope>
                            <biblScope unit="issue">16</biblScope>
                            <biblScope unit="page" from="34" to="1621" />
                            <date type="published" when="2016" />
                        </imprint>
                    </monogr>
</biblStruct>

As you can see, information about volume and issue is lost in result JATS XML. I suppose grobid module parses this data from the doi or pubmed links, that are putted to all our bibliogrphic citation list items and they are lost on somewhere on stage tei to jats transformation. This is issue is relevant to all articles, that I have already processed with this online service (near 20).
Pages, Journal Title and Year info is also different. So maybe references comes from other module. In this case volume and issue can be grabbed from grobit.

@Vitaliy-1
Copy link
Author

Hmm, as I see from grobit TEI to JATS xslt, it is not used for reference rendering at all. Not good, it maybe parses reference better than other modules :)

@axfelix
Copy link
Contributor

axfelix commented Jan 9, 2017

That's correct, we don't use Grobid for reference parsing -- either Cermine or meTypeset is used to detect the reference section, which is then sent to CrossRef to match known-good data, and ParsCit is used to parse any references that didn't have a DOI and couldn't be looked up. ParsCit still outperforms all other local reference parsing solutions that we've tried.

@Vitaliy-1
Copy link
Author

Vitaliy-1 commented Jan 10, 2017

Sometimes Cermine do not see reference section right. For example in article that I have tested with Cermine-only first 2 references were lost. They have been parsed as article text. But that`s was not the case with Grobid. I have not done much tests with the last soft, so could not say for sure what is better. Also I am planning to parse all our articles with open typesetting stack and can compare the result reference section with Grobid analog to see the difference. If it will help you in development of course.

Nevertheless, it will be great to add volume and issue tags inside JATS upon transformation, because now it is manual work for us. Think, it is lost on the stage of rendering CrossRef data (this is a case when reference article has doi or pmid).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants