Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Listing document titles #74

Closed
AstridDBJ opened this issue Jan 9, 2024 · 3 comments
Closed

Listing document titles #74

AstridDBJ opened this issue Jan 9, 2024 · 3 comments

Comments

@AstridDBJ
Copy link

Hi!

I'm quite new to Python (so this might be an easy fix), but I found LitStudy really interesting to look into.

I have tried to load documents from different files (from different databases) and used litstudy.types.DocumentSet.union to get a DocumentSet without duplicates. However, I would like to know which papers are then in this new collection/dataset. Is it possible to get LitStudy to list (e.g. in a Pandas DataFrame?) the titles of the documents in a specific dataset? Or provide a list/table of the titles just at any stage in the process?

@stijnh
Copy link
Member

stijnh commented Jan 9, 2024

Hi Astrid! Thanks for using litstudy and thanks for reporting this issue!

Unfortunately, at the moment there is no functionality to see which papers were removed when taking the union of multiple document sets.

Issue #68 discussed a similar problem where the is now way to find the papers removed by unique(). An idea there was to add a duplicates() method that returns the papers removed by unique() (such that len(docset) == len(docset.unique()) + len(docset.duplicates()). Something similar could be implemented for union().

We are open to contributes and will accept relevant pull requests that add this functionality.

@AstridDBJ
Copy link
Author

Good to know, thanks! However, I'm actually more interested in the documents that are kept after the union (so not the removed duplicates); e.g. to know which documents I should look into for my review, and thus also the titles of the documents that the different kinds of histograms are based on. Is that possible to do with LitStudy?

@stijnh
Copy link
Member

stijnh commented Jan 18, 2024

You can always print the documents like this:

docs_csv = docs_ieee | docs_springer

for doc in docs_csv:
  print(doc.title)

Would that work? Each document has many attribute that you can access (such as the title, authors, publisher, etc.). See here: https://nlesc.github.io/litstudy/api/types.html#litstudy.types.Document

@NLeSC NLeSC locked and limited conversation to collaborators Jan 25, 2024
@isazi isazi converted this issue into discussion #79 Jan 25, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants