You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this example, we are looking for mentions of countries, regions or locations on the basis of Abstract and Author and Index Keywords. For this, we are using
corpus = litstudy.build_corpus(docs_springer, ngram_threshold=0.8)
The ngram threshold, even at it's lowest possible value (0.1), returns a list of common words found in the abstract of these papers. However, this frequency does not go below 5 mentions, meaning that references to a number of countries is excluded from the word distribution.
Is there a way to reduce the ngram threshold further, or some other method so that we can capture all word mentions, that is, a count of 1 of greater? From this we can then see which refer to geographical areas, and use the filter(like='_', axis=0) function to relevant bigrams (e.g. United States).
Thanks,
S
The text was updated successfully, but these errors were encountered:
Maybe the simplest solution would be to simply tokenise and print to a .csv all of the words mentioned across the DocumentSet, from which we can refine for mentions of geographical locations?
In this example, we are looking for mentions of countries, regions or locations on the basis of Abstract and Author and Index Keywords. For this, we are using
corpus = litstudy.build_corpus(docs_springer, ngram_threshold=0.8)
The ngram threshold, even at it's lowest possible value (0.1), returns a list of common words found in the abstract of these papers. However, this frequency does not go below 5 mentions, meaning that references to a number of countries is excluded from the word distribution.
Is there a way to reduce the ngram threshold further, or some other method so that we can capture all word mentions, that is, a count of 1 of greater? From this we can then see which refer to geographical areas, and use the filter(like='_', axis=0) function to relevant bigrams (e.g. United States).
Thanks,
S
The text was updated successfully, but these errors were encountered: