Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Perseus data #3

Open
scott-fleischman opened this issue May 28, 2015 · 7 comments
Open

Add Perseus data #3

scott-fleischman opened this issue May 28, 2015 · 7 comments

Comments

@scott-fleischman
Copy link
Owner

Load XML from https://github.com/PerseusDL

Data we may want to extract:

  • Author
  • Work
  • Date?
  • A way to reference a part of the work: chapter / paragraph / line number
  • Greek text
    • Attested form
    • Variant readings?
    • Distinguish between editor additions and MSS forms (anything else?)

I imagine we want to be able to access the data:

  • Access all works as a giant list of works. I think a work or book is best top-level division.
    • Allow access to author, date, etc. so the list can be filtered on those parameters
  • Each work should at least provide a list of "words", with metadata on the word.
    • Words can be broken down into a list of Tokens (a data type representing letter, case and mark information)
    • It may be useful to know punctuation at the beginning or end of a word (period; Greek question mark, semicolon, etc.). Some editions use punctuation to indicate a variant reading (found in an apparatus).
    • Each word should have a reference (author, work, line/chapter number) so we can trace where it came from.

LSJ: Is the full work available here? If so, we will want to extract more precise information from it for lemmas, principal parts, dialect forms, etc.

@scott-fleischman
Copy link
Owner Author

I'd be curious to see what an SML development of these ideas would look like. Perhaps modules would give a good abstraction for each level that we want to interact with the language (e.g., letter, syllable, word, clause, etc.)?

@scott-fleischman
Copy link
Owner Author

We probably should add the Perseus repos as submodules. Adding a test to verify successful loading of the data would be helpful.

@jonsterling
Copy link
Collaborator

@scott-fleischman Thanks for the added detail!

As you mention, ML modules might be a good fit for the kind of abstractions you're talking about. On the other hand, it feels like Haskell's laziness could be a big win when considering "the whole Perseus corpus" as a big datastructure that we want to query into...

@scott-fleischman
Copy link
Owner Author

I wonder if it would be worthwhile to create a separate Haskell library that provides access to the Perseus data.

@jonsterling
Copy link
Collaborator

It may be a good idea; I'm afraid that during the next week, I will probably not have time to look into this. So if anyone else wants to take a look, you won't be stepping on my toes.

@scott-fleischman
Copy link
Owner Author

I won't be getting to it in the next week either; maybe in the next month or two it will become useful as we work on our own from-scratch morphological analysis.

@scott-fleischman
Copy link
Owner Author

One thing I would like to find out sooner than later is whether LSJ is available in the data.

@scott-fleischman scott-fleischman removed their assignment Jan 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants