Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better fancify.sh #3

Open
fabi1cazenave opened this issue Dec 9, 2024 · 3 comments
Open

Better fancify.sh #3

fabi1cazenave opened this issue Dec 9, 2024 · 3 comments

Comments

@fabi1cazenave
Copy link
Contributor

Given that:

  • most keyboard layouts have no support for fancy letters or punctuation marks such as æ, , “”, , etc.
  • many corpus texts don’t use these fancy characters either
  • the kalamine analyzer can default to ASCII when these characters are not supported by a keyboard layout: ae instead of æ, ' instead of , ... instead of , "" instead of “”, etc.

our corpus should be “fancified” before getting transformed into JSON dictionary, in order not to penalize keyboard layouts that have a proper support for these special characters. That’s what the fancify.sh script (or make fancy target) does. But this is still a work in progress — several substitutions are still missing, e.g.:

  • straight quote pairs into “”, « », „“ depending on the language
  • fine no-break space before ?:;! in French
  • ¿ sign in Spanish
  • dashes rather than --
  • etc.
@fabi1cazenave fabi1cazenave changed the title Better fanciftcation Better fancification Dec 9, 2024
@fabi1cazenave fabi1cazenave changed the title Better fancification Better fancify.sh Dec 9, 2024
@Ced-C
Copy link

Ced-C commented Dec 19, 2024

One might want to also add accent marks on capital letters (e.g. EtreÊtre)

Also, we might eventually want to support several languages, so it should be an entry parameter.

Given that the above use case and some functionality such as quote pair matching might be a little complex to implement, I would tend to switch from bash script to python wdyt ?

@Ced-C
Copy link

Ced-C commented Dec 19, 2024

I was comparing the Alice ebook I had :

Publication: 1865
Catégorie(s): Fiction, Fantasy, Jeunesse
Source: http://www.ebooksgratuits.com

to the edition in this repo…
The edition I have, albeit different, already has all proper typography (œ,  ?, « »,&c.)
This makes me wonder… does a generic parsing process for corpora makes a lot of sense ?

I mean, sure, some standard mistakes could be address (like the one pointed out at the beginig of this issue)
but it will never be perfect.

For instance, in Alice (Guttenberg edition) ellipse are typographed as 4 hyphens ---- ; is that standard or is it just applicable to this edition ? Looking on the web, I have found no mention of this use of hyphens.

wouldn’t it better to chose already well typographed editions?

@fabi1cazenave
Copy link
Contributor Author

One might want to also add accent marks on capital letters (e.g. Etre→Être)

+1

Also, we might eventually want to support several languages, so it should be an entry parameter.

True. Some substitutions are generic, others are language-specific.

Given that the above use case and some functionality such as quote pair matching might be a little complex to implement, I would tend to switch from bash script to python wdyt ?

I’d bet sed would be enough. Moving to a full-Python solution can still be a later option.

I mean, sure, some standard mistakes could be address (like the one pointed out at the beginig of this issue)
but it will never be perfect.

I agree it won’t be perfect. But even non-perfect typographic improvements would be welcome for conversational corpora such as Leipzig.

For instance, in Alice (Guttenberg edition) ellipse … are typographed as 4 hyphens ---- ; is that standard or is it just applicable to this edition ?

This is absolutely non-standard and should be fixed. I expect most books to require specific tuning/replacements.

wouldn’t it better to chose already well typographed editions?

I agree it would be better but I’m not sure we can find a significant book collection with an open licence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants