-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lowercase script? #82
Comments
This is actually quite trivial in Python and on command line, not sure whether adding a lowercase script would be beneficial. In Python: s = "abc"
s.lower() On command line:
But if more people vote +1 on the idea, it's not hard to implement and add it =) |
It's definitely easy to do (although I always have to google the command line version), but I often find myself looking for a script to do it, and the original moses had one. |
Caution with |
For what it's worth, $ echo "He lived in Moscow." | tr [:upper:] [:lower:]
he lived in moscow.
$ echo "Он жил в Москве." | tr [:upper:] [:lower:]
он жил в москве.
$ echo "Έζησε στη Μόσχα." | tr [:upper:] [:lower:]
έζησε στη μόσχα |
Not in GNU coreutils 8.28 (ubuntu 18.04.03): $ echo "He lived in Moscow." | tr [:upper:] [:lower:]
he lived in moscow.
$ echo "Он жил в Москве." | tr [:upper:] [:lower:]
Он жил в Москве.
$ echo "Έζησε στη Μόσχα." | tr [:upper:] [:lower:]
Έζησε στη Μόσχα. |
Interesting. Hmmm, so is that feature in the @noe's point to https://stackoverflow.com/questions/13381746/tr-upper-lower-with-cyrillic-text/13383175#13383175 is right, on Ubuntu $ echo "Έζησε στη Μόσχα." | tr [:upper:] [:lower:]
Έζησε στη Μόσχα.
$ echo "Έζησε στη Μόσχα." | sed 's/[[:upper:]]*/\L&/'
έζησε στη Μόσχα. |
To me, having the lowercasing in
|
+1. It would also be nice to chain operations, e.g., echo This is a test | sacremoses normalize [options] lowercase [options]... |
@mayhewsw @noe @mjpost No promises but lowercase is a low-hanging fruit. Lets see how far I get go by end of the week of this sprint =) @mjpost good idea on pipelining. Any other interface to follow? Anyone can point to similar pipelining interface in CLI? Maybe it should start with how we want to do in within Python first then move to CLI? References: |
@mjpost Good news on chaining the commands for pipelining https://click.palletsprojects.com/en/7.x/commands/#multi-command-pipelines =) Gonna be a fun Tuesday tomorrow, implementing this!! |
Here's some updates on a POC on the pipeline, it seems like doing any simplistic stdin pipelining with I'm not sure how UNIX do it but keeping stdin / stdout in memory might be painful when corpus is rather huge. Currently, if we do the processing stepwise, streaming in and out, theoretically nothing would be kept in memory but I/O time is costly since we have to save the stdout to somewhere. Anyone knows how UNIX does streams and pipes? Any pointers? |
All the processors are generators. So at the top level you should be able to just pass one sentence at a time through each of them, right? I don’t see what about this requires you to load all the data (but I agree that you cannot do that!)
matt
… On Apr 14, 2020, at 00:08, alvations ***@***.***> wrote:
Here's some updates on a POC on the pipeline, it seems like doing any simplistic stdin pipelining with click requires some full storage of the data into some memory first. https://github.com/alvations/warppipe <https://github.com/alvations/warppipe>
I'm not sure how UNIX do it but keeping stdin / stdout in memory might be painful when corpus is rather huge. Currently, if we do the processing stepwise, streaming in and out, theoretically nothing would be kept in memory but I/O time is costly since we have to save the stdout to somewhere.
Anyone knows how UNIX does streams and pipes? Any pointers?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <https://github.com/alvations/sacremoses/issues/82#issuecomment-613214949>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADPEWBM5EMDZVQA3MBGTBDRMPOUFANCNFSM4KGTUVCQ>.
|
Maybe there's some usefulness in loading the whole dataset into memory instead of processing one sentence at time. Empirically it seems to be a few seconds faster on a dataset that takes 20-30 seconds to tokenize. Maybe this should be an option too From some playing around the UNIX CLI, it looks like it's processing the full pipeline by chunks instead of performing the processes sequentially. Got to look at this a little more carefully. https://linux.die.net/man/7/pipe P/S: Fixing last issues with the kwargs things in |
Loading into RAM by default is a huge mistake. You are introducing a hardware constraint where there doesn’t need to be one. IMO it’s not worth the complexity to even permit preloading, just to save a few seconds. It doesn’t matter.
matt (from my phone)
… Le 14 avr. 2020 à 18:20, alvations ***@***.***> a écrit :
Maybe there's some usefulness in loading the whole dataset into memory instead of processing one sentence at time. Empirically it seems to be a few seconds faster on a dataset that takes 20-30 seconds to tokenize. Maybe this should be an option too ---load-in-ram or something. Given that processing usually gets done of servers with much more RAM than plain text data (nowadays), this isn't a problem.
From some playing around the UNIX CLI, it looks like it's processing the full pipeline by chunks instead of performing the processes sequentially. Got to look at this a little more carefully. https://linux.die.net/man/7/pipe
P/S: Fixing last issues with the kwargs things in click and the pipeline feature should be good to go for a PR.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
With I guess with the pipeline global, the lowercase command would look something like this: cat big.txt | sacremoses -j 4 -l en lowercase [OPTIONS] |
I can't think of any options for lowercase. That looks good above. |
I wonder if a "reverse lowercase" option would be useful. Sometimes you want everything in upper case. |
@mayhewsw I can't think of a frequent NLP usecase where everything needs to be uppercase. What did you have in mind? |
I agree that it's not frequent, but sometimes it's useful, and if the pipeline is already there, it shouldn't be hard to add |
There's something better coming up, upper, lower and a surprise. But it'll take a couple of days to free myself up for some more coding and finishing up the feature =) |
Moses scripts included a useful lowercasing script. Are there any plans to add this?
The text was updated successfully, but these errors were encountered: