Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case-insensitive parsing and validation. #51

Open
robertodr opened this issue Mar 27, 2019 · 30 comments
Open

Case-insensitive parsing and validation. #51

robertodr opened this issue Mar 27, 2019 · 30 comments
Labels
enhancement New feature or request RFC Planning and design discussions
Milestone

Comments

@robertodr
Copy link
Contributor

As pointed out by @stigrj it would be good to not force case-sensitivity upon the users. I see some options:

  1. Do it before we parse. This would mean the programmer using the library has to normalize the case of the input file and the validation template has to adhere to this choice. That is, if normalization is uppercase (lowercase), then validation template use uppercase (lowercase).
  2. As point 1, but we add an option to the api.lex (and expose it in the CLI). Something like --case upper (--case lower)
  3. We let the grammar take care of case normalization, by adding a parse action to the various tokens. In case we change the parsing library (from pyparsing to lark, for example) this will not carry through.
  4. We do case normalization at the validation level. I think this is the most invasive option of all.
@robertodr robertodr added enhancement New feature or request RFC Planning and design discussions labels Mar 27, 2019
@stigrj
Copy link
Collaborator

stigrj commented Mar 27, 2019

  1. 👍

@bast
Copy link
Member

bast commented Mar 27, 2019

I agree with 2.

@robertodr
Copy link
Contributor Author

Noting this here, as it might be helpful: https://docs.python.org/3/library/stdtypes.html#str.casefold

@robertodr
Copy link
Contributor Author

Does it make sense to you that the case of quoted strings should be preserved? The way Getkw works, you can have quoted and unquoted strings. The latter can only look like valid C user tokens (first character can be an underscore or an alphabetic chars, followed by more underscores and alphanumeric chars) whereas the former can be almost anything (unicode too, I think) Most importantly, quoted string can contain slashes, so that this is the format one should use to specify file paths. I'd expect file paths to always be intended to be case-sensitive, hence my question.

@bast
Copy link
Member

bast commented Mar 28, 2019

I agree that changing case of paths is dangerous. Also until now I did not realize that we wanted to possibly normalize everything - somehow I imagined that we only normalize keys but not values.

@robertodr
Copy link
Contributor Author

Yeah, it's thorny. Are all of these acceptable?

Functional = B3LYP
functional = b3lyp
FunCtioNAL = b3LYP

@bast
Copy link
Member

bast commented Mar 28, 2019

Every time I saw case insensitivity as user (e.g. Fortran, old OS X file system (?)), I thought: "wow what a terrible idea!" - but I guess we can offer the possibility.

@bast
Copy link
Member

bast commented Mar 28, 2019

How will this work:

datapath1 = /home/user/doNotNormalizeMe
datapath2 = "/home/user/doNotNormalizeMe"

@robertodr
Copy link
Contributor Author

The current grammar will not be able to parse the former.

@bast
Copy link
Member

bast commented Mar 28, 2019

How about we only offer ignore-casing of keys? But we preserve values?

@stigrj
Copy link
Collaborator

stigrj commented Mar 28, 2019

How about this

  a_bool = True
  another_bool = false

Is it defined in the grammar what to accept?

@robertodr
Copy link
Contributor Author

robertodr commented Mar 28, 2019

@stigrj All of these are valid for booleans as caseless literals already.

truthy = ["TRUE", "ON", "YES", "Y"]
"""List[str]: List of true-like values."""
falsey = ["FALSE", "OFF", "NO", "N"]
"""List[str]: List of false-like values."""

@bast I think I would still like b3lyp to be as valid as B3LYP though. Regarding floating point numbers, only caseless E is recognized now. D can be arranged too, but I'd rather not encourage using D...

@bast
Copy link
Member

bast commented Mar 28, 2019

Good point about the booleans, this is a frequent papercut. Also scientific notation floats should allow ignore-case "D" and "E".

@bast
Copy link
Member

bast commented Mar 28, 2019

Yes, b3lyp example is convincing. Also I would like to type cc-pvdz.
Some half baked ideas:

  • we allow to ignore ignore case in the template for some sensitive nodes (sounds not elegant)
  • we add "path" to the set of allowed types and we do not touch "paths"

@stigrj
Copy link
Collaborator

stigrj commented Mar 28, 2019

Yes, I think the most realistic use case is to allow both

functional = b3lyp
functional = B3LYP

The keywords can always be set to follow a convention, like in MRChem:

  • sections: capitalized camel-case
  • keywords: lowercase with underscores

It's not that easy with values

@bast
Copy link
Member

bast commented Mar 28, 2019

About the "NO" interpreted as falsey - we had a nasty bug in a web thingy where we were parsing country codes for Nordic projects and NO (Norway) was interpreted as false :-)

@robertodr
Copy link
Contributor Author

I think paths are the only case where we would never want to modify the case of the string. Are there any other examples? I feel adding a Path and List[Path] type could be the elegant and easy way out of this:

  • everything case-sensitive by default (except booleans and floats in scientific notation)
  • case-sensitive with case normalization selected by the user. Paths must be given as quoted strings, types must be declared accordingly. Case normalization is only done on values that are strings At what level I still have to figure out. Probably needs two passes: one before going into the lexer proper, we read the file and normalize skipping all singly and doubly quoted strings, and another when fixing defaults, to make sure all strings are normalized.

@bast I was bitten by the same two days ago (in the randomized testing of lists of unquoted strings) Should we only allow true and false? I was trying to reproduce the original grammar: https://github.com/dev-cafe/libgetkw/blob/master/Python/getkw.py#L757

@robertodr
Copy link
Contributor Author

Bonus of having the Path type. Predicates to check that a file already exists are trivial: from pathlib import Path; Path(value).exists()

@stigrj
Copy link
Collaborator

stigrj commented Mar 28, 2019

How about adding actions to the keywords similar to predicates, so that you can get normalized values where you want them, like functionals and basis sets

keyword:
  - name: functional
    type: str
    predicates:
      - 'len(value) < 20'
    actions:
      - 'value = value.lower()'

A Path type still sounds like a good idea!

@robertodr
Copy link
Contributor Author

So much power... I like the idea, but I don't think it scales well (I need to do it for all keywords) I need to think about it more.

@bast
Copy link
Member

bast commented Mar 28, 2019

After some more thinking I still like the idea of a Path type. Strings and paths are really distinct types.

As to "NO": I would not miss yes/no. I am still unsure whether I would miss on/off.

@bast
Copy link
Member

bast commented Mar 28, 2019

Another thing that I would not like to case-normalize are checksums/hashes.

@robertodr
Copy link
Contributor Author

Let's split the str type into two: case-sensitive and case-insensitive. The latter would have the case normalization action on its value "embedded". The Path type is a valuable addition regardless of this discussion. I'll open a new issue (and possibly have a PR ready later this evening)

@bast
Copy link
Member

bast commented Mar 29, 2019

I like the suggestion of having two str types. This makes everything less surprising.

@robertodr
Copy link
Contributor Author

Name suggestions for the types? istr and sstr?

@bast
Copy link
Member

bast commented Mar 29, 2019

How about str and str_ignorecase?

@robertodr
Copy link
Contributor Author

I have written some preliminary design docs. Let me know if you agree. See also my comment on #55 as to why, contrary to my initial enthusiam, we do not need/want a path type.

Case-sensitivity is a sensitive issue. It is desirable to have the following two examples be equally valid:

$ functional = b3lyp
$ Functional = B3LYP

We can ensure case-insensitive comparisions by preliminarly normalizing the case of the input. The programmer would have the choice of normalizing to uppercase or lowercase. Hence, the above examples would both decay to either:

$ functional = b3lyp

when normalizing to lowercase, or:

$ FUNCTIONAL = B3LYP

when normalizing to uppercase. Note how normalization happens on the left-hand (section and keyword labels) and the right-hand side (values).
This means that the validation specifications in the template.yml file need to conform to whatever the chosen normalization is, both for keyword/section names and possible string defaults. Note that for booleans case normalization is not problematic: their type will be coerced upon reading in, be it through tokenization of the input file or loading of the template.yml.

There are however cases where normalizing the case is not desirable. Two notable examples are filesystem paths and checksums. For these cases, the case from input needs to be respected. Once again, note that the case of default values in the template.yml is completely up to the programmer. No normalization will ever be performed by the library.

We offer two string types:

  • Case sensitive: str. No case normalization will be performed for values of this type.
  • Case insensitive: str_ignorecase. For values of this type, the case
    will be normalized - either upper or lower - according to the request of the
    programmer. If no case normalization is requested, this string type decays to str.

@stigrj
Copy link
Collaborator

stigrj commented Mar 30, 2019

Do we really need/want insensitive keywords and sections? Wouldn't it be more sensible to only normalize the str_ignorecase values and keep everything else, instead of normalizing everything and keep the sensitive str? It feels a bit strange that if you want to accept case insensitive strings from the user you have to declare it as str_ignorecase and at the same time make all section and keyword names insensitive.

How about three types:

  • str: case sensitive
  • str_lower: value will be normalized to lower
  • str_upper: value will be normalized to upper

@bast
Copy link
Member

bast commented Mar 30, 2019

Thanks for convincing arguments about path. Also I agree that we should perhaps only allow ignorecase/lower/upper on the values but not the keys. The reason is that later the code wants to fish out the values using well defined keys and perhaps we don't want another degree of freedom there.
I like the suggestion with str, str_lower, str_upper - this way the code knows what they get and do not need to normalize inside again just to be sure.

@bast
Copy link
Member

bast commented Jul 18, 2019

Discussion with Stig and Roberto:

  • keys (in other words keywords and section names) are normalized to lowercase and the dictionary is then accessed with lowercase keys
  • values: we postpone this decision until we decide about actions later since actions would allow lower() or upper()

@robertodr robertodr added this to the v1.0.0 milestone Aug 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request RFC Planning and design discussions
Projects
None yet
Development

No branches or pull requests

3 participants