-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Fix atol parsing for prefixed integer literals with leading underscores #3180
Conversation
Thanks for taking the time to do this, could you add some of the suggestions and tests for the other PR here? I see the same behavior missing (underscore and 0 at the beginning of non base 10 numbers) |
atol
function and improve documentationb457b4a
to
98e3a69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
btw. loved this refactor
var start = pos
if start + 1 < str_len:
var prefix_char = str_ref[start + 1]
if str_ref[start] == "0" and (
(base == 2 and (prefix_char == "b" or prefix_char == "B"))
or (base == 8 and (prefix_char == "o" or prefix_char == "O"))
or (base == 16 and (prefix_char == "x" or prefix_char == "X"))
):
start += 2
return start
atol
9e923ed
to
e2dc1e6
Compare
Thanks @martinvuyk :) But single newline seems to be consistent with other docstring, no? |
atol
I have repurposed this PR to focus on the mentioned issue. |
ba3a8a5
to
6a0b673
Compare
stdlib/src/builtin/string.mojo
Outdated
# starting "was_last_digit_undescore" to true such that | ||
# if the first digit is an undesrcore an error is raised | ||
var was_last_digit_undescore = True | ||
var was_last_digit_undescore = real_base == 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want was_last_underscore
to be false only when an accepted prefix is present, which is not necessarily when real_base != 10
, but rather only if real_base
is 2,8,16 and it contains a prefix, according to the lexical syntax of Python.
Thoughts @soraros, @martinvuyk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I hadn't seen "Base must be >= 2 and <= 36, or 0."
so that means this could receive other bases that have no prefix. So yeah totally, leading underscore should fail
Python:
print(int("_123", 30))
ValueError: invalid literal for int() with base 30: '_123'
so it should be
real_base in (2, 8, 16)
nice catch, maybe we could add some tests with assert raises (and also passing ones) with bases other than the classics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will for sure add some test cases with the next commit. We will have to add a check for whether there exists a prefix as well though.
So something along the lines of real_base in (2, 8, 16) and has_prefix
, as we could have atol("_ff", 16)
which should raise as invalid.
Was playing around with (real_base & 0b11010) != real_base and has_prefix
which would've have been fun... 10 and 18 breaks it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So something along the lines of real_base in (2, 8, 16) and has_prefix, as we could have atol("_ff", 16) which should raise as invalid.
yeah I guess str_ref[start] == "0"
can be taken out of the if to a has_prefix var, nice catch again :D
Was playing around with (real_base & 0b11010) != real_base and has_prefix which would've have been fun...
Niiice, maybe this will work?
from bit import is_power_of_two
...
# first character as underscore is only valid for bases 2, 8, 16 with prefix
var was_last_digit_undescore = not (is_power_of_two(real_base) and has_prefix and real_base != 32)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the concept! Will have to catch 4 explicitly there as well.
We can use a mask, bitmask = (1 << 2) | (1 << 8) | (1 << 16)
and then not ((1 << real_base) & bitmask) and has_prefix
.
This seems to work. But it might be overcomplicated for a simple value range check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think this was a fun exercise lol but let's just stick with tuple contains
I agree. At least we can defer further optimisation to a later PR should we find it necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martinvuyk, I couldn't get it to run. Cannot call cast
on a Boolean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it will probably work by just putting int(real_base < 17 and has_prefix)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue @martinvuyk, but now for Int
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah but removing the .cast
since it is a SIMD method, but never mind it's not important
3ca2e5a
to
f4a054a
Compare
!sync |
- Support leading underscores (bug fix) for prefixed literal - Added relevant tests Signed-off-by: Joshua James Venter <[email protected]>
!sync |
✅🟣 This contribution has been merged 🟣✅ Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours. We use Copybara to merge external contributions, click here to learn more. |
…th leading underscores (#43607) [External] [stdlib] Fix atol parsing for prefixed integer literals with leading underscores This addresses #3182 to correctly parse prefixed string literals. Changes include: - Bug-fix - Added appropriate test cases to verify the new behavior When the base 2, 8, 16 are set or inferred (base 0), for a string prefixed appropriately, leading underscores should be allowed as per Python's integer literal grammar: ``` integer ::= decinteger | bininteger | octinteger | hexinteger decinteger ::= nonzerodigit (["_"] digit)* | "0"+ (["_"] "0")* bininteger ::= "0" ("b" | "B") (["_"] bindigit)+ octinteger ::= "0" ("o" | "O") (["_"] octdigit)+ hexinteger ::= "0" ("x" | "X") (["_"] hexdigit)+ nonzerodigit ::= "1"..."9" digit ::= "0"..."9" bindigit ::= "0" | "1" octdigit ::= "0"..."7" hexdigit ::= digit | "a"..."f" | "A"..."F" ``` Currently, this such a string passed to `atol` raises. The changes in this PR fixes this issue. ORIGINAL_AUTHOR=Joshua James Venter <[email protected]> PUBLIC_PR_LINK=#3180 Co-authored-by: Joshua James Venter <[email protected]> Closes #3180 MODULAR_ORIG_COMMIT_REV_ID: 71bba679c2eeeea09f0bd4393611da50bc6570a2
Landed in 2aa4060! Thank you for your contribution 🎉 |
…th leading underscores (#43607) [External] [stdlib] Fix atol parsing for prefixed integer literals with leading underscores This addresses #3182 to correctly parse prefixed string literals. Changes include: - Bug-fix - Added appropriate test cases to verify the new behavior When the base 2, 8, 16 are set or inferred (base 0), for a string prefixed appropriately, leading underscores should be allowed as per Python's integer literal grammar: ``` integer ::= decinteger | bininteger | octinteger | hexinteger decinteger ::= nonzerodigit (["_"] digit)* | "0"+ (["_"] "0")* bininteger ::= "0" ("b" | "B") (["_"] bindigit)+ octinteger ::= "0" ("o" | "O") (["_"] octdigit)+ hexinteger ::= "0" ("x" | "X") (["_"] hexdigit)+ nonzerodigit ::= "1"..."9" digit ::= "0"..."9" bindigit ::= "0" | "1" octdigit ::= "0"..."7" hexdigit ::= digit | "a"..."f" | "A"..."F" ``` Currently, this such a string passed to `atol` raises. The changes in this PR fixes this issue. ORIGINAL_AUTHOR=Joshua James Venter <[email protected]> PUBLIC_PR_LINK=#3180 Co-authored-by: Joshua James Venter <[email protected]> Closes #3180 MODULAR_ORIG_COMMIT_REV_ID: 71bba679c2eeeea09f0bd4393611da50bc6570a2
This addresses #3182 to correctly parse prefixed string literals. Changes include:
When the base 2, 8, 16 are set or inferred (base 0), for a string prefixed appropriately, leading underscores should be allowed as per Python's integer literal grammar:
Currently, this such a string passed to
atol
raises. The changes in this PR fixes this issue.I should note that I have another PR (#3207) to refactor, and clean up,
atol
for reusability to introducestol
as per request #2639. If this PR is merged, I will update #3207, which will require a minor tweak.