Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex generator generates invalid unicode sequences (and is unmaintained) #1848

Open
timvahlbrock opened this issue Jan 6, 2025 · 4 comments

Comments

@timvahlbrock
Copy link
Contributor

We're using regular expressions like .{0,24} to limit the contents of string fields in their size and rely on the generator to generate corresponding example values. However, we're having trouble uploading the contracts to the pact broker, as the CLI (ruby) reports incomplete surrogate pairs (like \uDAA5) in the contract.

A month ago I reported this on the Pact Slack and the ruby repository as a suspected defect in the CLI. However, having had some time to take a closer look into what surrogates do I believe that the CLI is showing valid behavior. In my opinion the problem is in fact in the Pact-JVM Library for generating invalid unicode sequences, or more closely the Generex library. The later seems to be unmaintained since at least 5 years. Maybe it's time to switch to another generator, or implement an own?

@timvahlbrock
Copy link
Contributor Author

Even generex doesn't seem to be directly responsible for this problem, this seems to originate in automaton: cs-au-dk/dk.brics.automaton#15 . But this issue also seems to exist for more than 5 years and generex doesn't even use the currently published automaton version.

@rholshausen
Copy link
Contributor

rholshausen commented Jan 12, 2025

Just a note on the use of . in the regexes. This will generate any byte value, not a character value. Better to use \w which will generate character values (i.e. \w{0,24} or even [0-9a-zA_Z]{0,24} if the class shorthand does not work`).

@timvahlbrock
Copy link
Contributor Author

Yes, I tried that, but the problem is \w wouldn't create any non ASCII characters, which I want to do.

@timvahlbrock
Copy link
Contributor Author

I can try \w or other character classes as a workaround again, but I think I had issues with those too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@timvahlbrock @rholshausen and others