Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure Java Compatibility Check for Regex Patterns #11912

Open
wants to merge 2 commits into
base: branch-25.02
Choose a base branch
from

Conversation

SurajAralihalli
Copy link
Collaborator

@SurajAralihalli SurajAralihalli commented Jan 3, 2025

Resolves #11651

Before passing the regex pattern to CudfRegexTranspiler and validating its compatibility with cuDF, we need to first ensure that the pattern is Java-compatible java.util.regex.Pattern.compile(pattern). If the pattern is not valid in Java, the same error should be raised for both CPU and GPU runs.

Signed-off-by: Suraj Aralihalli <[email protected]>
@SurajAralihalli
Copy link
Collaborator Author

build

Signed-off-by: Suraj Aralihalli <[email protected]>
@@ -51,6 +53,8 @@ class GpuRegExpReplaceMeta(
expr.regexp match {
case Literal(s: UTF8String, DataTypes.StringType) if s != null =>
javaPattern = Some(s.toString())
// check that this is valid in Java
Pattern.compile(javaPattern.toString)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this to the RegexTranspiler code. I think it could be done in both getTranspiledAST(...)/RegexParser(...)

Otherwise this line will have to be called in every Regexp Expression class, and it could easily be lost in a few places. The transpiler is used by all of these methods, so this makes sense as a shortcut.

@gerashegalov
Copy link
Collaborator

the failing spark400 check should succeed now if you rerun it per apache/spark@3ecb290

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Parse regular expressions using JDK to make error behavior more consistent between CPU and GPU
3 participants