Adjust the launch bounds to get_json_object to avoid spilling #2015

revans2 · 2024-05-03T16:22:38Z

We found in practice that the get_json_object kernel could be very slow due to register spilling. This asks CUDA to avoid spilling if possible.

Note that this has been upmerged to include changes equivalent to #1924

The comparison of the benchmarks are the following

['baseline.json', 'updated.json']
# get_json_object

## [0] NVIDIA RTX A6000

|  size_bytes  |  max_depth  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |           Diff |   %Diff |  Status  |
|--------------|-------------|------------|-------------|------------|-------------|----------------|---------|----------|
|   1000000    |      2      |   4.418 ms |       0.67% |   3.901 ms |       0.64% |    -516.929 us | -11.70% |   FAIL   |
|   10000000   |      2      |   4.923 ms |       0.40% |   4.369 ms |       1.21% |    -554.720 us | -11.27% |   FAIL   |
|  100000000   |      2      |  57.260 ms |       1.21% |  17.878 ms |       0.43% |  -39382.238 us | -68.78% |   FAIL   |
|  1000000000  |      2      | 513.896 ms |       0.31% | 151.902 ms |       0.22% | -361993.797 us | -70.44% |   FAIL   |
|   1000000    |      4      |   4.491 ms |       0.26% |   3.957 ms |       0.13% |    -534.262 us | -11.90% |   FAIL   |
|   10000000   |      4      |   5.006 ms |       3.18% |   4.419 ms |       1.82% |    -586.499 us | -11.72% |   FAIL   |
|  100000000   |      4      |  57.266 ms |       1.15% |  17.909 ms |       0.48% |  -39357.476 us | -68.73% |   FAIL   |
|  1000000000  |      4      | 513.419 ms |       0.50% | 150.293 ms |       0.13% | -363125.961 us | -70.73% |   FAIL   |
|   1000000    |      6      |   4.609 ms |       0.29% |   4.024 ms |       0.25% |    -585.725 us | -12.71% |   FAIL   |
|   10000000   |      6      |   5.097 ms |       0.50% |   4.455 ms |       0.44% |    -642.529 us | -12.61% |   FAIL   |
|  100000000   |      6      |  58.795 ms |       1.04% |  17.975 ms |       0.46% |  -40820.301 us | -69.43% |   FAIL   |
|  1000000000  |      6      | 518.776 ms |       0.47% | 151.316 ms |       0.21% | -367460.065 us | -70.83% |   FAIL   |
|   1000000    |      8      |   4.624 ms |       0.25% |   4.049 ms |       0.19% |    -575.162 us | -12.44% |   FAIL   |
|   10000000   |      8      |   5.134 ms |       0.30% |   4.499 ms |       2.72% |    -635.060 us | -12.37% |   FAIL   |
|  100000000   |      8      |  59.252 ms |       1.10% |  18.132 ms |       0.26% |  -41119.616 us | -69.40% |   FAIL   |
|  1000000000  |      8      | 526.177 ms |       0.54% | 152.719 ms |       0.21% | -373457.061 us | -70.98% |   FAIL   |

# Summary

- Total Matches: 16
  - Pass    (diff <= min_noise): 0
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  16

In my other tests based on NVIDIA/spark-rapids#10729 I saw speedups against the current version of get_json_object of 1.6x to 3.5x

For the legacy version that didn't have any validation I saw improvements between 1.04x and and 1.8x.

I ran the integration tests and they all passed also, so I think this is good to go.

the launch bounds to remove spilling Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 · 2024-05-03T16:22:47Z

build

revans2 · 2024-05-03T16:23:17Z

@ttnghia could you take a look at this over #1924

revans2 · 2024-05-03T16:27:16Z

build

thirtiseven · 2024-05-06T03:57:27Z

Just merged #1924 with a new comment address commit 28387a9 , please upmerge this, thanks

res-life · 2024-05-06T10:05:30Z

src/main/cpp/src/json_parser.cuh

+    return char_range(_data + pos, len);
+  }
+
+  __device__ inline char_range slice_from(cudf::size_type pos) const


slice_from is useless.

res-life · 2024-05-06T10:33:14Z

src/main/cpp/src/json_parser.cuh

  // save current token start pos, used by coping current token text
-  char const* current_token_start_pos;
+  cudf::size_type current_token_start_pos;


Both the following variables can be removed.

cudf::size_type current_token_start_pos; cudf::size_type number_token_len;

If postpone the validatation into write_raw and copy_current_structure
Then we can remove these two variables.

Current logic is:
Validation a token will go to the end char of the token and save the start pos the the token. If need to copy the current token, have to go back and scan the token again.

Updated logic will be:
Change current token to expected token by only detect the first char of the next token. Postpone the validatation into write_raw and copy_current_structure.
e.g.:
{'k' : '123456789'}
When executing next_token when current pos is pointing to the string '123456789', just scanning 1st char ' decides the next expected token is String. Let curr pos point to the start of '123456789' and set current expected token as STRING. Then when executing write_raw and copy_current_structure, do the validation when copying.
After change, the get_json_object will only scan the whole JSON 1 time without go back.

I have a draft (about 30%) for this, but not sure the improvement gain.

revans2 · 2024-05-06T16:05:01Z

build

revans2 · 2024-05-06T16:08:01Z

I have upmerged the code relative to the changes in #1924 and addressed the review comments here.

I reran the benchmark comparison and got the following.

['baseline.json', 'updated.json']
# get_json_object

## [0] NVIDIA RTX A6000

|  size_bytes  |  max_depth  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |           Diff |   %Diff |  Status  |
|--------------|-------------|------------|-------------|------------|-------------|----------------|---------|----------|
|   1000000    |      2      |   4.610 ms |       0.77% |   3.889 ms |       0.85% |    -721.561 us | -15.65% |   FAIL   |
|   10000000   |      2      |   5.030 ms |       0.20% |   4.364 ms |       4.69% |    -665.711 us | -13.23% |   FAIL   |
|  100000000   |      2      |  50.644 ms |       1.09% |  18.459 ms |       0.50% |  -32185.049 us | -63.55% |   FAIL   |
|  1000000000  |      2      | 453.540 ms |       0.68% | 151.467 ms |       0.24% | -302072.676 us | -66.60% |   FAIL   |
|   1000000    |      4      |   4.687 ms |       0.15% |   3.955 ms |       0.15% |    -732.169 us | -15.62% |   FAIL   |
|   10000000   |      4      |   5.139 ms |       0.48% |   4.412 ms |       2.38% |    -726.866 us | -14.14% |   FAIL   |
|  100000000   |      4      |  51.086 ms |       1.32% |  17.707 ms |       0.27% |  -33378.536 us | -65.34% |   FAIL   |
|  1000000000  |      4      | 453.716 ms |       0.30% | 150.117 ms |       0.18% | -303599.867 us | -66.91% |   FAIL   |
|   1000000    |      6      |   4.741 ms |       0.16% |   4.017 ms |       0.16% |    -724.545 us | -15.28% |   FAIL   |
|   10000000   |      6      |   5.168 ms |       0.22% |   4.456 ms |       9.16% |    -711.311 us | -13.76% |   FAIL   |
|  100000000   |      6      |  52.284 ms |       1.17% |  17.854 ms |       0.50% |  -34429.908 us | -65.85% |   FAIL   |
|  1000000000  |      6      | 459.636 ms |       0.55% | 151.526 ms |       0.19% | -308110.211 us | -67.03% |   FAIL   |
|   1000000    |      8      |   4.782 ms |       0.14% |   4.039 ms |       0.18% |    -743.326 us | -15.54% |   FAIL   |
|   10000000   |      8      |   5.239 ms |       1.00% |   4.492 ms |       0.17% |    -747.483 us | -14.27% |   FAIL   |
|  100000000   |      8      |  52.991 ms |       0.99% |  18.085 ms |       0.46% |  -34906.481 us | -65.87% |   FAIL   |
|  1000000000  |      8      | 464.851 ms |       0.49% | 152.728 ms |       0.20% | -312122.459 us | -67.14% |   FAIL   |

# Summary

- Total Matches: 16
  - Pass    (diff <= min_noise): 0
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  16

The numbers are really unchanged from the previous results, except that there are some small movement on the baseline numbers.

src/main/cpp/src/json_parser.cuh

revans2 · 2024-05-06T17:30:04Z

build

nvdbaranec

Almost entirely nits.

nvdbaranec · 2024-05-06T19:09:16Z

src/main/cpp/src/json_parser.cuh

+  __device__ inline bool eof(cudf::size_type pos) { return pos >= _len; }
+  __device__ inline bool is_null() { return _data == nullptr; }
+  __device__ inline bool is_empty() { return _len == 0; }


These 3 could be const.

nvdbaranec · 2024-05-06T19:09:45Z

src/main/cpp/src/json_parser.cuh

+  __device__ inline bool eof() { return _range.eof(_pos); }
+  __device__ inline bool is_null() { return _range.is_null(); }


const here and current_char() and pos()

nvdbaranec · 2024-05-06T19:10:24Z

src/main/cpp/src/json_parser.cuh

+  __device__ inline bool eof(cudf::size_type pos) { return pos >= chars.size(); }
+  __device__ inline bool eof() { return curr_pos >= chars.size(); }


nvdbaranec · 2024-05-06T19:18:45Z

src/main/cpp/src/json_parser.cuh

+      char c = str.current_char();
      int v  = static_cast<int>(c);


Looks like both of these could be const.

nvdbaranec · 2024-05-06T19:19:04Z

src/main/cpp/src/json_parser.cuh

+          if (copy_destination != nullptr) { *copy_destination++ = str.current_char(); }
+        } else {
+          // escape_style::ESCAPED
+          int escape_chars = escape_char(str.current_char(), copy_destination);


src/main/cpp/src/json_parser.cuh

src/main/cpp/src/get_json_object.cu

src/main/cpp/src/json_parser.cuh

src/main/cpp/src/get_json_object.cu

revans2 · 2024-05-06T21:54:43Z

build

revans2 · 2024-05-06T21:54:59Z

@nvdbaranec please take another look

nvdbaranec · 2024-05-07T16:43:41Z

src/main/cpp/src/get_json_object.cu

+// We have 1 for the minBlocksPerMultiprocessor in the launch bounds to avoid spilling from
+// the kernel itself. By default NVCC uses a heuristic to find a balance between the
+// maximum number of registers used by a kernel and the parallelism of the kernel.
+// The lots of registers are used the parallelism will suffer. But in our case


Suggested change

// The lots of registers are used the parallelism will suffer. But in our case

// If lots of registers are used the parallelism may suffer. But in our case

ttnghia · 2024-05-07T17:00:43Z

src/main/cpp/src/json_parser.cuh

- * Note: when setting `allow_single_quotes` or `allow_unescaped_control_chars`,
- * then JSON format is not conventional.
+ * Note: This is not conventional as it allows
+ * single quotes and  unescaped control characters


Suggested change

* single quotes and unescaped control characters

* single quotes and unescaped control characters

ttnghia · 2024-05-07T17:01:25Z

src/main/cpp/src/json_parser.cuh

+  {
+  }
+
+  __device__ inline char_range(const cudf::string_view& input)


Suggested change

__device__ inline char_range(const cudf::string_view& input)

__device__ inline char_range(cudf::string_view const& input)

ttnghia · 2024-05-07T17:01:40Z

src/main/cpp/src/json_parser.cuh

+  // a member variable with a static method like this.
+  __device__ inline static char_range null() { return char_range(nullptr, 0); }
+
+  __device__ inline char_range(const char_range&)            = default;


Suggested change

__device__ inline char_range(const char_range&) = default;

__device__ inline char_range(char_range const&) = default;

ttnghia · 2024-05-07T17:01:58Z

src/main/cpp/src/json_parser.cuh

+
+  __device__ inline char_range(const char_range&)            = default;
+  __device__ inline char_range(char_range&&)                 = default;
+  __device__ inline char_range& operator=(const char_range&) = default;


Suggested change

__device__ inline char_range& operator=(const char_range&) = default;

__device__ inline char_range& operator=(char_range const&) = default;

ttnghia · 2024-05-07T17:03:15Z

src/main/cpp/src/json_parser.cuh

+      const char c = str.current_char();
+      const int v  = static_cast<int>(c);


Suggested change

const char c = str.current_char();

const int v = static_cast<int>(c);

char const c = str.current_char();

int const v = static_cast<int>(c);

ttnghia · 2024-05-07T17:04:27Z

src/main/cpp/src/json_parser.cuh

+          if (copy_destination != nullptr) { *copy_destination++ = str.current_char(); }
+        } else {
+          // escape_style::ESCAPED
+          const int escape_chars = escape_char(str.current_char(), copy_destination);


Suggested change

const int escape_chars = escape_char(str.current_char(), copy_destination);

int const escape_chars = escape_char(str.current_char(), copy_destination);

ttnghia · 2024-05-07T17:04:46Z

src/main/cpp/src/json_parser.cuh

-      char c = *str_pos;
-      switch (*str_pos) {
+    if (!str.eof()) {
+      const char c = str.current_char();


Suggested change

const char c = str.current_char();

char const c = str.current_char();

ttnghia · 2024-05-07T17:05:09Z

src/main/cpp/src/json_parser.cuh

+    cudf::char_utf8 code_point = 0;
+    for (size_t i = 0; i < 4; i++) {
+      if (str.eof()) { return false; }
+      const char c = str.current_char();


Suggested change

const char c = str.current_char();

char const c = str.current_char();

ttnghia · 2024-05-07T17:05:28Z

src/main/cpp/src/json_parser.cuh

+    // In UTF-8, the maximum number of bytes used to encode a single character
+    // is 4
+    char buff[4];
+    const cudf::size_type bytes = from_char_utf8(utf_char, buff);


Suggested change

const cudf::size_type bytes = from_char_utf8(utf_char, buff);

cudf::size_type const bytes = from_char_utf8(utf_char, buff);

ttnghia · 2024-05-07T17:06:20Z

src/main/cpp/src/json_parser.cuh

-    if (!eof(curr_pos)) {
-      char c = *curr_pos;
+    if (!eof()) {
+      const char c = chars[curr_pos];


Suggested change

const char c = chars[curr_pos];

char const c = chars[curr_pos];

ttnghia · 2024-05-07T17:06:33Z

src/main/cpp/src/json_parser.cuh

-        if (!eof(curr_pos)) {
-          char next_char_after_zero = *curr_pos;
+        if (!eof()) {
+          const char next_char_after_zero = chars[curr_pos];


Suggested change

const char next_char_after_zero = chars[curr_pos];

char const next_char_after_zero = chars[curr_pos];

ttnghia · 2024-05-07T17:06:47Z

src/main/cpp/src/json_parser.cuh

-      char c = *curr_pos;
+    skip_whitespaces();
+    if (!eof()) {
+      const char c = chars[curr_pos];


Suggested change

const char c = chars[curr_pos];

char const c = chars[curr_pos];

ttnghia

Sorry for the additional style comments. cudf enforces a very annoying east-const style that we need to follow.

revans2 · 2024-05-07T17:49:27Z

@ttnghia and @nvdbaranec please take another look. I think all of the comments are now addressed.

revans2 · 2024-05-07T17:49:39Z

build

gerashegalov · 2024-05-07T19:08:59Z

Sorry for the additional style comments. cudf enforces a very annoying east-const style that we need to follow.

I don't see this being enforced in thirdparty/cudf/.clang-format since it does not have QualifierAlignment: Right

if I add this rule the clang-format pre-commit finds and auto-corrects many violations in spark-rapids-jni and cudf

revans2 added 4 commits May 2, 2024 16:09

Clean up some of the get_json_object code and adjust

9dd2767

the launch bounds to remove spilling Signed-off-by: Robert (Bobby) Evans <[email protected]>

Some cleanup and comments addressed

369e641

More comments and cleanup

1a697f8

Removed some checks that were not tested/needed and had a bug in it

c96ac4e

Fixed formatting

e040f8a

sameerz added the performance label May 4, 2024

res-life reviewed May 6, 2024

View reviewed changes

revans2 mentioned this pull request May 6, 2024

Get json object comments address #1924

Merged

revans2 added 3 commits May 6, 2024 10:30

Make changes similar to related PR

69b4c05

Merge branch 'branch-24.06' into get_json_obj_adjust_launch_bounds

2150bf8

Remove unused method

1c149b1

revans2 requested a review from ttnghia May 6, 2024 16:05

ttnghia reviewed May 6, 2024

View reviewed changes

src/main/cpp/src/json_parser.cuh Show resolved Hide resolved

revans2 mentioned this pull request May 6, 2024

[WIP] get-json-object perf improvement: handle 32 rows per thread #2004

Closed

Comments

c1286d2

nvdbaranec reviewed May 6, 2024

View reviewed changes

nvdbaranec requested changes May 6, 2024

View reviewed changes

src/main/cpp/src/get_json_object.cu Show resolved Hide resolved

revans2 added 2 commits May 6, 2024 16:52

Review comments

2168f27

Style

314ae37

ttnghia previously approved these changes May 7, 2024

View reviewed changes

nvdbaranec previously approved these changes May 7, 2024

View reviewed changes

ttnghia reviewed May 7, 2024

View reviewed changes

Review Comments

2183f69

revans2 dismissed stale reviews from nvdbaranec and ttnghia via 2183f69 May 7, 2024 17:48

ttnghia approved these changes May 7, 2024

View reviewed changes

revans2 merged commit 980633d into NVIDIA:branch-24.06 May 7, 2024
3 checks passed

revans2 deleted the get_json_obj_adjust_launch_bounds branch May 7, 2024 22:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust the launch bounds to get_json_object to avoid spilling #2015

Adjust the launch bounds to get_json_object to avoid spilling #2015

revans2 commented May 3, 2024

revans2 commented May 3, 2024

revans2 commented May 3, 2024

revans2 commented May 3, 2024

thirtiseven commented May 6, 2024

res-life May 6, 2024

res-life May 6, 2024 •

edited

Loading

revans2 commented May 6, 2024

revans2 commented May 6, 2024

revans2 commented May 6, 2024

nvdbaranec left a comment

nvdbaranec May 6, 2024

nvdbaranec May 6, 2024

nvdbaranec May 6, 2024

nvdbaranec May 6, 2024

nvdbaranec May 6, 2024

revans2 commented May 6, 2024

revans2 commented May 6, 2024

nvdbaranec May 7, 2024

ttnghia May 7, 2024

ttnghia May 7, 2024

ttnghia May 7, 2024

ttnghia May 7, 2024

ttnghia May 7, 2024

ttnghia May 7, 2024

ttnghia May 7, 2024

ttnghia May 7, 2024

ttnghia May 7, 2024

ttnghia May 7, 2024

ttnghia May 7, 2024

ttnghia May 7, 2024

ttnghia left a comment

revans2 commented May 7, 2024

revans2 commented May 7, 2024

gerashegalov commented May 7, 2024

		__device__ inline bool eof() { return _range.eof(_pos); }
		__device__ inline bool is_null() { return _range.is_null(); }

		__device__ inline bool eof(cudf::size_type pos) { return pos >= chars.size(); }
		__device__ inline bool eof() { return curr_pos >= chars.size(); }

	// The lots of registers are used the parallelism will suffer. But in our case
	// If lots of registers are used the parallelism may suffer. But in our case

	* single quotes and unescaped control characters
	* single quotes and unescaped control characters

	__device__ inline char_range(const cudf::string_view& input)
	__device__ inline char_range(cudf::string_view const& input)

	__device__ inline char_range(const char_range&) = default;
	__device__ inline char_range(char_range const&) = default;

	__device__ inline char_range& operator=(const char_range&) = default;
	__device__ inline char_range& operator=(char_range const&) = default;

		const char c = str.current_char();
		const int v = static_cast<int>(c);

	const int escape_chars = escape_char(str.current_char(), copy_destination);
	int const escape_chars = escape_char(str.current_char(), copy_destination);

	const char c = str.current_char();
	char const c = str.current_char();

	const cudf::size_type bytes = from_char_utf8(utf_char, buff);
	cudf::size_type const bytes = from_char_utf8(utf_char, buff);

	const char c = chars[curr_pos];
	char const c = chars[curr_pos];

	const char next_char_after_zero = chars[curr_pos];
	char const next_char_after_zero = chars[curr_pos];

Adjust the launch bounds to get_json_object to avoid spilling #2015

Adjust the launch bounds to get_json_object to avoid spilling #2015

Conversation

revans2 commented May 3, 2024

revans2 commented May 3, 2024

revans2 commented May 3, 2024

revans2 commented May 3, 2024

thirtiseven commented May 6, 2024

Choose a reason for hiding this comment

res-life May 6, 2024 • edited Loading

Choose a reason for hiding this comment

revans2 commented May 6, 2024

revans2 commented May 6, 2024

revans2 commented May 6, 2024

nvdbaranec left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 commented May 6, 2024

revans2 commented May 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia left a comment

Choose a reason for hiding this comment

revans2 commented May 7, 2024

revans2 commented May 7, 2024

gerashegalov commented May 7, 2024

res-life May 6, 2024 •

edited

Loading