Cleanup CPU predict function. #11139

trivialfis · 2025-01-02T14:02:20Z

Remove predict instance. It's dead code as we can't use it outside of XGBoost even with C++ include.
Remove unroll. No performance benefit.
Optimize dense QDM inference.
optimize data loading by directly copying data to feature vector instead of going through a workspace.

Partially address #10793

The optimization mostly focuses on dense data and the result varies between CPUs:

| Xeon(R) Gold 6128 |            DMatrix |    QuantileDMatrix |
|-------------------+--------------------+--------------------|
| Master            | 27.980122327804565 | 55.665775775909424 |
| PR                |  23.63674759864807 | 30.158272981643677 |

| Ryzen 9 7900X3D |            DMatrix |    QuantileDMatrix |
|-----------------+--------------------+--------------------|
| Master          | 24.764960527420044 | 31.460495710372925 |
| PR              | 22.532921314239502 | 21.412014961242676 |

trivialfis · 2025-01-06T13:50:37Z

@razdoburdin Could you please help take a look into the optimization?

I'm not an expert in CPU optimization. The changes in the predictor affects the Xeon much more significantly than the Ryzen. If I remove the dense optimization, it adds about 3 seconds to Ryzen, but 20 seconds to the Xeon.

Looking at some profiling results on Ryzen, the bottleneck seems to be in data loading (movss/movl). Would love to get some opinions.

razdoburdin · 2025-01-09T10:26:37Z

@razdoburdin Could you please help take a look into the optimization?

I'm not an expert in CPU optimization. The changes in the predictor affects the Xeon much more significantly than the Ryzen. If I remove the dense optimization, it adds about 3 seconds to Ryzen, but 20 seconds to the Xeon.

Looking at some profiling results on Ryzen, the bottleneck seems to be in data loading (movss/movl). Would love to get some opinions.

It is hard to give the exact answer without deep investigation of the changes. My hypothesis are:

Xeon benefits more from vectorization due to AVX512 support
Xeon has much smaller L3 cache, that makes memory access optimizations more critical.

trivialfis · 2025-01-09T10:46:05Z

@razdoburdin Thank you for sharing, could you please help review the changes in the CPU predictor when you are available?

It is hard to give the exact answer without deep investigation of the changes

Currently, the evaluation might be even more expensive than training for some datasets. Would be great if we could get some help on that.

Cleanup CPU predict function.

d40949e

trivialfis force-pushed the cleanup-predict branch from 7a9c90f to d40949e Compare January 2, 2025 14:07

trivialfis added 6 commits January 3, 2025 02:50

Optimize QDM inference.

e86e93d

Fixes, lint.

e51877b

cat.

06993ef

Direct fill.

b2caadb

Fixes.

d610455

lint.

25d1d7c

trivialfis mentioned this pull request Jan 6, 2025

Auto encoding for categorical data during inference. #11088

Open

6 tasks

Merge branch 'master' into cleanup-predict

d32fdb8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup CPU predict function. #11139

Cleanup CPU predict function. #11139

trivialfis commented Jan 2, 2025 •

edited

Loading

trivialfis commented Jan 6, 2025 •

edited

Loading

razdoburdin commented Jan 9, 2025

trivialfis commented Jan 9, 2025

Cleanup CPU predict function. #11139

Are you sure you want to change the base?

Cleanup CPU predict function. #11139

Conversation

trivialfis commented Jan 2, 2025 • edited Loading

trivialfis commented Jan 6, 2025 • edited Loading

razdoburdin commented Jan 9, 2025

trivialfis commented Jan 9, 2025

trivialfis commented Jan 2, 2025 •

edited

Loading

trivialfis commented Jan 6, 2025 •

edited

Loading