-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Followup items from #758 Async IO threads #761
Comments
|
Edit: Let's not do it. See discussion below. |
Hi all, I'll start working on the "Offload PING command". |
I've been working on the offload of the PING command and I came up with two approaches that I would like your opinion on. Approach A) The IO thread that is choose to handle the read of the PING command, read, parses, and executes the command, and then also writes the reply, before processing a new IO job, Approach B) After the PING command has been read and parsed, the main thread schedules the execution of the PING command (enqueues a new job in the IO queue). The IO thread that picks this "execution" job, executes the job and enqueues the write job in the IO queue. Approach A is more simple, but it might block the IO thread if the socket is not available to write, while approach B is more complex but more asynchronous. I'm in favor of approach B, even though it will require a generalization of the IO threads component to support a different kind (non-IO job), and also support the concurrent insertion of jobs in the shared queue. @zuiderkwast @madolson thoughts? |
@uriyage, you know IO threading better than anyone. Can you help? @rjd15372 Approach B sounds like it will not be worth the overhead. Executing PING is very fast, maybe even faster executing in the main thread than delegating it to some thread to execute it. I like approach A, which should add zero work to the main thread if the socket is writable, which is probably the normal case. If it's not writable, we need to register a write handler in the event loop and add the client to the global list When the main thread delegates a read-and-parse-command job to an IO thread, it does among other things.. c->io_read_state = CLIENT_PENDING_IO;
connSetPostponeUpdateState(c->conn, 1);
IOJobQueue_push(jq, ioThreadReadQueryFromClient, c);
c->flag.pending_read = 1;
listLinkNodeTail(server.clients_pending_io_read, &c->pending_read_list_node); Later, the main thread checks if the clients are ready by calling Now, I'm just thinking, is it possible that the main thread has delegated read-and-parse for a client to one IO thread and also delegated write to that same client to another IO thread? That may be a problem... Have we've already prevented this due to TLS? |
@zuiderkwast @rjd15372
If the socket is not writeable, we will have to store the command output (or partial output if we succeed in writing part of it) in the client's reply buffer to be sent later. Overall, I believe the performance gain from both approaches is marginal and may not justify the added complexity. Perhaps we should consider this approach only for more time-consuming commands that can be delegated to the I/O threads. Regarding your question, we already ensure that the same thread that reads is the one that writes to the client. |
@uriyage @zuiderkwast My idea is to make it generic, and add a new command flag that states if the command can be offloaded. |
Thanks Uri! Good points. I guess offloading PING may be not worth it then. I don't see the point of doing it for PING just as a preparation for other commands. It's better to do it for another command where it actually matters. I don't know any command that doesn't need to access any global data though. Do you? Perhaps it's better to focus on another of the follow-up items? |
|
|
Approach B is what we want, PING command, even including hello command, they are health status commands of the server, they should be executed by separated thread without any data command blocking. |
I don't want to offload the ping command. I feel like part of the job of the ping command is to determine the health of the server, so it could be offloaded even if the server is in some type of dead loop. |
|
|
## TLS Negotiation Offloading to I/O Threads ### Overview This PR introduces the ability to offload TLS handshake negotiations to I/O threads, significantly improving performance under high TLS connection loads. ### Key Changes - Added infrastructure to offload TLS negotiations to I/O threads - Refactored SSL event handling to allow I/O threads modify conn flags. - Introduced new connection flag to identify client connections ### Performance Impact Testing with 650 clients with SET commands and 160 new TLS connections per second in the background: #### Throughput Impact of new TLS connections - **With Offloading**: Minimal impact (1050K → 990K ops/sec) - **Without Offloading**: Significant drop (1050K → 670K ops/sec) #### New Connection Rate - **With Offloading**: - 1,757 conn/sec - **Without Offloading**: - 477 conn/sec ### Implementation Details 1. **Main Thread**: - Initiates negotiation-offload jobs to I/O threads - Adds connections to pending-read clients list (using existing read offload mechanism) - Post-negotiation handling: - Creates read/write events if needed for incomplete negotiations - Calls accept handler for completed negotiations 2. **I/O Thread**: - Performs TLS negotiation - Updates connection flags based on negotiation result Related issue:#761 --------- Signed-off-by: Uri Yagelnik <[email protected]> Signed-off-by: ranshid <[email protected]> Co-authored-by: ranshid <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>
Support Primary client IO offload. Related issue: #761 --------- Signed-off-by: Uri Yagelnik <[email protected]>
# Refactor client structure to use modular data components ## Current State The client structure allocates memory for replication / pubsub / multi-keys / module / blocked data for every client, despite these features being used by only a small subset of clients. In addition the current field layout in the client struct is suboptimal, with poor alignment and unnecessary padding between fields, leading to a larger than necessary memory footprint of 896 bytes per client. Furthermore, fields that are frequently accessed together during operations are scattered throughout the struct, resulting in poor cache locality. ## This PR's Change 1. Lazy Initialization - **Components are only allocated when first used:** - PubSubData: Created on first SUBSCRIBE/PUBLISH operation - ReplicationData: Initialized only for replica connections - ModuleData: Allocated when module interaction begins - BlockingState: Created when first blocking command is issued - MultiState: Initialized on MULTI command 2. Memory Layout Optimization: - Grouped related fields for better locality - Moved rarely accessed fields (e.g., client->name) to struct end - Optimized field alignment to eliminate padding 3. Additional changes: - Moved watched_keys to be static allocated in the `mstate` struct - Relocated replication init logic to replication.c ### Key Benefits - **Efficient Memory Usage:** - 45% smaller base client structure - Basic clients now use 528 bytes (down from 896). - Better memory locality for related operations - Performance improvement in high throughput scenarios. No performance regressions in other cases. ### Performance Impact Tested with 650 clients and 512 bytes values. #### Single Thread Performance | Operation | Dataset | New (ops/sec) | Old (ops/sec) | Change % | |------------|---------|---------------|---------------|-----------| | SET | 1 key | 261,799 | 258,261 | +1.37% | | SET | 3M keys | 209,134 | ~209,000 | ~0% | | GET | 1 key | 281,564 | 277,965 | +1.29% | | GET | 3M keys | 231,158 | 228,410 | +1.20% | #### 8 IO Threads Performance | Operation | Dataset | New (ops/sec) | Old (ops/sec) | Change % | |------------|---------|---------------|---------------|-----------| | SET | 1 key | 1,331,578 | 1,331,626 | -0.00% | | SET | 3M keys | 1,254,441 | 1,152,645 | +8.83% | | GET | 1 key | 1,293,149 | 1,289,503 | +0.28% | | GET | 3M keys | 1,152,898 | 1,101,791 | +4.64% | #### Pipeline Performance (3M keys) | Operation | Pipeline Size | New (ops/sec) | Old (ops/sec) | Change % | |-----------|--------------|---------------|---------------|-----------| | SET | 10 | 548,964 | 538,498 | +1.94% | | SET | 20 | 606,148 | 594,872 | +1.89% | | SET | 30 | 631,122 | 616,606 | +2.35% | | GET | 10 | 628,482 | 624,166 | +0.69% | | GET | 20 | 687,371 | 681,659 | +0.84% | | GET | 30 | 725,855 | 721,102 | +0.66% | ### Observations: 1. Single-threaded operations show consistent improvements (1-1.4%) 2. Multi-threaded performance shows significant gains for large datasets: - SET with 3M keys: +8.83% improvement - GET with 3M keys: +4.64% improvement 3. Pipeline operations show consistent improvements: - SET operations: +1.89% to +2.35% - GET operations: +0.66% to +0.84% 4. No performance regressions observed in any test scenario Related issue:#761 --------- Signed-off-by: Uri Yagelnik <[email protected]> Signed-off-by: uriyage <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]>
List of pending items from #758, all of these are future enhancements, and not pending issues that need to be merged before Valkey 8 (but they could be).
The text was updated successfully, but these errors were encountered: