Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spill framework refactor for better performance and extensibility [databricks] #11747

Merged
merged 37 commits into from
Dec 13, 2024

Conversation

abellina
Copy link
Collaborator

This is a very large PR that I'd like some 👀 on. Marked it as draft as I still have some TODOs around more tests. The PR is NOT going to go to 24.12, it's just that we don't have a 25.02 available.

The main file I think one should focus on is SpillFramework.scala (yeap one file, let me know if you want me to break that into multiple files). SpillFramework.scala has a comment describing how things should work, please take a look at that.

The main contribution here is a simplification of the framework where we remove the idea of a RapidsBuffer that has to be acquired and unacquired, for the idea of a handle that just knows how to materialize. There isn't a concept of acquisition in the new framework.

There is a SpillableColumnarBatch api and a lazy-spillable api for Join that I did not touch and left there on purpose, but we can start to remove that API and create spillable handles that replicate the lazy behavior we wanted in lazy spillable, or the recomputing behavior we want for broadcasts. This is the second contribution of the PR: handles decide how to spill, not the framework.

There is one easily fixable shortcoming today in the multiple-spiller case, that I will fix in a follow on PR. While we are spilling a handle, the handle holds a lock. The same lock is used to figure out if the handle is spillable. A second thread that is trying to spill may need to wait for this lock (and spill) to finish, to figure out if it needs to spill that handle or not. We can make this more straightforward by handling the spill state separate from the materialization/data state, but I'd like to submit that work as an improvement.

I have run this against NDS @ 3TB in our perf cluster and I don't see regressions, and have run it against spill prone cases and I am able to see multiple threads in the "spill path", and no deadlocks. I'll post more results when I can run them.

@@ -44,22 +44,20 @@ class RapidsSerializerManager (conf: SparkConf) {

private lazy val compressionCodec: CompressionCodec = TrampolineUtil.createCodec(conf)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to remove this class, or make it much simpler.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't finish, but things keep changing under me so I thought I would publish what I have so far

true
}

shouldRetry
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why is this shouldRetry change in there? I assume it was for debugging at some point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undoing

@@ -1110,7 +1110,7 @@ class CudfSpillableHostConcatResult(
val hmb: HostMemoryBuffer) extends SpillableHostConcatResult {

override def toBatch: ColumnarBatch = {
closeOnExcept(buffer.getHostBuffer()) { hostBuf =>
closeOnExcept(buffer.getHostBuffer) { hostBuf =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why change this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

@@ -307,7 +307,6 @@ class LazySpillableColumnarBatchImpl(
spill = Some(SpillableColumnarBatch(cached.get,
SpillPriorities.ACTIVE_ON_DECK_PRIORITY))
} finally {
// Putting data in a SpillableColumnarBatch takes ownership of it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this no longer true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was debugging something here, and I must have forgotten to undo this change. coming up

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but to answer your question, the spill framework takes ownership, always

@abellina
Copy link
Collaborator Author

Sorry for the rapid movement @revans2. I'll pause for a bit, and come back to address the ChunkedPacker comments and other comments I get.

Copy link
Collaborator

@zpuller zpuller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise leaving partial review

@abellina abellina changed the base branch from branch-24.12 to branch-25.02 November 25, 2024 19:59
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have a lot more to go through, but I thought I would at least get some of my comments in.

* Set a new spill priority.
*/
override def setSpillPriority(priority: Long): Unit = {
// TODO: handle.setSpillPriority(priority)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or remove this entirely.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other code in the plugin calls this. I kept it for now, but can clean up if you really want me to.

@@ -245,18 +338,29 @@ object SpillableColumnarBatch {
*/
def apply(batch: ColumnarBatch,
priority: Long): SpillableColumnarBatch = {
Cuda.DEFAULT_STREAM.sync()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add comments in the code around why we have the sync.

The reason is that if you hand off an object to the framework, we could turn around and spill them immediately. We need to make sure the object is immutable when it hits the store.

We can move the sync back to the factory methods in the spill framework if that is desired?

RapidsBufferCatalog.addBatch(batch, initialSpillPriority)
}
}
Cuda.DEFAULT_STREAM.sync()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again why?

catalog: RapidsBufferCatalog): RapidsBufferHandle = {
withResource(batch) { batch =>
catalog.addBatch(batch, initialSpillPriority)
val handle = SpillableHostColumnarBatchHandle(batch)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no stream sync if it is needed for the other APIs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a host batch, so we didn't need a device sync on this :)

val handle = withResource(buffer) { _ =>
RapidsBufferCatalog.addBuffer(buffer, meta, priority)
}
Cuda.DEFAULT_STREAM.sync()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have a lot more to look at, but I am making progress

buffOffset += blockRange.rangeSize()
}
needsCleanup = false
} catch {
case ioe: IOException =>
case ex: Throwable =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to catch Errors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me fix to catch Exception

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not done, but I need to switch to some other things so I am going to comment on what I have looked at so far.

*
* CUDA/Host synchronization:
*
* We assume all device backed handles are completely materialized on the device (before adding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? Please add that to the comments so that it is clear what is happening.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added info in this comment. We could, at least for the GPU handles, add an event per handle and we could record on that event, instead of synchronizing. I think this means I should move the synchronization to the framework, right now (as you spotted), I let callers synchronize before they call the factory methods.

* with extra locking is the `SpillableHostStore`, to maintain a `totalSize` number that is
* used to figure out cheaply when it is full.
*
* Handles hold a lock to protect the user against when it is either in the middle of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing. So essentially the lock prevents race conditions when the object is in the middle of spilling or being closed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a confusing comment. It seems important to someone designing a new spillable handle, not to a end user. I have reworded it.

* - If sizeInBytes is 0, the object is tracked by the stores so it can be
* removed on shutdown, or by handle.close, but 0-byte handles are not spillable.
*/
val sizeInBytes: Long
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is approximate we should name it such. So there is no confusion about how that can be used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this is a val, but if something spills wouldn't it change to 0 from whatever it was before? Shouldn't it be a def?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is, and it isn't. That's the issue with sizeInBytes. I am going to rework that and make it clearer. I think something like approxSizeInBytes makes sense, or trackedSizeInBytes, and it should be private[spill] going with the other things that are marked that way. But some handles do have a size that is not approximate, and IS used for creating buffers, so I kind of want that size reported with a different API, specific to each handle.

It is a val because I didn't see a point in changing it. If the object is spilled that is determined by dev, or host being empty.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also a val, because I wanted to set spillable at construction for 0-byte handles. Any other handle is spillable by definition at the level of the trait SpillableHandle, but if it's not a val then there's a chance that approxSizeInBytes is not ready at construction time, so I don't know if I can call it reliably.

Copy link
Collaborator Author

@abellina abellina Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree this is confusing. The val part is because I want to mark objects that are 0 sized as not spillable, and I'd like to do that at construction time. It removes some locking that otherwise I have to do.

I have renamed this approxSizeInBytes and left sizeInBytes public in handles that do support this non-approximate value. I default this so in most cases approxSizeInBytes = sizeInBytes, but I do think this makes it clearer. Please take a look at d9490ee

* or directly against the handle.
* @return sizeInBytes if spilled, 0 for any other reason (not spillable, closed)
*/
def spill: Long
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I thought the convention in Scala was that a method with no parameters should have parens if it did work (like side effects). Should this be def spill(): Long then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing all of these

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed with d9dbf36

/**
* Method called to spill this handle. It can be triggered from the spill store,
* or directly against the handle.
* @return sizeInBytes if spilled, 0 for any other reason (not spillable, closed)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify do we want them to return the result of calling sizeInBytes? Especially if it is just an approximate size? Can we clarify a little bit here about what is expected to be returned by this API?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this is clearer with a comment I added.

* is a `val`.
* @return true if currently spillable, false otherwise
*/
private[spill] def spillable: Boolean = sizeInBytes > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this private to spill? Just curious, because if we want others to add spillable things in does that mean they have to put the handles in com.nvidia.spark.rapids.spill to make them work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. We can try to open that up if we want spillables in other packages, but right now it's all in here so I made this change in response to a comment by @jlowe. I did want to access this from unit test, and that's why it is private[spill] specifically.

// do we care if:
// handle didn't fit in the store as it is too large.
// we made memory for this so we are going to hold our noses and keep going
// we could spill `handle` at this point.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we spoke about this already. There are currently two cases.

  1. We are running with no limit on the host memory, but a spill store limit
  2. We have a host memory limit.

If we have a host memory limit, then the spill store is constrained by the host memory limit so we can ignore it here. we will not have allocated a HostMemoryBuffer that is too large to fit.

If we have unlimited host memory, then making a SpillableHostBufferHandle is there for code compatibility, but it should not be added to the host store at all. It should never spill. We have unlimited host memory.

We should also document this somehere.

@pxLi
Copy link
Collaborator

pxLi commented Dec 9, 2024

build - re-deploy jenkins instance for an internal mandatory ops. rekicked the blossom-ci

@pxLi
Copy link
Collaborator

pxLi commented Dec 9, 2024

build

@abellina abellina marked this pull request as ready for review December 9, 2024 14:31
private val MAX_TABLE_ID = Integer.MAX_VALUE
private val TABLE_ID_UPDATER = new IntUnaryOperator {
override def applyAsInt(i: Int): Int = if (i < MAX_TABLE_ID) i + 1 else 0
def getColumnarBatchAndRemove(handle: RapidsShuffleHandle,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why this method exists on the catalog rather than a getColumnarBatch method and a sizeInBytes method on the RapidsShuffleHandle. Then the caller can just use withResource on the shuffle handle directly and call those methods within the withResource block.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tableMeta is not part of the spill framework for UCX, it is stored in the shuffle catalogs instead using a RapidsShuffleHandle (which doesn't live in the spill framework... yet).

I think we can make these first class spill handles.

Copy link
Collaborator

@zpuller zpuller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a pass on the SpillFramework tests and had some questions

}
}

def initialize(rapidsConf: RapidsConf): Unit = synchronized {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if it was already initialized, and if so, throw IllegalStateException ?

*
* We handle aliasing of objects, either in the spill framework or outside, by looking at the
* reference count. All objects added to the store should support a ref count.
* If the ref count is greater than the expected value, we assume it is being aliased,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work? What is the "expected value" derived from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each handle overrides def spillable and does things differently. A buffer is super easy, it's just MemoryBuffer.getRefCount==1, a ColumnarBatch needs to figure out if it has repetition: A CB of [col0, col0, col0] will have three columns, but they will have a refcount of 3, so 3 is the spillable ref count, not 1, and the spillable method checks this for every column of the batch.

}
}

test("an aliased contiguous table is not spillable (until closing the original) ") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be "until closing the alias?"

assertResult(2)(SpillFramework.stores.deviceStore.numHandles)
assert(!handle.spillable)
assert(!aliasHandle.spillable)
} // we now have two copies in the store
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment supposed to apply to the code inside the curly brackets, or just after closing? It's a bit confusing


test("a buffer is not spillable until the owner closes columns referencing it") {
val (ct, _) = buildContiguousTable()
// the contract for spillable handles is that they take ownership
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naive question - why can't the call to getBuffer contain the incRefCount? Are there instances where we want/need a non-owning reference (despite the contract)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should. We should file a follow on to fix that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertResult(0)(SpillFramework.stores.hostStore.numHandles)
assertResult(1)(SpillFramework.stores.diskStore.numHandles)
assert(handle.dev.isEmpty)
assert(handle.host.isDefined)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain/add a comment as to why handle.host is defined but handle.host.get.host is not? I'm sure there's a good reason but it's not obvious

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we have architected the spill handles is that they can cascade device->host->disk. But, the device handles don't have a host AND disk handles, they just have a host handle. If the host handle itself spilled, or could not fit on host to begin with, its host component is empty, and its disk component is set. So the view of the world from the handle's perspective is either set in the place where they are supposed to be, or they have a handle to something that can help them find the object later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, it's kind of like how the spill stores had a next store reference

}

test("host originated: get host memory buffer") {
val spillPriority = -10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had this value in the test before, I just copied it. I didn't want to remove spill priorities because we still want to implement them, and was hoping for tests to fail on me when I change the interface (or I could easily find/replace things)

test("host originated: a host batch supports aliasing and duplicated columns") {
SpillFramework.shutdown()
val sc = new SparkConf
// disables the host store limit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a confusing config name where setting it (enabled) to true disables it (because it enables some alternative mechanism or something), or is the comment wrong?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we have defined host limits is that if they are set, the host spill's store own limits are ignored.

The idea is that host spills will be triggered only by a host OOM, and not via a host store limit. So yes, enabling the off heap limit, disables the host store limit. I'll try to change the comment.

// this is a key behavior that we wanted to keep during the spill refactor
// where host objects that are added directly to the store do not cause a
// host->disk spill on their own, instead they will get spilled later
// due to device->host spills.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because we are assuming the memory is already allocated/accounted for? Are we still adding the new buffer size to the total store allocated size?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and yes. The host memory was allocated, and we track it as such. But we are not going to actively spill right now. We'll spill later when device wants to spill.

This may change later. And when host limits are enabled (not host store limits) then it fits that model better.

}
}

val hostSpillStorageSizes = Seq("-1", "1MB", "16MB")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does size -1 do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 means no host store limit. Will add comment.

@abellina
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator Author

build

@abellina
Copy link
Collaborator Author

build

1 similar comment
@abellina
Copy link
Collaborator Author

build

@sameerz sameerz added the performance A performance related task/issue label Dec 11, 2024
@abellina
Copy link
Collaborator Author

@zpuller @jlowe @revans2 Thank you for your review so far. Please let me know if there is anything else.

if (!spillable) {
0L
} else {
synchronized {
Copy link
Collaborator

@zpuller zpuller Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a race condition here? If Thread t1 calls spill(), and they see that spillable is true, but then thread t2 jumps in and calls materialize() and gets the lock and materializes. Say the buffer is on the device only at the moment, so t2 gets a ref to the DeviceMemoryBuffer, and ref count goes to 2. Now t2 drops the lock and goes on to actually use the buffer, and t1 proceeds and starts spilling even though it's no longer spillable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is an acceptable race.

In this case we copy to host but we can't free (we call close but we don't actually free). From the perspective of the spill framework, we are going to synchronize up to the point where we call .close on the DeviceMemoryBuffer. The rest is left to the caller. The caller needs to synchronize w.r.t. apis that it calls, and I think that part is understood (it's part of the original spill framework as well.

Signed-off-by: Alessandro Bellina <[email protected]>
@abellina
Copy link
Collaborator Author

build

@abellina abellina changed the title Spill framework refactor for better performance and extensibility Spill framework refactor for better performance and extensibility [databricks] Dec 12, 2024
@abellina
Copy link
Collaborator Author

running it again with db enabled.

@abellina
Copy link
Collaborator Author

build

1 similar comment
@abellina
Copy link
Collaborator Author

build

@abellina abellina merged commit e3798d2 into NVIDIA:branch-25.02 Dec 13, 2024
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants