-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster dict #103
base: master
Are you sure you want to change the base?
Faster dict #103
Conversation
This improves the speed of the previous dict implementation through a reduction in the number of atomic loads in the read path (max 1 when the dict is read-only - think globals) as well as the number of allocations needed in the write path. Overall the performance is improved by about 30%. Some of the major changes are as follows: * The internal table layout was changed from []*dictEntry to []dictEntry reducing a memory indirection as well as hopefully improving the speed of slot probing in insertAbsentEntry as well as lookupEntry. * Many iteration operations which might have needed to grab a relatively expensive lock previously can now do so without locking if the dict is in the read-only mode. * The sizeof(Dict) increased some as a few variables (used and fill) were moved from the dictTable to Dict itself. The addition of a `write` and `misses` values to the Dict makes the overall memory usage of Dict generally larger. This is offset by the type change of dictTable and the reduction of additional pointers there. An empty dict weighs in at 304 bytes compared to 176 previously. At 4 elements, both the old and new implementation use 304 bytes of memory. From that point on, the new implementation actually uses less memory.
Looks like I messed with some calls during the merge... |
@rezaghanbari I'm looking on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like change []*dictEntry -> []dictEntry
, and I bet this change has most positive impact on performance due to better cache locality and less allocation count.
I don't like copy-on-write part. Having write part under lock, read part and iteration could be easily done lock-free without copy-on-write:
- never overwrite
hash
andkey
of already written dictEntry.
Delete should be accomplished by writingdeletedValue
intovalue
instead ofdeletedKey
intokey
. - writing should be done in three atomic.Stores:
hash
, thenkey
, thenvalue
. - reading entry should be done as following:
key := atomic.Load(&entry.key)
if key == nil {
return emptyValue
}
for {
val := atomic.Load(&entry.Value)
if val == &deletedValue {
return empty
} else if val != nil {
return value
}
runtime.Gosched()
}
Hash value should be also read after checking key
Remark: hash == 0
could be used instead of key == nil
to mark never written yet entry
. It is preffered order is preffered for GetEntry
, because it more often will touch only hash
without need to check key
. Then if key's hash already equal to zero, it should be replaced with some constant.
Probably I'm mistaken, and this way GetEntry
will be considerably slower because of more atomic operations. But this design should be at least tried and compared with.
There is other "issue" with dict: Python 3.7 adopts order preserving dict by storing entries in an array and storing indices in a hash part. This is the way array
works in PHP7, Hash
implemented in Ruby 2.5, Map
in Javascript (i suppose), dict
in PyPy (iirc). While "order preserving" were not declared as part of language specification yet, it is a quite pretty property.
So, if change of dict implementation is desirable, then it is better to implement "order preserving" variant right now.
I agree with @funny-falcon about 3.7 order preservation. If a behavior of 3.X is 2.7-compatible, I will almost always be in favor of it. Python 3 is a goal. It is ok to have many PRs refactoring dict, for benchmark comparison... The actual benchmarks looks like this PR is better than the existing implementation. @funny-falcon, do you think we should wait for challenge implementations before merging or merge then compare with the challenges? |
Most dicts are "readonly" or at least write rarely. Do benchmarks with real
code. This is by far the closest you can get safely to py27's performance
without requiring additional external synchronization that isn't needed in
normal cpython.
I implemented a fully lock free map, but it was slower. The implementation
I followed was the one described by Cliff Click here
https://youtu.be/HJ-719EGIts. Go doesn't directly expose "light" enough
atomic loads - at least that was the case at the time I wrote the initial
patch.
Regardless, measure with real benchmarks, they tell a good story.
…On Tue, Oct 9, 2018, 00:40 Sokolov Yura ***@***.***> wrote:
***@***.**** requested changes on this pull request.
I like change []*dictEntry -> []dictEntry, and I bet this change has most
positive impact on performance due to better cache locality and less
allocation count.
I don't like copy-on-write part. Having write part under lock, read part
and iteration could be easily done lock-free without copy-on-write:
- never overwrite hash and key of already written dictEntry.
Delete should be accomplished by writing deletedValue into value
instead of deletedKey into key.
- writing should be done in three atomic.Stores: hash, then key, then
value.
- reading entry should be done as following:
key := atomic.Load(&entry.key)
if key == nil {
return emptyValue
}
for {
val := atomic.Load(&entry.Value)
if val == &deletedValue {
return empty
} else if val != nil {
return value
}
runtime.Gosched()
}
Hash value should be also read after checking key
Remark: hash == 0 could be used instead of key == nil to mark never
written yet entry. It is preffered order is preffered for GetEntry,
because it more often will touch only hash without need to check key.
Then if key's hash already equal to zero, it should be replaced with some
constant.
Probably I'm mistaken, and this way GetEntry will be considerably slower
because of more atomic operations. But this design should be at least tried
and compared with.
There is other "issue" with dict: Python 3.7 adopts order preserving dict
by storing entries in an array and storing indices in a hash part. This is
the way array works in PHP7, Hash implemented in Ruby 2.5, Map in
Javascript (i suppose), dict in PyPy (iirc). While "order preserving"
were not declared as part of language specification yet, it is a quite
pretty property.
So, if change of dict implementation is desirable, then it is better to
implement "order preserving" variant right now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#103 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAH3G-WSix49w6lplNlBl3yAm49pJKC2ks5ujFLegaJpZM4XLgM1>
.
|
Doesn't matter. I'll try to find time this week to make alternative variant, but I don't claim I will complete.
Yes. But some are not.
Yes, I've said already that I could be mistaken, and "lock-free" could be remarkably slower. |
I've spent a day today... my first attempt were noticeably slower. Will try again next week. |
I've done ordered dictionary at #111 with I didn't fix comments yet, that is why it is marked as WIP. |
#111 is ready for review |
Today I was thinking about TimSort and wondering if a Dict can adapt itself for the use behaviors. What about bookkeeping reads vs writes and change from Copy-On-Write (good for reads) into something else when is perceived to be a write-heavy dict? |
I still don't believe Copy-On-Write gives noticeable performance improvement. |
I had used https://github.com/grumpyhome/grumpy/tree/master/grumpy-runtime-src/benchmarks for the primary benchmarks. The testing.B benchmarks added to the Go code (and referenced in the initial PR) were created as an early indicator for how real programs might behave. The proof is in real code - microbenchmarks allow isolating and measuring opportunities. Integrating more of https://github.com/python/performance wouldn't be a bad idea. I was using it for measuring the difference in implementation performance available. All that being said, getting other, "real" sizable code chunks to measure is always useful. |
You mean this new bigDict := newTestDict(
"abc", "def",
"ghi", "jkl",
"mno", "pqr",
"stu", "vwx",
"yzA", "BCD",
"EFG", "HIJ",
"KLM", "OPQ",
"RST", "UVW",
"XYZ", "123",
"456", "789",
"10!", "@#$",
"%^&", "*()")
cases := []invokeTestCase{
{args: wrapArgs(NewDict()), want: NewList().ToObject()},
{args: wrapArgs(newTestDict("foo", None, 42, None)), want: newTestList(42, "foo").ToObject()},
{args: wrapArgs(bigDict), want: newTestList("abc", "yzA", "KLM", "XYZ", "10!", "456", "stu", "%^&", "mno", "RST", "ghi", "EFG").ToObject()},
} If so, it should be added to the test suite anyway... |
Agree. Would be a nice. |
Rebased PR google#259 over #85
Kudos @nairb774 for the original code and @trotterdylan for original guidance and comments.