-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DemoStorage does not take whiteouts into account -> leading to data corruption #318
Comments
Perhaps I misunderstand, but wouldn't that require (infinitely) storing the history of deletions, or at the very least storing an abbreviated version of it? Wouldn't that thus largely defeat a lot of the point of packing a storage? If not that, then it requires new auxiliary data structures to track it that are "packed" separately; e.g., for FileStorage, a whole new file. That doesn't seem particularly feasible. Likewise, it would also require substantial redesign for RelStorage to be able to provide that information in either history-preserving or history-free mode. |
@jamadden, thanks for feedback. Maybe I myself too misunderstand something, but I think there is no need for additional side history: if deletion record is not yet packed - we have its serial. If deletion record is covered by pack and removed - it is ok to return 0 as serial from loadBefore, i.e. as if the object was "not yet created and later removed". In other words for deletion records it is the same situatation as if it was just a plain data record - if that record is still covered by database, we have it in the history and take into account. If it is not covered - the database is operated as if there is no previous record for object in question. For FileStorage this should work well without any auxiliary file because FileStorage records object deletion via its regular https://github.com/zopefoundation/ZODB/blob/a89485c1/src/ZODB/FileStorage/format.py#L57-L75 About RelStorage - I quickly looked, and, if I understood correctly it is:
https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/storage/load.py#L245-L246
https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/adapters/packundo.py#L377-L384 which could be instead changed to be something like
i.e. adding next data record that indicates object deletion, instead of removing object state for object record with oldserial. This will maintain all information that loadBefore needs to correctly return deletedAt serial for loadBefore query with By the way - is that correct to remove data for record with oldserial like it is currently done? At that oldserial database view, object data has to be preserved and loadBefore with https://github.com/zopefoundation/ZODB/blob/5.6.0-3-g22df1fd11/src/ZODB/interfaces.py#L1276-L1280 maybe I'm misunderstanding something though...
i.e. object data is removed inplace. And it is correct to do so: history-free mode can be perceived as mode where pack is run after every commit preserving data reachable only from latest database view. And if we imagine such pack after commit with So, once again, maybe I'm missing something, but here I don't see any problem neither for RelStorage nor for traditional storages like FileStorage. |
For the reference, here is symmetric proposal for
return only
i.e. don't return next_serial at all. This would unify return signature to always be For FileStorage this should not make a difference, but for NEO and RelStorage this should offload loadBefore from performing additional SQL query which should improve loadBefore latency and speed things up. Here is relevant place in RelStorage with that additional SQL query to retrieve https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/storage/load.py#L259-L264 The reason I say that
query, the cache can remember When invalidation message from server is received, cache entries, that had For the record: some of my older notes regarding this topic from http://navytux.spb.ru/~kirr/neo.html#development-overview and other places go below. Kirill /cc @vpelletier, @jmuchemb, @arnaud-fontaine ---- 8< ---- A note goes right away that by insisting on specifying ZODB/go interfaces right, a performance bug in current ZODB/py was discovered: in ZODB/py Currently for a GetObject request NEO performs 2 SQL queries - one to load object data and another one to get next_serial for it: (mariadb backend) (sqlite backend) I originally wanted to remove next_serial from load return for interface clarity, but it turns out it also reduces load on ZODB storage server and in particular can reduce 2x the number of SQL queries. |
…s and not to skip transactions Currently invalidate documentation is not clear whether it should be called for every transaction and whether it should include full set of objects created/modified by that transaction. Until now this was working relatively well for the sole purpose of invalidating client ZEO cache, because for that particular task it is relatively OK not to include just created objects into invalidation messages, and even to completely skip sending invalidation if transaction only create - not modify - objects. Due to this fact the workings of the client cache was indifferent to the ambiguity of the interface. In 2016 skipping transactions with only created objects was reconsidered as bug and fixed in ZEO5 because ZODB5 relies more heavily on MVCC semantic and needs to be notified about every transaction committed to storage to be able to properly update ZODB.Connection view: zopefoundation/ZEO@02943acd#diff-52fb76aaf08a1643cdb8fdaf69e37802L889-R834 zopefoundation/ZEO@9613f09b However just-created objects were not included into invalidation messages until, hopefully, recently: zopefoundation/ZEO#160 As ZODB is started to be used more widely in areas where it was not traditionally used before, the ambiguity in invalidate interface and the lack of guarantees - for any storage - to be notified with full set of information, creates at least the following problems: - a ZODB client (not necessarily native ZODB/py client) can maintain raw cache for the storage. If such client tries to load an oid at database view when that object did not existed yet, gets "no object" reply and stores that information into raw cache, to properly invalidate the cache it needs an invalidation message from ZODB server that *includes* created object. - tools like `zodb watch` [1,2,3] don't work properly (give incorrect output) if not all objects modified/created by a transaction are included into invalidation messages. - similarly to `zodb watch`, a monitoring tool, that would want to be notified of all created/modified objects, won't see full database-change picture, and so won't work properly without knowing which objects were created. - wendelin.core 2 - which builds data from ZODB BTrees and data objects into virtual filesystem - needs to get invalidation messages with both modified and created objects to properly implement its own lazy invalidation and isolation protocol for file blocks in OS cache: when a block of file is accessed, all clients, that have this block mmaped, need to be notified and asked to remmap that block into particular revision of the file depending on a client's view of the filesystem and database [4,5]. To compute to where a client needs to remmap the block, WCFS server (that in turn acts as ZODB client wrt ZEO/NEO server), needs to be able to see whether client's view of the filesystem is before object creation (and then ask that client to pin that block to hole), or after creation (and then ask the client to pin that block to corresponding revision). This computation needs ZODB server to send invalidation messages in full: with both modified and just created objects. Also: - the property that all objects - both modified and just created - are included into invalidation messages is required and can help to remove `next_serial` from `loadBefore` return in the future. This, in turn, can help to do 2x less SQL queries in loadBefore for NEO and RelStorage (and maybe other storages too): zopefoundation#318 (comment) Current state of storages with respect to new requirements: - ZEO: does not skip transactions, but includes only modified - not created - objects. This is fixed by zopefoundation/ZEO#160 - NEO: already implements the requirements in full - RelStorage: already implements the requirements in full, if I understand correctly: https://github.com/zodb/relstorage/blob/3.1.2-1-gaf57d6c/src/relstorage/adapters/poller.py#L28-L145 While editing invalidate documentation, use the occasion to document recently added property that invalidate(tid) is always called before storage starts to report its lastTransaction() ≥ tid - see 4a6b028 (mvccadapter: check if the last TID changed without invalidation). /cc @jimfulton, @jamadden, @jmuchemb, @vpelletier, @arnaud-fontaine, @gidzit, @klawlf82, @jwolf083 [1] https://lab.nexedi.com/kirr/neo/blob/049cb9a0/go/zodb/zodbtools/watch.go [2] https://lab.nexedi.com/kirr/neo/commit/e0d59f5d [3] https://lab.nexedi.com/kirr/neo/commit/c41c2907 [4] https://lab.nexedi.com/kirr/wendelin.core/blob/1efb5876/wcfs/wcfs.go#L94-182 [5] https://lab.nexedi.com/kirr/wendelin.core/blob/1efb5876/wcfs/client/wcfs.h#L20-71
loadAt is new optional storage interface that is intended to replace loadBefore with more clean and uniform semantic. Compared to loadBefore, loadAt: 1) returns data=None and serial of the removal, when loaded object was found to be deleted. loadBefore is returning only data=None in such case. This loadAt property allows to fix DemoStorage data corruption when whiteouts in overlay part were not previously correctly taken into account. zopefoundation#318 2) for regular data records, does not require storages to return next_serial, in addition to (data, serial). loadBefore requirement to return both serial and next_serial is constraining storages unnecessarily, and, while for FileStorage it is free to implement, for other storages it is not - for example for NEO and RelStorage, finding out next_serial, after looking up oid@at data record, costs one more SQL query: https://lab.nexedi.com/nexedi/neoppod/blob/fb746e6b/neo/storage/database/mysqldb.py#L484-508 https://lab.nexedi.com/nexedi/neoppod/blob/fb746e6b/neo/storage/database/mysqldb.py#L477-482 https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/storage/load.py#L259-L264 https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/adapters/mover.py#L177-L199 next_serial is not only about execution overhead - it is semantically redundant to be there and can be removed from load return. The reason I say that next_serial can be removed is that in ZODB/py the only place, that I could find, where next_serial is used on client side is in client cache (e.g. in NEO client cache), and that cache can be remade to work without using that next_serial at all. In simple words whenever after loadAt(oid, at) -> (data, serial) query, the cache can remember data for oid in [serial, at] range. Next, when invalidation message from server is received, cache entries, that had at == client_head, are extended (at -> new_head) for oids that are not present in invalidation message, while for oids that are present in invalidation message no such extension is done. This allows to maintain cache in correct state, invalidate it when there is a need to invalidate, and not to throw away cache entries that should remain live. This of course requires ZODB server to include both modified and just-created objects into invalidation messages ( zopefoundation/ZEO#160 , zopefoundation#319 ). Switching to loadAt should thus allow storages like NEO and, maybe, RelStorage, to do 2x less SQL queries on every object access. zopefoundation#318 (comment) In other words loadAt unifies return signature to always be (data, serial) instead of POSKeyError object does not exist at all None object was removed (data, serial, next_serial) regular data record used by loadBefore. This patch: - introduces new interface. - introduces ZODB.utils.loadAt helper, that uses either storage.loadAt, or, if the storage does not implement loadAt interface, tries to mimic loadAt semantic via storage.loadBefore to possible extent + emits corresponding warning. - converts MVCCAdapter to use loadAt instead of loadBefore. - changes DemoStorage to use loadAt, and this way fixes above-mentioned data corruption issue; adds corresponding test; converts DemoStorage.loadBefore to be a wrapper around DemoStorage.loadAt. - adds loadAt implementation to FileStorage and MappingStorage. - adapts other tests/code correspondingly. /cc @jimfulton, @jamadden, @vpelletier, @jmuchemb, @arnaud-fontaine, @gidzit, @klawlf82, @hannosch
Patch for |
loadAt is new optional storage interface that is intended to replace loadBefore with more clean and uniform semantic. Compared to loadBefore, loadAt: 1) returns data=None and serial of the removal, when loaded object was found to be deleted. loadBefore is returning only data=None in such case. This loadAt property allows to fix DemoStorage data corruption when whiteouts in overlay part were not previously correctly taken into account. zopefoundation#318 2) for regular data records, does not require storages to return next_serial, in addition to (data, serial). loadBefore requirement to return both serial and next_serial is constraining storages unnecessarily, and, while for FileStorage it is free to implement, for other storages it is not - for example for NEO and RelStorage, finding out next_serial, after looking up oid@at data record, costs one more SQL query: https://lab.nexedi.com/nexedi/neoppod/blob/fb746e6b/neo/storage/database/mysqldb.py#L484-508 https://lab.nexedi.com/nexedi/neoppod/blob/fb746e6b/neo/storage/database/mysqldb.py#L477-482 https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/storage/load.py#L259-L264 https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/adapters/mover.py#L177-L199 next_serial is not only about execution overhead - it is semantically redundant to be there and can be removed from load return. The reason I say that next_serial can be removed is that in ZODB/py the only place, that I could find, where next_serial is used on client side is in client cache (e.g. in NEO client cache), and that cache can be remade to work without using that next_serial at all. In simple words whenever after loadAt(oid, at) -> (data, serial) query, the cache can remember data for oid in [serial, at] range. Next, when invalidation message from server is received, cache entries, that had at == client_head, are extended (at -> new_head) for oids that are not present in invalidation message, while for oids that are present in invalidation message no such extension is done. This allows to maintain cache in correct state, invalidate it when there is a need to invalidate, and not to throw away cache entries that should remain live. This of course requires ZODB server to include both modified and just-created objects into invalidation messages ( zopefoundation/ZEO#160 , zopefoundation#319 ). Switching to loadAt should thus allow storages like NEO and, maybe, RelStorage, to do 2x less SQL queries on every object access. zopefoundation#318 (comment) In other words loadAt unifies return signature to always be (data, serial) instead of POSKeyError object does not exist at all None object was removed (data, serial, next_serial) regular data record used by loadBefore. This patch: - introduces new interface. - introduces ZODB.utils.loadAt helper, that uses either storage.loadAt, or, if the storage does not implement loadAt interface, tries to mimic loadAt semantic via storage.loadBefore to possible extent + emits corresponding warning. - converts MVCCAdapter to use loadAt instead of loadBefore. - changes DemoStorage to use loadAt, and this way fixes above-mentioned data corruption issue; adds corresponding test; converts DemoStorage.loadBefore to be a wrapper around DemoStorage.loadAt. - adds loadAt implementation to FileStorage and MappingStorage. - adapts other tests/code correspondingly. /cc @jimfulton, @jamadden, @vpelletier, @jmuchemb, @arnaud-fontaine, @gidzit, @klawlf82, @hannosch
loadAt is new optional storage interface that is intended to replace loadBefore with more clean and uniform semantic. Compared to loadBefore, loadAt: 1) returns data=None and serial of the removal, when loaded object was found to be deleted. loadBefore is returning only data=None in such case. This loadAt property allows to fix DemoStorage data corruption when whiteouts in overlay part were not previously correctly taken into account. zopefoundation#318 2) for regular data records, does not require storages to return next_serial, in addition to (data, serial). loadBefore requirement to return both serial and next_serial is constraining storages unnecessarily, and, while for FileStorage it is free to implement, for other storages it is not - for example for NEO and RelStorage, finding out next_serial, after looking up oid@at data record, costs one more SQL query: https://lab.nexedi.com/nexedi/neoppod/blob/fb746e6b/neo/storage/database/mysqldb.py#L484-508 https://lab.nexedi.com/nexedi/neoppod/blob/fb746e6b/neo/storage/database/mysqldb.py#L477-482 https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/storage/load.py#L259-L264 https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/adapters/mover.py#L177-L199 next_serial is not only about execution overhead - it is semantically redundant to be there and can be removed from load return. The reason I say that next_serial can be removed is that in ZODB/py the only place, that I could find, where next_serial is used on client side is in client cache (e.g. in NEO client cache), and that cache can be remade to work without using that next_serial at all. In simple words whenever after loadAt(oid, at) -> (data, serial) query, the cache can remember data for oid in [serial, at] range. Next, when invalidation message from server is received, cache entries, that had at == client_head, are extended (at -> new_head) for oids that are not present in invalidation message, while for oids that are present in invalidation message no such extension is done. This allows to maintain cache in correct state, invalidate it when there is a need to invalidate, and not to throw away cache entries that should remain live. This of course requires ZODB server to include both modified and just-created objects into invalidation messages ( zopefoundation/ZEO#160 , zopefoundation#319 ). Switching to loadAt should thus allow storages like NEO and, maybe, RelStorage, to do 2x less SQL queries on every object access. zopefoundation#318 (comment) In other words loadAt unifies return signature to always be (data, serial) instead of POSKeyError object does not exist at all None object was removed (data, serial, next_serial) regular data record used by loadBefore. This patch: - introduces new interface. - introduces ZODB.utils.loadAt helper, that uses either storage.loadAt, or, if the storage does not implement loadAt interface, tries to mimic loadAt semantic via storage.loadBefore to possible extent + emits corresponding warning. - converts MVCCAdapter to use loadAt instead of loadBefore. - changes DemoStorage to use loadAt, and this way fixes above-mentioned data corruption issue; adds corresponding test; converts DemoStorage.loadBefore to be a wrapper around DemoStorage.loadAt. - adds loadAt implementation to FileStorage and MappingStorage. - adapts other tests/code correspondingly. /cc @jimfulton, @jamadden, @vpelletier, @jmuchemb, @arnaud-fontaine, @gidzit, @klawlf82, @hannosch
…s and not to skip transactions Currently invalidate documentation is not clear whether it should be called for every transaction and whether it should include full set of objects created/modified by that transaction. Until now this was working relatively well for the sole purpose of invalidating client ZEO cache, because for that particular task it is relatively OK not to include just created objects into invalidation messages, and even to completely skip sending invalidation if transaction only create - not modify - objects. Due to this fact the workings of the client cache was indifferent to the ambiguity of the interface. In 2016 skipping transactions with only created objects was reconsidered as bug and fixed in ZEO5 because ZODB5 relies more heavily on MVCC semantic and needs to be notified about every transaction committed to storage to be able to properly update ZODB.Connection view: zopefoundation/ZEO@02943acd#diff-52fb76aaf08a1643cdb8fdaf69e37802L889-R834 zopefoundation/ZEO@9613f09b However just-created objects were not included into invalidation messages until, hopefully, recently: zopefoundation/ZEO#160 As ZODB is started to be used more widely in areas where it was not traditionally used before, the ambiguity in invalidate interface and the lack of guarantees - for any storage - to be notified with full set of information, creates at least the following problems: - a ZODB client (not necessarily native ZODB/py client) can maintain raw cache for the storage. If such client tries to load an oid at database view when that object did not existed yet, gets "no object" reply and stores that information into raw cache, to properly invalidate the cache it needs an invalidation message from ZODB server that *includes* created object. - tools like `zodb watch` [1,2,3] don't work properly (give incorrect output) if not all objects modified/created by a transaction are included into invalidation messages. - similarly to `zodb watch`, a monitoring tool, that would want to be notified of all created/modified objects, won't see full database-change picture, and so won't work properly without knowing which objects were created. - wendelin.core 2 - which builds data from ZODB BTrees and data objects into virtual filesystem - needs to get invalidation messages with both modified and created objects to properly implement its own lazy invalidation and isolation protocol for file blocks in OS cache: when a block of file is accessed, all clients, that have this block mmaped, need to be notified and asked to remmap that block into particular revision of the file depending on a client's view of the filesystem and database [4,5]. To compute to where a client needs to remmap the block, WCFS server (that in turn acts as ZODB client wrt ZEO/NEO server), needs to be able to see whether client's view of the filesystem is before object creation (and then ask that client to pin that block to hole), or after creation (and then ask the client to pin that block to corresponding revision). This computation needs ZODB server to send invalidation messages in full: with both modified and just created objects. Also: - the property that all objects - both modified and just created - are included into invalidation messages is required and can help to remove `next_serial` from `loadBefore` return in the future. This, in turn, can help to do 2x less SQL queries in loadBefore for NEO and RelStorage (and maybe other storages too): zopefoundation#318 (comment) Current state of storages with respect to new requirements: - ZEO: does not skip transactions, but includes only modified - not created - objects. This is fixed by zopefoundation/ZEO#160 - NEO: already implements the requirements in full - RelStorage: already implements the requirements in full, if I understand correctly: https://github.com/zodb/relstorage/blob/3.1.2-1-gaf57d6c/src/relstorage/adapters/poller.py#L28-L145 While editing invalidate documentation, use the occasion to document recently added property that invalidate(tid) is always called before storage starts to report its lastTransaction() ≥ tid - see 4a6b028 (mvccadapter: check if the last TID changed without invalidation). /cc @jimfulton, @jamadden, @jmuchemb, @vpelletier, @arnaud-fontaine, @gidzit, @klawlf82, @jwolf083 [1] https://lab.nexedi.com/kirr/neo/blob/049cb9a0/go/zodb/zodbtools/watch.go [2] https://lab.nexedi.com/kirr/neo/commit/e0d59f5d [3] https://lab.nexedi.com/kirr/neo/commit/c41c2907 [4] https://lab.nexedi.com/kirr/wendelin.core/blob/1efb5876/wcfs/wcfs.go#L94-182 [5] https://lab.nexedi.com/kirr/wendelin.core/blob/1efb5876/wcfs/client/wcfs.h#L20-71
…s and not to skip transactions Currently invalidate documentation is not clear whether it should be called for every transaction and whether it should include full set of objects created/modified by that transaction. Until now this was working relatively well for the sole purpose of invalidating client ZEO cache, because for that particular task it is relatively OK not to include just created objects into invalidation messages, and even to completely skip sending invalidation if transaction only create - not modify - objects. Due to this fact the workings of the client cache was indifferent to the ambiguity of the interface. In 2016 skipping transactions with only created objects was reconsidered as bug and fixed in ZEO5 because ZODB5 relies more heavily on MVCC semantic and needs to be notified about every transaction committed to storage to be able to properly update ZODB.Connection view: zopefoundation/ZEO@02943acd#diff-52fb76aaf08a1643cdb8fdaf69e37802L889-R834 zopefoundation/ZEO@9613f09b However just-created objects were not included into invalidation messages until, hopefully, recently: zopefoundation/ZEO#160 As ZODB is started to be used more widely in areas where it was not traditionally used before, the ambiguity in invalidate interface and the lack of guarantees - for any storage - to be notified with full set of information, creates at least the following problems: - a ZODB client (not necessarily native ZODB/py client) can maintain raw cache for the storage. If such client tries to load an oid at database view when that object did not existed yet, gets "no object" reply and stores that information into raw cache, to properly invalidate the cache it needs an invalidation message from ZODB server that *includes* created object. - tools like `zodb watch` [1,2,3] don't work properly (give incorrect output) if not all objects modified/created by a transaction are included into invalidation messages. - similarly to `zodb watch`, a monitoring tool, that would want to be notified of all created/modified objects, won't see full database-change picture, and so won't work properly without knowing which objects were created. - wendelin.core 2 - which builds data from ZODB BTrees and data objects into virtual filesystem - needs to get invalidation messages with both modified and created objects to properly implement its own lazy invalidation and isolation protocol for file blocks in OS cache: when a block of file is accessed, all clients, that have this block mmaped, need to be notified and asked to remmap that block into particular revision of the file depending on a client's view of the filesystem and database [4,5]. To compute to where a client needs to remmap the block, WCFS server (that in turn acts as ZODB client wrt ZEO/NEO server), needs to be able to see whether client's view of the filesystem is before object creation (and then ask that client to pin that block to hole), or after creation (and then ask the client to pin that block to corresponding revision). This computation needs ZODB server to send invalidation messages in full: with both modified and just created objects. Also: - the property that all objects - both modified and just created - are included into invalidation messages is required and can help to remove `next_serial` from `loadBefore` return in the future. This, in turn, can help to do 2x less SQL queries in loadBefore for NEO and RelStorage (and maybe other storages too): zopefoundation#318 (comment) Current state of storages with respect to new requirements: - ZEO: does not skip transactions, but includes only modified - not created - objects. This is fixed by zopefoundation/ZEO#160 - NEO: already implements the requirements in full - RelStorage: already implements the requirements in full, if I understand correctly: https://github.com/zodb/relstorage/blob/3.1.2-1-gaf57d6c/src/relstorage/adapters/poller.py#L28-L145 While editing invalidate documentation, use the occasion to document recently added property that invalidate(tid) is always called before storage starts to report its lastTransaction() ≥ tid - see 4a6b028 (mvccadapter: check if the last TID changed without invalidation). /cc @jimfulton, @jamadden, @jmuchemb, @vpelletier, @arnaud-fontaine, @gidzit, @klawlf82, @jwolf083 /reviewed-on zopefoundation#319 /reviewed-by @dataflake [1] https://lab.nexedi.com/kirr/neo/blob/049cb9a0/go/zodb/zodbtools/watch.go [2] https://lab.nexedi.com/kirr/neo/commit/e0d59f5d [3] https://lab.nexedi.com/kirr/neo/commit/c41c2907 [4] https://lab.nexedi.com/kirr/wendelin.core/blob/1efb5876/wcfs/wcfs.go#L94-182 [5] https://lab.nexedi.com/kirr/wendelin.core/blob/1efb5876/wcfs/client/wcfs.h#L20-71
loadAt is new optional storage interface that is intended to replace loadBefore with more clean and uniform semantic. Compared to loadBefore, loadAt: 1) returns data=None and serial of the removal, when loaded object was found to be deleted. loadBefore is returning only data=None in such case. This loadAt property allows to fix DemoStorage data corruption when whiteouts in overlay part were not previously correctly taken into account. zopefoundation#318 2) for regular data records, does not require storages to return next_serial, in addition to (data, serial). loadBefore requirement to return both serial and next_serial is constraining storages unnecessarily, and, while for FileStorage it is free to implement, for other storages it is not - for example for NEO and RelStorage, finding out next_serial, after looking up oid@at data record, costs one more SQL query: https://lab.nexedi.com/nexedi/neoppod/blob/fb746e6b/neo/storage/database/mysqldb.py#L484-508 https://lab.nexedi.com/nexedi/neoppod/blob/fb746e6b/neo/storage/database/mysqldb.py#L477-482 https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/storage/load.py#L259-L264 https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/adapters/mover.py#L177-L199 next_serial is not only about execution overhead - it is semantically redundant to be there and can be removed from load return. The reason I say that next_serial can be removed is that in ZODB/py the only place, that I could find, where next_serial is used on client side is in client cache (e.g. in NEO client cache), and that cache can be remade to work without using that next_serial at all. In simple words whenever after loadAt(oid, at) -> (data, serial) query, the cache can remember data for oid in [serial, at] range. Next, when invalidation message from server is received, cache entries, that had at == client_head, are extended (at -> new_head) for oids that are not present in invalidation message, while for oids that are present in invalidation message no such extension is done. This allows to maintain cache in correct state, invalidate it when there is a need to invalidate, and not to throw away cache entries that should remain live. This of course requires ZODB server to include both modified and just-created objects into invalidation messages ( zopefoundation/ZEO#160 , zopefoundation#319 ). Switching to loadAt should thus allow storages like NEO and, maybe, RelStorage, to do 2x less SQL queries on every object access. zopefoundation#318 (comment) In other words loadAt unifies return signature to always be (data, serial) instead of POSKeyError object does not exist at all None object was removed (data, serial, next_serial) regular data record used by loadBefore. This patch: - introduces new interface. - introduces ZODB.utils.loadAt helper, that uses either storage.loadAt, or, if the storage does not implement loadAt interface, tries to mimic loadAt semantic via storage.loadBefore to possible extent + emits corresponding warning. - converts MVCCAdapter to use loadAt instead of loadBefore. - changes DemoStorage to use loadAt, and this way fixes above-mentioned data corruption issue; adds corresponding test; converts DemoStorage.loadBefore to be a wrapper around DemoStorage.loadAt. - adds loadAt implementation to FileStorage and MappingStorage. - adapts other tests/code correspondingly. /cc @jimfulton, @jamadden, @vpelletier, @jmuchemb, @arnaud-fontaine, @gidzit, @klawlf82, @hannosch
…s and not to skip transactions Currently invalidate documentation is not clear whether it should be called for every transaction and whether it should include full set of objects created/modified by that transaction. Until now this was working relatively well for the sole purpose of invalidating client ZEO cache, because for that particular task it is relatively OK not to include just created objects into invalidation messages, and even to completely skip sending invalidation if transaction only create - not modify - objects. Due to this fact the workings of the client cache was indifferent to the ambiguity of the interface. In 2016 skipping transactions with only created objects was reconsidered as bug and fixed in ZEO5 because ZODB5 relies more heavily on MVCC semantic and needs to be notified about every transaction committed to storage to be able to properly update ZODB.Connection view: zopefoundation/ZEO@02943acd#diff-52fb76aaf08a1643cdb8fdaf69e37802L889-R834 zopefoundation/ZEO@9613f09b However just-created objects were not included into invalidation messages until, hopefully, recently: zopefoundation/ZEO#160 As ZODB is started to be used more widely in areas where it was not traditionally used before, the ambiguity in invalidate interface and the lack of guarantees - for any storage - to be notified with full set of information, creates at least the following problems: - a ZODB client (not necessarily native ZODB/py client) can maintain raw cache for the storage. If such client tries to load an oid at database view when that object did not existed yet, gets "no object" reply and stores that information into raw cache, to properly invalidate the cache it needs an invalidation message from ZODB server that *includes* created object. - tools like `zodb watch` [1,2,3] don't work properly (give incorrect output) if not all objects modified/created by a transaction are included into invalidation messages. - similarly to `zodb watch`, a monitoring tool, that would want to be notified of all created/modified objects, won't see full database-change picture, and so won't work properly without knowing which objects were created. - wendelin.core 2 - which builds data from ZODB BTrees and data objects into virtual filesystem - needs to get invalidation messages with both modified and created objects to properly implement its own lazy invalidation and isolation protocol for file blocks in OS cache: when a block of file is accessed, all clients, that have this block mmaped, need to be notified and asked to remmap that block into particular revision of the file depending on a client's view of the filesystem and database [4,5]. To compute to where a client needs to remmap the block, WCFS server (that in turn acts as ZODB client wrt ZEO/NEO server), needs to be able to see whether client's view of the filesystem is before object creation (and then ask that client to pin that block to hole), or after creation (and then ask the client to pin that block to corresponding revision). This computation needs ZODB server to send invalidation messages in full: with both modified and just created objects. Also: - the property that all objects - both modified and just created - are included into invalidation messages is required and can help to remove `next_serial` from `loadBefore` return in the future. This, in turn, can help to do 2x less SQL queries in loadBefore for NEO and RelStorage (and maybe other storages too): #318 (comment) Current state of storages with respect to new requirements: - ZEO: does not skip transactions, but includes only modified - not created - objects. This is fixed by zopefoundation/ZEO#160 - NEO: already implements the requirements in full - RelStorage: already implements the requirements in full, if I understand correctly: https://github.com/zodb/relstorage/blob/3.1.2-1-gaf57d6c/src/relstorage/adapters/poller.py#L28-L145 While editing invalidate documentation, use the occasion to document recently added property that invalidate(tid) is always called before storage starts to report its lastTransaction() ≥ tid - see 4a6b028 (mvccadapter: check if the last TID changed without invalidation). /cc @jimfulton, @jamadden, @jmuchemb, @vpelletier, @arnaud-fontaine, @gidzit, @klawlf82, @jwolf083 /reviewed-on #319 /reviewed-by @dataflake /reviewed-by @jmuchemb [1] https://lab.nexedi.com/kirr/neo/blob/049cb9a0/go/zodb/zodbtools/watch.go [2] https://lab.nexedi.com/kirr/neo/commit/e0d59f5d [3] https://lab.nexedi.com/kirr/neo/commit/c41c2907 [4] https://lab.nexedi.com/kirr/wendelin.core/blob/1efb5876/wcfs/wcfs.go#L94-182 [5] https://lab.nexedi.com/kirr/wendelin.core/blob/1efb5876/wcfs/client/wcfs.h#L20-71
loadAt is new optional storage interface that is intended to replace loadBefore with more clean and uniform semantic. Compared to loadBefore, loadAt: 1) returns data=None and serial of the removal, when loaded object was found to be deleted. loadBefore is returning only data=None in such case. This loadAt property allows to fix DemoStorage data corruption when whiteouts in overlay part were not previously correctly taken into account. zopefoundation#318 2) for regular data records, does not require storages to return next_serial, in addition to (data, serial). loadBefore requirement to return both serial and next_serial is constraining storages unnecessarily, and, while for FileStorage it is free to implement, for other storages it is not - for example for NEO and RelStorage, finding out next_serial, after looking up oid@at data record, costs one more SQL query: https://lab.nexedi.com/nexedi/neoppod/blob/fb746e6b/neo/storage/database/mysqldb.py#L484-508 https://lab.nexedi.com/nexedi/neoppod/blob/fb746e6b/neo/storage/database/mysqldb.py#L477-482 https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/storage/load.py#L259-L264 https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/adapters/mover.py#L177-L199 next_serial is not only about execution overhead - it is semantically redundant to be there and can be removed from load return. The reason I say that next_serial can be removed is that in ZODB/py the only place, that I could find, where next_serial is used on client side is in client cache (e.g. in NEO client cache), and that cache can be remade to work without using that next_serial at all. In simple words whenever after loadAt(oid, at) -> (data, serial) query, the cache can remember data for oid in [serial, at] range. Next, when invalidation message from server is received, cache entries, that had at == client_head, are extended (at -> new_head) for oids that are not present in invalidation message, while for oids that are present in invalidation message no such extension is done. This allows to maintain cache in correct state, invalidate it when there is a need to invalidate, and not to throw away cache entries that should remain live. This of course requires ZODB server to include both modified and just-created objects into invalidation messages ( zopefoundation/ZEO#160 , zopefoundation#319 ). Switching to loadAt should thus allow storages like NEO and, maybe, RelStorage, to do 2x less SQL queries on every object access. zopefoundation#318 (comment) In other words loadAt unifies return signature to always be (data, serial) instead of POSKeyError object does not exist at all None object was removed (data, serial, next_serial) regular data record used by loadBefore. This patch: - introduces new interface. - introduces ZODB.utils.loadAt helper, that uses either storage.loadAt, or, if the storage does not implement loadAt interface, tries to mimic loadAt semantic via storage.loadBefore to possible extent + emits corresponding warning. - converts MVCCAdapter to use loadAt instead of loadBefore. - changes DemoStorage to use loadAt, and this way fixes above-mentioned data corruption issue; adds corresponding test; converts DemoStorage.loadBefore to be a wrapper around DemoStorage.loadAt. - adds loadAt implementation to FileStorage and MappingStorage. - adapts other tests/code correspondingly. /cc @jimfulton, @jamadden, @vpelletier, @jmuchemb, @arnaud-fontaine, @gidzit, @klawlf82, @hannosch
This issue can be fixed by #323. |
For the reference - see also https://lab.nexedi.com/kirr/neo/commit/4fb6bd0a?expanded=1 which proides DemoStorage-analogue for ZODB/go with fix for this particular issue. |
/cc @d-maurer |
This issue about data corruption is here for almost a year. Could we please move on with getting this issue fixed? |
Kirill Smelkov wrote at 2021-5-3 04:04 -0700:
...
Could we please move on with getting this issue fixed?
I will take a look in the next few days.
When you need a review, you can use a `github` workflow for the
request: you find the corresponding actions in the righ column
of the `githup` pages named "reviewer".
|
Thanks, @d-maurer.
Thanks for the advice. Just for the reference: for |
Jim is currently not interested in any ZEO/ZODB stuff, as his day job has nothing to do with these. I hope I got this right. I met him at April 2021 Zope sprint. |
I was just going to say the same thing. In a conference call a couple weeks ago Jim made it clear that he isn't using ZODB for anything right now, and he won't spend time on it anymore. Don't wait for his feedback, it will not come. |
I've been very busy on other things but hope to find time to examine these PRs in depth "soon." |
Thanks a lot for feedback, everyone. It is sad to hear about Jim leaving ZODB land, but at the same time it helps to have it clearly stated. @jamadden, thanks for the update. I understand we are all busy, and, due to that, in practice "soon" might turn out somewhat distant. Still it is very helpful to read your feedback, because it helps to keep up the motivation with knowing something as "this topic is not orphaned; people are just busy, but still they are planning to have a look". So, thanks a lot, again, for this. Kirill P.S. Did maybe @tseaver somehow somewhere provided a similar update about his status? At Pylons/zodburi#29 I'm struggling waiting for review feedback and it is not coming despite many ping attempts. And, maybe it's just a coincidence, but this days I see Jim and Tres being active at the same places on github... |
I am sure the situation is similar: Tres (and Jim) are listed as maintainers on many packages that they haven't been using in years. They have moved on to other work and other interests. |
@dataflake, thanks for feedback on this. |
When loading after the deleted tid, NEO raises |
Here is interface specification for Lines 710 to 724 in 1fb097b
My reading of that text is that storage should return
And this interpretation coincides with actual FileStorage behaviour. That means that here NEO is not following Anyway, what really matters for the fix is that in addition to "deleted" |
Julien Muchembled wrote at 2021-6-7 08:20 -0700:
> ```python
> The bug could be fixed if we change IStorage.loadBefore(oid) to return
>
> (None, serial) # serial = revision where object was deleted
>
> instead of just
>
> None
>
> for deleted records.
> ```
When loading after the deleted tid, NEO raises `POSException.POSKeyError`. Any comment about this ?
A stacking storage (such as `DemoStorage`) which stacks *T* on
top of *B* must distinguish the cases "*o* not in *T*" and
"*o* deleted in *T*": in the first case, it delegates the lookup
for *o* to *B*; in the second case, it reports that "*o* is deleted".
Thus, any storage *T* which does not distinguish the two cases above
does not truely support a stacking storage.
|
Dieter Maurer wrote at 2021-6-7 18:24 +0200:
Julien Muchembled wrote at 2021-6-7 08:20 -0700:
>> ```python
>> The bug could be fixed if we change IStorage.loadBefore(oid) to return
>>
>> (None, serial) # serial = revision where object was deleted
>>
>> instead of just
>>
>> None
>>
>> for deleted records.
>> ```
>
>When loading after the deleted tid, NEO raises `POSException.POSKeyError`. Any comment about this ?
A stacking storage (such as `DemoStorage`) which stacks *T* on
top of *B* must distinguish the cases "*o* not in *T*" and
"*o* deleted in *T*": in the first case, it delegates the lookup
for *o* to *B*; in the second case, it reports that "*o* is deleted".
Thus, any storage *T* which does not distinguish the two cases above
does not truely support a stacking storage.
If we consider (storage level) caching, a deletion record should
be treated in analogy with a ("normal") data record: especially
it has a "start tid" (aka "serial") and (potentially) an "end tid";
both together specify the validity interval.
Raising an exception provides not that much information.
|
I understand well the need to distinguish between the 2 cases. But I wondered at least whether I'd have something to fix in NEO, which surprised me because that's something on which I spent time. I've just checked the code of FileStorage and it behaves like NEO: since #64, a deletion results in POSKeyError, contrary to what the initial comment states. What's even less clear is that returning None for a deleted record (along with changes in DemoStorage) would actually fix this issue: of course, I admit it would not be ideal. |
Ok, it was my mistake in #318 (comment) with misunderstanding about how FileStorage behaves. Here is the proof that FileStorage indeed raises POSKeyError after object deletion, but returns None before it was created: ---- 8< ---- (z.dump)
(neo) (z-dev) (g.env) kirr@deco:~/src/wendelin/z/zodbtools/t$ zodb restore x.fs <z.dump
0000000000000001
0000000000000002
0000000000000003
0000000000000004
0000000000000005
0000000000000006 (neo) (z-dev) (g.env) kirr@deco:~/src/wendelin/z/zodbtools/t$ fs1 dump x.fs
Trans #00000 tid=0000000000000001 time=1900-01-01 00:00:00.000000 offset=27
status=' ' user='' description=''
data #00000 oid=0000000000000000 size=4 class=?.?
Trans #00001 tid=0000000000000002 time=1900-01-01 00:00:00.000000 offset=104
status=' ' user='' description=''
data #00000 oid=0000000000000001 size=5 class=?.?
Trans #00002 tid=0000000000000003 time=1900-01-01 00:00:00.000000 offset=182
status=' ' user='' description=''
data #00000 oid=0000000000000001 size=5 class=?.?
Trans #00003 tid=0000000000000004 time=1900-01-01 00:00:00.000000 offset=260
status=' ' user='' description=''
data #00000 oid=0000000000000001 class=undo or abort of object creation
Trans #00004 tid=0000000000000005 time=1900-01-01 00:00:00.000000 offset=341
status=' ' user='' description=''
data #00000 oid=0000000000000001 size=8 class=?.?
Trans #00005 tid=0000000000000006 time=1900-01-01 00:00:00.000000 offset=422
status=' ' user='' description=''
data #00000 oid=0000000000000001 class=undo or abort of object creation ---- 8< ---- x.py #!/usr/bin/env python
from __future__ import print_function
from ZODB.FileStorage import FileStorage
from ZODB.utils import p64, u64
import traceback as tb
import sys
oid = p64(1)
zstor = FileStorage('x.fs', read_only=True)
for tid in (1,2,3,4,5,6):
print('load %s@%s -> ' % (u64(oid), tid), end='')
sys.stdout.flush()
try:
x = zstor.loadBefore(oid, p64(tid+1))
except:
tb.print_exc()
else:
print('%r' % (x,)) (neo) (z-dev) (g.env) kirr@deco:~/src/wendelin/z/zodbtools/t$ ./x.py
load 1@1 -> None
load 1@2 -> ('DATA1', '\x00\x00\x00\x00\x00\x00\x00\x02', '\x00\x00\x00\x00\x00\x00\x00\x03')
load 1@3 -> ('DATA2', '\x00\x00\x00\x00\x00\x00\x00\x03', '\x00\x00\x00\x00\x00\x00\x00\x04')
load 1@4 -> Traceback (most recent call last):
File "./x.py", line 16, in <module>
x = zstor.loadBefore(oid, p64(tid+1))
File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/FileStorage/FileStorage.py", line 582, in loadBefore
raise POSKeyError(oid)
POSKeyError: 0x01
load 1@5 -> ('UNDELETE', '\x00\x00\x00\x00\x00\x00\x00\x05', '\x00\x00\x00\x00\x00\x00\x00\x06')
load 1@6 -> Traceback (most recent call last):
File "./x.py", line 16, in <module>
x = zstor.loadBefore(oid, p64(tid+1))
File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/FileStorage/FileStorage.py", line 582, in loadBefore
raise POSKeyError(oid)
POSKeyError: 0x01 DemoStorage comments seem to agree with my original understanding and the text of loadBefore interface: Lines 220 to 229 in 1fb097b
but FileStorage behaviour was indeed changed in #64. |
ZODB specifies deleteObject to create new revision that indicates object removal: def deleteObject(oid, serial, transaction): """Mark an object as deleted This method marks an object as deleted VIA A NEW OBJECT REVISION. Subsequent attempts to load current data for the object will fail with a POSKeyError, but loads for non-current data will succeed if there are previous non-delete records. The object will be removed from the storage when all not-delete records are removed. https://github.com/zopefoundation/ZODB/blob/bc13ca74/src/ZODB/interfaces.py#L1292-L1307 (emphasis mine) However currently for history-preserving mode, as explained in zopefoundation/ZODB#318 (comment), RelStorage purges latest object revision instead of creating new one with whiteout indication. This goes against deleteObject specification and, as demonstrated by attached test program, against particular FileStorage behaviour. -> Fix it. P.S. I'm complete RelStorage newbie and looked only briefly. It could be that my patch is e.g. incomplete, or not optimal. However it demonstrates a real problem, and it fixes both adjusted testcase and failure of attached tdelete.py P.P.S. Tested only with SQLite backend. ---- 8< ---- (tdelete.py) #!/usr/bin/env python """tdelete.py demonstrates that deleteObject should create new whiteout record, and that older data records should be still accessible. e.g. with FileStorage: $ ./tdelete.py file://1.db @03e40964a0766f33 (= 280359404597309235) obj<0000000000000001> -> int(0) @03e40964a0790944 (= 280359404597479748) obj<0000000000000001> -> int(1) -------- @03e40964a0766f33 obj<0000000000000001> -> int(0) # must be int(0) @03e40964a0790944 obj<0000000000000001> -> int(1) # must be int(1) However it currently fails with RelStorage, because deleteObject does not create new whiteout revision and instead purges already committed data: $ rm x/*; ./tdelete.py sqlite://?data_dir=`pwd`/x @03e40972d5408022 (= 280359465612509218) obj<0000000000000001> -> int(0) @03e40972d541ddee (= 280359465612598766) obj<0000000000000001> -> int(1) -------- @03e40972d5408022 obj<0000000000000001> -> int(0) # must be int(0) Traceback (most recent call last): File "./tdelete.py", line 84, in <module> main() File "./tdelete.py", line 80, in main dumpObjAt(at1, "must be int(1)") File "./tdelete.py", line 75, in dumpObjAt obj = conn.get(oid) File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/Connection.py", line 238, in get obj = self._reader.getGhost(p) File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/serialize.py", line 598, in getGhost unpickler = self._get_unpickler(pickle) File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/serialize.py", line 478, in _get_unpickler file = BytesIO(pickle) TypeError: StringIO() argument 1 must be string or buffer, not None """ from __future__ import print_function import zodburi from persistent import Persistent from ZODB.DB import DB from ZODB.Connection import TransactionMetaData from ZODB.utils import u64 import transaction import sys class PInt(Persistent): def __init__(self, i): self.i = i def __str__(self): return "int(%d)" % self.i def h(tid): return tid.encode('hex') def dump(obj): print("@%s (= %d) obj<%s> -> %s" % (h(obj._p_serial), u64(obj._p_serial), h(obj._p_oid), obj)) def main(): zurl = sys.argv[1] zstoropen, dbkw = zodburi.resolve_uri(zurl) stor = zstoropen() db = DB(stor, **dbkw) conn = db.open() root = conn.root() root['X'] = obj = PInt(0) transaction.commit() dump(obj) at0 = obj._p_serial oid = obj._p_oid obj.i += 1 transaction.commit() dump(obj) at1 = obj._p_serial txn_meta = TransactionMetaData() stor.tpc_begin(txn_meta) stor.deleteObject(oid, at1, txn_meta) stor.tpc_vote(txn_meta) stor.tpc_finish(txn_meta) print('\n--------\n') def dumpObjAt(at, comment): conn = db.open(at=at) obj = conn.get(oid) print("@%s obj<%s> -> %s\t# %s" % (h(at), h(oid), obj, comment)) conn.close() dumpObjAt(at0, "must be int(0)") dumpObjAt(at1, "must be int(1)") if __name__ == '__main__': main() P.P.P.S. SQLite URI resolver is currently broken after 08259fa (Finer control over sqlite storage locking, oid allocation and stats). I've used the following local patch as a workaround: --- a/src/relstorage/zodburi_resolver.py +++ b/src/relstorage/zodburi_resolver.py @@ -121,14 +121,14 @@ def factory(options): return factory, unused class SqliteAdapterHelper(Resolver): - _string_args = ('path',) + _string_args = ('data_dir',) def __call__(self, parsed_uri, kw): kw, unused = self.interpret_kwargs(kw) def factory(options): from relstorage.adapters.sqlite.adapter import Sqlite3Adapter - return Sqlite3Adapter(options=options, **kw) + return Sqlite3Adapter(options=options, pragmas={}, **kw) return factory, unused # The relstorage support is inspired by django-zodb.
loadBeforeEx is like loadBefore, but simpler, provides better information for object delete records and can be more efficiently implemented by many storages: zopefoundation/ZODB#323 On RelStorage loadBefore is currently implemented via 3 SQL queries: 1) check whether object record exists at all 2) retrieve object state 3) retrieve serial of next object revision Compared to that loadBeforeEx is implemented via only one SQL query "2" from the above - "retrieve object state". It is exactly the same query that loadBefore uses and after the patch loadBefore actually invokes loadBeforeEx for step 2. This change was outlined in zopefoundation/ZODB#318 (comment) and zopefoundation/ZODB#318 (comment) and as explained in the first link this patch is also semantically coupled with zodb#484 This patch passes tests with both ZODB5 and with ZODB5+zopefoundation/ZODB#323: - when ran with ZODB5 it verifies that loadBefore implementation does not become broken. - when ran with ZODB5+zopefoundation/ZODB#323 it verifies that loadBeforeEx implementation is correct. For tests to pass with ZODB5+zopefoundation/ZODB#323 we also need zopefoundation/zc.zlibstorage#11 because without that fix zc.zlibstorage does not decompress data on loadBeforeEx.
Hello up there. I've discovered that DemoStorage wrongly handles load for objects deleted in read-write overlay leading to data corruption. It can be fixed if we correct interface for
IStorage.loadBefore
. Please find details in example program provided below.Kirill
/cc @jimfulton, @hannosch
---- 8< ---- (bug-demostorage-delete)
The text was updated successfully, but these errors were encountered: