-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix XBRL extraction clobber #3026
Conversation
src/pudl/extract/xbrl.py
Outdated
sql_path=sql_path, | ||
clobber=False, # if we set clobber=True, clobbers on *every* call to run_main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this makes the clobber
option non-functional, at least in this location. The idea with clobber
is to functionally rm -f the_database.sqlite
before starting a new ETL run. Maybe it should only exist at the highest level in the ferc_to_sqlite
script and actually unlink the paths if set -- and then everything in here can just be like "You gave me a DB URL and I'm going to write to it. Not my problem if there was already something incompatible there."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the unlink()
on line 92 does the rm -f
you're looking for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously when I failed to clobber, the XBRL extractor would write duplicate data - when I ran xbrl_to_sqlite
multiple times I still only had one copy of the data afterwards. I guess it could have not written any data at all, the later times. I can run again and see if the database is gone between the unlink()
and the database write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the database gets deleted at the unlink()
step if clobber
is set to True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DELETED. Strongbad would be proud. I guess the other half of the logic we've previously associated with --clobber
is that if the DB already exists and you attempt to run the ETL, it bails out early, to avoid unintentionally modifying an existing DB. It's not trying to be anything fancy -- just a binary switch "Do you want me to make something new and blow everything that exists away if it's there? Or do you expect that there's a clean slate, and there should be no possible conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the current behavior though. If you don't say clobber and there is an existing DB what happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past if you don't say clobber and there's an existing DB, we actually just append to the existing DB. Which is pretty bad. I'll add a quick check to just bail out if there's an existing DB instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh i don't understand all of the whys for the changes in here but the broad banner of make pudl do a big clobber, always clobber=False
our run of the year-based extractor makes sense to me and looks good!
src/pudl/extract/xbrl.py
Outdated
sql_path = Path(urlparse(PudlPaths().sqlite_db(f"ferc{form.value}_xbrl")).path) | ||
|
||
if clobber: | ||
sql_path.unlink(missing_ok=True) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay so this is now the guy that removes the db and we always run clobber=True
in pudl when we call the extractor's run_main
bc each run is per year. cool cool!
4f947d9
to
67e14b4
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## dev #3026 +/- ##
=====================================
Coverage 88.7% 88.7%
=====================================
Files 90 90
Lines 10988 10994 +6
=====================================
+ Hits 9752 9758 +6
Misses 1236 1236
☔ View full report in Codecov by Sentry. |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
On dev, if you set
clobber: True
in theferc_to_sqlite
job config, it will clobber the existing database when extracting 2021 data, and then clobber it again when extracting 2022 data.This was probably the cause of the missing 2021 XBRL data that @cmgosnell ran into yesterday.