-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Export markdown files during ingestion #171
Conversation
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified Files
|
Follow-up tickets:
|
@@ -20,7 +22,16 @@ def prep_json_item(item: dict[str, str]) -> dict[str, str]: | |||
return item | |||
|
|||
common_base_url = "https://epolicy.dpss.lacounty.gov/epolicy/epolicy/server/general/projects_responsive/ePolicyMaster/mergedProjects/" | |||
ingest_json(db_session, json_filepath, doc_attribs, common_base_url, resume, prep_json_item) | |||
ingest_json( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a blocker, but noting we're feeding in a lot of parameters in this fn, might be something to consider refactoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely! ... in another PR
) | ||
|
||
|
||
def _fix_input_markdown(markdown: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function was moved and its contents was not modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good PR to get a better idea of your ingestion implementations (EDD, LA County, Imagine LA).
Wondering about thoughts on where to continue to standardize, e.g. I noticed ingest_json()
is currently in ingest_edd_web.py
but used by both EDD and LA County - we could move it to ingest_utils.py
URL mapping could also be standardized across sources, perhaps with a configuration schema/system (yaml)?
return split | ||
|
||
|
||
class HeadingBasedSplit(Split): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have a more clear naming convention between HeadingBasedSplit
and Split
. Since Split
is acting as a base class, maybe named BaseSplit
. Will be more clear HeadingBasedSplit
is a specific implementation from base Split
@@ -79,37 +99,81 @@ def prep_json_item(item: dict[str, str]) -> dict[str, str]: | |||
return item | |||
|
|||
common_base_url = "https://edd.ca.gov/en/" | |||
ingest_json(db_session, json_filepath, doc_attribs, common_base_url, resume, prep_json_item) | |||
ingest_json( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could move this to ingest_utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, we can refactor in a separate PR
Ticket
https://navalabs.atlassian.net/browse/DST-669
Changes
For the 3 datasets (EDD, Imagine LA's Information Hub, LA County ePolicy manual):
Testing
If you don't have src/ingestion/edd_scrapings.json, refer to testing description in PR #121 or do the following:
Then create md files under folder
edd_md
:If you don't have src/ingestion/la_policy_scrapings.json, refer to testing description in PR #161 or do the following:
Then create md files under folder
la_policy_md
:If you don't have src/ingestion/imagine_la/scrape/pages, refer to PR #141 or do the following:
Then create md files under folder
imagine_la_md
: