Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Export markdown files during ingestion #171

Merged
merged 7 commits into from
Jan 10, 2025
Merged

Conversation

yoomlam
Copy link
Contributor

@yoomlam yoomlam commented Jan 9, 2025

Ticket

https://navalabs.atlassian.net/browse/DST-669

Changes

For the 3 datasets (EDD, Imagine LA's Information Hub, LA County ePolicy manual):

  • Write markdown files for each document content
  • Write split/chunk files for each document

Testing

If you don't have src/ingestion/edd_scrapings.json, refer to testing description in PR #121 or do the following:

cd app
make scrape-edd-web

Then create md files under folder edd_md:

make ingest-edd-web DATASET_ID="CA EDD test" BENEFIT_PROGRAM=employment BENEFIT_REGION=California FILEPATH=src/ingestion/edd_scrapings.json INGEST_ARGS="--skip_db"

If you don't have src/ingestion/la_policy_scrapings.json, refer to testing description in PR #161 or do the following:

make scrape-la-county-policy

Then create md files under folder la_policy_md:

make ingest-la-county-policy DATASET_ID="LA County Policy test" BENEFIT_PROGRAM=mixed BENEFIT_REGION="California:LA County" FILEPATH=src/ingestion/la_policy_scrapings.json INGEST_ARGS="--skip_db"

If you don't have src/ingestion/imagine_la/scrape/pages, refer to PR #141 or do the following:

make scrape-imagine-la CONTENTHUB_PASSWORD=...

Then create md files under folder imagine_la_md:

make ingest-imagine-la DATASET_ID=imaginelaTest BENEFIT_PROGRAM=mixed BENEFIT_REGION=California FILEPATH=src/ingestion/imagine_la/scrape/pages INGEST_ARGS="--skip_db"

Copy link

github-actions bot commented Jan 10, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
2912 2607 90% 80% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
app/src/ingest_edd_web.py 90% 🟢
app/src/ingest_la_county_policy.py 93% 🟢
app/src/ingestion/imagine_la/ingest.py 96% 🟢
app/src/util/ingest_utils.py 97% 🟢
TOTAL 94% 🟢

updated for commit: 5d55dd7 by action🐍

@yoomlam
Copy link
Contributor Author

yoomlam commented Jan 10, 2025

Follow-up tickets:

  • make running ingestion more consistent across datasets
  • clean up code

@@ -20,7 +22,16 @@ def prep_json_item(item: dict[str, str]) -> dict[str, str]:
return item

common_base_url = "https://epolicy.dpss.lacounty.gov/epolicy/epolicy/server/general/projects_responsive/ePolicyMaster/mergedProjects/"
ingest_json(db_session, json_filepath, doc_attribs, common_base_url, resume, prep_json_item)
ingest_json(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a blocker, but noting we're feeding in a lot of parameters in this fn, might be something to consider refactoring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely! ... in another PR

)


def _fix_input_markdown(markdown: str) -> str:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was moved and its contents was not modified.

Copy link
Contributor

@fg-nava fg-nava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good PR to get a better idea of your ingestion implementations (EDD, LA County, Imagine LA).

Wondering about thoughts on where to continue to standardize, e.g. I noticed ingest_json() is currently in ingest_edd_web.py but used by both EDD and LA County - we could move it to ingest_utils.py

URL mapping could also be standardized across sources, perhaps with a configuration schema/system (yaml)?

return split


class HeadingBasedSplit(Split):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a more clear naming convention between HeadingBasedSplit and Split. Since Split is acting as a base class, maybe named BaseSplit. Will be more clear HeadingBasedSplit is a specific implementation from base Split

app/src/ingest_edd_web.py Show resolved Hide resolved
app/src/ingestion/imagine_la/ingest.py Show resolved Hide resolved
@@ -79,37 +99,81 @@ def prep_json_item(item: dict[str, str]) -> dict[str, str]:
return item

common_base_url = "https://edd.ca.gov/en/"
ingest_json(db_session, json_filepath, doc_attribs, common_base_url, resume, prep_json_item)
ingest_json(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could move this to ingest_utils.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, we can refactor in a separate PR

@yoomlam yoomlam merged commit a511080 into main Jan 10, 2025
4 checks passed
@yoomlam yoomlam deleted the yl/export-md-files branch January 10, 2025 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants