Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Ingest LA County Policy Manual #161

Merged
merged 63 commits into from
Dec 20, 2024
Merged

Conversation

yoomlam
Copy link
Contributor

@yoomlam yoomlam commented Dec 19, 2024

Ticket

https://navalabs.atlassian.net/browse/DST-668

Changes

  • Add app/src/ingest_la_county_policy.py to chunk content and add to DB
  • Add ingest-la-county-policy to Makefile and poetry
  • Update app/src/ingest_edd_web.py to produce valid markdown so chunking doesn't error
  • Update chat engine to use 'LA County Policy' dataset

Testing

Use playwright to expand nav bar in order to get a list of pages to scrape:
First go to this folder:

cd app/src/ingestion/la_policy/scrape

Run:

pip install -r requirements.txt
python scrape_la_policy_nav_bar.py

or

uv run --no-project scrape_la_policy_nav_bar.py

This creates la_policy_nav_bar.html, required for the next step.

Then use scrapy:

cd ../..
rm -rf scraped  # Optionally clear out old debug scrapings artifacts
DEBUG_SCRAPINGS=true python scrape_la_policy.py &> out.log
# Alternatively: DEBUG_SCRAPINGS=true uv run --no-project scrape_la_policy.py &> out.log
# Alternatively: `DEBUG_SCRAPINGS=true poetry run scrape-la-policy` in the `app/` folder

# Check for errors
grep "ERROR:" out.log

# Examine warnings, ignoring "MarkupResemblesLocatorWarning" ones
grep "WARNING:" out.log | grep -v "MarkupResemblesLocatorWarning"

Scraped html and resulting markdown files are under scraped/.

la_policy_scrapings.json is generated for use in chunking in the next step. la_policy_scrapings.json-pretty.json is for easier review.

Create chunks and add to DB:

cd app
# Enable saving chunks to `chunks-log` subfolder
touch SAVE_CHUNKS
make ingest-la-county-policy DATASET_ID="LA County Policy" BENEFIT_PROGRAM=mixed BENEFIT_REGION="California:LA County" FILEPATH=src/ingestion/la_policy_scrapings.json INGEST_ARGS="--resume"

Expect to see in about 4 minutes:

Processing: ...
Split into ... chunks
...
Done splitting 472 webpages into 2520 chunks

Then it will take a long while (30 mins) to create the embeddings:

...
Adding embeddings for...
...

Start the chatbot and test: make start

image image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored so that functions can be reused by app/src/ingest_la_county_policy.py

Copy link

github-actions bot commented Dec 19, 2024

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
2817 2517 89% 80% 🟢

New Files

File Coverage Status
app/src/ingest_la_county_policy.py 93% 🟢
TOTAL 93% 🟢

Modified Files

File Coverage Status
app/src/chat_engine.py 80% 🟢
app/src/ingest_edd_web.py 89% 🟢
app/src/ingestion/markdown_tree.py 96% 🟢
app/src/util/ingest_utils.py 98% 🟢
TOTAL 91% 🟢

updated for commit: 1ee97a6 by action🐍

@yoomlam yoomlam requested a review from a team December 20, 2024 15:11
html2text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the code now, getting another deps error, this time on mistletoe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added mistletoe. Try again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, everything ran successfully.

@fg-nava
Copy link
Contributor

fg-nava commented Dec 20, 2024

I haven't been able to run this e2e yet but giving tentative with the expectation tests are passing since @yoomlam is OOO after today.

@yoomlam
Copy link
Contributor Author

yoomlam commented Dec 20, 2024

Updated instructions to run using uv (https://docs.astral.sh/uv/getting-started/installation/)

@yoomlam yoomlam merged commit 48c1f43 into main Dec 20, 2024
4 checks passed
@yoomlam yoomlam deleted the yl/ingest_la_county_policy branch December 20, 2024 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants