-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Ingest LA County Policy Manual #161
Conversation
app/src/ingest_edd_web.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored so that functions can be reused by app/src/ingest_la_county_policy.py
☂️ Python Coverage
Overall Coverage
New Files
Modified Files
|
html2text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running the code now, getting another deps error, this time on mistletoe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added mistletoe. Try again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, everything ran successfully.
I haven't been able to run this e2e yet but giving tentative with the expectation tests are passing since @yoomlam is OOO after today. |
Updated instructions to run using |
Ticket
https://navalabs.atlassian.net/browse/DST-668
Changes
app/src/ingest_la_county_policy.py
to chunk content and add to DBingest-la-county-policy
to Makefile and poetryapp/src/ingest_edd_web.py
to produce valid markdown so chunking doesn't errorTesting
Use playwright to expand nav bar in order to get a list of pages to scrape:
First go to this folder:
Run:
or
This creates
la_policy_nav_bar.html
, required for the next step.Then use scrapy:
Scraped html and resulting markdown files are under
scraped/
.la_policy_scrapings.json
is generated for use in chunking in the next step.la_policy_scrapings.json-pretty.json
is for easier review.Create chunks and add to DB:
Expect to see in about 4 minutes:
Then it will take a long while (30 mins) to create the embeddings:
Start the chatbot and test:
make start