Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Export markdown files during ingestion #171

Merged
merged 7 commits into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,17 @@ poetry run scrape-edd-web

To load into the vector database, see sections above but use `make ingest-edd-web DATASET_ID="CA EDD" BENEFIT_PROGRAM=employment BENEFIT_REGION=California FILEPATH=src/ingestion/edd_scrapings.json`.


### To Skip Access to the DB

For dry-runs or exporting of markdown files, avoid reading and writing to the DB during ingestion by adding the `--skip_db` argument like so:
```
make ingest-edd-web DATASET_ID="CA EDD test" BENEFIT_PROGRAM=employment BENEFIT_REGION=California FILEPATH=src/ingestion/edd_scrapings.json INGEST_ARGS="--skip_db"
```

See PR #171 for other examples.


## Batch processing

To have answers generated for multiple questions at once, create a .csv file with a `question` column, for example:
Expand Down
6 changes: 4 additions & 2 deletions app/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,7 @@ documents/
.chainlit/translations/*.json
!.chainlit/translations/en-US.json

/chunks-log/
/src/ingestion/imagine_la/scrape/pages/*
/src/ingestion/imagine_la/scrape/pages/*

# intermediate markdown files during ingestion
/*_md/
17 changes: 17 additions & 0 deletions app/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -248,11 +248,28 @@ ifndef FILEPATH
$(error FILEPATH is undefined)
endif

scrape-edd-web:
$(PY_RUN_CMD) scrape-edd-web

ingest-edd-web: check-ingest-arguments
$(PY_RUN_CMD) ingest-edd-web "$(DATASET_ID)" "$(BENEFIT_PROGRAM)" "$(BENEFIT_REGION)" "$(FILEPATH)" $(INGEST_ARGS)


scrape-imagine-la:
cd src/ingestion/imagine_la/scrape; uv run --no-project scrape_content_hub.py https://socialbenefitsnavigator25.web.app/contenthub $(CONTENTHUB_PASSWORD)

ingest-imagine-la: check-ingest-arguments
$(PY_RUN_CMD) ingest-imagine-la "$(DATASET_ID)" "$(BENEFIT_PROGRAM)" "$(BENEFIT_REGION)" "$(FILEPATH)"


scrape-la-county-policy:
# Use playwright to scrape dynamic la_policy_nav_bar.html, required for the next step
cd src/ingestion/la_policy/scrape; uv run --no-project scrape_la_policy_nav_bar.py

# Now that we have the expanded nav bar, scrape all the links in the nav bar
# Either should work:
# DEBUG_SCRAPINGS=true uv run --no-project scrape_la_policy.py &> out.log
$(PY_RUN_CMD) scrape-la-policy 2>&1 | tee out.log

ingest-la-county-policy: check-ingest-arguments
$(PY_RUN_CMD) ingest-la-policy "$(DATASET_ID)" "$(BENEFIT_PROGRAM)" "$(BENEFIT_REGION)" "$(FILEPATH)" $(INGEST_ARGS)
Loading
Loading