Skip to content

Commit

Permalink
feat: Export markdown files during ingestion (#171)
Browse files Browse the repository at this point in the history
  • Loading branch information
yoomlam authored Jan 10, 2025
1 parent 9596f57 commit a511080
Show file tree
Hide file tree
Showing 13 changed files with 354 additions and 173 deletions.
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,17 @@ poetry run scrape-edd-web

To load into the vector database, see sections above but use `make ingest-edd-web DATASET_ID="CA EDD" BENEFIT_PROGRAM=employment BENEFIT_REGION=California FILEPATH=src/ingestion/edd_scrapings.json`.


### To Skip Access to the DB

For dry-runs or exporting of markdown files, avoid reading and writing to the DB during ingestion by adding the `--skip_db` argument like so:
```
make ingest-edd-web DATASET_ID="CA EDD test" BENEFIT_PROGRAM=employment BENEFIT_REGION=California FILEPATH=src/ingestion/edd_scrapings.json INGEST_ARGS="--skip_db"
```

See PR #171 for other examples.


## Batch processing

To have answers generated for multiple questions at once, create a .csv file with a `question` column, for example:
Expand Down
6 changes: 4 additions & 2 deletions app/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,7 @@ documents/
.chainlit/translations/*.json
!.chainlit/translations/en-US.json

/chunks-log/
/src/ingestion/imagine_la/scrape/pages/*
/src/ingestion/imagine_la/scrape/pages/*

# intermediate markdown files during ingestion
/*_md/
17 changes: 17 additions & 0 deletions app/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -248,11 +248,28 @@ ifndef FILEPATH
$(error FILEPATH is undefined)
endif

scrape-edd-web:
$(PY_RUN_CMD) scrape-edd-web

ingest-edd-web: check-ingest-arguments
$(PY_RUN_CMD) ingest-edd-web "$(DATASET_ID)" "$(BENEFIT_PROGRAM)" "$(BENEFIT_REGION)" "$(FILEPATH)" $(INGEST_ARGS)


scrape-imagine-la:
cd src/ingestion/imagine_la/scrape; uv run --no-project scrape_content_hub.py https://socialbenefitsnavigator25.web.app/contenthub $(CONTENTHUB_PASSWORD)

ingest-imagine-la: check-ingest-arguments
$(PY_RUN_CMD) ingest-imagine-la "$(DATASET_ID)" "$(BENEFIT_PROGRAM)" "$(BENEFIT_REGION)" "$(FILEPATH)"


scrape-la-county-policy:
# Use playwright to scrape dynamic la_policy_nav_bar.html, required for the next step
cd src/ingestion/la_policy/scrape; uv run --no-project scrape_la_policy_nav_bar.py

# Now that we have the expanded nav bar, scrape all the links in the nav bar
# Either should work:
# DEBUG_SCRAPINGS=true uv run --no-project scrape_la_policy.py &> out.log
$(PY_RUN_CMD) scrape-la-policy 2>&1 | tee out.log

ingest-la-county-policy: check-ingest-arguments
$(PY_RUN_CMD) ingest-la-policy "$(DATASET_ID)" "$(BENEFIT_PROGRAM)" "$(BENEFIT_REGION)" "$(FILEPATH)" $(INGEST_ARGS)
Loading

0 comments on commit a511080

Please sign in to comment.