Skip to content

Latest commit

 

History

History
27 lines (15 loc) · 3.21 KB

crawling.md

File metadata and controls

27 lines (15 loc) · 3.21 KB

Retrieving data

Brief overview of steps 1 and 2 in meting_mate/ingest

Fetching changes from drive

Call _1_crawl_drive.py.

This script loops through all app users and uses the google docs API to locate their own and shared google docs documents. Whenever a new or modified document is discovered, a check against the database is run. If the db timestamp deviates from the Gdocs timestamp (or if the doc is missing altogether), the document is upserted.

Retrieving data

Brief overview of steps 1 and 2 in meting_mate/ingest

Fetching changes from drive

Call _1_crawl_drive.py. This script loops through all app users and uses the google docs API to locate their own and shared google docs documents. Whenever a new or modified document is discovered, a check against the database is run. If the db timestamp deviates from the Gdocs timestamp (or if the doc is missing altogether), the document is upserted.


Call _2_get_contents.py.

This script locates all documents documents with no content from the "docs" collection, then proceeds to download the document contents in multiple formats - native JSON as returned by the google docs API, as well as a MS Word .docx export. The word format is then converted to HTML and Markdown using Mammoth.