Download pre-processed document files from Google Drive into folder evaluation_examples/documents/
. Unzip this file and we will get the following folder structure:
- evaluation_examples/documents/:
- docs.zip
- docs/
- doc_orig.zip
- doc_txt.zip
- doc_md.zip
- doc_html.zip
Among them, suffix _orig
means the crawled raw HTML web pages (using tool HTTrack), and other suffixes represent
Let's take .txt
format as an example. Unzip the archive file doc_txt.zip
, we will get the following structure:
- evaluation_examples/documents/docs/doc_txt/:
- webpage_title.json
- airbyte.com/
- quickstarts.txt
- quickstart/
- aggregating-data-from-mysql-and-postgres-into-bigquery-with-airbyte.txt
- ...
- blog/
- tutorials/
- docs.airbyte.com/
- cloud.google.com/bigquery/docs/
- access-control-basic-roles.txt
- admin-intro.txt
- ...
- ...
Here, each .txt
in a sub-folder represent a pre-processed text file which can be retrieved. We also preserve the website structure for each .txt
file. For the meta file webpage_title.json
, it records the web page title for each .txt
file. We believe both the web page path and title are meaningful during retrieval. For the other two formats, namely .html
and .md
, the only difference is that each .txt
file is pre-processed into Markdown or simplified HTML formats.
We present a simple retrieval baseline (mm_agents/retrieval/llama_index_retrieval.py
) in Spider2-V. It utilizes the task instruction as the query vector, huggingface BAAI/bge-large-en-v1.5
as the embedding model, and LlamaIndex framework as the retrieval to generate context for each task example. The retrieved context file is also provided under each task-specific folder (e.g., retrieved_chunk_size_512_chunk_overlap_20_topk_4_embed_bge-large-en-v1.5.txt
). We get all context files via script:
cd evaluation_examples/documents
git lfs install
git clone https://huggingface.co/BAAI/bge-large-en-v1.5
cd ../../
# you can use other document formats, e.g., --doc_directory evaluation_examples/documents/docs/doc_md
# this step may take some time to build the embedding index
python mm_agents/retrieval/llama_index_retrieval.py --embed_model_name_or_path evaluation_examples/documents/bge-large-en-v1.5 --doc_directory evaluation_examples/documents/docs/doc_txt
Feel free to use more advanced RAG method based on the document warehouse!