diff --git a/fern/docs/text-gen-solution/using-unstructured-io-for-embedding-documents.mdx b/fern/docs/text-gen-solution/using-unstructured-io-for-embedding-documents.mdx deleted file mode 100644 index 12f306d..0000000 --- a/fern/docs/text-gen-solution/using-unstructured-io-for-embedding-documents.mdx +++ /dev/null @@ -1,50 +0,0 @@ ---- -title: Using Unstructured.io for embedding documents -subtitle: Fast and easy document parsing and embedding using Unstuctured.io and OctoAI. -slug: text-gen-solution/using-unstructured-io-for-embedding-documents ---- - -## Introduction - -Unstructured is both an open-source library and an API service. The library provides components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. - -It also provides components to very easily embed these documents. In Unstructured's jargon this component is called an `EmbeddingEncoder`. The `OctoAIEmbedingEncoder` is available, so documents parsed with Unstructured can easily be embedded with the OctoAI embeddings endpoint. - -## Using the OctoAIEmbeddingEncocer - -The `OctoAIEmbeddingEncoder` class connects to the OctoAI Text&Embedding API to obtain embeddings for pieces of text. - -`embed_documents` will receive a list of Elements, and return an updated list which includes the embeddings attribute for each Element. - -`embed_query` will receive a query as a string, and return a list of floats which is the embedding vector for the given query string. - -`num_of_dimensions` is a metadata property that denotes the number of dimensions in any embedding vector obtained via this class. - -`is_unit_vector` is a metadata property that denotes if embedding vectors obtained via this class are unit vectors. - -The following code block shows an example of how to use `OctoAIEmbeddingEncoder`. -You will see the updated elements list (with the `embeddings` attribute included for each element), -the embedding vector for the query string, and some metadata properties about the embedding model. -You will need to set an environment variable named `OCTOAI_API_KEY` to be able to run this example. -To obtain an API key, visit: [How to create an OctoAI API token](https://octo.ai/docs/getting-started/how-to-create-octoai-access-token). - -```Python -import os - -from unstructured.documents.elements import Text -from unstructured.embed.octoai import OctoAiEmbeddingConfig, OctoAIEmbeddingEncoder - -embedding_encoder = OctoAIEmbeddingEncoder( - config=OctoAiEmbeddingConfig(api_key=os.environ["OCTOAI_API_KEY"]) -) -elements = embedding_encoder.embed_documents( - elements=[Text("This is sentence 1"), Text("This is sentence 2")], -) - -query = "This is the query" -query_embedding = embedding_encoder.embed_query(query=query) - -[print(e.embeddings, e) for e in elements] -print(query_embedding, query) -print(embedding_encoder.is_unit_vector(), embedding_encoder.num_of_dimensions()) -```