-
Notifications
You must be signed in to change notification settings - Fork 21
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #5 from pinecone-io/jhamon/v1-update
Updates for v1 node client
- Loading branch information
Showing
13 changed files
with
207 additions
and
172 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
PINECONE_API_KEY= | ||
PINECONE_ENVIRONMENT= | ||
PINECONE_INDEX= | ||
PINECONE_INDEX=semantic-search |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,21 +1,40 @@ | ||
# Semantic Search | ||
|
||
In this walkthrough we will see how to use Pinecone for semantic search. To begin we must install the required prerequisite libraries: | ||
In this walkthrough we will see how to use Pinecone for semantic search. | ||
|
||
## Setup | ||
|
||
Ensure you have `Node.js` version 19.7.0 and `npm` version 9.5.0 installed. Clone the repository and install the dependencies using `npm install`. | ||
Prerequisites: | ||
- `Node.js` version >=18.0.0 | ||
|
||
Clone the repository and install the dependencies. | ||
|
||
``` | ||
git clone [email protected]:pinecone-io/semantic-search-example.git | ||
cd semantic-search-example | ||
npm install | ||
``` | ||
|
||
### Configuration | ||
|
||
Create an `.env` file in the root of the project and add your Pinecone API key and environment details: | ||
In order to run this example, you have to supply the Pinecone credentials needed to interact with the Pinecone API. You can find these credentials in the Pinecone web console. This project uses `dotenv` to easily load values from the `.env` file into the environment when executing. | ||
|
||
Copy the template file: | ||
|
||
```sh | ||
cp .env.example .env | ||
``` | ||
|
||
And fill in your API key and environment details: | ||
|
||
```sh | ||
PINECONE_API_KEY=<your-api-key> | ||
PINECONE_ENVIRONMENT=<your-environment> | ||
PINECONE_INDEX=<index-name> | ||
PINECONE_INDEX=semantic-search | ||
``` | ||
|
||
`PINECONE_INDEX` is the name of the index where this demo will store and query embeddings. You can change `PINECONE_INDEX` to any name you like, but make sure the name not going to collide with any indexes you are already using. | ||
|
||
### Building | ||
|
||
To build the project please run the command: | ||
|
@@ -26,7 +45,7 @@ npm run build | |
|
||
## Application structure | ||
|
||
There are two main components to this application: the data loader and the search engine. The data loader is responsible for loading the data into Pinecone. The search engine is responsible for querying the index and returning the results. These two components share two common modules: the `embedder` and the `pinecone` utility module. | ||
There are two main components to this application: the data loader (load.ts) and the search engine (query.ts). The data loader is responsible for loading the data into Pinecone. The search engine is responsible for querying the index and returning similar results. These two components share a common modules, the `embedder`, which transforms natural language strings into embeddings using the [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model. | ||
|
||
## Data Preprocessing | ||
|
||
|
@@ -66,7 +85,8 @@ export default loadCSVFile; | |
The text embedding operation is performed in the `Embedder` class. This class uses a pipeline from the [`@xenova/transformers`](https://github.com/xenova/transformers.js) library to generate embeddings for the input text. We use the [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model to generate the embeddings. The class provides methods to embed a single string or an array of strings in batches - which will come in useful a bit later. | ||
|
||
```typescript | ||
import { Vector } from "@pinecone-database/pinecone"; | ||
import type { PineconeRecord } from "@pinecone-database/pinecone"; | ||
import type { TextMetadata } from "./types.js"; | ||
import { Pipeline } from "@xenova/transformers"; | ||
import { v4 as uuidv4 } from "uuid"; | ||
import { sliceIntoChunks } from "./utils/util.js"; | ||
|
@@ -81,7 +101,7 @@ class Embedder { | |
} | ||
|
||
// Embed a single string | ||
async embed(text: string): Promise<Vector> { | ||
async embed(text: string): Promise<PineconeRecord<TextMetadata>> { | ||
const result = this.pipe && (await this.pipe(text)); | ||
return { | ||
id: uuidv4(), | ||
|
@@ -97,7 +117,7 @@ class Embedder { | |
async embedBatch( | ||
texts: string[], | ||
batchSize: number, | ||
onDoneBatch: (embeddings: Vector[]) => void | ||
onDoneBatch: (embeddings: PineconeRecord<TextMetadata>[]) => void | ||
) { | ||
const batches = sliceIntoChunks<string>(texts, batchSize); | ||
for (const batch of batches) { | ||
|
@@ -112,53 +132,25 @@ class Embedder { | |
const embedder = new Embedder(); | ||
|
||
export { embedder }; | ||
``` | ||
|
||
## Pinecone utility function | ||
|
||
This function ensures that the required environment variables are set, and then initializes the Pinecone client. To save unnecessary instantiations of the Pinecone client, we use a singleton pattern to ensure that only one instance of the client is created. | ||
|
||
```typescript | ||
import { PineconeClient } from "@pinecone-database/pinecone"; | ||
import { config } from "dotenv"; | ||
import { getEnv, validateEnvironmentVariables } from "./utils/util.js"; | ||
|
||
config(); | ||
|
||
let pineconeClient: PineconeClient | null = null; | ||
|
||
// Returns a Promise that resolves to a PineconeClient instance | ||
export const getPineconeClient = async (): Promise<PineconeClient> => { | ||
validateEnvironmentVariables(); | ||
|
||
if (pineconeClient) { | ||
return pineconeClient; | ||
} else { | ||
pineconeClient = new PineconeClient(); | ||
|
||
await pineconeClient.init({ | ||
apiKey: getEnv("PINECONE_API_KEY"), | ||
environment: getEnv("PINECONE_ENVIRONMENT"), | ||
}); | ||
} | ||
return pineconeClient; | ||
}; | ||
``` | ||
|
||
## Loading embeddings into Pinecone | ||
|
||
Now that we have a way to load data and create embeddings, let put the two together and save the embeddings in Pinecone. In the following section, we get the path of the file we need to process from the command like. We load the CSV file, create the Pinecone index and then start the embedding process. The embedding process is done in batches of 1000. Once we have a batch of embeddings, we insert them into the index. | ||
|
||
```typescript | ||
import { utils } from "@pinecone-database/pinecone"; | ||
import cliProgress from "cli-progress"; | ||
import { config } from "dotenv"; | ||
import loadCSVFile from "./csvLoader.js"; | ||
|
||
import { embedder } from "./embeddings.js"; | ||
import { getPineconeClient } from "./pinecone.js"; | ||
import { getEnv } from "./utils/util.js"; | ||
const { createIndexIfNotExists, chunkedUpsert } = utils; | ||
import { Pinecone } from '@pinecone-database/pinecone'; | ||
import { getEnv, validateEnvironmentVariables } from "./utils/util.js"; | ||
|
||
import type { TextMetadata } from "./types.js"; | ||
|
||
// Load environment variables from .env | ||
config(); | ||
|
||
const progressBar = new cliProgress.SingleBar( | ||
|
@@ -169,11 +161,10 @@ const progressBar = new cliProgress.SingleBar( | |
let counter = 0; | ||
|
||
export const load = async (csvPath: string, column: string) => { | ||
// Get index name | ||
const indexName = getEnv("PINECONE_INDEX"); | ||
|
||
// Get a PineconeClient instance | ||
const pineconeClient = await getPineconeClient(); | ||
validateEnvironmentVariables(); | ||
|
||
// Get a Pinecone instance | ||
const pinecone = new Pinecone(); | ||
|
||
// Create a readable stream from the CSV file | ||
const { data, meta } = await loadCSVFile(csvPath); | ||
|
@@ -187,21 +178,31 @@ export const load = async (csvPath: string, column: string) => { | |
// Extract the selected column from the CSV file | ||
const documents = data.map((row) => row[column] as string); | ||
|
||
// Create a Pinecone index with the name "word-embeddings" and a dimension of 384 | ||
await createIndexIfNotExists(pineconeClient, indexName, 384); | ||
// Get index name | ||
const indexName = getEnv("PINECONE_INDEX"); | ||
|
||
// Check whether the index already exists. If it doesn't, create | ||
// a Pinecone index with a dimension of 384 to hold the outputs | ||
// of our embeddings model. | ||
const indexList = await pinecone.listIndexes(); | ||
if (indexList.indexOf({ name: indexName }) === -1) { | ||
await pinecone.createIndex({ name: indexName, dimension: 384, waitUntilReady: true }) | ||
} | ||
|
||
// Select the target Pinecone index | ||
const index = pineconeClient.Index(indexName); | ||
// Select the target Pinecone index. Passing the TextMetadata generic type parameter | ||
// allows typescript to know what shape to expect when interacting with a record's | ||
// metadata field without the need for additional type casting. | ||
const index = pinecone.index<TextMetadata>(indexName); | ||
|
||
// Start the progress bar | ||
progressBar.start(documents.length, 0); | ||
|
||
// Start the batch embedding process | ||
await embedder.init(); | ||
await embedder.embedBatch(documents, 1, async (embeddings) => { | ||
await embedder.embedBatch(documents, 100, async (embeddings) => { | ||
counter += embeddings.length; | ||
// Whenever the batch embedding process returns a batch of embeddings, insert them into the index | ||
await chunkedUpsert(index, embeddings, "default"); | ||
await index.upsert(embeddings) | ||
progressBar.update(counter); | ||
}); | ||
|
||
|
@@ -247,31 +248,31 @@ Now that our index is populated we can begin making queries. We are performing a | |
```typescript | ||
import { config } from "dotenv"; | ||
import { embedder } from "./embeddings.js"; | ||
import { getPineconeClient } from "./pinecone.js"; | ||
import { Pinecone } from "@pinecone-database/pinecone"; | ||
import { getEnv, validateEnvironmentVariables } from "./utils/util.js"; | ||
import type { TextMetadata } from "./types.js"; | ||
|
||
config(); | ||
|
||
export const query = async (query: string, topK: number) => { | ||
const indexName = getEnv("PINECONE_INDEX"); | ||
validateEnvironmentVariables(); | ||
const pineconeClient = await getPineconeClient(); | ||
const pinecone = new Pinecone(); | ||
|
||
// Insert the embeddings into the index | ||
const index = pineconeClient.Index(indexName); | ||
// Target the index | ||
const indexName = getEnv("PINECONE_INDEX"); | ||
const index = pinecone.index<TextMetadata>(indexName); | ||
|
||
await embedder.init(); | ||
|
||
// Embed the query | ||
const queryEmbedding = await embedder.embed(query); | ||
|
||
// Query the index | ||
// Query the index using the query embedding | ||
const results = await index.query({ | ||
queryRequest: { | ||
vector: queryEmbedding.values, | ||
topK, | ||
includeMetadata: true, | ||
includeValues: false, | ||
namespace: "default", | ||
}, | ||
vector: queryEmbedding.values, | ||
topK, | ||
includeMetadata: true, | ||
includeValues: false | ||
}); | ||
|
||
// Print the results | ||
|
@@ -284,6 +285,7 @@ export const query = async (query: string, topK: number) => { | |
})) | ||
); | ||
}; | ||
|
||
``` | ||
The querying process is very similar to the indexing process. We create a Pinecone client, select the index we want to query, and then embed the query. We then use the `query` method to search the index for the most similar embeddings. The `query` method returns a list of matches. Each match contains the metadata associated with the embedding, as well as the score of the match. | ||
|
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Oops, something went wrong.