Skip to content

Commit

Permalink
update environment variables to support cloud and region, update asso…
Browse files Browse the repository at this point in the history
…ciated readme examples
  • Loading branch information
austin-denoble committed Jan 15, 2024
1 parent 4816bb4 commit 80cb075
Show file tree
Hide file tree
Showing 7 changed files with 65 additions and 96 deletions.
4 changes: 3 additions & 1 deletion .env.example
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
PINECONE_API_KEY=
PINECONE_INDEX=semantic-search
PINECONE_INDEX="semantic-search"
PINECONE_CLOUD="aws"
PINECONE_REGION="us-west-2"
2 changes: 2 additions & 0 deletions .github/actions/integrationTests/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ runs:
CI: true
PINECONE_API_KEY: ${{ inputs.pinecone_api_key }}
PINECONE_INDEX: "semantic-search-testing"
PINECONE_CLOUD: "aws"
PINECONE_REGION: "us-west-2"
run: npm run test
- name: "Report Coverage"
if: always() # Also generate the report if tests are failing
Expand Down
81 changes: 43 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ In this walkthrough we will see how to use Pinecone for semantic search.
## Setup

Prerequisites:

- `Node.js` version >=18.0.0

Clone the repository and install the dependencies.
Expand All @@ -17,7 +18,7 @@ npm install

### Configuration

In order to run this example, you have to supply the Pinecone credentials needed to interact with the Pinecone API. You can find these credentials in the Pinecone web console. This project uses `dotenv` to easily load values from the `.env` file into the environment when executing.
In order to run this example, you have to supply the Pinecone credentials needed to interact with the Pinecone API. You can find these credentials in the Pinecone web console. This project uses `dotenv` to easily load values from the `.env` file into the environment when executing.

Copy the template file:

Expand All @@ -29,11 +30,15 @@ And fill in your API key and index name:

```sh
PINECONE_API_KEY=<your-api-key>
PINECONE_INDEX=semantic-search
PINECONE_INDEX="semantic-search"
PINECONE_CLOUD="aws"
PINECONE_REGION="us-west-2"
```

`PINECONE_INDEX` is the name of the index where this demo will store and query embeddings. You can change `PINECONE_INDEX` to any name you like, but make sure the name not going to collide with any indexes you are already using.

`PINECONE_CLOUD` and `PINECONE_REGION` define where the index should be deployed. Currently, this is the only available cloud and region combination (`aws` and `us-west-2`), so it's recommended to leave them defaulted.

### Building

To build the project please run the command:
Expand All @@ -51,8 +56,8 @@ There are two main components to this application: the data loader (load.ts) and
The data loading process starts with the CSV file. This file contains the articles that will be indexed and made searchable. To load this data, the project uses the `papaparse` library. The loadCSVFile function in `csvLoader.ts` reads the file and uses `papaparse` to parse the CSV data into JavaScript objects. The `dynamicTyping` option is set to true to automatically convert the data to the appropriate types. After this step, you will have an array of objects, where each object represents an article​.

```typescript
import fs from "fs/promises";
import Papa from "papaparse";
import fs from 'fs/promises';
import Papa from 'papaparse';

async function loadCSVFile(
filePath: string
Expand All @@ -62,7 +67,7 @@ async function loadCSVFile(
const csvAbsolutePath = await fs.realpath(filePath);

// Create a readable stream from the CSV file
const data = await fs.readFile(csvAbsolutePath, "utf8");
const data = await fs.readFile(csvAbsolutePath, 'utf8');

// Parse the CSV file
return await Papa.parse(data, {
Expand All @@ -84,19 +89,19 @@ export default loadCSVFile;
The text embedding operation is performed in the `Embedder` class. This class uses a pipeline from the [`@xenova/transformers`](https://github.com/xenova/transformers.js) library to generate embeddings for the input text. We use the [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model to generate the embeddings. The class provides methods to embed a single string or an array of strings in batches​ - which will come in useful a bit later.

```typescript
import type { PineconeRecord } from "@pinecone-database/pinecone";
import type { TextMetadata } from "./types.js";
import { Pipeline } from "@xenova/transformers";
import { v4 as uuidv4 } from "uuid";
import { sliceIntoChunks } from "./utils/util.js";
import type { PineconeRecord } from '@pinecone-database/pinecone';
import type { TextMetadata } from './types.js';
import { Pipeline } from '@xenova/transformers';
import { v4 as uuidv4 } from 'uuid';
import { sliceIntoChunks } from './utils/util.js';

class Embedder {
private pipe: Pipeline | null = null;

// Initialize the pipeline
async init() {
const { pipeline } = await import("@xenova/transformers");
this.pipe = await pipeline("embeddings", "Xenova/all-MiniLM-L6-v2");
const { pipeline } = await import('@xenova/transformers');
this.pipe = await pipeline('embeddings', 'Xenova/all-MiniLM-L6-v2');
}

// Embed a single string
Expand Down Expand Up @@ -131,23 +136,22 @@ class Embedder {
const embedder = new Embedder();

export { embedder };

```

## Loading embeddings into Pinecone

Now that we have a way to load data and create embeddings, let put the two together and save the embeddings in Pinecone. In the following section, we get the path of the file we need to process from the command like. We load the CSV file, create the Pinecone index and then start the embedding process. The embedding process is done in batches of 1000. Once we have a batch of embeddings, we insert them into the index.

```typescript
import cliProgress from "cli-progress";
import { config } from "dotenv";
import loadCSVFile from "./csvLoader.js";
import cliProgress from 'cli-progress';
import { config } from 'dotenv';
import loadCSVFile from './csvLoader.js';

import { embedder } from "./embeddings.js";
import { embedder } from './embeddings.js';
import { Pinecone } from '@pinecone-database/pinecone';
import { getEnv, validateEnvironmentVariables } from "./utils/util.js";
import { getEnv, validateEnvironmentVariables } from './utils/util.js';

import type { TextMetadata } from "./types.js";
import type { TextMetadata } from './types.js';

// Load environment variables from .env
config();
Expand All @@ -161,7 +165,7 @@ let counter = 0;

export const load = async (csvPath: string, column: string) => {
validateEnvironmentVariables();

// Get a Pinecone instance
const pinecone = new Pinecone();

Expand All @@ -177,8 +181,10 @@ export const load = async (csvPath: string, column: string) => {
// Extract the selected column from the CSV file
const documents = data.map((row) => row[column] as string);

// Get index name
const indexName = getEnv("PINECONE_INDEX");
// Get index name, cloud, and region
const indexName = getEnv('PINECONE_INDEX');
const indexCloud = getEnv('PINECONE_CLOUD');
const indexRegion = getEnv('PINECONE_REGION');

// Create a Pinecone index with a dimension of 384 to hold the outputs
// of our embeddings model. Use suppressConflicts in case the index already exists.
Expand All @@ -187,8 +193,8 @@ export const load = async (csvPath: string, column: string) => {
dimension: 384,
spec: {
serverless: {
region: "us-west-2",
cloud: "aws",
region: indexRegion,
cloud: indexCloud,
},
},
waitUntilReady: true,
Expand All @@ -208,7 +214,7 @@ export const load = async (csvPath: string, column: string) => {
await embedder.embedBatch(documents, 100, async (embeddings) => {
counter += embeddings.length;
// Whenever the batch embedding process returns a batch of embeddings, insert them into the index
await index.upsert(embeddings)
await index.upsert(embeddings);
progressBar.update(counter);
});

Expand Down Expand Up @@ -252,11 +258,11 @@ Index is ready.
Now that our index is populated we can begin making queries. We are performing a semantic search for similar questions, so we should embed and search with another question.
```typescript
import { config } from "dotenv";
import { embedder } from "./embeddings.js";
import { Pinecone } from "@pinecone-database/pinecone";
import { getEnv, validateEnvironmentVariables } from "./utils/util.js";
import type { TextMetadata } from "./types.js";
import { config } from 'dotenv';
import { embedder } from './embeddings.js';
import { Pinecone } from '@pinecone-database/pinecone';
import { getEnv, validateEnvironmentVariables } from './utils/util.js';
import type { TextMetadata } from './types.js';

config();

Expand All @@ -265,9 +271,9 @@ export const query = async (query: string, topK: number) => {
const pinecone = new Pinecone();

// Target the index
const indexName = getEnv("PINECONE_INDEX");
const indexName = getEnv('PINECONE_INDEX');
const index = pinecone.index<TextMetadata>(indexName);

await embedder.init();

// Embed the query
Expand All @@ -278,7 +284,7 @@ export const query = async (query: string, topK: number) => {
vector: queryEmbedding.values,
topK,
includeMetadata: true,
includeValues: false
includeValues: false,
});

// Print the results
Expand All @@ -291,7 +297,6 @@ export const query = async (query: string, topK: number) => {
}))
);
};

```
The querying process is very similar to the indexing process. We create a Pinecone client, select the index we want to query, and then embed the query. We then use the `query` method to search the index for the most similar embeddings. The `query` method returns a list of matches. Each match contains the metadata associated with the embedding, as well as the score of the match.
Expand All @@ -307,11 +312,11 @@ The result for this will be something like:
```js
[
{
text: "Which country in the world has the largest population?",
text: 'Which country in the world has the largest population?',
score: 0.79473877,
},
{
text: "Which cities are the most densely populated?",
text: 'Which cities are the most densely populated?',
score: 0.706895828,
},
];
Expand All @@ -328,11 +333,11 @@ And the result:
```js
[
{
text: "Which cities are the most densely populated?",
text: 'Which cities are the most densely populated?',
score: 0.66688776,
},
{
text: "What are the most we dangerous cities in the world?",
text: 'What are the most we dangerous cities in the world?',
score: 0.556335568,
},
];
Expand Down
57 changes: 5 additions & 52 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
"format:check": "npx prettier --check src"
},
"dependencies": {
"@pinecone-database/pinecone": "^1.1.2-spruceDev.20231211000839",
"@pinecone-database/pinecone": "^1.1.3-spruceDev.20240115214739",
"@xenova/transformers": "2.0.1",
"cli-progress": "^3.12.0",
"dotenv": "^16.0.3",
Expand Down
13 changes: 9 additions & 4 deletions src/load.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,10 @@ import { config } from "dotenv";
import loadCSVFile from "./csvLoader.js";

import { embedder } from "./embeddings.js";
import { Pinecone } from "@pinecone-database/pinecone";
import {
Pinecone,
type ServerlessSpecCloudEnum,
} from "@pinecone-database/pinecone";
import { getEnv, validateEnvironmentVariables } from "./utils/util.js";

import type { TextMetadata } from "./types.js";
Expand Down Expand Up @@ -36,8 +39,10 @@ export const load = async (csvPath: string, column: string) => {
// Extract the selected column from the CSV file
const documents = data.map((row) => row[column] as string);

// Get index name
// Get index name, cloud, and region
const indexName = getEnv("PINECONE_INDEX");
const indexCloud = getEnv("PINECONE_CLOUD") as ServerlessSpecCloudEnum;
const indexRegion = getEnv("PINECONE_REGION");

// Create a Pinecone index with a dimension of 384 to hold the outputs
// of our embeddings model. Use suppressConflicts in case the index already exists.
Expand All @@ -46,8 +51,8 @@ export const load = async (csvPath: string, column: string) => {
dimension: 384,
spec: {
serverless: {
region: "us-west-2",
cloud: "aws",
region: indexRegion,
cloud: indexCloud,
},
},
waitUntilReady: true,
Expand Down
2 changes: 2 additions & 0 deletions src/utils/util.ts
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ function getEnv(key: string): string {
const validateEnvironmentVariables = () => {
getEnv("PINECONE_API_KEY");
getEnv("PINECONE_INDEX");
getEnv("PINECONE_CLOUD");
getEnv("PINECONE_REGION");
};

export { getEnv, sliceIntoChunks, validateEnvironmentVariables };

0 comments on commit 80cb075

Please sign in to comment.