Skip to content

Commit

Permalink
Merge pull request #796 from NVIDIA/branch-24.12
Browse files Browse the repository at this point in the history
[auto-merge] branch-24.12 to branch-25.02 [skip ci] [bot]
  • Loading branch information
nvauto authored Dec 6, 2024
2 parents 34c4c14 + 557ddf0 commit 8d65e6d
Showing 1 changed file with 72 additions and 3 deletions.
75 changes: 72 additions & 3 deletions notebooks/approx-nearest-neighbors.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,8 @@
"id": "204b7e72-737e-4a8d-81ce-5a275cb7446a",
"metadata": {},
"source": [
"## Spark RAPIDS ML (GPU)"
"## Spark RAPIDS ML (GPU)\n",
"The ApproximateNearestNeighbors class of Spark Rapids ML uses the ivfflat algorithm by default."
]
},
{
Expand Down Expand Up @@ -318,9 +319,9 @@
"id": "8cd56670-7633-4fe6-ab75-0fd680c63baa",
"metadata": {},
"source": [
"# PySpark\n",
"## PySpark\n",
"\n",
"PySpark does not have an exact kNN implementation, but it does have an LSH-based Approximate Nearest Neighbors implementation, shown here to illustrate the similarity between the APIs. However, the algorithms are very different, so their results are only roughly comparable, and it would require elaborate tuning of parameters to produce similar results."
"PySpark has an LSH-based Approximate Nearest Neighbors implementation, shown here to illustrate the similarity between the APIs. However, the algorithms are very different, so their results are only roughly comparable, and it would require elaborate tuning of parameters to produce similar results."
]
},
{
Expand Down Expand Up @@ -440,6 +441,74 @@
"# saves the LSH hashes for the input rows\n",
"model.write().overwrite().save(\"/tmp/ann_model\")"
]
},
{
"cell_type": "markdown",
"id": "b1398af2",
"metadata": {},
"source": [
"## Spark Rapids ML (GPU CAGRA algorithm) \n",
"CAGRA is a cutting-edge graph-based algorithm available in cuVS, and is now integrated into the ApproximateNearestNeighbors class of Spark Rapids ML. Cagra currently supports sqeuclidean distance metric only, and the metric must be set before using the main APIs."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0e0bef26",
"metadata": {},
"outputs": [],
"source": [
"knn = ApproximateNearestNeighbors(k=2, algorithm='cagra', metric='sqeuclidean', algoParams={\"build_algo\" : \"nn_descent\"})\n",
"knn.setInputCol(\"features\")\n",
"knn_model = knn.fit(item_df)\n",
"item_id_df, query_id_df, neighbor_df = knn_model.kneighbors(query_df)\n",
"neighbor_df.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0b22ac85",
"metadata": {},
"outputs": [],
"source": [
"result_df = knn_model.approxSimilarityJoin(query_df)\n",
"result_df.show()"
]
},
{
"cell_type": "markdown",
"id": "87fb3f48",
"metadata": {},
"source": [
"## Spark Rapids ML (GPU IVFPQ algorithm)\n",
"The IVFPQ algorithm combines the power of Inverted File Indexing with Product Quantization to deliver fast and memory-efficient approximate nearest neighbor search. It is now integrated into Spark Rapids ML."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d40b73ef",
"metadata": {},
"outputs": [],
"source": [
"knn = ApproximateNearestNeighbors(k=2, algorithm='ivfpq', algoParams={\"M\": 2, \"n_bits\": 8})\n",
"knn.setInputCol(\"features\")\n",
"knn_model = knn.fit(item_df)\n",
"item_id_df, query_id_df, neighbor_df = knn_model.kneighbors(query_df)\n",
"neighbor_df.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "11224698",
"metadata": {},
"outputs": [],
"source": [
"result_df = knn_model.approxSimilarityJoin(query_df)\n",
"result_df.show()"
]
}
],
"metadata": {
Expand Down

0 comments on commit 8d65e6d

Please sign in to comment.