Merge pull request #796 from NVIDIA/branch-24.12

[auto-merge] branch-24.12 to branch-25.02 [skip ci] [bot]
NVIDIA · Dec 6, 2024 · 8d65e6d · 8d65e6d
2 parents 34c4c14 + 557ddf0
commit 8d65e6d
Showing 1 changed file with 72 additions and 3 deletions.
diff --git a/notebooks/approx-nearest-neighbors.ipynb b/notebooks/approx-nearest-neighbors.ipynb
@@ -138,7 +138,8 @@
    "id": "204b7e72-737e-4a8d-81ce-5a275cb7446a",
    "metadata": {},
    "source": [
-    "## Spark RAPIDS ML (GPU)"
+    "## Spark RAPIDS ML (GPU)\n",
+    "The ApproximateNearestNeighbors class of Spark Rapids ML uses the ivfflat algorithm by default."
    ]
   },
   {
@@ -318,9 +319,9 @@
    "id": "8cd56670-7633-4fe6-ab75-0fd680c63baa",
    "metadata": {},
    "source": [
-    "# PySpark\n",
+    "## PySpark\n",
     "\n",
-    "PySpark does not have an exact kNN implementation, but it does have an LSH-based Approximate Nearest Neighbors implementation, shown here to illustrate the similarity between the APIs.  However, the algorithms are very different, so their results are only roughly comparable, and it would require elaborate tuning of parameters to produce similar results."
+    "PySpark has an LSH-based Approximate Nearest Neighbors implementation, shown here to illustrate the similarity between the APIs.  However, the algorithms are very different, so their results are only roughly comparable, and it would require elaborate tuning of parameters to produce similar results."
    ]
   },
   {
@@ -440,6 +441,74 @@
     "# saves the LSH hashes for the input rows\n",
     "model.write().overwrite().save(\"/tmp/ann_model\")"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b1398af2",
+   "metadata": {},
+   "source": [
+    "## Spark Rapids ML (GPU CAGRA algorithm) \n",
+    "CAGRA is a cutting-edge graph-based algorithm available in cuVS, and is now integrated into the ApproximateNearestNeighbors class of Spark Rapids ML. Cagra currently supports sqeuclidean distance metric only, and the metric must be set before using the main APIs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0e0bef26",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "knn = ApproximateNearestNeighbors(k=2, algorithm='cagra', metric='sqeuclidean', algoParams={\"build_algo\" : \"nn_descent\"})\n",
+    "knn.setInputCol(\"features\")\n",
+    "knn_model = knn.fit(item_df)\n",
+    "item_id_df, query_id_df, neighbor_df = knn_model.kneighbors(query_df)\n",
+    "neighbor_df.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0b22ac85",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "result_df = knn_model.approxSimilarityJoin(query_df)\n",
+    "result_df.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "87fb3f48",
+   "metadata": {},
+   "source": [
+    "## Spark Rapids ML (GPU IVFPQ algorithm)\n",
+    "The IVFPQ algorithm combines the power of Inverted File Indexing with Product Quantization to deliver fast and memory-efficient approximate nearest neighbor search. It is now integrated into Spark Rapids ML."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d40b73ef",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "knn = ApproximateNearestNeighbors(k=2, algorithm='ivfpq', algoParams={\"M\": 2, \"n_bits\": 8})\n",
+    "knn.setInputCol(\"features\")\n",
+    "knn_model = knn.fit(item_df)\n",
+    "item_id_df, query_id_df, neighbor_df = knn_model.kneighbors(query_df)\n",
+    "neighbor_df.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "11224698",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "result_df = knn_model.approxSimilarityJoin(query_df)\n",
+    "result_df.show()"
+   ]
   }
  ],
  "metadata": {