release 0.0.4

marsupialtail · Oct 3, 2022 · d6ba121 · d6ba121
1 parent af99d94
commit d6ba121
Show file tree

Hide file tree

Showing 3 changed files with 51 additions and 12 deletions.
diff --git a/docs/docs/simple.md b/docs/docs/simple.md
@@ -2,7 +2,7 @@
 
 This section is for learning how to use Quokka's DataStream API. **Quokka's DataStream API is basically a dataframe API.** It takes heavy inspiration from SparkSQL and Polars, and adopts a lazy execution model. This means that in contrast to Pandas, your operations are not executed immediately after you define them. Instead, Quokka builds a logical plan under the hood and executes it only when the user wants to "collect" the result, just like Spark. 
 
-For the first part of our tutorial, we are going to go through implementing a few SQL queries in the TPC-H benchmark suite. You can download the data [here](https://drive.google.com/file/d/1a4yhPoknXgMhznJ9OQO3BwHz2RMeZr8e/view?usp=sharing). It is about 1GB unzipped. Please download the data (should take 2 minutes) and extract it to some directory locally. The SQL queries themselves can be found on this awesome [interface](https://umbra-db.com/interface/).
+For the first part of our tutorial, we are going to go through implementing a few SQL queries in the TPC-H benchmark suite. You can download the data [here](https://drive.google.com/file/d/19hgYxZ4u28Cxe0s616Q3yAfkuRdQlmvO/view?usp=sharing). It is about 1GB unzipped. Please download the data (should take 2 minutes) and extract it to some directory locally. The SQL queries themselves can be found on this awesome [interface](https://umbra-db.com/interface/).
 
 These tutorials will use your local machine. They shouldn't take too long to run. It would be great if you can follow along, not just for fun -- **if you find a bug in this tutorial I will buy you a cup of coffee!**
 

diff --git a/pyquokka/df.py b/pyquokka/df.py
@@ -12,17 +12,12 @@ def __init__(self, cluster = None) -> None:
         self.nodes = {}
         self.cluster = LocalCluster() if cluster is None else cluster
 
-    '''
-    The API layer for read_csv mainly do four things in sequence:
-    - Detect if it's a S3 location or a disk location
-    - Detect if it's a list of files or a single file
-    - Detect if the data is small enough (< 10MB and single file) to be immediately materialized into a Polars dataframe.
-    - Detect the schema if not supplied. This is not needed by the reader but is needed by the logical plan optimizer.
-    After it has done these things, if the dataset is not materialized, we will instantiate a logical plan node and return a DataStream
-    '''
-
     def read_files(self, table_location: str):
 
+        """
+        This doesn't work yet due to difficulty handling Object types in Polars
+        """
+
         if table_location[:5] == "s3://":
 
             if type(self.cluster) == LocalCluster:
@@ -59,9 +54,49 @@ def read_files(self, table_location: str):
         self.latest_node_id += 1
         return DataStream(self, ["filename","object"], self.latest_node_id - 1)
 
+    '''
+    The API layer for read_csv mainly do four things in sequence:
+    - Detect if it's a S3 location or a disk location
+    - Detect if it's a list of files or a single file
+    - Detect if the data is small enough (< 10MB and single file) to be immediately materialized into a Polars dataframe.
+    - Detect the schema if not supplied. This is not needed by the reader but is needed by the logical plan optimizer.
+    After it has done these things, if the dataset is not materialized, we will instantiate a logical plan node and return a DataStream
+    '''
 
     def read_csv(self, table_location: str, schema = None, has_header = False, sep=","):
 
+        """
+        Read in a CSV file from a table location. It can be a single CSV or a list of CSVs. It can be CSV(s) on disk
+        or CSV(s) on S3. Currently other cloud sare not supported. The CSVs can have a predefined schema using a list of 
+        column names in the schema argument, or you can specify the CSV has a header row and Quokka will read the schema 
+        from it. You should also specify the CSV's separator. 
+
+        Args:
+            table_location (str): where the CSV(s) are. This mostly mimics Spark behavior. Look at the examples.
+            on (str): You could either specify this, if the join column has the same name in this DataStream and `right`, or `left_on` and `right_on` 
+                if the join columns don't have the same name.
+            left_on (str): the name of the join column in this DataStream.
+            right_on (str): the name of the join column in `right`.
+            suffix (str): if `right` has columns with the same names as columns in this DataStream, their names will be appended with the suffix in the result.
+            how (str): only supports "inner" for now. 
+
+        Return:
+            A new DataStream that's the joined result of this DataStream and "right". By default, columns from both side will be retained, 
+            except for `right_on` from the right side. 
+        
+        Examples:
+            ~~~python
+            >>> lineitem = qc.read_csv("lineitem.csv")
+
+            >>> orders = qc.read_csv("orders.csv")
+
+            >>> result = lineitem.join(orders, left_on = "l_orderkey", right_on = "o_orderkey")
+
+            # this will now fail, since o_orderkey is not in the joined DataStream.
+            >>> result = result.select(["o_orderkey"])
+            ~~~
+        """
+
         if schema is None:
             assert has_header, "if not provide schema, must have header."
         if schema is not None and has_header:

diff --git a/setup.py b/setup.py
@@ -1,8 +1,12 @@
 from setuptools import setup, find_packages
 
-VERSION = '0.0.2' 
+VERSION = '0.0.4' 
 DESCRIPTION = 'Quokka'
-LONG_DESCRIPTION = 'Dope way to do cloud analytics'
+LONG_DESCRIPTION = """
+Dope way to do cloud analytics
+Check out https://github.com/marsupialtail/quokka
+or https://marsupialtail.github.io/quokka/
+"""
 
 # Setting up
 setup(