Skip to content

Commit

Permalink
release 0.0.4
Browse files Browse the repository at this point in the history
  • Loading branch information
marsupialtail committed Oct 3, 2022
1 parent af99d94 commit d6ba121
Show file tree
Hide file tree
Showing 3 changed files with 51 additions and 12 deletions.
2 changes: 1 addition & 1 deletion docs/docs/simple.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This section is for learning how to use Quokka's DataStream API. **Quokka's DataStream API is basically a dataframe API.** It takes heavy inspiration from SparkSQL and Polars, and adopts a lazy execution model. This means that in contrast to Pandas, your operations are not executed immediately after you define them. Instead, Quokka builds a logical plan under the hood and executes it only when the user wants to "collect" the result, just like Spark.

For the first part of our tutorial, we are going to go through implementing a few SQL queries in the TPC-H benchmark suite. You can download the data [here](https://drive.google.com/file/d/1a4yhPoknXgMhznJ9OQO3BwHz2RMeZr8e/view?usp=sharing). It is about 1GB unzipped. Please download the data (should take 2 minutes) and extract it to some directory locally. The SQL queries themselves can be found on this awesome [interface](https://umbra-db.com/interface/).
For the first part of our tutorial, we are going to go through implementing a few SQL queries in the TPC-H benchmark suite. You can download the data [here](https://drive.google.com/file/d/19hgYxZ4u28Cxe0s616Q3yAfkuRdQlmvO/view?usp=sharing). It is about 1GB unzipped. Please download the data (should take 2 minutes) and extract it to some directory locally. The SQL queries themselves can be found on this awesome [interface](https://umbra-db.com/interface/).

These tutorials will use your local machine. They shouldn't take too long to run. It would be great if you can follow along, not just for fun -- **if you find a bug in this tutorial I will buy you a cup of coffee!**

Expand Down
53 changes: 44 additions & 9 deletions pyquokka/df.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,12 @@ def __init__(self, cluster = None) -> None:
self.nodes = {}
self.cluster = LocalCluster() if cluster is None else cluster

'''
The API layer for read_csv mainly do four things in sequence:
- Detect if it's a S3 location or a disk location
- Detect if it's a list of files or a single file
- Detect if the data is small enough (< 10MB and single file) to be immediately materialized into a Polars dataframe.
- Detect the schema if not supplied. This is not needed by the reader but is needed by the logical plan optimizer.
After it has done these things, if the dataset is not materialized, we will instantiate a logical plan node and return a DataStream
'''

def read_files(self, table_location: str):

"""
This doesn't work yet due to difficulty handling Object types in Polars
"""

if table_location[:5] == "s3://":

if type(self.cluster) == LocalCluster:
Expand Down Expand Up @@ -59,9 +54,49 @@ def read_files(self, table_location: str):
self.latest_node_id += 1
return DataStream(self, ["filename","object"], self.latest_node_id - 1)

'''
The API layer for read_csv mainly do four things in sequence:
- Detect if it's a S3 location or a disk location
- Detect if it's a list of files or a single file
- Detect if the data is small enough (< 10MB and single file) to be immediately materialized into a Polars dataframe.
- Detect the schema if not supplied. This is not needed by the reader but is needed by the logical plan optimizer.
After it has done these things, if the dataset is not materialized, we will instantiate a logical plan node and return a DataStream
'''

def read_csv(self, table_location: str, schema = None, has_header = False, sep=","):

"""
Read in a CSV file from a table location. It can be a single CSV or a list of CSVs. It can be CSV(s) on disk
or CSV(s) on S3. Currently other cloud sare not supported. The CSVs can have a predefined schema using a list of
column names in the schema argument, or you can specify the CSV has a header row and Quokka will read the schema
from it. You should also specify the CSV's separator.
Args:
table_location (str): where the CSV(s) are. This mostly mimics Spark behavior. Look at the examples.
on (str): You could either specify this, if the join column has the same name in this DataStream and `right`, or `left_on` and `right_on`
if the join columns don't have the same name.
left_on (str): the name of the join column in this DataStream.
right_on (str): the name of the join column in `right`.
suffix (str): if `right` has columns with the same names as columns in this DataStream, their names will be appended with the suffix in the result.
how (str): only supports "inner" for now.
Return:
A new DataStream that's the joined result of this DataStream and "right". By default, columns from both side will be retained,
except for `right_on` from the right side.
Examples:
~~~python
>>> lineitem = qc.read_csv("lineitem.csv")
>>> orders = qc.read_csv("orders.csv")
>>> result = lineitem.join(orders, left_on = "l_orderkey", right_on = "o_orderkey")
# this will now fail, since o_orderkey is not in the joined DataStream.
>>> result = result.select(["o_orderkey"])
~~~
"""

if schema is None:
assert has_header, "if not provide schema, must have header."
if schema is not None and has_header:
Expand Down
8 changes: 6 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
from setuptools import setup, find_packages

VERSION = '0.0.2'
VERSION = '0.0.4'
DESCRIPTION = 'Quokka'
LONG_DESCRIPTION = 'Dope way to do cloud analytics'
LONG_DESCRIPTION = """
Dope way to do cloud analytics
Check out https://github.com/marsupialtail/quokka
or https://marsupialtail.github.io/quokka/
"""

# Setting up
setup(
Expand Down

0 comments on commit d6ba121

Please sign in to comment.