-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MosaicML-Streaming on Databricks #801
Comments
Hi @gtmdotme, petastorm is not supported anymore as a DL dataloader on Databricks, my knowledge about pytastorm is also very limited :face_palm: Let us know if you need more help. |
Hi @XiaohanZhangCMU, thanks for your answers. I have a follow-up question on the third point. The tutorial you shared seems to indicate that we need to first convert parquet (pyspark dataframe) to MDS format (streaming dataset). So, does that mean that we will consume double storage: one for stored parquet data and another for MDS format data? |
@gtmdotme yes, that's correct. You will need to persist two copies. |
We often store data in Parquet format (for PySpark dataframes), but to use your library, we need to also convert it to MDS format. This creates a second copy of the data, which could lead to a significant increase in disk storage usage. Is that a known tradeoff? In my comparison for a synthetic dataset (example code below), I found that the MDS format is taking nearly 30x larger disk space than the Parquet format. I’m wondering if I'm doing something wrong or if this is expected behavior. This code is tested on the Databricks Runtime 15.4 LTS ML. Thanks for your help! Exampleimport os, shutil
import numpy as np
import pandas as pd
from sklearn.datasets import make_multilabel_classification
from streaming.base.converters import dataframe_to_mds
local_dir = './data_loading/'
shutil.rmtree(local_dir, ignore_errors=True)
os.makedirs(local_dir, exist_ok=True)
def get_dummy_data(n_samples, n_features, n_labels, seed):
n_classes = n_labels ## Max number of labels
avg_labels_per_class = int(n_classes*0.1) ## Number of average labels per instance, here 10%
# Generate dummy data
X, y = make_multilabel_classification(n_samples=n_samples, n_features=n_features,
n_classes=n_classes, n_labels=avg_labels_per_class,
random_state=seed)
# Convert to pandas dataframe
feature_cols = [f'feature_{i}' for i in range(n_features)]
target_cols = [f'target_{i}' for i in range(n_classes)]
X = pd.DataFrame(X, columns=feature_cols)
y = pd.DataFrame(y, columns=target_cols)
# Merge the group ids with the dataset
data = pd.concat([X, y], axis=1)
return data, feature_cols, target_cols
df, feature_cols, target_cols = get_dummy_data(n_samples=100_000, n_features=128, n_labels=32, seed=42)
print(df.shape)
# save as parquet, csv and numpy array
df.to_parquet(local_dir+'dummy_data.parquet')
df.to_csv(local_dir+'dummy_data.csv', index=False)
np.save(local_dir+"dummy_numpy.npy", df.to_numpy())
# convert pandas df to pyspark df
df_spark = spark.createDataFrame(df)
df_spark.write.mode("overwrite").save(local_dir+"dummy_spark")
# save the dataset to an MDS format
mds_kwargs = {'out': out_path}
shutil.rmtree(local_dir, ignore_errors=True)
dataframe_to_mds(df_spark.repartition(4), merge_index=True, mds_kwargs=mds_kwargs) Output: $ !du -achd1 {local_dir}
55M ./data_loading/dummy_data.csv
4.4M ./data_loading/dummy_data.parquet
123M ./data_loading/dummy_mosaic
123M ./data_loading/dummy_numpy.npy
5.2M ./data_loading/dummy_spark
310M ./data_loading/
310M total |
Hey @gtmdotme Yes the trade-off of having an additional copy is known. To help reduce the MDS copy size, you may want to add a compression method to mds_kwargs (for details, you can take a look at this page). parquet compresses data efficiently if you have many repetitive values etc, so I am not surprised they lead to much smaller size since your MDS copy is just a serialized binary format. Let me know what the size looks like after you applying the compression method. |
Awesome! Compression did help. I used the Here is the final code: import os, shutil
import numpy as np
import pandas as pd
from sklearn.datasets import make_multilabel_classification
from streaming.base.converters import dataframe_to_mds
from pyspark.sql.functions import col
local_dir = './data_loading/'
shutil.rmtree(local_dir, ignore_errors=True)
os.makedirs(local_dir, exist_ok=True)
def get_dummy_data(n_samples, n_features, n_labels, seed):
n_classes = n_labels ## Max number of labels
avg_labels_per_class = int(n_classes*0.1) ## Number of average labels per instance, here 10%
# Generate dummy data
X, y = make_multilabel_classification(n_samples=n_samples, n_features=n_features,
n_classes=n_classes, n_labels=avg_labels_per_class,
random_state=42)
# Convert to pandas dataframe
feature_cols = [f'feature_{i}' for i in range(n_features)]
target_cols = [f'target_{i}' for i in range(n_classes)]
X = pd.DataFrame(X, columns=feature_cols)
y = pd.DataFrame(y, columns=target_cols)
# Merge the group ids with the dataset
data = pd.concat([X, y], axis=1)
return data, feature_cols, target_cols
df, feature_cols, target_cols = get_dummy_data(n_samples=100_000, n_features=128, n_labels=32, seed=42)
print(df.shape)
# save as parquet, csv and numpy array
df.to_parquet(local_dir+'dummy_data.parquet')
df.to_csv(local_dir+'dummy_data.csv', index=False)
np.save(local_dir+'dummy_numpy.npy', df.to_numpy())
# convert pandas df to pyspark df
df_spark = spark.createDataFrame(df)
df_spark.write.mode("overwrite").save(local_dir+"dummy_spark")
# save the dataset to an MDS format
out_path = os.path.join(local_dir, 'dummy_mosaic')
shutil.rmtree(out_path, ignore_errors=True)
mds_kwargs = {'out': out_path, 'compression': 'zstd:9'}
dataframe_to_mds(df_spark.repartition(4), merge_index=True, mds_kwargs=mds_kwargs) Output: !du -achd1 {local_dir}
55M ./data_loading/dummy_data.csv
4.4M ./data_loading/dummy_data.parquet
6.1M ./data_loading/dummy_mosaic
123M ./data_loading/dummy_numpy.npy
5.2M ./data_loading/dummy_spark
193M total |
Q1. The DB Runtime 15.4 LTS ML comes with streaming Q2. I have a column of type
I tried to set those empty arrays to
Is there a solution to this? Q3. On this page, I can't find a datatype for boolean values. Is there a recommended way for converting boolean columns from PySpark to MDS format? Here is the error:
|
@gtmdotme Question to Q1 is yes. You will need to upgrade it which has the array encoder. For Q2, it's not a desired failure but sort of expected. MDS converter cannot do any imputation. It will expect the same dtype for all records and they have valid values. Q3: you are right, there is no boolean currently. I think a workaround is making it an integer and using "int". |
Thanks so much for the clarification. I think there are two limitations and subsequently improvements for a future release.
|
Hi all, I'm a new user of mosaicml-streaming on Databricks who stumbled upon Mosaic ML (and Petastorm) for loading large data from PySpark to PyTorch tensors. Here is an example jupyter notebook that I'm trying to replicate on my Databricks clusters, however, I have a few questions:
The above notebook's requirements say that we need a "Databricks Runtime for ML 15.2 or higher". However, my organization has an earlier version. Can we use mosaic-streaming on earlier runtime versions?
This above notebook imports "from petastorm import TransformSpec" but this blog says that Petastorm is deprecated and suggests the usage of mosaic-streaming instead. I checked the code, and it just imports the petastorm but doesn't use it. Can someone confirm if this package was mistakenly imported?
My understanding of mosaic-streaming is that it takes a PySpark dataframe as input and provides API for getting a PyTorch dataloader as output which can be used for ML training on the fly without writing the whole data in some MDS format. Is my understanding correct?
PS: I started a discussion on your Slack community but was re-directed to submit an issue here.
The text was updated successfully, but these errors were encountered: