It provides with tools to Read Data from various sources, Apply Transformations on the data read and then load them to the destination.
- Delimited Files
- JSONS
- Fixed length record files
- ZIP files which contains delimited text files
- Avro | dependency - databricks-avro jar
- Parquet
- Hive Tables
- Delimited Files
- JSONs
- Avro | dependency - databricks-avro jar
- ORC
- Parquet
- Hive Tables
- Performs Upsert
- doesn't support delete of records.
- -- More like SCD I
- apply udf to all string columns
- single columns transformations
- drop multiple columns
- keep columns
- outlier detection and handling
- missing value imputation with - mean, meadian, mode, constant
- Report on all columns
- comparison of data in two dataframes
It is a work in progress