diff --git a/.ipynb_checkpoints/EDA_code-checkpoint.ipynb b/.ipynb_checkpoints/EDA_code-checkpoint.ipynb deleted file mode 100644 index b46f761..0000000 --- a/.ipynb_checkpoints/EDA_code-checkpoint.ipynb +++ /dev/null @@ -1,261 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "e488cd55-9305-4d28-ab18-d38621bf2c0d", - "metadata": {}, - "source": [ - "## Data Source Access in JupyterLab\n", - "\n", - "This code snippet demonstrates how to interact with a data source using the `domino.data_sources.DataSourceClient` from the Domino Data Lab environment. Specifically, it performs the following operations:\n", - "\n", - "1. **Initialization**: Instantiates a `DataSourceClient` object to interact with the available data sources.\n", - "2. **Data Source Fetching**: Retrieves a specific data source instance named \"winequality\".\n", - "3. **Object Listing**: Lists all objects available in the \"winequality\" data source.\n", - "\n", - "The commented sections of the code provide examples of additional operations:\n", - "- **Binary Content Retrieval**: Shows how to fetch the binary content of a specified object.\n", - "- **File Download**: Illustrates downloading the content of a specified object to a local file.\n", - "- **File Object Download**: Demonstrates downloading content directly into a Python `io.BytesIO()` file object for further manipulation within the notebook." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e2745b72-251d-443e-89b9-3575823295e0", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from domino.data_sources import DataSourceClient\n", - "\n", - "# instantiate a client and fetch the datasource instance\n", - "object_store = DataSourceClient().get_datasource(\"mlops-best-practices\")\n", - "\n", - "# list objects available in the datasource\n", - "objects = object_store.list_objects()\n", - "\n", - "## get content as binary\n", - "# content = object_store.get(\"key\")\n", - "\n", - "## download content to file\n", - "# object_store.download_file(\"key\", \"./path/to/local/file\")\n", - "\n", - "## Download content to file object\n", - "# f = io.BytesIO()\n", - "# object_store.download_fileobj(\"key\", f)" - ] - }, - { - "cell_type": "markdown", - "id": "738c4a7a-e575-46c3-95ab-e25b45369922", - "metadata": {}, - "source": [ - "## Data Loading and Display in JupyterLab\n", - "\n", - "This code snippet is designed to demonstrate the process of loading and displaying data within a JupyterLab environment, particularly using the `pandas` library for handling CSV data. The operations performed are as follows:\n", - "\n", - "1. **Data Retrieval**: Retrieves the binary content of the \"WineQualityData.csv\" file from a data source, converting it to a UTF-8 string.\n", - "2. **String to Data Stream**: Converts the string data into a stream using `StringIO`, making it readable by pandas.\n", - "3. **Data Frame Creation**: Loads the data into a pandas DataFrame by reading from the StringIO object.\n", - "4. **Display Data**: Displays the first few rows of the DataFrame to provide a snapshot of the dataset.\n", - "\n", - "This snippet is particularly useful for quickly visualizing the structure and a portion of the data directly from a data source managed by Domino's `DataSourceClient`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e9c04968-a681-4715-88ce-b9b68cc927e3", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from io import StringIO\n", - "import pandas as pd\n", - "\n", - "s = str(object_store.get(\"credit_card_default.csv\"), 'utf-8')\n", - "data = StringIO(s)\n", - "\n", - "# Load only the specified columns\n", - "columns_to_load = ['ID', 'PAY_0', 'PAY_2', 'PAY_4', 'LIMIT_BAL', 'PAY_3', 'BILL_AMT1', 'default payment next month']\n", - "df = pd.read_csv(data, usecols=columns_to_load)\n", - "\n", - "# Rename the column after loading the data\n", - "df.rename({'default payment next month': 'DEFAULT'}, axis=1, inplace=True)\n", - "\n", - "df.head()" - ] - }, - { - "cell_type": "markdown", - "id": "5832f796-419b-4a8e-87a2-66cf212a276c", - "metadata": {}, - "source": [ - "## Visualizing Data Correlations in JupyterLab\n", - "\n", - "This code snippet uses Python libraries `seaborn` and `matplotlib` to visualize correlations between numeric features of a dataset within a JupyterLab environment. The operations performed include:\n", - "\n", - "1. **Column Creation**: Adds a new column `is_red` to the DataFrame `df`. This column is a binary indicator where 1 represents 'red' wine types based on the `type` column of the DataFrame.\n", - "2. **Figure Setup**: Sets up a figure with a specified size (10x10 inches) using `matplotlib`.\n", - "3. **Heatmap Generation**: Generates a heatmap of the correlation matrix of numeric-only columns in `df` using `seaborn`. Correlation values are annotated and formatted to one decimal place.\n", - "\n", - "This visualization is helpful for identifying relationships between different numeric features, especially in contexts like feature selection or initial data analysis." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fb7bbd22-532e-4c71-b3c6-c3b0ad5ecd4d", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import seaborn as sns\n", - "import matplotlib.pyplot as plt\n", - "fig = plt.figure(figsize=(10,10))\n", - "sns.heatmap(df.corr(numeric_only=True), annot = True, fmt='.1g')" - ] - }, - { - "cell_type": "markdown", - "id": "c43a0bd9-b79c-4b47-8db2-826421a4990f", - "metadata": {}, - "source": [ - "## Feature Importance Visualization\n", - "\n", - "This code snippet demonstrates how to identify and visualize important features related to the 'quality' variable of a dataset using Python libraries `seaborn` and `matplotlib`. The snippet performs the following steps:\n", - "\n", - "1. **Correlation Calculation**: Computes the Pearson correlation coefficients between all numeric features and the 'quality' feature of the DataFrame `df`.\n", - "2. **Sorting and Filtering**: Sorts these coefficients by their values associated with 'quality' and filters out the 'quality' column itself. It then selects features with an absolute correlation value greater than 0.08, considering these as important features.\n", - "3. **Visualization Setup**: Sets the theme for the plot using `seaborn` and initializes a figure with a size of 16x5 inches.\n", - "4. **Bar Plot Creation**: Creates a bar plot to display the Pearson correlation values of the identified important features. The plot has a title and labels for clarity, and uses a 'seismic_r' color palette to differentiate the values.\n", - "\n", - "This approach is useful for quickly identifying which features have a significant correlation with the target variable 'quality', aiding in feature selection and preliminary data analysis." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "17a5dd2e-e372-4f24-a9dd-73e32754a0e3", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Calculate the correlation and sort by 'DEFAULT'\n", - "corr_values = df.corr(numeric_only=True).sort_values(by='DEFAULT')['DEFAULT']\n", - "important_feats = corr_values[abs(corr_values) > 0.08]\n", - "print(important_feats)\n", - "\n", - "# Set the theme\n", - "sns.set_theme(style=\"darkgrid\")\n", - "\n", - "# Prepare the figure\n", - "plt.figure(figsize=(16, 5))\n", - "plt.title('Feature Importance for Credit Scoring')\n", - "plt.ylabel('Pearson Correlation')\n", - "\n", - "# Create a barplot without a palette argument, using a default color temporarily\n", - "ax = sns.barplot(x=important_feats.keys(), y=important_feats.values, color='gray')\n", - "\n", - "# Get colors from the 'seismic_r' palette based on the number of entries\n", - "palette = sns.color_palette(\"seismic_r\", len(important_feats))\n", - "\n", - "# Set the colors for each bar individually\n", - "for bar, color in zip(ax.patches, palette):\n", - " bar.set_color(color)\n", - "\n", - "# Show the plot\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "23e96d14-249c-4010-a346-7609a91be0c7", - "metadata": {}, - "source": [ - "## Histogram Visualization of Important Features\n", - "\n", - "This code snippet is designed to visualize the distribution of important features identified as having a significant correlation with wine quality, along with the distribution of the quality itself, using Python libraries `seaborn` and `matplotlib`. The snippet executes the following steps:\n", - "\n", - "1. **Loop Through Features**: Iterates over the keys of the `important_feats` dictionary (features with a strong correlation to 'quality') and includes the 'quality' column itself.\n", - "2. **Histogram Plotting**:\n", - " - For each feature in the loop, it initializes a new figure with a predefined size (8x5 inches).\n", - " - Sets a title specific to the feature being plotted.\n", - " - Uses `seaborn.histplot` to create a histogram with a kernel density estimate (KDE) overlay for each feature. This helps in visualizing the distribution and density of the data points.\n", - "\n", - "This method provides a detailed look at the distribution characteristics of each key feature, assisting in understanding the variability and distribution trends within the data set." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fc88b823-03c7-413b-b194-4b60b95f4258", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "for i in list(important_feats.keys()) + ['DEFAULT']:\n", - " plt.figure(figsize=(8, 5))\n", - " plt.title(f'Histogram of {i}')\n", - " sns.histplot(df[i].dropna(), kde=True)" - ] - }, - { - "cell_type": "markdown", - "id": "e84b43d0-5fa3-410d-b919-3250b675a97c", - "metadata": {}, - "source": [ - "## Saving DataFrame to CSV in a Project-Specific Path\n", - "\n", - "This code snippet demonstrates how to save a pandas DataFrame to a CSV file in a project-specific directory within the JupyterLab environment. The snippet carries out the following operations:\n", - "\n", - "1. **Path Construction**: Constructs the file path using the environment variable `DOMINO_PROJECT_NAME` to dynamically create a directory path within `/mnt/data/`. This path points to where the 'WineQualityData.csv' will be saved, ensuring the file location is relative to the current Domino project.\n", - "2. **Save DataFrame**: Utilizes the `to_csv` method of the pandas DataFrame `df` to write the DataFrame to the constructed path without including the index column in the output file.\n", - "\n", - "This approach ensures that the output CSV file is easily accessible within the specific context of the current Domino project, promoting better organization and data management practices." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2f729616-c760-4d59-94bd-a83c35645d88", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import os\n", - "path = str('/mnt/data/mlops-best-practices/credit_card_default.csv')\n", - "df.to_csv(path, index = False)" - ] - } - ], - "metadata": { - "dca-init": "true", - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.18" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/.ipynb_checkpoints/Readme-checkpoint.md b/.ipynb_checkpoints/Readme-checkpoint.md deleted file mode 100644 index c9192cd..0000000 --- a/.ipynb_checkpoints/Readme-checkpoint.md +++ /dev/null @@ -1,14 +0,0 @@ -# Domino Hands-On Workshop: Predictions - -#### In this workshop, you will work through an end-to-end workflow broken into various labs to - - -* Read in data from a live datasource -* Prepare your data in an IDE of your choice, with an option to leverage distributed computing clusters -* Train several models in various frameworks -* Compare model performance across different frameworks and select the best-performing model -* Deploy model to a containerized endpoint and web-app frontend for consumption -* Leverage collaboration and documentation capabilities to make all work reproducible and sharable! - -You will find a full walkthrough of our Workshop here: [VERSION 6.x WORKSHOP LINK](https://docs.google.com/document/u/4/d/11eA3ney10KzX7GF9G7f5n72f4p7k7CHSpLxoUfbAGE8/pub) - -You will find a full walkthrough of our Workshop here: [VERSION 5.x WORKSHOP LINK](https://docs.google.com/document/d/e/2PACX-1vS9LKbBYYOrsDmshmKvEIUkDMYVMAivoodg1CTEgjZRPW_IJFV2Un4l5uaE2jI1BsbN3-tQ8IMSkGoL/pub) \ No newline at end of file diff --git a/.ipynb_checkpoints/app-checkpoint.sh b/.ipynb_checkpoints/app-checkpoint.sh deleted file mode 100755 index fea5999..0000000 --- a/.ipynb_checkpoints/app-checkpoint.sh +++ /dev/null @@ -1,15 +0,0 @@ -mkdir ~/.streamlit -echo "[browser]" > ~/.streamlit/config.toml -echo "gatherUsageStats = true" >> ~/.streamlit/config.toml -echo "serverAddress = \"0.0.0.0\"" >> ~/.streamlit/config.toml -echo "serverPort = 8888" >> ~/.streamlit/config.toml -echo "[server]" >> ~/.streamlit/config.toml -echo "port = 8888" >> ~/.streamlit/config.toml -echo "enableCORS = false" >> ~/.streamlit/config.toml -echo "enableXsrfProtection = false" >> ~/.streamlit/config.toml - -streamlit run apps/streamlit_app.py - -# python app/rai.py - -# R -e 'shiny::runApp("/mnt/scripts/shiny_app.R", port=8888, host="0.0.0.0")' \ No newline at end of file