3. done

MLOps-Courses · Mar 28, 2024 · 4236337 · 4236337
1 parent 375ead0
commit 4236337
Show file tree

Hide file tree

Showing 5 changed files with 314 additions and 71 deletions.
diff --git a/docs/3. Refactoring/3.3. Entrypoints.md b/docs/3. Refactoring/3.3. Entrypoints.md
@@ -1 +1,103 @@
-# 3.3. Entrypoints
+# 3.3. Entrypoints
+
+## What are package entrypoints?
+
+Package entrypoints are mechanisms in Python packaging that facilitate the exposure of scripts and utilities to end users. Entrypoints streamline the process of integrating and utilizing the functionalities of a package, whether that be through command-line interfaces (CLI) or by other software packages.
+
+To elaborate, entrypoints are specified in a package's setup configuration, marking certain functions or classes to be directly accessible. This setup benefits both developers and users by simplifying access to a package's capabilities, improving interoperability among different software components, and enhancing the user experience by providing straightforward commands to execute tasks.
+
+## Why do I need to set up entrypoints?
+
+Entrypoints are essential for making specific functionalities of your package directly accessible from the command-line interface (CLI) or to other software. By setting up entrypoints, you allow users to execute components of your package directly from the CLI, streamlining operations like script execution, service initiation, or utility invocation. Additionally, entrypoints facilitate dynamic discovery and utilization of your package's functionalities by other software and frameworks, such as Apache Airflow, without the need for hard-coded paths or module names. This flexibility is particularly beneficial in complex, interconnected systems where adaptability and ease of use are paramount.
+
+## How do I create entrypoints with poetry?
+
+Creating entrypoints with Poetry involves specifying them in the `pyproject.toml` file under the `[tool.poetry.scripts]` section. This section outlines the command-line scripts that your package will make available:
+
+```toml
+[tool.poetry.scripts]
+bikes = 'bikes.scripts:main'
+```
+
+In this syntax, `bikes` represents the command users will enter in the CLI to activate your tool. The path `bikes.scripts:main` directs Poetry to execute the `main` function found in the `scripts` module of the `bikes` package. Upon installation, Poetry generates an executable script for this command, integrating your package's functionality seamlessly into the user's command-line environment, alongside other common utilities:
+
+```bash
+$ poetry run bikes one two three
+```
+
+This snippet run the bikes entrypoint from the CLI and passes 3 positional arguments: one, two, and three.
+
+## How can I use this entrypoint in other software?
+
+Defining and installing a package with entrypoints enables other software to easily leverage these entrypoints. For example, within Apache Airflow, you can incorporate a task in a Directed Acyclic Graph (DAG) to execute one of your CLI tools as part of an automated workflow. By utilizing Airflow's `BashOperator` or `PythonOperator`, your package’s CLI tool can be invoked directly, facilitating seamless integration:
+
+```python
+from airflow import DAG
+from datetime import datetime, timedelta
+from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunNowOperator
+
+# Define default arguments for your DAG
+default_args = {...}
+
+# Create a DAG instance
+with DAG(
+    'databricks_submit_run_example',
+    default_args=default_args,
+    description='An example DAG to submit a Databricks job',
+    schedule_interval='@daily',
+    catchup=False,
+) as dag:
+    # Define a task to submit a job to Databricks
+    submit_databricks_job = DatabricksSubmitRunNowOperator(
+        task_id='main',
+        json={
+            "python_wheel_task": {
+                "package_name": "bikes",
+                "entry_point": "bikes",
+                "parameters": [ "one", "two", "three" ],
+            },
+        }
+    )
+
+    # Set task dependencies and order (if you have multiple tasks)
+    # In this simple example, there's only one task
+    submit_databricks_job
+```
+
+In this example, `submit_databricks_job` is a task that executes the `bikes` entrypoint.
+
+## How can I use this entrypoint from the command-line (CLI)?
+
+Once your Python package has been packaged with Poetry and a wheel file is generated, you can install and use the package directly from the command-line interface (CLI). Here are the steps to accomplish this:
+
+1. **Build your package:** Use Poetry to compile your project into a distributable format, such as a wheel file. This is done with the `poetry build` command, which generates the package files in the `dist/` directory.
+
+```bash
+poetry build
+```
+
+2. **Install your package:** With the generated wheel file (`*.whl`), use `pip` to install your package into your Python environment. The `pip install` command looks for the wheel file in the `dist/` directory, matching the pattern `bikes*.whl`, which is the package file created by Poetry.
+
+```bash
+pip install dist/bikes*.whl
+```
+
+3. **Run your package from the CLI:** After installation, you can invoke the package's entrypoint—defined in your `pyproject.toml` file—directly from the command line. In this case, the `bikes` command followed by any necessary arguments. If your entrypoint is designed to accept arguments, they can be passed directly after the command. Ensure the arguments are separated by spaces unless specified otherwise in your documentation or help command.
+
+```bash
+bikes one two three
+```
+
+## Which should be the input or output of my entrypoint?
+
+**Inputs** for your entrypoint can vary based on the requirements and functionalities of your package but typically include:
+
+- **Configuration files (e.g., JSON, YAML, TOML):** These files can define essential settings, parameters, and options required for your tool or package to function. Configuration files are suited for static settings that remain constant across executions, such as environment settings or predefined operational parameters.
+- **Command-line arguments (e.g., --verbose, --account):** These arguments provide a dynamic way for users to specify options, flags, and parameters at runtime, offering adaptability for different operational scenarios.
+
+**Outputs** from your entrypoint should be designed to provide valuable insights and effects, such as:
+
+- **Side effects:** The primary purpose of your tool or package, which could include data processing, report generation, or initiating other software processes.
+- **Logging:** Detailed logs are crucial for debugging, monitoring, and understanding how your tool or package operates within larger systems or workflows.
+
+Careful design of your entrypoints' inputs and outputs ensures your package can be integrated and used efficiently across a wide range of environments and applications, maximizing its utility and effectiveness.
diff --git a/docs/3. Refactoring/3.4. Configurations.md b/docs/3. Refactoring/3.4. Configurations.md
@@ -2,59 +2,82 @@
 
 ## What are configurations?
 
-Configurations are sets of parameters or constants that are external to your program but are essential for its operation. They are not hard-coded but are passed to the program via different means such as:
-- [Environment variables](https://en.wikipedia.org/wiki/Environment_variable)
-- [Configuration files](https://en.wikipedia.org/wiki/Configuration_file)
-- [Command-Line Interface (CLI) arguments](https://en.wikipedia.org/wiki/Command-line_interface#Arguments)
-
-For example, a configuration file in YAML format might look like this:
+Configurations consist of parameters or constants for the operation of your program, externalized to allow flexibility and adaptability. Configurations can be provided through various means, such as environment variables, configuration files, or command-line interface (CLI) arguments. For instance, a YAML configuration file might look like this:
 
 ```yaml
 job:
   KIND: TrainingJob
   inputs:
-    KIND: ParquetDataset
+    KIND: ParquetReader
     path: data/inputs.parquet
-  target:
-    KIND: ParquetDataset
-    path: data/target.parquet
-  output_model: outputs/model.joblib
+  targets:
+    KIND: ParquetReader
+    path: data/targets.parquet
 ```
 
+This structure allows for easy adjustment of parameters like file paths or job kinds, facilitating the program's operation across diverse environments and use cases.
+
 ## Why do I need to write configurations?
 
-Configurations allow your code to be adaptable and flexible without needing to modify the source code for different environments or use cases. This approach aligns with the principle of separating code from its execution environment, thereby enhancing portability and ease of changes. It’s akin to customizing application settings without altering the application’s core codebase.
+Configurations enhance your code's flexibility, making it adaptable to different environments and scenarios without source code modifications. This separation of code from its execution environment boosts portability and simplifies updates or changes, much like adjusting settings in an application without altering its core functionality.
 
 ## Which file format should I use for configurations?
 
-Common formats for configuration files include [JSON](https://www.json.org/json-en.html), [TOML](https://toml.io/en/), and [YAML](https://yaml.org/). YAML is often preferred due to its human-readable format, support for comments, and simpler syntax compared to TOML. However, be cautious with YAML files as they can contain malicious structures; always [load them safely](https://pyyaml.org/wiki/PyYAMLDocumentation).
+When choosing a format for configuration files, common options include JSON, TOML, and YAML. YAML is frequently preferred for its readability, ease of use, and ability to include comments, which can be particularly helpful for documentation and maintenance. However, it's essential to be aware of YAML's potential for loading malicious content; therefore, always opt for safe loading practices.
 
 ## How should I pass configuration files to my program?
 
-Passing configuration files to your program is effectively done using the Command-Line Interface (CLI). For example:
+Passing configuration files to your program typically utilizes the CLI, offering a straightforward method to integrate configurations with additional command options or flags. For example, executing a command like:
 
 ```bash
-$ program defaults.yaml training.yaml
+$ bikes defaults.yaml training.yaml --verbose
 ```
 
-CLI allows for easy integration of configuration files with other command options and flags, like verbose logging:
+This example enables the combination of configuration files with verbosity options for more detailed logging. This flexibility is also extendable to configurations stored on cloud services, provided your application supports such paths.
 
-```bash
-$ program defaults.yaml training.yaml --verbose
+## Which toolkit should I use to parse and load configurations?
+
+For handling configurations in Python, [OmegaConf](https://omegaconf.readthedocs.io/) offers a powerful solution with features like YAML loading, deep merging, variable interpolation, and read-only configurations. It's particularly suited for complex settings and hierarchical structures. Additionally, for applications involving cloud storage, [cloudpathlib](https://cloudpathlib.drivendata.org/stable/) facilitates direct loading from services like AWS, GCP, and Azure.
+
+Utilizing Pydantic for configuration validation ensures that your application behaves as expected by catching mismatches or errors in configuration files early in the process, thereby avoiding potential failures after long-running jobs.
+
+```python
+import pydantic as pdt
+
+class TrainTestSplitter(pdt.BaseModel):
+    """Split a dataframe into a train and test set.
+
+    Parameters:
+        shuffle (bool): shuffle the dataset. Default is False.
+        test_size (int | float): number/ratio for the test set.
+        random_state (int): random state for the splitter object.
+    """
+
+    shuffle: bool = False
+    test_size: int | float
+    random_state: int = 42
 ```
 
-## Which toolkit should I use to parse and load configurations?
+## When should I use environment variables instead of configurations files?
+
+Environment variables are more suitable for simple configurations or when dealing with sensitive information that shouldn't be stored in files, even though they lack the structure and type-safety of dedicated configuration files. They are universally supported and easily integrated but may become cumbersome for managing complex or numerous settings.
+
+```bash
+$ MLFLOW_TRACKING_URI=./mlruns bikes one two three
+```
 
-For parsing and loading configurations in Python, [OmegaConf](https://omegaconf.readthedocs.io/) is a robust choice. It supports [loading YAML](https://omegaconf.readthedocs.io/en/latest/usage.html#creating) from various sources, [deep merging](https://omegaconf.readthedocs.io/en/latest/usage.html#omegaconf-merge) of configurations, [variable interpolation](https://omegaconf.readthedocs.io/en/latest/usage.html#variable-interpolation), and setting configurations as [read-only](https://omegaconf.readthedocs.io/en/latest/usage.html#read-only-flag). Additionally, for cloud-based projects, [cloudpathlib](https://cloudpathlib.drivendata.org/stable/) is useful for loading configurations from cloud storage services like AWS, GCP, and Azure.
+In this example, the `MLFLOW_TRACKING_URI` is passed as an environment variable to the `bikes` program, while the command also accepts 3 positional arguments: one, two, and three.
 
 ## What are the best practices for writing and loading configurations?
 
-1. **Safe Loading**: Always use `yaml.safe_load()` when loading YAML files to avoid executing arbitrary code.
-2. **File Handling**: Employ context managers (the `with` statement) for file operations to ensure files are properly opened and closed.
-3. **Error Handling**: Robust error handling for file IO and YAML parsing is crucial, such as handling missing or corrupted files.
-4. **Validate Schema**: Validate configuration files against a predefined schema to ensure the correct structure and types.
-5. **Sensitive Data**: Never store sensitive data like passwords in plain text. Use environment variables or secure storage solutions instead.
-6. **Default Values**: Provide defaults for optional configuration parameters to enhance flexibility.
-7. **Comments**: Use comments in your YAML files to explain the purpose and usage of various configurations.
-8. **Consistent Formatting**: Keep a consistent format in your configuration files for ease of reading and maintenance.
-9. **Versioning**: Version your configuration file format, especially for larger projects where changes in configurations are frequent and significant.
+To ensure effective configuration management:
+
+- Always use `yaml.safe_load()` to prevent the execution of arbitrary code.
+- Utilize context managers for handling files to ensure proper opening and closing.
+- Implement robust error handling for I/O operations and parsing.
+- Validate configurations against a schema to confirm correctness.
+- Avoid storing sensitive information in plain text; instead, use secure mechanisms.
+- Provide defaults for optional parameters to enhance usability.
+- Document configurations with comments for clarity.
+- Maintain a consistent format across configuration files for better readability.
+- Consider versioning your configuration format to manage changes effectively in larger projects.