-
Notifications
You must be signed in to change notification settings - Fork 4
2.2 Controller architecture
The controller folder contains the controller component implementation. The controller is composed of the following files:
The file name describes which functionality/class can be found within them.
The controller exposes a specific port unique to it. It is defined in its Dockerfile/launch.json and should not be changed (as the frontend has to know the port and cannot search for it).
When the controller is running it is starting its GRPC server and is waiting for incoming connections. The controller GRPC interface provides the following functionality to the frontend:
CreateNewUser: generate a new unique UUID to identify a user (user id), that is connected to an AspNetUser in the frontend MS SQL database GetHomeOverviewInformation: Get the collection of overview information displayed on the users home overview page (# of datasets, # of trainings,...)
- CreateDataset: creates a new dataset record in the MongoDB and persists the dataset file accordingly on the controller side
- GetDatasets: returns a list of all existing dataset records of a user from the MongoDB
- GetDataset: returns a specific dataset record by dataset id and the user id from the MongoDB
- GetTabularDatasetColumn: from a specific tabular dataset return all column names and first entries (Still in use??? #TODO)
- DeleteDataset: deletes a specific dataset record and its associated training, models, and prediction records in the MongoDB and all associated files from the disc
- SetDatasetFileConfiguration: updates the datasets file configuration and perform a new dataset analysis
- CreateTraining: creates a new training record in the MongoDB and execute a new training session with the AutoML adapters
- GetTrainings: returns a list of all existing training records of a user from the MongoDB
- GetTraining: returns a specific training record by training id and the user id from the MongoDB
- DeleteTraining: deletes a specific training record and its associated models, and prediction records in the MongoDB and all associated files from the disc
- GetModels: returns a list of all existing model records by user id and training id from the MongoDB
- GetModel: returns a specific model record by model id and the user id from the MongoDB
- DeleteModel: deletes a specific model record and its associated prediction records in the MongoDB and all associated files from the disc
- CreatePrediction: creates a new prediction record in the MongoDB and execute the new prediction using the associated AutoML adapter
- GetPredictions: returns a list of all existing prediction records by user id and model id from the MongoDB
- GetPrediction: returns a specific prediction record by user id and prediction id from the MongoDB
- DeletePrediction: deletes a specific prediction record and all associated files from the disc
- GetAutoMlSolutionsForConfiguration: return a list of all AutoML solutions that are compatible with the current training wizard configuration of the frontend from the Ontology
- GetAvailableStrategies: return a list of all training strategies that are compatible with the current training task from the Ontology
- GetDatasetTypes: return a list of all dataset types from the Ontology
- GetMlLibrariesForTask: return a list of all ML libraries compatible with the selected ML task of the training wizard from the Ontology
- GetObjectInformation: return the complete object information of a specific RDF key from the Ontology
- GetTasksForDatasetType: return a list of all ML tasks that are compatible with the selected dataset type from the Ontology
From the above-listed GRPC interface functionality, this section will dive deeper into functions that perform further actions than simply interacting with Ontology or MongoDB.
After having created a new dataset record inside MongoDB and saved to its correct disc path, the background task of analyzing the dataset will be started. The SetDatasetFileConfiguration, also launches a new dataset analysis, as with a changing file configuration a dataset might become readable from not readable if the uploaded file does not follow the OMA-ML default file configuration.
The following sequence diagram displays the executed logic when a new training request is received by the controller. This workflow is started after a new training record was inserted into MongoDB, all function calls that will create or update the training and created model records are not shown as they would overload the sequence diagram.
After having created a new training record inside MongoDB, the injected AdapterRuntiemScheduler will initiate a new training session by creating a new AdapterRuntimeManager. The AdapterRuntimeManager will take charge of creating the appropriate AdapterManagers (representing the required AutoML adapters connections). Furthermore, the AdapterRuntimeManager will initiate a new Blackboard for this specific session (see Blackboard below for more information) and later the ExplainableAIManager. After the required training components are initialized, the AdapterRuntimeManager will wait for the Blackboard to update the phase to running before starting the actual thread to connect to the AutoML adapters. When the running phase update occurs the background thread will be started for all AdapterManagers of this training. They will send the StartAutoML request to their respective adapter and then pull for new AutoMlStatus messages until the AutoML adapter finishes its search process (see adapter architecture for more information about the search process). When the final AutoMLStatus message is received the AdapterManager will use the callback function passed during its initialization to notify the AdapterRuntimeManager. The AdapterRuntimeManager will persist the received model information and label the training finished for this adapter. Then it will initiate a new ExplainableAIManager for every AdapterManager to begin the Model analysis. This task will run in the background because it takes several minutes, but the model can already be used or downloaded using the frontend. The ExplainableAIManager will send several requests to the AutoML adapter with test data. This test data will be predicted by the found model to compute a probability metric, this metric is needed to compute the SHAP models explaining the found model (how features impact the prediction of the model). When the ExplainableAIModel concludes the entire model analysis is finished and all plots are persisted on the disc and the model record in MongoDB is updated with the paths.
The following sequence diagram displays the executed logic when a new prediction request is received by the controller. This workflow is started after a new prediction record was inserted into MongoDB, all function calls that will create or update the prediction are not shown as they would overload the sequence diagram.
After having created a new prediction record inside MongoDB, the injected AdapterRuntimeScheduler will initiate a new prediction by creating a new AdapterPredictionRuntimeManager. The AdapterPredictionRuntimeManager will take charge of creating the appropriate AdapterPredictionManager (representing the required AutoML adapter connection). After initiating the appropriated components, the AdapterRuntimePredictionManager will start a new background task using the AdapterPredictionManager, as the prediction task may require several minutes to conclude. The AdapterPredictionManager, will connect to the requested adapter, and start the PredictModel request, the adapter will load the correct model and make a prediction on the live dataset. The prediction result are persisted as a CSV file and its path is returned to the AdapterPredictionManager to be persisted in MongoDB. Finally, the prediction_completed_callbacks will be called by the AdapterPredictionManager and AdapterRuntimePredictionManager to delete the objects.
The Blackboard and its associated StrategyController are the centers of the training workflow and are created for every training session individually. Every component that influences or holds information about the training itself and has a respective Agent class:
- AdapterRuntimeManagerAgent -> agent for AdapterRuntimeManager
- AdapterManagerAgent -> agent for AdapterManager
- DataAnalysisAgent -> agent for the data analysis (since this step is performed at dataset upload, this agent has instant access to the result and does not need a unique counterpart)
When a new agent instance is created, it holds a reference to its counterpart and will register itself with the blackboard. The blackboard has access to and knowledge of all existing agents for its training session.
While the training is active, the StrategyController has a background thread running that routinely checks all agents if they have new information they can contribute. If that is the case the current state will be re-evaluated and the strategy controller will take new steps to guide the training session.
Which strategy the controller supports are defined in an individual strategy class. Currently the DataPreparationStrategy is supported, those strategy influence the dataset during the data preparation steps of the training process.
Change the data type of all redundant features to IGNORE.
TODO missing logic Change the dataset to omit redundant samples.
TODO missing logic Split the dataset into a smaller subset used during the adapter training.
Update the Blackboard phase to running, initiating the actual adapter training. Required as else the StrategyController would not know when the preprocessing is completed.
This strategy will run all adapters with only a small part of the data. Then it will train the best half solutions with more data again and so on, until one last adapter is trained with the full data.
Input by the user:
- time t in minutes
- amount of AutoML-solutions m
The following calculations are necessary for the right distribution of the given time and the dataset-size of each iteration:
General calculations:
- amount of pretraings n = INT(log2(m))
- sum_s = SUM(i=0; n; si=0.5n *2i)
For each iteration i <= n, there are the following calculations:
- amount of ML-solutions mi = INT(m * 0.5i)
- dataset size si = 0.5n * 2i
- time ti = t * si / sum_s
Data sampling is a technique employed to ensure a well-balanced distribution of classes within a dataset. For most of the sampling techniques you need more information about the dataset itself. We don't have these that's why we reduce the focus on only two key strategies - Upsampling and Downsampling. Upsampling involves augmenting the number of data points in the minority class, addressing class imbalance by generating additional instances. Conversely, downsampling focuses on reducing the number of data points in the majority class to create a more equitable representation of classes. These combined approaches contribute to achieving a balanced training set, thereby enhancing the overall performance of machine learning models.
- If the length of a class falls below the 25% quantile, the minority class is upsampled with replacement to match the size at the 25% quantile.
- If the length of a class is between the 25% and 50% quantiles, the minority class undergoes upsampling with replacement to achieve the size at the 50% quantile.
- If the length of a class is between the 50% quantile and the mean, the majority class is downsampled without replacement to match the size at the 50% quantile.
- If the length of a class is between the mean and the 75% quantile, the majority class experiences downsampling without replacement to reach the size at the mean quantile.
- If the length of a class is at or exceeds the 75% quantile, the majority class undergoes downsampling without replacement to align with the size at the 75% quantile.
For effective data sampling, crucial information, such as the dataset schema, file configuration, target column / feature and dataset path is necessary. The process begins by reading the dataset using the "read_dataset" function of the Csv Manager. If the dataset is available, data sampling is then applied and the resulting sampled data is saved as a CSV file in the original dataset path with a "sampled" name. If there is more information about the datasets in the future, data sampling will perform even better.
-
Simple Implementation: The approach is relatively easy to implement and does not require complex calculations or intricate code.
-
Flexibility in Class Adjustment: The use of quantiles provides a degree of flexibility in adjusting classes, especially when dealing with different data distributions.
-
Discrete Decision Points: Fixing specific quantile values (e.g., 25%, 50%, 75%) results in discrete decision points, potentially overlooking subtle differences in the data.
-
Sensitivity to Outliers: The approach may be susceptible to outliers in the data, as they can influence the quantiles and lead to non-representative samples.
In the future, subsequent teams may consider defining specific thresholds. This includes determining, for instance, the point at which the number of features becomes impractical, specifying the minimum required number of samples in a dataset, or deciding whether it is appropriate to completely remove classes because they are significantly underrepresented. Currently, the strategy becomes available once a specific target (Target) and the machine learning task "Classification" have been chosen.
The Optimum Strategy evaluates the training processes and aims to determine the runtime at which models achieve the best accuracy.
At the start, when a training session is initiated, it is counted as T1. The program then enters the Optimum Strategy phase, records the accuracy, and marks the training_id
as parent_training_id
in the database. The Controller Manager then initiates a new training session, marking its training_id
as the child ID.
Training 2 is executed, and at the end, its models' accuracies are compared with those of Training 1. If the accuracies improve or remain the same, another training session is initiated with the same data, and this process continues until no further improvements in accuracy are observed. The process stops once the accuracies no longer improve.
- ControllerServer: The application's main entry point, initiates the controller GRPC server.
- ControllerServiceManager: The server implementation of the compiled controller GRPC interface functionality, the entry point for all incoming GRPC requests is located here.
- Container: Collection of dependency injection containers, all dependency injected objects are initialized here.
- ControllerBGRPC: the compiled GRPC server and client interface definition for the controller, the server is implemented in ControllerServiceManager.
- AdapterBGRPC: the compiled GRPC server and client interface definition for the adapters, the server is implemented in the adapters.
All incoming GRPC requests from the controller server are routed to their respective manager instance.
- DatasetManager: the functionality for the GRPC interface to manipulate datasets (Create, Get, Update Delete) records are located here
- UserManager: all functions that are used by a user to perform general actions (for example: create a user, or Get home page overview information)
- PredictionManager: the functionality for the GRPC interface to manipulate predictions (Create, Get, Update Delete) records are located here
- ModelManager: the functionality for the GRPC interface to manipulate models (Get, Update Delete) records are located here
- TrainingManager: the functionality for the GRPC interface to manipulate training (Create, Get, Update Delete) records are located here
- OntologyManager: the functionality for the GRPC interface to retrieve ontology information is located here.
- Queries: holds the RDF query strings used to retrieve information from the ontology
- ThreadLock: a multithreaded lock to allow thread-safety, used by the explainableAIManager and DatasetAnalysisManager to avoid issues with Matplotlib
- LongitudinalDataManager: a static class implementing functionality to read longitudinal datasets (TS format) and retrieve information about its columns and rows.
- CsvManager: a static class that implements functionality for reading and writing CSV datasets, as well as creating a default dataset for a new user. To read the CSV dataset, information such as the dataset path, dataset schema, and dataset configuration is required. These details are extracted by the agent and runtime manager.
- AdapterRuntimeScheduler: Main manager object responsible to create new training and persistence sessions on frontend requests.
- AdapterRuntimePredictionManager: the object instance of this class represents a single prediction request from the frontend, it will create the AdapterPredictionManager object to create new prediction request with an AutoML adapter.
- AdapterPredictionManager: Allows the connection to a single AutoML adapter to perform a new prediction and persists the result in MongoDB
- AdapterRuntimeManager: the object instance of this class represents a single training request from the frontend, it will create the AdapterManager object to create new training request with an AutoML adapter.
- AdapterManager: Allows the connection to a single AutoML adapter to start a new training session and poll its status until the adapter finishes. It also provides the GRPC connection to the explainableAIManager to connect to the same adapter and perform the model explanations needed by it.
- MongoDbClient: MongoDB database interface API.
- DataStorage: Centralized Access to File System and Database by using the MongoDBClient
- DatasetAnalysisManager: generates the dataset analysis of a dataset, currently the disc size and creation date are saved for all types but only CSV based dataset will be further explored to retrieve information about their composition (see dataset analysis in dataset schema)
- ExplainableAIManager: uses a successfully found model to generate probabilities to generate graphs that explain the feature impact on a models decision.
- Blackboard: Central blackboard component which acts as the shared / common data structure for all training data.
- StrategyController: Strategy controller which supervises the blackboard.
- IAbsctractStrategy: Interface representing the functionality any controller strategy must provide.
- DataPreparationStrategy: Collection of data preparation strategy functionality executed on blackboard phase data preprocessing.
- IAbstractBlackboardAgent: Interface representing the functionality any blackboard agent must provide.
- AdapterRuntimeManagerAgent: Agent implementation for the AdapterRuntimeManager, allowing the blackboard to poll information from the AdapterRuntimeManager of a training session
- DataAnalysisAgent: Agent implementation for the DataAnalysis, allowing the blackboard to poll information from the analysis of the dataset
- AdapterManagerAgent: Agent implementation for the AdapterManager, allowing the blackboard to poll information from a AdapterManager of a training session
This folder represents the point of exchange between the controller and the adapters. The datasets used by the AutoML training are saved here (in the /datasets subfolder) as well as the output files of the AutoML adapters (within the /output/ subfolders). This is done so that these (potentially large) files don't have to be transferred in a gRPC request.
In practice this folder is mounted as a volume for shared use between the controller and the adapters.