Skip to content
rumineykova edited this page Oct 23, 2023 · 22 revisions

FabGuard - FabFlee Input File Verification with Pandera

Introduction

FabGuard is a Python library that simplifies input file verification. It is based on the data validation library Pandera and adapted for FabFlee. This documentation will guide you through the steps to use FabGuard for input file verification.

Prerequisites

Before you get started with FabGuard, make sure you have the following prerequisites in place:

  • The Pandas library installed.
  • The Pandera library installed.

Project Structure

FabGuard is a plugin for FabFlee. The structure of the FabGuard folder is asa follows:

  • tests Folder: Contains schemes (tests) for various input files. For example, the closure_scheme folder contains verification tests for the closure file.

  • config.py: Contains configuration information, including the names of test files.

  • error_messages.py: Contains error messages used in your verification checks.

  • fab_guard.py: The main wrapper for Pandera tests. It defines decorators, such as fg.log for functions defining error messages and fg-check for functions that should be executed as part of the test suite. It also provides utility functions like load_files for reading a CSV file and returning a DataFrame, and transpose for transposing a CSV file.

Each scheme file contains a class that inherits from pa.DataFrameModel.

Important Util Functions

To ensure efficient use of resources, all test files are loaded into memory only once. This prevents unnecessary file loading, and you can achieve this by using the singleton class FabGuard. Load all files using the following method:

FabGuard.get_instance().load_file(config.routes)

How-To: Creating Tests for an Input File

In this guide, we will create tests for the locations.csv file as an example. Follow these steps to create your validation tests:

  1. Add a new Python file location_test.py and add it to the FabGuard\tests folder.
  2. Create a Python class LocationsSchemeTest that inherits from pa.DataFrameModel.
class LocationsSchemeTest(pa.DataFrameModel):
  1. In this class, define constraints for each column as fields of the class. For example, if you have a location file with columns like region, country, lat, and lon, you can specify their data types as follows:
region: Series[pa.String] = pa.Field()
country: Series[pa.String] = pa.Field()
lat: Series[pa.Float] = pa.Field()
lon: Series[pa.Float] = pa.Field()

You can refine the data-type constraints further by passing parameters to the Field constructor. All build-in Pandera checks are available as name arguments to Field.

location_type: Series[pa.String] = pa.Field(
     isin = ["conflict_zone", "town","camp", "forwarding_hub", "marker", "idpcamp"])
conflict_date: Series[float] = pa.Field(nullable=True)
population: Series[float] = pa.Field(ge=0,nullable=True)
  • Above we are using the build-in Pandera constraints ge (Greater than or equal to) and isin (In a list). Check the full list of build-in methods here:

  • To specify that a filed cannot be null, set the nullable argument is True..

Finally, we specify the constraints for the column name:

    name: Series[pa.String] = pa.Field(nullable=False, alias='#"name"')

We use the alias parameter to specify the real name of the column in Python since # is a special character in Python.

  1. Add the file to the the tests/__init__.py
  2. Finally, register your file for testing. In registry.py in the test_all_files function add self.register_for_test(<filename.classname>, <name of the file>). Your registry.py file should look as below:
@fgcheck
def test_all_files(self):
    # self.register_for_test(location_scheme.LocationsScheme, config.locations)
    self.register_for_test(location_test.LocationsSchemeTest, config.locations)
    # self.register_for_test(routes_scheme.RoutesScheme, config.routes)
    # self.register_for_test(closures_scheme.ClosuresScheme, config.closures) 

Notice that we have commented all the other tests as to avoid noise in your testing.

  1. Execute the test on a config files for a particular conflict. Below we run the test for the conflict car. The folder user test is config_files/car:
fabsim localhost flee_verify_input:car

How-To: More Interesting Constraints

In addition to the simple build-in tests and data constraints we saw above, you can also create, you can perform two additional types of checks: column-level checks and dataframe-level checks.

Column-level checks

Column-level checks define simple tests that apply to all values in a column. For example, to check if names are valid in the location file, use the decorator as follows:

@pa.check(name1, element_wise=true)
def names_in_routes(cls, name1):
    # Define your test logic here.

Dataframe level checks

For conditional tests or multi-column constraints, create custom test methods within the class. Use the @pa.dataframe_check decorator to mark these methods for testing. For example, to ensure that the country value in the locations.csv file matches the country in row 0 when the location type is "conflict_zone," use the following code:

@pa.dataframe_check()
def conflict_zone_country_should_be_0(cls, df: pd.DataFrame) -> Series[bool]:
   country = df["country"][0]
   mask = ((df["location_type"] == "conflict_zone") & (df["country"] != country))
   
   if mask.any():  # Check if any rows meet the condition
       raise ValueError(Errors.location_country_err(df.index[mask], config.locations))
   return ~mask

Note that all errors are stored inside the error_messages.py file. Pass the number of faulty rows (df.index[mask]) and the name of the input file under test (config.locations) to ensure that the correct error message will be logged.

Clone this wiki locally