A project given to FCI 2021-2025 Suez University Class to test their capabilities in the subject.
Subject: CS342 Automata and Language Theory
Detect bad words from a .csv file compressed in a .rar format using the provided Bad_Words.csv file and output both excel and .csv files that have analytics about the process.
Python 3.10 or newer
pandas
pyahocorasick
rarfile
openpyxl
pytest
2- Navigate to the args.json file and put your .csv file in .rar compression format and pass the path into the data_file section
git clone https://github.com/GreenVenom77/Bad_Words_Detector.git
cd Bad_Words_Detector
python main.py -h
python main.py -d './46,080,374Rows_365Columns.rar' -b './BadWords.csv' -s 150000 -f 'AhoCorasick' -p 'ProcessesPool' -c '1,2,3'
usage: Bad Words Filter App [-h] -d DATA_FILE -b BAD_WORDS_FILE [-s CHUNK_SIZE] [-f {Regex,AhoCorasick}]
[-p {MultiThreading,MultiProcessing,ProcessesPool}] [-c COLUMNS]
filter the specified columns from a big compressed csv file the bad words rows.
options:
-h, --help show this help message and exit
-d DATA_FILE, --data_file DATA_FILE
The csv file that we will filter
-b BAD_WORDS_FILE, --bad_words_file BAD_WORDS_FILE
The name of bad words file
-s CHUNK_SIZE, --chunk_size CHUNK_SIZE
The chunk size will be processed
-f {Regex,AhoCorasick}, --filter_mode {Regex,AhoCorasick}
The mode of filtering.
-p {MultiThreading,MultiProcessing,ProcessesPool}, --processing_mode {MultiThreading,MultiProcessing,ProcessesPool}
the concurrent model that will work
-c COLUMNS, --columns COLUMNS
specified columns that will be filtered in format column1,column... like 1,2,3,4
Any contribution is very welcomed even if it's a small one.