This project aims to improve fraud detection systems in e-commerce, partnering with IEEE-CIS and Vesta Corporation to analyze a large dataset of real-world transactions.
- Size: 590,540 transactions, 434 features.
- Key Features: Transaction amount, transaction time, card details, address, distance, email domains, and various anonymized features.
- Transaction Amounts: Log transformation revealed clear distinctions between fraudulent and non-fraudulent transactions.
- Timing Patterns: Fraudulent transactions often occurred at irregular hours.
- Feature Importance: PCA reduced multicollinearity, highlighting key features like 'TransactionAmt' and 'TransactionDT'.
- Imbalanced Data: The dataset exhibited a significant imbalance between fraudulent and non-fraudulent transactions, posing challenges for effective model training. Due to the many anonymous or encoded features, techniques like SMOTE were not applicable since they require knowledge of the feature meanings. Consequently, we relied heavily on comprehensive feature selection techniques and specific models capable of handling imbalanced data.
- Feature Selection: Choosing the most relevant features from a large set of 434 features required extensive visual analysis and dimensionality reduction techniques like PCA and Random Forest feature importance.
- Objective: Balance recall and precision to minimize false positives and maximize fraud detection.
- Hyperparameter Tuning: Grid search for Random Forest, Logistic Regression, and XGBoost.
- Random Forest: ROC-AUC 0.89
- Logistic Regression: ROC-AUC 0.51
- XGBoost: ROC-AUC 0.88
- Threshold Experimentation: XGBoost maintained better stability and performance across different thresholds, chosen as the final model.
The XGBoost model was chosen for its robust performance and balanced precision-recall metrics, making it effective for fraud detection. The model demonstrated a strong F1-score of 0.76, with nearly equal precision and recall scores for both classes and an overall accuracy of 81%.
- Model Integration: Explore ensembling Random Forest and XGBoost.
- Real-Time Deployment: Evaluate the model in a live environment for real-time fraud detection.
This project enhances fraud detection accuracy, ensuring robust protection and minimizing disruptions for genuine users.
These instructions will guide you in getting a copy of the project up and running on your local machine for development and testing purposes.
Ensure you have the following prerequisites installed and set up:
- Python Version: Python 3.10.12 (recommended: 3.8 or higher)
- Libraries and Dependencies: Ensure all necessary libraries and dependencies are installed as listed in
requirements.txt
. - Google Account: Required for Google Colab.
- Local or Cloud Environment: Capable of running Jupyter notebooks if not using Google Colab.
Follow these steps to set up the project environment:
- Clone the repository:
git clone <repository-url> cd <repository-name>
- Install required Python packages:
pip install -r requirements.txt
If you are using Google Colab, follow these instructions:
- Open Google Colab.
- Click on
File
>Open notebook
>GitHub tab
and paste the URL of your notebook. - Alternatively, click on
File
>Upload notebook
to upload the notebook file from your local machine. - Follow the instructions within the notebook to mount your Google Drive if required for data access or file storage.
If you are running the notebook locally:
- Launch Jupyter Notebook in your environment:
jupyter notebook
- Navigate to the notebook file in the Jupyter interface that opens in your web browser and open it.
- Run the cells in the notebook sequentially to replicate the analysis.
To utilize datasets from Kaggle for your project, you'll need to configure the Kaggle API on your system. Follow these steps:
-
Create or Log Into Your Kaggle Account:
- New users can create an account.
- Existing users can log in.
-
API Token Generation:
- Navigate to your account settings by clicking on your profile picture in the top right corner and selecting 'Account'.
- Scroll to the 'API' section and click the 'Create New API Token' button. This downloads a file named
kaggle.json
, containing your API credentials.
-
Setup API Token on Your System:
- Place the downloaded
kaggle.json
file into the directory specified in your notebook. The standard location is~/.kaggle/
for Unix-like systems.
- Place the downloaded
-
Explore the Dataset:
- Visit the IEEE Fraud Detection Kaggle Competition page to explore and understand the dataset you'll be working with.
-
Dataset Download and Setup:
- Execute the following commands in your notebook to set up the Kaggle API and download the dataset:
# Create the necessary Kaggle directory !mkdir ~/.kaggle # Copy the kaggle.json file into this directory !cp /your/drive/kaggle.json ~/.kaggle/ # Secure the API token by updating its permissions !chmod 600 ~/.kaggle/kaggle.json # Download the dataset from the Kaggle competition !kaggle competitions download -c ieee-fraud-detection # Unzip the downloaded dataset into the specified directory !unzip /content/ieee-fraud-detection.zip
By following these steps, you'll be able to securely set up the Kaggle API on your system and access the datasets required for your project.
Valentina Sanchez
[email protected]
Inspiration, datasets, and code snippets for this project were provided by:
- IEEE Computational Intelligence Society - Source of the Kaggle dataset.