Winter 2024 CS291K (Machine Learning and Data Mining) course project
In this project, we investigated the application of pre-training strategies to improve landslides detection using deep learning. We employed the Faster R-CNN framework with transfer learning on satellite imagery datasets and explored the following pre-training strategies:
- Image Classification
- Knowledge Distillation
- Masked Autoencoder (MAE)
By utilizing various pre-trained models and techniques, we achieved improvements in landslides bounding box detection on satellite images.
💡 For environment setup, please see: Implementation Details Section
🔎 For further reading, please see: our final report
Landslides have affected about 5 million people worldwide.
Existing deep-learning approaches either require training from scratch or use natural image pre-trained weights to initialize the model.
Although the use of different pre-training strategies (e.g., MAE) with satellite imagery has been extensively studied, their effectiveness in landslide detection remains unexplored.
- Objective: Pre-train image encoders using different strategies
- Datasets:
- Architecture (image encoder):
- CNN-based: ResNet-18 and EfficientNet-B0
- Transformer-based: ViT and Swin Transformer
- Image Encoder Pre-training Strategies
- Image Classification:
- train an image encoder with the ImageNet dataset to predict the image category
- Knowledge Distillation: distilled a more complex model into our smaller, task-specific image encoder:
- load the ImageNet-pre-trained weights onto a predetermined teacher model;
- fine-tune the teacher model using the Landslide4Sense dataset on binary landslides image classification;
- freeze the teacher model's weights;
- train a smaller student model to predict the teacher model's soft target probabilities along with the ground-truth hard labels
- Masked Autoencoder:
- mask input image patches and train an encoder-decoder framework to reconstruct the original image
- Image Classification:
- Objective: Landslide object bounding box detection
- Datasets: Landslide4Sense
- Architecture:
- Backbone: image encoder pretrained during Stage 1
- Head: Faster R-CNN
A detection is considered correct if Intersection Over Union (IoU) ≥ predefined threshold.
We first investigated the performance of various backbone architectures on landslide object detection. The pre-training strategy was fixed to be ImageNet-1K image classification for all architectures. We found that the Swin-Base outperformed other architectures.
We then selected the three best-performing architectures (i.e., ViT-Large, Swin-Tiny, and Swin-Base) and investigated the effect of various pre-training strategies on further improving the models' performance for this landslide detection task. Across all dataset-strategy-model combinations, the Swin-Base pre-trained using MAE yielded the best performance.
Sample prediction results from our best-performing model (Swin-Base pre-trained using MAE):
We presented a comprehensive approach to landslide detection using deep-learning techniques, focusing on using pre-trained image encoder architectures within the Faster R-CNN framework.
We found that the Swin-Base architecture, pre-trained using Masked Autoencoder (MAE) yielded the best performance in detecting landslides within satellite imagery.
Our findings highlighted the importance of selecting appropriate pre-training strategies and backbone architectures for improving landslide detection performance.
This project was part of the CS291K: Machine Learning and Data Mining course. Study design and code implementation were done by me (Yuchen Hou) and Vihaan Akshaay Rajendiran.
cd 291k
conda env create -f environment.yml
conda activate 291k
- CLIP image encoder weights: https://github.com/wangzhecheng/SkyScript
- vitdet (MAE) backbone model weights: https://github.com/sustainlab-group/SatMAE
- Other weights (pretrained for this project): https://drive.google.com/drive/folders/1kzUATRd5Rzyav2yyE-Bcl7hGmHzPDLgL?usp=sharing
- Landslide4sense: https://www.kaggle.com/datasets/tekbahadurkshetri/landslide4sense
- Dataset used in paper "A novel Dynahead-Yolo neural network for the detection of landslides with variable proportions using remote sensing images": https://github.com/Abbott-max/dataset/tree/main
- Swin-MAE: https://github.com/Zian-Xu/Swin-MAE
- General Faster-RCNN training pipeline: https://github.com/sovit-123/fasterrcnn-pytorch-training-pipeline
- Swin-Transformer FPN neck: https://github.com/oloooooo/faster_rcnn_swin_transformer_detection/tree/master
- Knowledge distillation: https://huggingface.co/docs/transformers/en/tasks/knowledge_distillation_for_image_classification
- CLIP image encoder: https://github.com/mlfoundations/open_clip
📧 Yuchen Hou | GitHub | LinkedIn | Webpage
☕ I'm always happy to chat about research ideas, potential collaborations, or anything you're passionate about!