An official Pytorch implement of the paper "Towards Robust Scene Text Image Super-resolution via Explicit Location Enhancement" (IJCAI2023).
Authors: Hang Guo, Tao Dai, Guanghao Meng, and Shu-Tao Xia
This work proposes the Location Enhanced Multi-ModAl network (LEMMA) to address the challenges posed by complex backgrounds in scene text images with explicit positional enhancement. The architecture of LEMMA is as follows.
As the previous code is a bit of a mess, we re-organize the code and retrain our LEMMA. The performance of this re-trained model is as follows (better performance than that reported in the paper).
Text Recognizer | Easy | Medium | Hard | avgAcc |
---|---|---|---|---|
CRNN | 64.98% | 59.89% | 43.48% | 56.73% |
MORAN | 76.90% | 64.28% | 46.84% | 63.60% |
ASTER | 81.53% | 67.40% | 48.85% | 66.93% |
One can download this model using this link which contains the parameters of both the super-resolution brach and guidance generation branch.
The log file of training is also available with this link.
In this work, we use STISR datasets TextZoom and four STR benchmarks, i.e., ICDAR2015, CUTE80, SVT and SVTP for model comparison. All the datasets are lmdb
format. One can download these datasets from the this link we have prepared for you. And please do not forget to accustom your own dataset path in ./comfig.yaml
, such as the parameter train_data_dir
and val_data_dir
.
Following previous STISR works, we also use CRNN, MORAN and ASTER as the downstream text recognizer.
Moreover, the code also supports some new text recognizers, such as ABINet, MATRN and PARSeq. You can find the detailed comparison using these three new text recognizers in the supplementary material we provided and can also test LEMMA with these recognizers by modifying the command (e.g., --test_model='ABINet'
). Please download these pre-trained text recognition models from the corresponding repositories we have provided above.
You also need to modify the text recognizer model path in the ./config.yaml
file. Moreover, we employ the text focus loss proposed by STT during model training, since this text focus loss uses a pre-trained transformer based text recognizer, please download this recognition model here and also accustom the ckpt path.
We have set some default hype-parameters in the config.yaml
and main.py
, so you can directly implement training and testing after you modify the path of datasets and pre-trained model.
python main.py
python main.py --test
NOTE: You can also auccstom other hype-parameters in the config.yaml
and main.py
file, such as the n_gpu
.
If you find our work helpful, please consider citing us.
@inproceedings{ijcai2023p87,
title = {Towards Robust Scene Text Image Super-resolution via Explicit Location Enhancement},
author = {Guo, Hang and Dai, Tao and Meng, Guanghao and Xia, Shu-Tao},
booktitle = {Proceedings of the Thirty-Second International Joint Conference on
Artificial Intelligence, {IJCAI-23}},
publisher = {International Joint Conferences on Artificial Intelligence Organization},
pages = {782--790},
year = {2023},
month = {8},
note = {Main Track},
doi = {10.24963/ijcai.2023/87},
url = {https://doi.org/10.24963/ijcai.2023/87},
}
The code of this work is based on TBSRN, TATT, and C3-STISR. Thanks for your contributions.