Text Role Classification in Scientific Charts Using Multimodal Transformers
Abstract: Text role classification involves classifying the semantic role of textual elements within scientific charts. For this task, we propose to finetune two pretrained multimodal document layout analysis models, LayoutLMv3 and UDOP, on chart datasets. The transformers utilize the three modalities of text, image, and layout as input. We further investigate whether data augmentation and balancing methods help the performance of the models. The models are evaluated on various chart datasets, and results show that LayoutLMv3 outperforms UDOP in all experiments. LayoutLMv3 achieves the highest F1-macro score of 82.87 on the ICPR22 test dataset, beating the best-performing model from the ICPR22 CHART-Infographics challenge. Moreover, the robustness of the models is tested on a synthetic noisy dataset ICPR22-N. Finally, the generalizability of the models is evaluated on three chart datasets, CHIME-R, DeGruyter, and EconBiz, for which we added labels for the text roles. Findings indicate that even in cases where there is limited training data, transformers can be used with the help of data augmentation and balancing methods. The source code and datasets are available on GitHub under https://github.com/hjkimk/text-role-classification
- Rabah A. Al-Zaidy and C. Lee Giles. 2017. A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, Satinder Singh and Shaul Markovitch (Eds.). AAAI Press, 4644–4649. http://aaai.org/ocs/index.php/IAAI/IAAI17/paper/view/14275
- Showmik Bhowmik. 2023. Document Region Classification. In Document Layout Analysis. Springer, 43–65.
- Survey and empirical comparison of different approaches for text extraction from scholarly figures. Multim. Tools Appl. 77, 22 (2018), 29475–29505. https://doi.org/10.1007/s11042-018-6162-7
- Falk Böschen and Ansgar Scherp. 2015. Multi-oriented Text Extraction from Information Graphics. In Proceedings of the 2015 ACM Symposium on Document Engineering, DocEng 2015, Lausanne, Switzerland, September 8-11, 2015, Christine Vanoirbeek and Pierre Genevès (Eds.). ACM, 35–38. https://doi.org/10.1145/2682571.2797092
- Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade R-CNN: Delving Into High Quality Object Detection. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, 6154–6162. https://doi.org/10.1109/CVPR.2018.00644
- Sagnik Ray Choudhury and Clyde Lee Giles. 2015. An Architecture for Information Extraction from Figures in Digital Libraries. In Proceedings of the 24th International Conference on World Wide Web Companion, WWW 2015, Florence, Italy, May 18-22, 2015 - Companion Volume, Aldo Gangemi, Stefano Leonardi, and Alessandro Panconesi (Eds.). ACM, 667–672. https://doi.org/10.1145/2740908.2741712
- Randaugment: Practical automated data augmentation with a reduced search space. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, June 14-19, 2020. Computer Vision Foundation / IEEE, 3008–3017. https://doi.org/10.1109/CVPRW50498.2020.00359
- CoAtNet: Marrying Convolution and Attention for All Data Sizes. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). 3965–3977. https://proceedings.neurips.cc/paper/2021/hash/20568692db622456cc42a2e853ca21f8-Abstract.html
- ICDAR 2019 Competition on Harvesting Raw Tables from Infographics (CHART-Infographics). In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019. IEEE, 1594–1599. https://doi.org/10.1109/ICDAR.2019.00203
- Chart Mining: A Survey of Methods for Automated Chart Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 43, 11 (2021), 3799–3819. https://doi.org/10.1109/TPAMI.2020.2992028
- ICPR 2020 - Competition on Harvesting Raw Tables from Infographics. In Pattern Recognition. ICPR International Workshops and Challenges - Virtual Event, January 10-15, 2021, Proceedings, Part VIII (Lecture Notes in Computer Science, Vol. 12668), Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani (Eds.). Springer, 361–380. https://doi.org/10.1007/978-3-030-68793-9_27
- ICPR 2022: Challenge on Harvesting Raw Tables from Infographics (CHART-Infographics). In 26th International Conference on Pattern Recognition, ICPR 2022, Montreal, QC, Canada, August 21-25, 2022. IEEE, 4995–5001. https://doi.org/10.1109/ICPR56361.2022.9956289
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
- Terrance Devries and Graham W. Taylor. 2017. Improved Regularization of Convolutional Neural Networks with Cutout. CoRR abs/1708.04552 (2017). arXiv:1708.04552 http://arxiv.org/abs/1708.04552
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=YicbFdNTTy
- CMA-CLIP: Cross-Modality Attention Clip for Text-Image Classification. In 2022 IEEE International Conference on Image Processing, ICIP 2022, Bordeaux, France, 16-19 October 2022. IEEE, 2846–2850. https://doi.org/10.1109/ICIP46576.2022.9897323
- In-depth analysis of the impact of OCR errors on named entity recognition and linking. Nat. Lang. Eng. 29, 2 (2023), 425–448. https://doi.org/10.1017/S1351324922000110
- Mask R-CNN. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society, 2980–2988. https://doi.org/10.1109/ICCV.2017.322
- Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 770–778. https://doi.org/10.1109/CVPR.2016.90
- LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, João Magalhães, Alberto Del Bimbo, Shin’ichi Satoh, Nicu Sebe, Xavier Alameda-Pineda, Qin Jin, Vincent Oria, and Laura Toni (Eds.). ACM, 4083–4091. https://doi.org/10.1145/3503161.3548112
- Text Localization in Scientific Figures using Fully Convolutional Neural Networks on Limited Training Data. In Proceedings of the ACM Symposium on Document Engineering 2019, Berlin, Germany, September 23-26, 2019, Sonja Schimmler and Uwe M. Borghoff (Eds.). ACM, 13:1–13:10. https://doi.org/10.1145/3342558.3345396
- Supervised Multimodal Bitransformers for Classifying Images and Text. CoRR abs/1909.02950 (2019). arXiv:1909.02950 http://arxiv.org/abs/1909.02950
- DiT: Self-supervised Pre-training for Document Image Transformer. CoRR abs/2203.02378 (2022). https://doi.org/10.48550/arXiv.2203.02378 arXiv:2203.02378
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986
- Jorge Poco and Jeffrey Heer. 2017. Reverse-Engineering Visualizations: Recovering Visual Encodings from Chart Images. Comput. Graph. Forum 36, 3 (2017), 353–363. https://doi.org/10.1111/cgf.13193
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748–8763. http://proceedings.mlr.press/v139/radford21a.html
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67. http://jmlr.org/papers/v21/20-074.html
- Zero-Shot Text-to-Image Generation. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8821–8831. http://proceedings.mlr.press/v139/ramesh21a.html
- Addressing imbalance in multi-label classification using weighted cross entropy loss function. In 2020 27th National and 5th International Iranian Conference on Biomedical Engineering (ICBME). IEEE, 333–338.
- How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers. CoRR abs/2106.10270 (2021). arXiv:2106.10270 https://arxiv.org/abs/2106.10270
- Unifying Vision, Text, and Layout for Universal Document Processing. CoRR abs/2212.02623 (2022). https://doi.org/10.48550/arXiv.2212.02623 arXiv:2212.02623
- MLP-Mixer: An all-MLP Architecture for Vision. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). 24261–24272. https://proceedings.neurips.cc/paper/2021/hash/cba0a4ee5ccd02fda0fe3f9a3e7b89fe-Abstract.html
- Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 10347–10357. http://proceedings.mlr.press/v139/touvron21a.html
- Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
- Visual and Textual Information Fusion Method for Chart Recognition. In Pattern Recognition. ICPR International Workshops and Challenges - Virtual Event, January 10-15, 2021, Proceedings, Part VIII (Lecture Notes in Computer Science, Vol. 12668), Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani (Eds.). Springer, 381–389. https://doi.org/10.1007/978-3-030-68793-9_28
- Improving Machine Understanding of Human Intent in Charts. In 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part III (Lecture Notes in Computer Science, Vol. 12823), Josep Lladós, Daniel Lopresti, and Seiichi Uchida (Eds.). Springer, 676–691. https://doi.org/10.1007/978-3-030-86334-0_44
- LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 1192–1200. https://doi.org/10.1145/3394486.3403172
- LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 2579–2591. https://doi.org/10.18653/v1/2021.acl-long.201
- Context-Aware Chart Element Detection. CoRR abs/2305.04151 (2023). https://doi.org/10.48550/arXiv.2305.04151 arXiv:2305.04151
- Semi-automatic Ground Truth Generation for Chart Image Recognition. In Document Analysis Systems VII, 7th International Workshop, DAS 2006, Nelson, New Zealand, February 13-15, 2006, Proceedings (Lecture Notes in Computer Science, Vol. 3872), Horst Bunke and A. Lawrence Spitz (Eds.). Springer, 324–335. https://doi.org/10.1007/11669487_29
- mixup: Beyond Empirical Risk Minimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=r1Ddp1-Rb
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.