Papers
Topics
Authors
Recent
Search
2000 character limit reached

Text Role Classification in Scientific Charts Using Multimodal Transformers

Published 8 Feb 2024 in cs.CV, cs.CL, and cs.LG | (2402.14579v1)

Abstract: Text role classification involves classifying the semantic role of textual elements within scientific charts. For this task, we propose to finetune two pretrained multimodal document layout analysis models, LayoutLMv3 and UDOP, on chart datasets. The transformers utilize the three modalities of text, image, and layout as input. We further investigate whether data augmentation and balancing methods help the performance of the models. The models are evaluated on various chart datasets, and results show that LayoutLMv3 outperforms UDOP in all experiments. LayoutLMv3 achieves the highest F1-macro score of 82.87 on the ICPR22 test dataset, beating the best-performing model from the ICPR22 CHART-Infographics challenge. Moreover, the robustness of the models is tested on a synthetic noisy dataset ICPR22-N. Finally, the generalizability of the models is evaluated on three chart datasets, CHIME-R, DeGruyter, and EconBiz, for which we added labels for the text roles. Findings indicate that even in cases where there is limited training data, transformers can be used with the help of data augmentation and balancing methods. The source code and datasets are available on GitHub under https://github.com/hjkimk/text-role-classification

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Rabah A. Al-Zaidy and C. Lee Giles. 2017. A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, Satinder Singh and Shaul Markovitch (Eds.). AAAI Press, 4644–4649. http://aaai.org/ocs/index.php/IAAI/IAAI17/paper/view/14275
  2. Showmik Bhowmik. 2023. Document Region Classification. In Document Layout Analysis. Springer, 43–65.
  3. Survey and empirical comparison of different approaches for text extraction from scholarly figures. Multim. Tools Appl. 77, 22 (2018), 29475–29505. https://doi.org/10.1007/s11042-018-6162-7
  4. Falk Böschen and Ansgar Scherp. 2015. Multi-oriented Text Extraction from Information Graphics. In Proceedings of the 2015 ACM Symposium on Document Engineering, DocEng 2015, Lausanne, Switzerland, September 8-11, 2015, Christine Vanoirbeek and Pierre Genevès (Eds.). ACM, 35–38. https://doi.org/10.1145/2682571.2797092
  5. Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade R-CNN: Delving Into High Quality Object Detection. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, 6154–6162. https://doi.org/10.1109/CVPR.2018.00644
  6. Sagnik Ray Choudhury and Clyde Lee Giles. 2015. An Architecture for Information Extraction from Figures in Digital Libraries. In Proceedings of the 24th International Conference on World Wide Web Companion, WWW 2015, Florence, Italy, May 18-22, 2015 - Companion Volume, Aldo Gangemi, Stefano Leonardi, and Alessandro Panconesi (Eds.). ACM, 667–672. https://doi.org/10.1145/2740908.2741712
  7. Randaugment: Practical automated data augmentation with a reduced search space. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, June 14-19, 2020. Computer Vision Foundation / IEEE, 3008–3017. https://doi.org/10.1109/CVPRW50498.2020.00359
  8. CoAtNet: Marrying Convolution and Attention for All Data Sizes. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). 3965–3977. https://proceedings.neurips.cc/paper/2021/hash/20568692db622456cc42a2e853ca21f8-Abstract.html
  9. ICDAR 2019 Competition on Harvesting Raw Tables from Infographics (CHART-Infographics). In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019. IEEE, 1594–1599. https://doi.org/10.1109/ICDAR.2019.00203
  10. Chart Mining: A Survey of Methods for Automated Chart Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 43, 11 (2021), 3799–3819. https://doi.org/10.1109/TPAMI.2020.2992028
  11. ICPR 2020 - Competition on Harvesting Raw Tables from Infographics. In Pattern Recognition. ICPR International Workshops and Challenges - Virtual Event, January 10-15, 2021, Proceedings, Part VIII (Lecture Notes in Computer Science, Vol. 12668), Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani (Eds.). Springer, 361–380. https://doi.org/10.1007/978-3-030-68793-9_27
  12. ICPR 2022: Challenge on Harvesting Raw Tables from Infographics (CHART-Infographics). In 26th International Conference on Pattern Recognition, ICPR 2022, Montreal, QC, Canada, August 21-25, 2022. IEEE, 4995–5001. https://doi.org/10.1109/ICPR56361.2022.9956289
  13. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
  14. Terrance Devries and Graham W. Taylor. 2017. Improved Regularization of Convolutional Neural Networks with Cutout. CoRR abs/1708.04552 (2017). arXiv:1708.04552 http://arxiv.org/abs/1708.04552
  15. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=YicbFdNTTy
  16. CMA-CLIP: Cross-Modality Attention Clip for Text-Image Classification. In 2022 IEEE International Conference on Image Processing, ICIP 2022, Bordeaux, France, 16-19 October 2022. IEEE, 2846–2850. https://doi.org/10.1109/ICIP46576.2022.9897323
  17. In-depth analysis of the impact of OCR errors on named entity recognition and linking. Nat. Lang. Eng. 29, 2 (2023), 425–448. https://doi.org/10.1017/S1351324922000110
  18. Mask R-CNN. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society, 2980–2988. https://doi.org/10.1109/ICCV.2017.322
  19. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 770–778. https://doi.org/10.1109/CVPR.2016.90
  20. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, João Magalhães, Alberto Del Bimbo, Shin’ichi Satoh, Nicu Sebe, Xavier Alameda-Pineda, Qin Jin, Vincent Oria, and Laura Toni (Eds.). ACM, 4083–4091. https://doi.org/10.1145/3503161.3548112
  21. Text Localization in Scientific Figures using Fully Convolutional Neural Networks on Limited Training Data. In Proceedings of the ACM Symposium on Document Engineering 2019, Berlin, Germany, September 23-26, 2019, Sonja Schimmler and Uwe M. Borghoff (Eds.). ACM, 13:1–13:10. https://doi.org/10.1145/3342558.3345396
  22. Supervised Multimodal Bitransformers for Classifying Images and Text. CoRR abs/1909.02950 (2019). arXiv:1909.02950 http://arxiv.org/abs/1909.02950
  23. DiT: Self-supervised Pre-training for Document Image Transformer. CoRR abs/2203.02378 (2022). https://doi.org/10.48550/arXiv.2203.02378 arXiv:2203.02378
  24. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
  25. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986
  26. Jorge Poco and Jeffrey Heer. 2017. Reverse-Engineering Visualizations: Recovering Visual Encodings from Chart Images. Comput. Graph. Forum 36, 3 (2017), 353–363. https://doi.org/10.1111/cgf.13193
  27. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748–8763. http://proceedings.mlr.press/v139/radford21a.html
  28. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67. http://jmlr.org/papers/v21/20-074.html
  29. Zero-Shot Text-to-Image Generation. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8821–8831. http://proceedings.mlr.press/v139/ramesh21a.html
  30. Addressing imbalance in multi-label classification using weighted cross entropy loss function. In 2020 27th National and 5th International Iranian Conference on Biomedical Engineering (ICBME). IEEE, 333–338.
  31. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers. CoRR abs/2106.10270 (2021). arXiv:2106.10270 https://arxiv.org/abs/2106.10270
  32. Unifying Vision, Text, and Layout for Universal Document Processing. CoRR abs/2212.02623 (2022). https://doi.org/10.48550/arXiv.2212.02623 arXiv:2212.02623
  33. MLP-Mixer: An all-MLP Architecture for Vision. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). 24261–24272. https://proceedings.neurips.cc/paper/2021/hash/cba0a4ee5ccd02fda0fe3f9a3e7b89fe-Abstract.html
  34. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 10347–10357. http://proceedings.mlr.press/v139/touvron21a.html
  35. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
  36. Visual and Textual Information Fusion Method for Chart Recognition. In Pattern Recognition. ICPR International Workshops and Challenges - Virtual Event, January 10-15, 2021, Proceedings, Part VIII (Lecture Notes in Computer Science, Vol. 12668), Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani (Eds.). Springer, 381–389. https://doi.org/10.1007/978-3-030-68793-9_28
  37. Improving Machine Understanding of Human Intent in Charts. In 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part III (Lecture Notes in Computer Science, Vol. 12823), Josep Lladós, Daniel Lopresti, and Seiichi Uchida (Eds.). Springer, 676–691. https://doi.org/10.1007/978-3-030-86334-0_44
  38. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 1192–1200. https://doi.org/10.1145/3394486.3403172
  39. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 2579–2591. https://doi.org/10.18653/v1/2021.acl-long.201
  40. Context-Aware Chart Element Detection. CoRR abs/2305.04151 (2023). https://doi.org/10.48550/arXiv.2305.04151 arXiv:2305.04151
  41. Semi-automatic Ground Truth Generation for Chart Image Recognition. In Document Analysis Systems VII, 7th International Workshop, DAS 2006, Nelson, New Zealand, February 13-15, 2006, Proceedings (Lecture Notes in Computer Science, Vol. 3872), Horst Bunke and A. Lawrence Spitz (Eds.). Springer, 324–335. https://doi.org/10.1007/11669487_29
  42. mixup: Beyond Empirical Risk Minimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=r1Ddp1-Rb

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.