LORE++: Logical Location Regression Network for Table Structure Recognition with Pre-training (2401.01522v1)
Abstract: Table structure recognition (TSR) aims at extracting tables in images into machine-understandable formats. Recent methods solve this problem by predicting the adjacency relations of detected cell boxes or learning to directly generate the corresponding markup sequences from the table images. However, existing approaches either count on additional heuristic rules to recover the table structures, or face challenges in capturing long-range dependencies within tables, resulting in increased complexity. In this paper, we propose an alternative paradigm. We model TSR as a logical location regression problem and propose a new TSR framework called LORE, standing for LOgical location REgression network, which for the first time regresses logical location as well as spatial location of table cells in a unified network. Our proposed LORE is conceptually simpler, easier to train, and more accurate than other paradigms of TSR. Moreover, inspired by the persuasive success of pre-trained models on a number of computer vision and natural language processing tasks, we propose two pre-training tasks to enrich the spatial and logical representations at the feature level of LORE, resulting in an upgraded version called LORE++. The incorporation of pre-training in LORE++ has proven to enjoy significant advantages, leading to a substantial enhancement in terms of accuracy, generalization, and few-shot capability compared to its predecessor. Experiments on standard benchmarks against methods of previous paradigms demonstrate the superiority of LORE++, which highlights the potential and promising prospect of the logical location regression paradigm for TSR.
- S. Raja, A. Mondal, and C. Jawahar, “Visual understanding of complex table structures from document images,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 2299–2308.
- R. Long, W. Wang, N. Xue, F. Gao, Z. Yang, Y. Wang, and G.-S. Xia, “Parsing table structures in the wild,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 944–952.
- L. Qiao, Z. Li, Z. Cheng, P. Zhang, S. Pu, Y. Niu, W. Ren, W. Tan, and F. Wu, “Lgpma: Complicated table structure recognition with local and global pyramid mask alignment,” in International Conference on Document Analysis and Recognition. Springer, 2021, pp. 99–114.
- X. Zheng, D. Burdick, L. Popa, X. Zhong, and N. X. R. Wang, “Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 697–706.
- D. Prasad, A. Gadpal, K. Kapadni, M. Visave, and K. Sultanpure, “Cascadetabnet: An approach for end to end table detection and structure recognition from image-based documents,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 572–573.
- S. S. Paliwal, D. Vishwanath, R. Rahul, M. Sharma, and L. Vig, “Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019, pp. 128–133.
- Z. Chi, H. Huang, H.-D. Xu, H. Yu, W. Yin, and X.-L. Mao, “Complicated table structure recognition,” arXiv preprint arXiv:1908.04729, 2019.
- S. Raja, A. Mondal, and C. Jawahar, “Table structure recognition using top-down and bottom-up cues,” in European Conference on Computer Vision. Springer, 2020, pp. 70–86.
- H. Liu, X. Li, B. Liu, D. Jiang, Y. Liu, and B. Ren, “Neural collaborative graph machines for table structure recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4533–4542.
- X. Zhong, E. ShafieiBavani, and A. Jimeno Yepes, “Image-based table recognition: data, model, and evaluation,” in European Conference on Computer Vision. Springer, 2020, pp. 564–580.
- H. Desai, P. Kayal, and M. Singh, “Tablex: a benchmark dataset for structure and content information extraction from scientific tables,” in International Conference on Document Analysis and Recognition. Springer, 2021, pp. 554–569.
- W. Xue, B. Yu, W. Wang, D. Tao, and Q. Li, “Tgrnet: A table graph reconstruction network for table structure recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1295–1304.
- B. Smock, R. Pesala, and R. Abraham, “Pubtables-1m: Towards comprehensive table extraction from unstructured documents,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4634–4642.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- K. He, X. Chen, S. Xie, Y. Li, P. Doll’ar, and R. B. Girshick, “Masked autoencoders are scalable vision learners,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15 979–15 988, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:243985980
- H. Xing, F. Gao, R. Long, J. Bu, Q. Zheng, L. Li, C. Yao, and Z. Yu, “Lore: Logical location regression network for table structure recognition,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, pp. 2992–3000, Jun. 2023. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/25402
- S. Schreiber, S. Agne, I. Wolf, A. Dengel, and S. Ahmed, “Deepdesrt: Deep learning for detection and structure recognition of tables in document images,” in 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol. 1. IEEE, 2017, pp. 1162–1167.
- S. A. Siddiqui, I. A. Fateh, S. T. R. Rizvi, A. Dengel, and S. Ahmed, “Deeptabstr: Deep learning based table structure recognition,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019, pp. 1403–1409.
- Z. Zhang, J. Zhang, J. Du, and F. Wang, “Split, embed and merge: An accurate table structure recognizer,” Pattern Recognition, vol. 126, p. 108565, 2022.
- T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in International Conference on Learning Representations (ICLR), 2017.
- S. R. Qasim, H. Mahmood, and F. Shafait, “Rethinking table recognition using graph neural networks,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019, pp. 142–147.
- H. Liu, X. Li, B. Liu, D. Jiang, Y. Liu, B. Ren, and R. Ji, “Show, read and reason: Table structure recognition with flexible context aggregator,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1084–1092.
- M. Li, L. Cui, S. Huang, F. Wei, M. Zhou, and Z. Li, “TableBank: Table benchmark for image-based table detection and recognition,” in Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, May 2020, pp. 1918–1925. [Online]. Available: https://aclanthology.org/2020.lrec-1.236
- J. Ye, X. Qi, Y. He, Y. Chen, D. Gu, P. Gao, and R. Xiao, “Pingan-vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: Table recognition to html,” arXiv preprint arXiv:2105.01848, 2021.
- J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 134–18 144.
- J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, and S. De Mello, “Open-vocabulary panoptic segmentation with text-to-image diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2955–2966.
- Z. Dai, B. Cai, Y. Lin, and J. Chen, “Up-detr: Unsupervised pre-training for object detection with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1601–1610.
- A. Bar, X. Wang, V. Kantorov, C. J. Reed, R. Herzig, G. Chechik, A. Rohrbach, T. Darrell, and A. Globerson, “Detreg: Unsupervised pretraining with region priors for object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 605–14 615.
- C. Luo, C. Cheng, Q. Zheng, and C. Yao, “Geolayoutlm: Geometric pre-training for visual information extraction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7092–7101.
- Z. Yang, R. Long, P. Wang, S. Song, H. Zhong, W. Cheng, X. Bai, and C. Yao, “Modeling entities as semantic points for visual information extraction in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 358–15 367.
- J. Wang, C. Liu, L. Jin, G. Tang, J. Zhang, S. Zhang, Q. Wang, Y. Wu, and M. Cai, “Towards robust visual information extraction in real world: New dataset and novel solution,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 2738–2745.
- M. Göbel, T. Hassan, E. Oro, and G. Orsi, “A methodology for evaluating algorithms for table understanding in pdf documents,” in Proceedings of the 2012 ACM symposium on Document engineering, 2012, pp. 45–48.
- X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv preprint arXiv:1904.07850, 2019.
- Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou, “Layoutlm: Pre-training of text and layout for document image understanding,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1192–1200.
- Y. Xu, Y. Xu, T. Lv, L. Cui, F. Wei, G. Wang, Y. Lu, D. Florencio, C. Zhang, W. Che, M. Zhang, and L. Zhou, “Layoutlmv2: Multi-modal pre-training for visually-rich document understanding,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL) 2021, 2021.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- W. Xue, Q. Li, and D. Tao, “Res2tim: Reconstruct syntactic structures from table images,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019, pp. 749–755.
- M. Göbel, T. Hassan, E. Oro, and G. Orsi, “Icdar 2013 table competition,” in 2013 12th International Conference on Document Analysis and Recognition. IEEE, 2013, pp. 1449–1453.
- L. Gao, Y. Huang, H. Déjean, J.-L. Meunier, Q. Yan, Y. Fang, F. Kleber, and E. Lang, “Icdar 2019 competition on table detection and recognition (ctdar),” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019, pp. 1510–1515.
- F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2403–2412.
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.