Deep Learning for Unified Table and Caption Detection in Scientific Documents
Abstract
This study explores the extraction of tables and their corresponding captions from scien- tific documents, aiming to enhance the capabilities of information extraction and analysis. Although there have been significant advancements in table extraction and analysis, there remains a gap in extracting tables along with their respective captions, which encapsulate the complete informational context. This research focuses on selecting and fine-tuning deep learning models while creating specific datasets for the chosen task.
By utilizing a subset of the TableBank dataset and implementing manual annotation, the study employs Faster R-CNN and Mask R-CNN models hosted by LayoutParser, which is pre-trained on the PubLayNet dataset. The overall training pipeline is implemented using the Detectron2 framework. The experimental phases involve a systematic increase in dataset size and document layout complexity, along with extensive hyperparameter searches to improve overall detection accuracy.
The results indicate that the fine-tuned models demonstrate high accuracy and robustness towards the proposed methodologies in extracting document elements from the scientific doc- uments. Notably, the Mask R-CNN with the ResNeXt-101 backbone achieved optimal results, highlighting the importance of model and backbone architecture, dataset variability, and hy- perparameter tuning. These findings open doors for future research applications that might utilize the combined context of document elements in diverse document scenarios, facilitating better knowledge extraction.