Deep Learning for Unified Table and Caption Detection in Scientific Documents
Abstract
This study explores the extraction of tables and their corresponding captions from scientific documents, aiming to enhance the capabilities of information extraction and analysis. Although there have been significant advancements in table extraction and analysis, there remains a gap in extracting tables along with their respective captions, which encapsulate the complete informational context. This research focuses on selecting and fine-tuning deep learning models while creating specific datasets for the chosen task.
By utilizing a subset of the TableBank dataset and implementing manual annotation, the study employs Faster R-CNN and Mask R-CNN models hosted by LayoutParser, which is pre-trained on the PubLayNet dataset. The overall training pipeline is implemented using the Detectron2 framework. The experimental phases involve a systematic increase in dataset size and document layout complexity, along with extensive hyperparameter searches to improve overall detection accuracy.
The results indicate that the fine-tuned models demonstrate high accuracy and robustness towards the proposed methodologies in extracting document elements from the scientific documents. Notably, the Mask R-CNN with the ResNeXt-101 backbone achieved optimal results, highlighting the importance of model and backbone architecture, dataset variability, and hyperparameter tuning. These findings open doors for future research applications that might utilize the combined context of document elements in diverse document scenarios, facilitating better knowledge extraction.