Sammendrag
Tables are common and important in scientific publications. They serve as the main elements for presenting findings in a structured way. This project concerns the extraction of tables from scientific papers that have been published on arxiv.org. ArXiv is an open archive for scholarly articles, where articles are published not only in PDF format, but the respective LaTeX sources are also made available for most.
The specific project objectives are:
(i) Developing a method for identifying and extracting tables from a La- TeX document; (ii) Enriching the extracted table data with metadata from the article; (iii) Creating a large-scale table corpus that can be dis- tributed; (iv) Setting up batch processes to continuously update the table corpus