A Novel Approach to Data Extraction on Hyperlinked Webpages

Shaukat, Kamran; Masood, Nayyer; Khushi, Matloob

dc.contributor.author	Shaukat, Kamran
dc.contributor.author	Masood, Nayyer
dc.contributor.author	Khushi, Matloob
dc.date.accessioned	2020-01-28T10:15:46Z
dc.date.available	2020-01-28T10:15:46Z
dc.date.created	2019-11-27T15:01:38Z
dc.date.issued	2019-11
dc.identifier.citation	Shaukat, K., Masood, N., Khushi, M. (2019) A Novel Approach to Data Extraction on Hyperlinked Webpages, Applied Sciences, 9(23)	nb_NO
dc.identifier.issn	2076-3417
dc.identifier.uri	http://hdl.handle.net/11250/2638286
dc.description.abstract	The World Wide Web has an enormous amount of useful data presented as HTML tables. These tables are often linked to other web pages, providing further detailed information to certain attribute values. Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms. We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites. Tables from the HTML code were extracted and table rows were labeled with appropriate class labels. Conditional random fields (CRF) were used for the classification of table rows, and a nondeterministic finite automaton (NFA) algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables. A simple schema for non-linked tables was extracted and for the linked-tables, relational schema in the form of primary and foreign-keys (PK and FK) were developed. Child tables were concatenated with the parent table’s attribute value (PK), serving as foreign keys (FKs). Resultantly, these tables could assist with performing better and stronger queries using the join operation. A manual checking of the linked web table results revealed a 99% precision and 68% recall values. Our 15,000-strong downloadable corpus and a novel algorithm will provide the basis for further research in this field.	nb_NO
dc.language.iso	eng	nb_NO
dc.publisher	MDPI	nb_NO
dc.rights	Navngivelse 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/deed.no	*
dc.subject	IT	nb_NO
dc.subject	nettabeller	nb_NO
dc.subject	web tables	nb_NO
dc.subject	informasjonsteknologi	nb_NO
dc.title	A Novel Approach to Data Extraction on Hyperlinked Webpages	nb_NO
dc.type	Journal article	nb_NO
dc.type	Peer reviewed	nb_NO
dc.description.version	publishedVersion	nb_NO
dc.rights.holder	© 2019 by the authors	nb_NO
dc.subject.nsi	VDP::Technology: 500::Information and communication technology: 550	nb_NO
dc.source.pagenumber	1-14	nb_NO
dc.source.volume	9	nb_NO
dc.source.journal	Applied Sciences	nb_NO
dc.source.issue	23	nb_NO
dc.identifier.doi	10.3390/app9235102
dc.identifier.cristin	1753186
cristin.unitcode	217,8,4,0
cristin.unitname	Institutt for data- og elektroteknologi
cristin.ispublished	true
cristin.qualitycode	1

Tilhørende fil(er)

Filnavn:: applsci-09-05102-v2.pdf
Størrelse:: 1.652Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Publikasjoner fra CRIStin [4268]
Vitenskapelige publikasjoner (TN-IDE) [243]

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Navngivelse 4.0 Internasjonal