Vis enkel innførsel

dc.contributor.authorShaukat, Kamran
dc.contributor.authorMasood, Nayyer
dc.contributor.authorKhushi, Matloob
dc.date.accessioned2020-01-28T10:15:46Z
dc.date.available2020-01-28T10:15:46Z
dc.date.created2019-11-27T15:01:38Z
dc.date.issued2019-11
dc.identifier.citationShaukat, K., Masood, N., Khushi, M. (2019) A Novel Approach to Data Extraction on Hyperlinked Webpages, Applied Sciences, 9(23)nb_NO
dc.identifier.issn2076-3417
dc.identifier.urihttp://hdl.handle.net/11250/2638286
dc.description.abstractThe World Wide Web has an enormous amount of useful data presented as HTML tables. These tables are often linked to other web pages, providing further detailed information to certain attribute values. Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms. We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites. Tables from the HTML code were extracted and table rows were labeled with appropriate class labels. Conditional random fields (CRF) were used for the classification of table rows, and a nondeterministic finite automaton (NFA) algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables. A simple schema for non-linked tables was extracted and for the linked-tables, relational schema in the form of primary and foreign-keys (PK and FK) were developed. Child tables were concatenated with the parent table’s attribute value (PK), serving as foreign keys (FKs). Resultantly, these tables could assist with performing better and stronger queries using the join operation. A manual checking of the linked web table results revealed a 99% precision and 68% recall values. Our 15,000-strong downloadable corpus and a novel algorithm will provide the basis for further research in this field.nb_NO
dc.language.isoengnb_NO
dc.publisherMDPInb_NO
dc.rightsNavngivelse 4.0 Internasjonal*
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/deed.no*
dc.subjectITnb_NO
dc.subjectnettabellernb_NO
dc.subjectweb tablesnb_NO
dc.subjectinformasjonsteknologinb_NO
dc.titleA Novel Approach to Data Extraction on Hyperlinked Webpagesnb_NO
dc.typeJournal articlenb_NO
dc.typePeer reviewednb_NO
dc.description.versionpublishedVersionnb_NO
dc.rights.holder© 2019 by the authorsnb_NO
dc.subject.nsiVDP::Technology: 500::Information and communication technology: 550nb_NO
dc.source.pagenumber1-14nb_NO
dc.source.volume9nb_NO
dc.source.journalApplied Sciencesnb_NO
dc.source.issue23nb_NO
dc.identifier.doi10.3390/app9235102
dc.identifier.cristin1753186
cristin.unitcode217,8,4,0
cristin.unitnameInstitutt for data- og elektroteknologi
cristin.ispublishedtrue
cristin.qualitycode1


Tilhørende fil(er)

Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel

Navngivelse 4.0 Internasjonal
Med mindre annet er angitt, så er denne innførselen lisensiert som Navngivelse 4.0 Internasjonal