Text Pattern Discovery and Extraction

Morten, Waersland

Morten, Waersland

Master thesis

Åpne

Waersland_Morten.pdf (1.110Mb)

Permanent lenke

http://hdl.handle.net/11250/2413858

Utgivelsesdato

2016-06-15

Metadata

Vis full innførsel

Samlinger

Studentoppgaver (TN-IDE) [901]

Sammendrag

This thesis presents a technique for discovering and extracting unknown patterns for structured data. There is no need for pre-knowledge to be able to discover patterns. But by applying pre-knowledge these patterns can be classified. When merging information from structured data, it is important that correct information is merged together. To achieved this multiple techniques are needed to analyse the information. This thesis provides a technique that can increase the accuracy. By collecting unique values using a trie structure, unknown pattern is discovered and extracted. These patterns are represented by using regular expressions and classified by using a decision tree. The technique presented provides regular expressions that are efficient and accurate. Along with the decision tree that classifies correct with a score greater than 80%. This technique can be used to improve the accuracy when merging structured data, increases the knowledge about a file, detect ID values, calculate other measurement including the consistency of a file, and if there are typographical errors.

Beskrivelse

Master's thesis in Computer science

Utgiver

University of Stavanger, Norway

Serie

Masteroppgave/UIS-TN-IDE/2016;