Text Pattern Discovery and Extraction
Master thesis
Permanent lenke
http://hdl.handle.net/11250/2413858Utgivelsesdato
2016-06-15Metadata
Vis full innførselSamlinger
- Studentoppgaver (TN-IDE) [901]
Sammendrag
This thesis presents a technique for discovering and extracting unknown patterns for structured data. There is no need for pre-knowledge to be able to discover patterns. But by applying pre-knowledge these patterns can be classified. When merging information from structured data, it is important that correct information is merged together. To achieved this multiple techniques are needed to analyse the information. This thesis provides a technique that can increase the accuracy. By collecting unique values using a trie structure, unknown pattern is discovered and extracted. These patterns are represented by using regular expressions and classified by using a decision tree. The technique presented provides regular expressions that are efficient and accurate. Along with the decision tree that classifies correct with a score greater than 80%. This technique can be used to improve the accuracy when merging structured data, increases the knowledge about a file, detect ID values, calculate other measurement including the consistency of a file, and if there are typographical errors.
Beskrivelse
Master's thesis in Computer science