Data Fingerprinting -- Identifying Files and Tables with Hashing Schemes

Tobiesen, Ole

dc.contributor.advisor	Davidrajuh, Reggie
dc.contributor.author	Tobiesen, Ole
dc.date.accessioned	2016-10-10T10:21:26Z
dc.date.available	2016-10-10T10:21:26Z
dc.date.issued	2016-06-15
dc.identifier.uri	http://hdl.handle.net/11250/2413856
dc.description	Master's thesis in Computer science	nb_NO
dc.description.abstract	INTRODUCTION: Although hash functions are nothing new, these are not limited to cryptographic purposes. One important field is data fingerprinting. Here, the purpose is to generate a digest which serves as a fingerprint (or a license plate) that uniquely identifies a file. More recently, fuzzy fingerprinting schemes — which will scrap the avalanche effect in favour of detecting local changes — has hit the spotlight. The main purpose of this project is to find ways to classify text tables, and discover where potential changes or inconsitencies have happened. METHODS: Large parts of this report can be considered applied discrete mathematics — and finite fields and combinatorics have played an important part. Rabin’s fingerprinting scheme was tested extensively and compared against existing cryptographic algorithms, CRC and FNV. Moreover, a self-designed fuzzy hashing algorithm with the preliminary name No-Frills Hash has been created and tested against Nilsimsa and Spamsum. NFHash is based on Mersenne primes, and uses a sliding window to create a fuzzy hash. Futhermore, the usefullness of lookup tables (with partial seeds) were also explored. The fuzzy hashing algorithm has also been combined with a k-NN classifier to get an overview over it’s ability to classify files. In addition to NFHash, Bloom filters combined with Merkle Trees have been the most important part of this report. This combination will allow a user to see where a change was made, despite the fact that hash functions are one-way. Large parts of this project has dealt with the study of other open-source libraries and applications, such as Cassandra and SSDeep — as well as how bitcoins work. Optimizations have played a crucial role as well; different approaches to a problem might lead to the same solution, but resource consumption can be very different. RESULTS: The results have shown that the Merkle Tree-based approach can track changes to a table very quickly and efficiently, due to it being conservative when it comes to CPU resources. Moreover, the self-designed algorithm NFHash also does well in terms of file classification when it is coupled with a k-NN classifyer. CONCLUSION: Hash functions refers to a very diverse set of algorithms, and not just algorithms that serve a limited purpose. Fuzzy Fingerprinting Schemes can still be considered to be at their infant stage, but a lot has still happened the last ten years. This project has introduced two new ways to create and compare hashes that can be compared to similar, yet not necessarily identical files — or to detect if (and to what extent) a file was changed. Note that the algorithms presented here should be considered prototypes, and still might need some large scale testing to sort out potential flaws	nb_NO
dc.language.iso	eng	nb_NO
dc.publisher	University of Stavanger, Norway	nb_NO
dc.relation.ispartofseries	Masteroppgave/UIS-TN-IDE/2016;
dc.rights	Navngivelse 3.0 Norge	*
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/no/	*
dc.subject	Mersenne Primes	nb_NO
dc.subject	Merkle Trees	nb_NO
dc.subject	Damerau-Levenshtein	nb_NO
dc.subject	data fingerprinting	nb_NO
dc.subject	hash function	nb_NO
dc.subject	finite fields	nb_NO
dc.subject	k-nearest neighbor	nb_NO
dc.title	Data Fingerprinting -- Identifying Files and Tables with Hashing Schemes	nb_NO
dc.type	Master thesis	nb_NO
dc.subject.nsi	VDP::Technology: 500::Information and communication technology: 550::Computer technology: 551	nb_NO
dc.source.pagenumber	145	nb_NO

Tilhørende fil(er)

Filnavn:: tobiesen_ole.pdf
Størrelse:: 2.195Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Studentoppgaver (TN-IDE) [866]
Studentoppgaver i informasjonsteknologi, datateknikk / kybernetikk, signalbehandling

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Navngivelse 3.0 Norge