Using various Natural Language Processing Techniques to Automate Information Retrieval

Tesfay, Tsegazab

dc.contributor.advisor	Ferhat Özgur Catak
dc.contributor.author	Tesfay, Tsegazab
dc.date.accessioned	2022-09-29T15:51:19Z
dc.date.available	2022-09-29T15:51:19Z
dc.date.issued	2022
dc.identifier	no.uis:inspera:92613016:64152896
dc.identifier.uri	https://hdl.handle.net/11250/3022599
dc.description.abstract	The existence of Natural Language Processing(NLP) provides numerous benefits, including the understanding and analysis of unstructured data, as well as the efficient and precise automation of real-time processes. Despite the fact that NLP began in the 1940s, the importance of having an application that uses the benefits of NLP has never been greater than in the last two decades. This is because as the number of people who have access to the internet or digital devices grows, so does the size of the data collected. Thus, NLP and automated processes play a significant role in the quality and performance of services that users encounter. Datasets are not always structured or automated. This is due to the size of the data or the companies' age in terms of data collection. Several studies have shown that unstructured data contains useful information that, when managed properly, can point businesses in the right direction. To address these issues, it is critical to combine NLP and Machine Learning(ML) or Deep Learning(DL) algorithms. In other words, algorithms can deal with structured, unstructured, or both types of data. The algorithms' contributions are to automatically learn the language pattern in the given text and use that pattern to identify the unseen or validation data. Hyperparameter optimization are also performed in both supervised and unsupervised type of machine learning to make the algorithms as flexible as possible while achieving the desired results. The goal of this thesis is to develop an automated system that classifies files using various NLP in conjunction with the ML/DL algorithm that produces the best performance results. Autiliy AS is a young company focused on digitalization buildings. There are thousands of structured and unstructured files in Autility. Autility intends to use an automated system to extract information and classify files based on the system-code labeled "SYSTEMKODELIST NS3451". The "SYSTEMKODELISTE NS3451" is the "backbone'' for the entire system creation process. The first part of the main "SYSTEMKODELISTE NS3451'' from Norwegian Statsbygg is shown in figure 1. Only 12 rows of the standard "SYSTEMKODELISTE NS3451'' are displayed. The labeled dataset produces models with an average accuracy of roughly 85%. However, because the dataset contains far more unstructured files than structured files, research into algorithms that handle both structured and unstructured data is critical. Because many of the files contained drawings of buildings and pictures, the results of semi-supervised algorithms indicated the importance of formal language. To ensure consistent performance and a system with less overfitting, textaugmentation and hypertunneling are used. The assumptions made and the challenges faced are documented throughout this project. A few algorithms are presented in detail, along with their theoretical and mathematical concepts.
dc.description.abstract
dc.language	eng
dc.publisher	uis
dc.title	Using various Natural Language Processing Techniques to Automate Information Retrieval
dc.type	Master thesis

Tilhørende fil(er)

Filnavn:: no.uis:inspera:92613016:641528 ...
Størrelse:: 3.251Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Studentoppgaver (TN-IDE) [866]
Studentoppgaver i informasjonsteknologi, datateknikk / kybernetikk, signalbehandling

Vis enkel innførsel