dc.description.abstract | This thesis describes a project which aims to develop a geoparser for Norwegian language
text. A geoparser is a tool that reads a piece of text, extracts any potential location mentions,
and then resolves these location mentions to their real-world toponyms. At the time this
thesis was written, there were no known geoparsers available that specialize exclusively on
Norwegian text. The solution produced here is therefore unique in this sense.
The task of geoparsing is non-trivial, as there are often many geographical locations that
share the same name. The geoparser must therefore be able to disambiguate a location men-
tion, using whatever clues it has available to it. In this project, the geoparser will try to infer
geographical regions of relevance, and also try to identify potential geographical hierarchies
between the different location mentions in the text. Furthermore, it is also based on common
geoparsing heuristics, such as population size being a strong indicator of toponym impor-
tance. To find potential candidates for a location mention, it uses GeoNames, a geographical
gazetteer containing entries for more than 11 million toponyms from all over the world. It also
uses Stedsnavn, a Norwegian dataset containing over 1 million Norwegian toponym entries.
Basic testing is done to check the viability of the solution, but evaluating the geoparser in
general is tough, as there are no proper datasets with which to test it. | |