Scibib

Visual Document Analysis: Towards a Semantic Analysis of Large Document Collections

2010
Dissertation

Abstract

Large amounts of data are only available in textual form. However, due to the semi-structured nature of text and the impressive flexibility and complexity of natural language the development of automatic methods for text analysis is a challenging task. The presented work is centered around a framework for analyzing documents (collections) that takes the whole document analysis process into account. Central to this framework is the idea that most analysis tasks do not require a full-text understanding. Instead, one or several semantic aspects of the text (called quasi-semantic properties) can be identified that are relevant for answering the analysis task. This permits to targetly search for combinations of (measurable) text features that are able to approximate the specific semantic aspect. Those approximations are then used to solve the analysis task computationally or to support the analysis of a document (collection) visually. The thesis discusses the above-mentioned framework theoretically and presents concrete application examples in four different domains: literature analysis, readability analysis, the extraction of discriminating and overlap terms, and finally sentiment and opinion analysis. Thereby, the advantages of working with the above-mentioned framework are shown. A focus is put on showing where and how visualization techniques can provide valuable support in the document analysis process. Novel visualizations are introduced and common ones are evaluated for their suitability in this context. Furthermore, several examples are given of how good approximations of semantic aspects of a document can be found and how given measures can be evaluated and improved.