Thumbnail

Document Structure Analysis for Large Electronic Document Collections

A. Stoffel

2013
Dissertation

Besides document collections containing a wide variety of different documents, such as the web, collections exist, which are collecting documents of a single or very few different types. For example, the EDGAR database of the SAC is collecting different documents of companies, which have to report regularly on different issues of their business. Similar databases exist in medicine or in industry. Due to the nature of this database, to contain very similar documents, it is often desirable to provide analysis functionalities that are able to automatically gain some insight into the filed information. A major problem of automatic processing is the lack of structured information in electronic documents. The majority of electronic document formats used for archiving are based on visual representations for human readers. This makes automatic processing complex because relevant and irrelevant content cannot be automatically distinguished easily. This thesis addresses this issue and describes and evaluates techniques for logical and functional structure analysis. The presented techniques are based on machine learning. Whereas the analysis of logical structure uses mainly geometric and formatting information, inspects the analysis of functional structures the textual content. The problem to identify and analyze errors in the structure analysis results is solved with visualization. The variable text scaling technique is designed to highlight interesting parts in logical and functional structures. It is also applicable to visualize keyword search results in document viewers. Afterward several examples using the presented techniques are discussed. The thesis concludes with a summary of the results and discusses open research questions.

Materials
Title