The immense growth of digital text data evokes demand for automatic text analysis tools for information retrieval. A plain text provides sufficient information for a heuristic approach to identify meaningful keywords. Text as documents and text streams also feature an inherent structure that inform about their content. In this thesis, two approaches for retrieval of meaningful information from single documents are developed: keyword extraction and the detection of structural changes in texts. The combination of multiple heuristic keyword extraction algorithms is superior to individual methods, and can improve the quality of the results significantly. To further this idea in the first part of my thesis, I compare different combination methods and utilize PCA as a parameter-free and effective method to determine optimal combination candidates. Then, I demonstrate the success of these methods with an efficient and flexible keyword extraction approach that is language-independent, fast, and does not require a training phase. The results of this algorithm are deemed meaningful, and its performance is superior to the well known TF-IDF. In the second part of my thesis, I analyze the structure of text documents and develop a novel algorithm that detects structural changes. This algorithm identifies fluctuations in the composition of a text. It is flexible, language-independent, and performs on single documents as well as indefinite text streams. I demonstrate the accuracy of my approach using cogent real-world examples, and present its compelling performance with a benchmark algorithm. As an application of my work, I implement a keyword extraction approach into the CommunityMashup in a collaboration. The CommunityMashup is a data aggregation solution for different social networks. With the extraction of keywords in almost real time, we are able to identify new relations between contents and people and visualize them with an interactive and platform-independent solution.
«The immense growth of digital text data evokes demand for automatic text analysis tools for information retrieval. A plain text provides sufficient information for a heuristic approach to identify meaningful keywords. Text as documents and text streams also feature an inherent structure that inform about their content. In this thesis, two approaches for retrieval of meaningful information from single documents are developed: keyword extraction and the detection of structural changes in texts. Th...
»