CUTER: An Efficient Useful Text Extraction Mechanism

Publication TypeConference Paper
Year of Publication2009
AuthorsBouras, C, Poulopoulos, V, Adam, G
Conference NameThe 2009 IEEE International Symposium on Mining and Web(WAM09), Bradford, UK
Date Published26 - 29 May

In this paper we present CUTER, a system that processes HTML pages in order to extract the useful text from them. The mechanism is focalized on HTML pages that include news articles from major portals and blogs. As useful text we define the body of the article that contains the news report. In order to extract the body of the article we deconstruct the HTML page to its DOM model and we apply a set of algorithms in order to clean and correct the HTML code, locate and characterize each node of the DOM model and finally store the text from the nodes that are characterized as useful text nodes. CUTER is a subsystem of peRSSonal, a web tool that is used to obtain news articles from all over the world, process them and present them back to the end users in a personalized manner. The role of CUTER is to feed peRSSonal with the body of the articles that are collected from major news portals and blogs. In this paper we present the basic algorithms and experimental results on the efficiency of the CUTER text extractor.