Volume 2, Issue 4 (12-2010)                   2010, 2(4): 19-31 | Back to browse issues page

XML Print


1- Artificial Intelligence Department, and Advance Research Center (ARC) Islamic Azad University Mashhad Branch, Iran
2- Electrical Engineering Department Ferdowsi University of Mashhad Mashhad, Iran
Abstract:   (2142 Views)

Text categorization is one of the well studied problems in data mining and information retrieval. Given a large quantity of documents in a data set where each document is associated with its corresponding category. This research proposes a novel approach for English and Persian documents classification with using novel method that combined competitive neural text categorizer with new vectors that we called, string vectors. Traditional approaches to text categorization require encoding documents into numerical vectors which leads to the two main problems: huge dimensionality and sparse distribution. Although many various feature selection methods are developed to address the first problem, the reduced dimension remains still large. If the dimension is reduced excessively by a feature selection method, robustness of document categorization is degraded. The idea of this research as the solution to the problems is to encode the documents into string vectors and apply it to the novel competitive neural text categorizer as a string vector. Extensive experiments based on several benchmarks are conducted. The results indicated that this method can significantly improve the performance of documents classification up to 13.8% in comparison to best traditional algorithm on standard Reuter 21578 dataset.

Full-Text [PDF 2058 kb]   (858 Downloads)    
Type of Study: Research | Subject: Information Technology

Rights and permissions
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.