Text document pre-processing using the Bayes formula for classification based on the vector space model

Isa, Dino and Hong, Lee Lam and Kallimani, V.P. and Rajkumar, R. (2008) Text document pre-processing using the Bayes formula for classification based on the vector space model. Computer and Information Science, 1 (4). pp. 79-90. ISSN 1913-8989

[img] PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Available under Licence Creative Commons Attribution.
Download (255kB)

Abstract

This work utilizes the Bayes formula to vectorize a document according to a probability distribution based on keywords reflecting the probable categories that the document may belong to. The Bayes formula gives a range of probabilities to which the document can be assigned according to a pre determined set of topics (categories). Using this probability distribution as the vectors to represent the document, the text classification algorithms based on the vector space model, such as the Support Vector Machine (SVM) and Self-Organizing Map (SOM) can then be used to classify the documents on a multi-dimensional level, thus improving on the results obtained using only the highest probability to classify the document, such as that achieved by implementing the naïve Bayes classifier by itself. The effects of an inadvertent dimensionality reduction can be overcome using these algorithms. We compare the performance of these classifiers for high dimensional data.

Item Type: Article
Schools/Departments: University of Nottingham, Malaysia Campus > Faculty of Science > School of Computer Science
Depositing User: Davies, Mrs Sarah
Date Deposited: 29 Apr 2014 13:22
Last Modified: 14 Sep 2016 03:37
URI: http://eprints.nottingham.ac.uk/id/eprint/2995

Actions (Archive Staff Only)

Edit View Edit View