Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

A study on mutual information-based feature selection for text categorization

Xu, Yang, Jones, Gareth J.F. orcid logoORCID: 0000-0003-2923-8365, Li, Jintao, Wang, Bin and Sun, ChunMing (2007) A study on mutual information-based feature selection for text categorization. Journal of Computational Information Systems, 3 (3). pp. 1007-1012. ISSN 1553-9105

Abstract
Feature selection plays an important role in text categorization. Automatic feature selection methods such as document frequency thresholding (DF), information gain (IG), mutual information (MI), and so on are commonly applied in text categorization. Many existing experiments show IG is one of the most effective methods, by contrast, MI has been demonstrated to have relatively poor performance. According to one existing MI method, the mutual information of a category c and a term t can be negative, which is in conflict with the definition of MI derived from information theory where it is always non-negative. We show that the form of MI used in TC is not derived correctly from information theory. There are two different MI based feature selection criteria which are referred to as MI in the TC literature. Actually, one of them should correctly be termed "pointwise mutual information" (PMI). In this paper, we clarify the terminological confusion surrounding the notion of "mutual information" in TC, and detail an MI method derived correctly from information theory. Experiments with the Reuters-21578 collection and OHSUMED collection show that the corrected MI method’s performance is similar to that of IG, and it is considerably better than PMI.
Metadata
Item Type:Article (Published)
Refereed:Yes
Uncontrolled Keywords:feature selection; text categorization; text categorisation
Subjects:Computer Science > Information retrieval
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Publisher:Binary Information Press
Copyright Information:© 2007 Binary Information Press.
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
ID Code:16194
Deposited On:14 Jun 2011 13:47 by Shane Harper . Last Modified 03 Feb 2023 16:21
Documents

Full text available as:

[thumbnail of A_Study_on_Mutual_Information-based_Feature_Selection_for_Text_Categorization.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
130kB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record