A probabilistic framework for the hierarchic organisation and classification of document collections.
J INTELL INF SYST
153 - 172.
This paper presents a probabilistic mixture modeling framework for the hierarchic organisation of document collections. It is demonstrated that the probabilistic corpus model which emerges from the automatic or unsupervised hierarchical organisation of a document collection can be further exploited to create a kernel which boosts the performance of state-of-the-art support vector machine document classifiers. It is shown that the performance of such a classifier is further enhanced when employing the kernel derived from an appropriate hierarchic mixture model used for partitioning a document corpus rather than the kernel associated with a flat non-hierarchic mixture model. This has important implications for document classification when a hierarchic ordering of topics exists. This can be considered as the effective combination of documents with no topic or class labels (unlabeled data), labeled documents, and prior domain knowledge (in the form of the known hierarchic structure), in providing enhanced document classification performance.
|Title:||A probabilistic framework for the hierarchic organisation and classification of document collections|
|Keywords:||hierarchical probabilistic clustering, probabilistic latent semantic analysis, text categorization, support vector machines, NETWORKS, MODELS|
|UCL classification:||UCL > School of BEAMS > Faculty of Maths and Physical Sciences > Statistical Science|
Archive Staff Only