UCL logo

UCL Discovery

UCL home » Library Services » Electronic resources » UCL Discovery

The organisation and visualisation of document corpora: A probabilistic approach

Girolami, M; Vinokourov, A; Kaban, A; (2000) The organisation and visualisation of document corpora: A probabilistic approach. In: Tjoa, AM and Wagner, RR and AlZobaidie, A, (eds.) 11TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATION, PROCEEDINGS. (pp. 558 - 564). IEEE COMPUTER SOC

Full text not available from this repository.

Abstract

In this paper a generic probabilistic framework for the unsupervised organisation and visualisation of document collections is presented. The probabilistic hierarchical clustering of large-scale sparse min high-dimensional data collections is achieved by the development of a family of latent class models which are parameterised using the expectation maximisation algorithm. The framework is based on a hierarchical probabilistic mixture methodology Two classes of models enlarge front the analysis and these have been termed as symmetric and asymmetric models. For tt,rt data specifically both asymmetric and symmetric models based on the multinomial and binomial distributions are most appropriate. The subsequent visualisation of document collections is achieved by exploiting the topographic relations between similar documents. A latent trait model is developed which provides the means of viewing vector space document representations on a 2-D grid and thereby visualising the inherent structure of the document collection. A number of experiments are provided to demonstrate the technique and a concluding discussion on the proposed models is given.

Type:Proceedings paper
Title:The organisation and visualisation of document corpora: A probabilistic approach
Event:11th International Workshop on Database and Expert Systems Applications
Location:LONDON, ENGLAND
Dates:2000-09-04 - 2000-09-08
ISBN:0-7695-0680-1
UCL classification:UCL > School of BEAMS > Faculty of Maths and Physical Sciences > Statistical Science

Archive Staff Only: edit this record