The organisation and visualisation of document corpora: A probabilistic approach.
In: Tjoa, AM and Wagner, RR and AlZobaidie, A, (eds.)
11TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATION, PROCEEDINGS.
(pp. 558 - 564).
IEEE COMPUTER SOC
In this paper a generic probabilistic framework for the unsupervised organisation and visualisation of document collections is presented. The probabilistic hierarchical clustering of large-scale sparse min high-dimensional data collections is achieved by the development of a family of latent class models which are parameterised using the expectation maximisation algorithm. The framework is based on a hierarchical probabilistic mixture methodology Two classes of models enlarge front the analysis and these have been termed as symmetric and asymmetric models. For tt,rt data specifically both asymmetric and symmetric models based on the multinomial and binomial distributions are most appropriate. The subsequent visualisation of document collections is achieved by exploiting the topographic relations between similar documents. A latent trait model is developed which provides the means of viewing vector space document representations on a 2-D grid and thereby visualising the inherent structure of the document collection. A number of experiments are provided to demonstrate the technique and a concluding discussion on the proposed models is given.
|Title:||The organisation and visualisation of document corpora: A probabilistic approach|
|Event:||11th International Workshop on Database and Expert Systems Applications|
|Dates:||2000-09-04 - 2000-09-08|
|UCL classification:||UCL > School of BEAMS > Faculty of Maths and Physical Sciences > Statistical Science|
Archive Staff Only