Date of Award
Doctor of Philosophy
A general model of a bibliographic retrieval sytem is presented which has five main elements: the documents, the queries, the thesaurus of indexing terms, the search algorithms and the physical storage locations. This is adapted to produce a probabilistic model which is suitable for simulation purposes, concentrating on the assignment of index terms to documents. This is accomplished by using the distribution of terms over documents and over queries, the distribution of exhaustivity over documents and over queries, the distribution of co-occurrences (occurrences of pairs of terms), the distribution of relevant and non-relevant documents over the number of terms matching the query. Several theoretical distributions were tested against four databases to find the best fitting distributions using the chi-square criterion. The distribution of terms over documents was split into two parts. The low frequency terms were analyzed using the number of terms which occurred x times, called the frequency-size approach. The high frequency terms were ranked by the number of occurrences in documents and analyzed using the rank versus the frequency of the term, called the frequency-rank approach. It was found that a generalized Zipf distribution fit the frequency-size portion and a generalized Bradford or log-rank distribution was best for the frequency-rank part.;These distributions were incorporated into a simulation program using a probabilistic model of term occurrences and co-occurrences. Simulation of the four databases was carried out using both the independence assumption of the occurrence of terms and the dependence assumption. In most cases the dependence model gave an improvement over the independent model but did not reproduce fully the original distribution of co-occurrences.;A small experiment with the clustering of terms to incorporate term dependence was also carried out. A method of incorporating the clustered terms into a simulation model needs to be found.;More work needs to be done in incorporating dependence of index terms, especially of order higher than two, into a model of bibliographic retrieval systems. Goodness-of-fit tests and parameter estimation methods need to be devised for the type of long tailed distributions encountered.
Nelson, Michael John, "Probabilistic Models For The Simulation Of Bibliographic Retrieval Systems" (1982). Digitized Theses. 1143.