Dalhousie Repository

An n-gram Based Approach to the Automatic Classification of Web Pages by Genre

DalSpace/Manakin Repository

Show simple item record

dc.contributor.author Mason, Jane E.
dc.date.accessioned 2009-12-16T15:09:12Z
dc.date.available 2009-12-16T15:09:12Z
dc.date.issued 2009-12-16T15:09:12Z
dc.identifier.uri http://hdl.handle.net/10222/12351
dc.description.abstract The extraordinary growth in both the size and popularity of the World Wide Web has generated a growing interest in the identification of Web page genres, and in the use of these genres to classify Web pages. Web page genre classification is a potentially powerful tool for filtering the results of online searches. Although most information retrieval searches are topic-based, users are typically looking for a specific type of information with regard to a particular query, and genre can provide a complementary dimension along which to categorize Web pages. Web page genre classification could also aid in the automated summarization and indexing of Web pages, and in improving the automatic extraction of metadata. The hypothesis of this thesis is that a byte n-gram representation of a Web page can be used effectively to classify the Web page by its genre(s). The goal of this thesis was to develop an approach to the problem of Web page genre classification that is effective not only on balanced, single-label corpora, but also on unbalanced and multi-label corpora, which better represent a real world environment. This thesis research develops n-gram representations for Web pages and Web page genres, and based on these representations, a new approach to the classification of Web pages by genre is developed. The research includes an exhaustive examination of the questions associated with developing the new classification model, including the length, number, and type of the n-grams with which each Web page and Web page genre is represented, the method of computing the distance (dissimilarity) between two n-gram representations, and the feature selection method with which to choose these n-grams. The effect of preprocessing the data is also studied. Techniques for setting genre thresholds in order to allow a Web page to belong to more than one genre, or to no genre at all are also investigated, and a comparison of the classification performance of the new classification model with that of the popular support vector machine approach is made. Experiments are also conducted on highly unbalanced corpora, both with and without the inclusion of noise Web pages. en_US
dc.language.iso en en_US
dc.subject Web page classification, Web page genre, information retrieval en_US
dc.title An n-gram Based Approach to the Automatic Classification of Web Pages by Genre en_US
dc.date.defence 2009-12-10
dc.contributor.department Faculty of Computer Science en_US
dc.contributor.degree Doctor of Philosophy en_US
dc.contributor.external-examiner Dr. Stan Matwin en_US
dc.contributor.graduate-coordinator Dr. Malcolm Heywood en_US
dc.contributor.thesis-reader Dr. Jack Duffy en_US
dc.contributor.thesis-reader Dr. Vlado Keselj en_US
dc.contributor.thesis-supervisor Dr. Michael Shepherd, Dr. Evangelos Milios en_US
dc.contributor.ethics-approval Not Applicable en_US
dc.contributor.manuscripts Not Applicable en_US
dc.contributor.copyright-release Not Applicable en_US

Files in this item

This item appears in the following Collection(s)

Show simple item record