An n-gram Based Approach to the Automatic Classification of Web Pages by Genre

Mason, Jane E.

dc.contributor.author	Mason, Jane E.
dc.date.accessioned	2009-12-16T15:09:12Z
dc.date.available	2009-12-16T15:09:12Z
dc.date.issued	2009-12-16T15:09:12Z
dc.identifier.uri	http://hdl.handle.net/10222/12351
dc.description.abstract	The extraordinary growth in both the size and popularity of the World Wide Web has generated a growing interest in the identification of Web page genres, and in the use of these genres to classify Web pages. Web page genre classification is a potentially powerful tool for filtering the results of online searches. Although most information retrieval searches are topic-based, users are typically looking for a specific type of information with regard to a particular query, and genre can provide a complementary dimension along which to categorize Web pages. Web page genre classification could also aid in the automated summarization and indexing of Web pages, and in improving the automatic extraction of metadata. The hypothesis of this thesis is that a byte n-gram representation of a Web page can be used effectively to classify the Web page by its genre(s). The goal of this thesis was to develop an approach to the problem of Web page genre classification that is effective not only on balanced, single-label corpora, but also on unbalanced and multi-label corpora, which better represent a real world environment. This thesis research develops n-gram representations for Web pages and Web page genres, and based on these representations, a new approach to the classification of Web pages by genre is developed. The research includes an exhaustive examination of the questions associated with developing the new classification model, including the length, number, and type of the n-grams with which each Web page and Web page genre is represented, the method of computing the distance (dissimilarity) between two n-gram representations, and the feature selection method with which to choose these n-grams. The effect of preprocessing the data is also studied. Techniques for setting genre thresholds in order to allow a Web page to belong to more than one genre, or to no genre at all are also investigated, and a comparison of the classification performance of the new classification model with that of the popular support vector machine approach is made. Experiments are also conducted on highly unbalanced corpora, both with and without the inclusion of noise Web pages.	en_US
dc.language.iso	en	en_US
dc.subject	Web page classification, Web page genre, information retrieval	en_US
dc.title	An n-gram Based Approach to the Automatic Classification of Web Pages by Genre	en_US
dc.date.defence	2009-12-10
dc.contributor.department	Faculty of Computer Science	en_US
dc.contributor.degree	Doctor of Philosophy	en_US
dc.contributor.external-examiner	Dr. Stan Matwin	en_US
dc.contributor.graduate-coordinator	Dr. Malcolm Heywood	en_US
dc.contributor.thesis-reader	Dr. Jack Duffy	en_US
dc.contributor.thesis-reader	Dr. Vlado Keselj	en_US
dc.contributor.thesis-supervisor	Dr. Michael Shepherd, Dr. Evangelos Milios	en_US
dc.contributor.ethics-approval	Not Applicable	en_US
dc.contributor.manuscripts	Not Applicable	en_US
dc.contributor.copyright-release	Not Applicable	en_US

Find Full text

Files in this item

Name:: thesis_16_12_09.pdf
Size:: 568.0Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Faculty of Graduate Studies Online Theses

Show simple item record