Show simple item record

dc.contributor.authorMason, Jane E.
dc.date.accessioned2009-12-16T15:09:12Z
dc.date.available2009-12-16T15:09:12Z
dc.date.issued2009-12-16T15:09:12Z
dc.identifier.urihttp://hdl.handle.net/10222/12351
dc.description.abstractThe extraordinary growth in both the size and popularity of the World Wide Web has generated a growing interest in the identification of Web page genres, and in the use of these genres to classify Web pages. Web page genre classification is a potentially powerful tool for filtering the results of online searches. Although most information retrieval searches are topic-based, users are typically looking for a specific type of information with regard to a particular query, and genre can provide a complementary dimension along which to categorize Web pages. Web page genre classification could also aid in the automated summarization and indexing of Web pages, and in improving the automatic extraction of metadata. The hypothesis of this thesis is that a byte n-gram representation of a Web page can be used effectively to classify the Web page by its genre(s). The goal of this thesis was to develop an approach to the problem of Web page genre classification that is effective not only on balanced, single-label corpora, but also on unbalanced and multi-label corpora, which better represent a real world environment. This thesis research develops n-gram representations for Web pages and Web page genres, and based on these representations, a new approach to the classification of Web pages by genre is developed. The research includes an exhaustive examination of the questions associated with developing the new classification model, including the length, number, and type of the n-grams with which each Web page and Web page genre is represented, the method of computing the distance (dissimilarity) between two n-gram representations, and the feature selection method with which to choose these n-grams. The effect of preprocessing the data is also studied. Techniques for setting genre thresholds in order to allow a Web page to belong to more than one genre, or to no genre at all are also investigated, and a comparison of the classification performance of the new classification model with that of the popular support vector machine approach is made. Experiments are also conducted on highly unbalanced corpora, both with and without the inclusion of noise Web pages.en_US
dc.language.isoenen_US
dc.subjectWeb page classification, Web page genre, information retrievalen_US
dc.titleAn n-gram Based Approach to the Automatic Classification of Web Pages by Genreen_US
dc.date.defence2009-12-10
dc.contributor.departmentFaculty of Computer Scienceen_US
dc.contributor.degreeDoctor of Philosophyen_US
dc.contributor.external-examinerDr. Stan Matwinen_US
dc.contributor.graduate-coordinatorDr. Malcolm Heywooden_US
dc.contributor.thesis-readerDr. Jack Duffyen_US
dc.contributor.thesis-readerDr. Vlado Keseljen_US
dc.contributor.thesis-supervisorDr. Michael Shepherd, Dr. Evangelos Miliosen_US
dc.contributor.ethics-approvalNot Applicableen_US
dc.contributor.manuscriptsNot Applicableen_US
dc.contributor.copyright-releaseNot Applicableen_US
 Find Full text

Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record