Large-Scale Web Page Classification

Marath, Sathi

dc.contributor.author	Marath, Sathi
dc.date.accessioned	2010-12-09T16:31:36Z
dc.date.available	2010-12-09T16:31:36Z
dc.date.issued	2010-12-09
dc.identifier.uri	http://hdl.handle.net/10222/13130
dc.description.abstract	Web page classification is the process of assigning predefined categories to web pages. Empirical evaluations of classifiers such as Support Vector Machines (SVMs), k-Nearest Neighbor (k-NN), and Naïve Bayes (NB), have shown that these algorithms are effective in classifying small segments of web directories. The effectiveness of these algorithms, however, has not been thoroughly investigated on large-scale web page classification of such popular web directories as Yahoo! and LookSmart. Such web directories have hundreds of thousands of categories, deep hierarchies, spindle category and document distributions over the hierarchies, and skewed category distribution over the documents. These statistical properties indicate class imbalance and rarity within the dataset. In hierarchical datasets similar to web directories, expanding the content of each category using the web pages of the child categories helps to decrease the degree of rarity. This process, however, results in the localized overabundance of positive instances especially in the upper level categories of the hierarchy. The class imbalance, rarity and the localized overabundance of positive instances make applying classification algorithms to web directories very difficult and the problem has not been thoroughly studied. To our knowledge, the maximum number of categories ever previously classified on web taxonomies is 246,279 categories of Yahoo! directory using hierarchical SVMs leading to a Macro-F1 of 12% only. We designed a unified framework for the content based classification of imbalanced hierarchical datasets. The complete Yahoo! web directory of 639,671 categories and 4,140,629 web pages is used to setup the experiments. In a hierarchical dataset, the prior probability distribution of the subcategories indicates the presence or absence of class imbalance, rarity and the overabundance of positive instances within the dataset. Based on the prior probability distribution and associated machine learning issues, we partitioned the subcategories of Yahoo! web directory into five mutually exclusive groups. The effectiveness of different data level, algorithmic and architectural solutions to the associated machine learning issues is explored. Later, the best performing classification technologies for a particular prior probability distribution have been identified and integrated into the Yahoo! Web directory classification model. The methodology is evaluated using a DMOZ subset of 17,217 categories and 130,594 web pages and we statistically proved that the methodology of this research works equally well on large and small dataset. The average classifier performance in terms of macro-averaged F1-Measure achieved in this research for Yahoo! web directory and DMOZ subset is 81.02% and 84.85% respectively.	en_US
dc.language.iso	en	en_US
dc.subject	Web page classification,class imbalance,rarity	en_US
dc.title	Large-Scale Web Page Classification	en_US
dc.type	Thesis	en_US
dc.date.defence	2010-11-09
dc.contributor.department	Faculty of Computer Science	en_US
dc.contributor.degree	Doctor of Philosophy	en_US
dc.contributor.external-examiner	n/a	en_US
dc.contributor.graduate-coordinator	Dr. Malcolm Heywood	en_US
dc.contributor.thesis-reader	Dr. Evangelos E. Milios	en_US
dc.contributor.thesis-reader	Dr. Malcolm Heywood	en_US
dc.contributor.thesis-supervisor	Dr. Michael Shepherd	en_US
dc.contributor.ethics-approval	Not Applicable	en_US
dc.contributor.manuscripts	Not Applicable	en_US
dc.contributor.copyright-release	Not Applicable	en_US

Find Full text

Files in this item

Name:: Marath_Sathi_PhD_CSCI_December ...
Size:: 1.505Mb
Format:: PDF
Description:: Phd Thesis

View/Open

This item appears in the following Collection(s)

Faculty of Graduate Studies Online Theses

Show simple item record