Large-Scale Web Page Classification

Marath, Sathi

Large-Scale Web Page Classification

Files

Marath_Sathi_PhD_CSCI_December_2010.pdf (1.51 MB)

Date

2010-12-09

Authors

Marath, Sathi

Abstract

Web page classification is the process of assigning predefined categories to web pages. Empirical evaluations of classifiers such as Support Vector Machines (SVMs), k-Nearest Neighbor (k-NN), and Naïve Bayes (NB), have shown that these algorithms are effective in classifying small segments of web directories. The effectiveness of these algorithms, however, has not been thoroughly investigated on large-scale web page classification of such popular web directories as Yahoo! and LookSmart. Such web directories have hundreds of thousands of categories, deep hierarchies, spindle category and document distributions over the hierarchies, and skewed category distribution over the documents. These statistical properties indicate class imbalance and rarity within the dataset. In hierarchical datasets similar to web directories, expanding the content of each category using the web pages of the child categories helps to decrease the degree of rarity. This process, however, results in the localized overabundance of positive instances especially in the upper level categories of the hierarchy. The class imbalance, rarity and the localized overabundance of positive instances make applying classification algorithms to web directories very difficult and the problem has not been thoroughly studied. To our knowledge, the maximum number of categories ever previously classified on web taxonomies is 246,279 categories of Yahoo! directory using hierarchical SVMs leading to a Macro-F1 of 12% only. We designed a unified framework for the content based classification of imbalanced hierarchical datasets. The complete Yahoo! web directory of 639,671 categories and 4,140,629 web pages is used to setup the experiments. In a hierarchical dataset, the prior probability distribution of the subcategories indicates the presence or absence of class imbalance, rarity and the overabundance of positive instances within the dataset. Based on the prior probability distribution and associated machine learning issues, we partitioned the subcategories of Yahoo! web directory into five mutually exclusive groups. The effectiveness of different data level, algorithmic and architectural solutions to the associated machine learning issues is explored. Later, the best performing classification technologies for a particular prior probability distribution have been identified and integrated into the Yahoo! Web directory classification model. The methodology is evaluated using a DMOZ subset of 17,217 categories and 130,594 web pages and we statistically proved that the methodology of this research works equally well on large and small dataset. The average classifier performance in terms of macro-averaged F1-Measure achieved in this research for Yahoo! web directory and DMOZ subset is 81.02% and 84.85% respectively.

Keywords

Web page classification,class imbalance,rarity

URI

http://hdl.handle.net/10222/13130

Collections

Faculty of Graduate Studies Online Theses

Full item page

Large-Scale Web Page Classification

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections