Large-Scale Web Page Classification
Date
2010-12-09
Authors
Marath, Sathi
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Web page classification is the process of assigning predefined categories to web pages.
Empirical evaluations of classifiers such as Support Vector Machines (SVMs), k-Nearest
Neighbor (k-NN), and Naïve Bayes (NB), have shown that these algorithms are effective
in classifying small segments of web directories. The effectiveness of these algorithms,
however, has not been thoroughly investigated on large-scale web page classification of
such popular web directories as Yahoo! and LookSmart. Such web directories have
hundreds of thousands of categories, deep hierarchies, spindle category and document
distributions over the hierarchies, and skewed category distribution over the documents.
These statistical properties indicate class imbalance and rarity within the dataset.
In hierarchical datasets similar to web directories, expanding the content of each category
using the web pages of the child categories helps to decrease the degree of rarity. This
process, however, results in the localized overabundance of positive instances especially
in the upper level categories of the hierarchy. The class imbalance, rarity and the
localized overabundance of positive instances make applying classification algorithms to
web directories very difficult and the problem has not been thoroughly studied. To our
knowledge, the maximum number of categories ever previously classified on web
taxonomies is 246,279 categories of Yahoo! directory using hierarchical SVMs leading to
a Macro-F1 of 12% only.
We designed a unified framework for the content based classification of imbalanced
hierarchical datasets. The complete Yahoo! web directory of 639,671 categories and
4,140,629 web pages is used to setup the experiments. In a hierarchical dataset, the prior
probability distribution of the subcategories indicates the presence or absence of class
imbalance, rarity and the overabundance of positive instances within the dataset. Based
on the prior probability distribution and associated machine learning issues, we
partitioned the subcategories of Yahoo! web directory into five mutually exclusive
groups. The effectiveness of different data level, algorithmic and architectural solutions
to the associated machine learning issues is explored. Later, the best performing
classification technologies for a particular prior probability distribution have been
identified and integrated into the Yahoo! Web directory classification model. The
methodology is evaluated using a DMOZ subset of 17,217 categories and 130,594 web
pages and we statistically proved that the methodology of this research works equally
well on large and small dataset.
The average classifier performance in terms of macro-averaged F1-Measure achieved in
this research for Yahoo! web directory and DMOZ subset is 81.02% and 84.85%
respectively.
Description
Keywords
Web page classification,class imbalance,rarity