Large-Scale Web Page Classification
Loading...
Date
Authors
Marath, Sathi
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Web page classification is the process of assigning predefined categories to web pages. 
Empirical evaluations of classifiers such as Support Vector Machines (SVMs), k-Nearest 
Neighbor (k-NN), and Naïve Bayes (NB), have shown that these algorithms are effective 
in classifying small segments of web directories. The effectiveness of these algorithms, 
however, has not been thoroughly investigated on large-scale web page classification of 
such popular web directories as Yahoo! and LookSmart. Such web directories  have 
hundreds of thousands of categories, deep hierarchies, spindle category and document 
distributions over the hierarchies, and skewed category distribution over the documents. 
These statistical properties indicate class imbalance and rarity within the dataset.  
 
In hierarchical datasets similar to web directories, expanding the content of each category 
using the web pages of the child categories helps to decrease the degree of rarity. This 
process, however, results in the localized overabundance of positive instances especially 
in the upper level categories of the hierarchy.  The class imbalance, rarity and the 
localized overabundance of positive instances make applying classification algorithms to 
web directories very difficult and the problem has not been thoroughly studied. To our 
knowledge, the maximum number of categories ever previously classified on web 
taxonomies is 246,279 categories of Yahoo! directory using hierarchical SVMs leading to 
a Macro-F1 of 12% only.  
 
We designed a unified framework  for the content based classification of imbalanced 
hierarchical datasets. The complete Yahoo! web directory of 639,671 categories and 
4,140,629 web pages is used to setup the experiments. In a hierarchical dataset, the prior 
probability distribution of the  subcategories indicates the presence or absence of class 
imbalance, rarity and the overabundance of positive instances within the dataset. Based 
on the prior probability distribution and associated machine learning issues, we 
partitioned the subcategories  of Yahoo! web directory into five mutually exclusive 
groups. The effectiveness of different data level, algorithmic and architectural solutions 
to the associated machine learning issues is explored. Later, the best performing 
classification technologies for a particular prior probability distribution have been 
identified and integrated into the Yahoo! Web directory classification model. The 
methodology is evaluated using a DMOZ subset of 17,217 categories and 130,594 web 
pages and we statistically proved that the methodology of this research works equally 
well on large and small dataset. 
 
The average classifier performance in terms of macro-averaged F1-Measure achieved in 
this research for Yahoo! web directory and DMOZ subset is 81.02% and 84.85% 
respectively.
Description
Keywords
Web page classification,class imbalance,rarity
