dc.contributor.author | Zhang, Yongzheng. | en_US |
dc.date.accessioned | 2014-10-21T12:35:35Z | |
dc.date.available | 2007 | |
dc.date.issued | 2007 | en_US |
dc.identifier.other | AAINR31513 | en_US |
dc.identifier.uri | http://hdl.handle.net/10222/54977 | |
dc.description | Web site summarization, which identifies the essential content covered in a given Web site, plays an important role in Web information management. However, straightforward summarization of an entire Web site, which is large and with diverse content, may lead to a summary heavily biased to a subset of main topics covered in the target Web site. In this thesis, we propose a two-stage framework for effective summarization of multi-topic Web sites. The first stage identifies the main topics covered in a Web site and the second stage summarizes each topic separately. | en_US |
dc.description | In order to identify the different topics covered in a Web site, we perform both text- and link-based clustering. In text-based clustering, we investigate the impact of document representation and feature selection on the clustering quality. In link-based clustering, we study co-citation and bibliographic coupling. We demonstrate that text-based clustering based on the selection of features with high variance over Web pages is reliable and that outgoing links can be used to improve the clustering quality if a rich set of cross links is available. | en_US |
dc.description | Each individual cluster computed above is summarized using an extraction-based summarization system, which extracts key phrases and key sentences from source documents to generate a summary. The performance of such an extraction-based Web site summarization system depends on its underlying key phrase extraction method. Hence, we conduct a user study to investigate five alternative key phrase extraction methods. Results show that the best method combines linguistic constraints with frequency over the corpus adjusted to take into account nesting of terms. Another important component in an extraction based summarization system is the key sentence extraction. To this end, we design and develop a classification approach in the cluster summarization stage. The classifier uses statistical and linguistic features to determine the topical significance of each sentence. | en_US |
dc.description | Finally, we evaluate the proposed system via a user study. We demonstrate that the proposed clustering summarization approach significantly outperforms the single-topic summarization approach for any given Web site summarization task. | en_US |
dc.description | Thesis (Ph.D.)--Dalhousie University (Canada), 2007. | en_US |
dc.language | eng | en_US |
dc.publisher | Dalhousie University | en_US |
dc.publisher | | en_US |
dc.subject | Computer Science. | en_US |
dc.title | A framework for summarization of multi-topic Web sites. | en_US |
dc.type | text | en_US |
dc.contributor.degree | Ph.D. | en_US |