Repository logo
 

Fast calculation of n-gram-based phrase similarity

Date

2017-12-18T14:00:23Z

Authors

Ai, Zichu

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Text Relatedness Using Word and Phrase Relatedness Method (TrWP) is a text relatedness measure that computes semantic similarity between words and phrases utilizing aggregated statistics from the Google Web-1T corpus. The phrase similarity computation in TrWP has significant overhead in time and memory cost, making TrWP inefficient for practical scenario with massive queries. This thesis presents an in-memory computational framework for TrWP, which optimizes the calculation process by efficient indexing and compact storage using perfect hashing, parallelism, quantization and variable length encoding. Using the Google Web 1T 5-gram corpus, we demonstrate that the fastest computational speed of our framework reaches 4098 queries per second.

Description

Keywords

Natural Language Processing, High Performance Computing

Citation