Fast calculation of n-gram-based phrase similarity
Date
2017-12-18T14:00:23Z
Authors
Ai, Zichu
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Text Relatedness Using Word and Phrase Relatedness Method (TrWP) is a text relatedness measure that computes semantic similarity between words and phrases utilizing aggregated statistics from the Google Web-1T corpus. The phrase similarity computation in TrWP has significant overhead in time and memory cost, making TrWP inefficient for practical scenario with massive queries. This thesis presents an in-memory computational framework for TrWP, which optimizes the calculation process by efficient indexing and compact storage using perfect hashing, parallelism, quantization and variable length encoding. Using the Google Web 1T 5-gram corpus, we demonstrate that the fastest computational speed of our framework reaches 4098 queries per second.
Description
Keywords
Natural Language Processing, High Performance Computing