UNSUPERVISED PARAPHRASE GENERATION FROM HIERARCHICAL LANGUAGE MODELS
MetadataShow full item record
Paraphrase generation is a challenging problem that requires a semantic representation of language. Language models implemented with deep neural networks (DNN) have the ability to transform text to a real valued vector space that can capture useful semantic information. In light of this, this work employs hierarchical language modeling to produce semantic representations of sentences. An encoder-decoder model is employed that uses four components: a word encoder, sentence encoder, sentence decoder, and word decoder. These components hierarchically convert a sentence from characters through word representations to a fixed-size sentence representation, then back down through words to characters. Many types of neural network are suitable for each component, and a number of them are compared in this work, including a novel architecture, the Self Attentive Recurrent Array (SARAh). The SARAh is shown to perform at least as well as Gated Recurrent Units (GRU) and Transformers on language modeling tasks, and requires fewer parameters. These language models are trained on a large and diverse dataset, but this work also shows that it is possible to fine tune such models to a particular domain, such as the works of a single author. These fine tuned models are able to leverage information learned on the larger dataset in order to perform better on the target domain. Finally, a language model is trained to produce semantic representations of sentences that are subsequently used to produce paraphrases in a completely unsupervised setting. The language model, which is trained to predict the sentence most likely to follow the input sentence, is fine tuned to instead autoencode the input sentence. Given that the sentence encoder produces a semantic representation, it is possible to use a number of techniques to encourage the decoder to generate a paraphrase rather than reconstruct the exact input sentence. These techniques include adding noise to the sentence representation, and sampling characters from the model's output layer.