On The Use of Vector Representation for Improved Accuracy and Currency of Twitter POS Tagging
Abstract
The scarcity of labelled text corpora has inspired alternative methods for harnessing data for training and development of Natural Language Processing systems geared toward tasks such as Part-of-Speech (POS) tagging, Chunking and Semantic Role Labelling. Of particular interest is the performance of POS taggers on corpora which are largely informal and unstructured such as Twitter posts. In modern business activity, the expansion of social media networks has led to increased ’lead generation’ activity; POS taggers form a significant part of such activities. We have trained a neural network based POS tagger using commercially available, labelled Penn Tree-bank data together with Twitter word embeddings. Word embeddings (or vector representations) are generated from tweets and used for training of the POS tagger. We illustrate the value of harnessing tweets as an unlimited, freely available data source by demonstration of improved performance on tagging of twitter text.