Biological Information Extraction using Patterns of Characters, Tag Sequences and Subgraphs
MetadataShow full item record
The magnitude of the document collection in the biology domain boosts the demand for effective and efficient literature mining and knowledge discovery that can help biologists to gather and make use of the knowledge encoded in text documents. In this thesis, we present three different pattern-based techniques to target two important tasks of biological information extraction: entity recognition and relation extraction. The first technique is an unsupervised method to automatically extract domain-specific prefix and suffix characters from biological corpora. The extracted characters are integrated into the parametrization of an existing system for biological entity recognition in order to aid the system to annotate biological entities. The second technique is an approach to identify sentences that describe interactions between co-occurring biological entities using patterns defined as a sequence of specialized Part-of-Speech (POS) tags that capture the structure of key sentences in the scientific literature. Each candidate sentence for the classification task is encoded as a POS array and then aligned to a collection of pre-extracted patterns. The quality of the alignment is expressed as a pairwise alignment score. The most innovative component of this work is the use of a Genetic Algorithm (GA) to maximize the classification performance of the alignment scoring scheme. The third technique is a graph matching-based approach to extract complex biological events from the scientific literature. Sentences are represented as dependency graphs, and biological event rules are extracted from sentences as minimal dependency graphs that capture the typical contextual structures of biological events. We investigate whether the subgraph matching problem can be used in the BioNLP field to extract biological events by searching for subgraphs isomorphic to the graphs of event rules within the graphs of sentences.