Show simple item record

dc.contributor.authorRamirez-Orta, Juan Antonio
dc.date.accessioned2023-12-20T17:13:23Z
dc.date.available2023-12-20T17:13:23Z
dc.date.issued2023-12-14
dc.identifier.urihttp://hdl.handle.net/10222/83324
dc.description.abstractGiven the recent rise in popularity of methods based on Deep Neural Networks inside Natural Language Processing (NLP), important progress has been made in a variety of tasks that was not possible before, given their complexity and the heavy feature engineering required. However, the application of Deep Language Models to texts longer than a few paragraphs using standard hardware remains an open challenge. In this thesis, we explore a novel set of techniques to process full documents that rely exclusively on local context. These techniques, called local methods, work by splitting the input into smaller pieces, processing them independently and combining the partial results in a coherent way. Their main advantage over other current methods is their efficiency: since they only process small parts of the full document, they prevent the model from wasting resources extracting meaningless relationships. To test the effectiveness of local methods, we apply them in two tasks that require the processing full documents: the correction of documents processed with Optical Character Recognition systems and the Summarization of Scientific Documents. First, we introduce a method to summarize scientific documents of any length based on sentence embeddings and graphs that is simple, fast and efficient. Second, we introduce a method to correct long strings of characters by splitting them into n-grams, correcting them using character sequence-to-sequence models and joining them coherently via a voting. Third, we introduce a methodology for the Query-Focused Summarization of Scientific Documents based on splitting the input documents into sentences and training Machine Learning classifiers on-the-fly to determine their relevance to the query. And finally, we introduce a methodology to automatically obtain datasets for the tasks of Scientific Query-Focused Summarization and Citation Prediction by taking advantage of existing collections of academic documents. In the end, the techniques introduced in this thesis provide evidence that local methods are a viable alternative to more complex, resource-hungry methods which currently represent the state of the art in NLP, promising to be resource- and sample-efficient, paving the way for a new family of methods for document-level NLP.en_US
dc.language.isoenen_US
dc.subjectNatural Language Processingen_US
dc.subjectDeep Learningen_US
dc.subjectLocal Methodsen_US
dc.titleLocal Methods for Document-Level Natural Language Processingen_US
dc.date.defence2023-11-30
dc.contributor.departmentFaculty of Computer Scienceen_US
dc.contributor.degreeDoctor of Philosophyen_US
dc.contributor.external-examinerAijun Anen_US
dc.contributor.thesis-readerAna Maguitmanen_US
dc.contributor.thesis-readerHassan Sajjaden_US
dc.contributor.thesis-supervisorEvangelos Miliosen_US
dc.contributor.thesis-supervisorAxel Sotoen_US
dc.contributor.ethics-approvalReceiveden_US
dc.contributor.manuscriptsYesen_US
dc.contributor.copyright-releaseYesen_US
 Find Full text

Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record