Mining the text biobibliome for experimental evidence of predicted gene regulation in E. coli

Posdoctoral fellow: Carlos Rodríguez Penagos, Ph.D.

The core of the project is to develop suitable software tools to mine the E. coli literature to extract information useful for the curatorial efforts of the Regulon database maintained at CCG. A promising line of research is employing Language Engineering techniques to locate experimental evidence in the biobibliome (The vast full-text repositories of biomedical research articles) for entities in the database that have been inferred or predicted by analytical and computational methods.

In general, we will evaluate how Natural Language Processing tools and techniques can have an impact in the curatorial and knowledge-discovery efforts focused on the regulatory mechanisms of the E. coli model organism.

Future Work

The explosive growth of available data for computational approaches to biological research has led in the last 7 years to the development of advanced methods for mining the vast literature known as the biobibliome in order to extract useful facts from free-form text. These Information Extraction and Retrieval methods go well beyond usual keyword searching and abstract scanning by doing semantic interpretation of full texts that can identify the bioentities involved, as well as the relationships between them that are being described. This project aims at developing Computational Linguistics tools to enrich and extend the curatorial efforts for the E. coli Regulon database, as well as exploring novel methodologies and algorithms for the task of locating relevant information from textual sources. The techniques involved range from rule-based approaches to statistical and Machine Learning methods that have proven to be accurate and robust in other domains of application. We will explore, in particular, the possibility of discovering in the literature repositories experimental evidence about bioentities (mainly genes and/or proteins) that have been put forward by analytical or computational methods.

Computational Genomics Program