Search CORE

36 research outputs found

Identifying chemical entities on literature:a machine learning approach using dictionaries as domain knowledge

Author: Grego Tiago Daniel Pereira, 1983-
Publication venue
Publication date: 01/01/2013
Field of study

Tese de doutoramento, Informática (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2013The volume of life science publications, and therefore the underlying biomedical knowledge, are growing at a fast pace. However the manual literature analysis is a slow and painful task. Hence, text mining systems have been developed to automatically locate the relevant information contained in the literature. An essential step in text mining is named entitiy recognition, but the inherent complexity of biomedical entities, such as chemical compounds, makes it difficult to obtain good performances in this task. This thesis proposes methods capable to improve the current performance of chemical entity recognition from text. Hereby a case based method for recognizing chemical entities is proposed and the obtained evaluation results outperform the most widely used methods, based in dictionaries. A lexical similarity based chemical entity resolution method was also developed and allows an efficient mapping of the recognized entities to the ChEBI database. To improve the chemical entity identification results we developed a validation method that exploits the semantic relationships in ChEBI to measure the similarity between the entities found in the text, in order to discriminate between the correctly identified entities that can be validated and identification errors that should be discarded. A machine learning method for entity recognition error is also proposed, which can efectively find recognition errors in rule based systems. The methods were integrated in a system capable of recognizing chemical entities in texts, map them to the ChEBI database, and provide evidence of validation or recognition error for the recognized entities.O volume de publicações científicas nas ciências da vida está a aumentar a um ritmo crescente. Contudo a análise manual da literatura é um processo árduo e moroso, pelo que têm sido desenvolvidos sistemas de prospecção de texto para identificar automaticamente a informação relevante contida na literatura. Um passo essencial em prospecção de texto é a identificação de entidades nomeadas, mas a complexidade inerente às entidades biomédicas, como é o caso dos compostos químicos, torna difícil obter bons desempenhos nesta tarefa. Esta tese propõe métodos para melhorar o desempenho actual do processo de reconhecimento de entidades químicas em texto. Para tal propõe-se um método para reconhecimento de entidades químicas baseado em aprendizagem automática, que obteve resultados superiores aos métodos baseados em dicionários utilizados actualmente. Desenvolveu-se ainda um método baseado em semelhança lexical que realiza o mapeamento de entidades para a ontologia ChEBI. Para melhorar os resultados de identificação de entidades químicas desenvolveu-se um método de validação que explora as relações semânticas do ChEBI para medir a semelhança entre as entidades encontradas no texto, de forma a discriminar as entidades correctamente identificadas dos erros de identificação. Um método de filtragem de erros baseado em aprendizagem automática é também proposto, e foi testado num sistema baseado em regras. Estes métodos foram integrados num sistema capaz de reconhecer as entidades químicas em texto, mapear para o ChEBI, e fornecer evidência para validação ou detecção de erros das entidades reconhecidas.Fundação para a Ciência e a Tecnologia (FCT, SFRH/BD/36015/2007

Universidade de Lisboa: Repositório.UL

InterPro in 2022.

Author: Bateman Alex
Bileschi Maxwell L
Blum Matthias
Bork Peer
Bridge Alan
Chuguransky Sara
Colwell Lucy
Gough Julian
Grego Tiago
Haft Daniel H
Letunić Ivica
Marchler-Bauer Aron
Mi Huaiyu
Natale Darren A
Orengo Christine A
Pandurangan Arun P
Paysan-Lafosse Typhaine
Pinto Beatriz Lázaro
Rivoire Catherine
Salazar Gustavo A
Sigrist Christian JA
Sillitoe Ian
Thanki Narmada
Thomas Paul D
Tosatto Silvio CE
Wu Cathy H
Publication venue: 'Oxford University Press (OUP)'
Publication date: 09/11/2022
Field of study

The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. These developments extend and enrich the information provided by InterPro, and provide a more user friendly access to the data. Additionally, we have worked on adding Pfam website features to the InterPro website, as the Pfam website will be retired in late 2022. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB. Moreover, we report the development of a card game as a method of engaging the non-scientific community. Finally, we discuss the benefits and challenges brought by the use of artificial intelligence for protein structure prediction

UCL Discovery

One-step generation of conditional and reversible gene knockouts

Author: A Abuin
Alessandra Merenda
Amanda Andersson-Rolf
AN Economides
Bon-Kyoung Koo
FD Urnov
H Gu
H te Riele
JD Sander
Jihoon Kim
José C R Silva
Juergen Fink
K Yusa
Katie Andrews
Katie Tremble
L Cong
M Burset
M Jinek
N Lyashenko
P Mali
PH Tate
R Anton
RM Mortensen
Roxana C Mustata
Sajith Perera
SW Cho
T Sato
T Sato
Tiago Grego
VT Chu
William C Skarnes
Z Dominski
Publication venue: Nature Methods
Publication date: 30/01/2017
Field of study

Loss-of-function studies are key for investigating gene function, and CRISPR technology has made genome editing widely accessible in model organisms and cells. However, conditional gene inactivation in diploid cells is still difficult to achieve. Here, we present CRISPR-FLIP, a strategy that provides an efficient, rapid and scalable method for biallelic conditional gene knockouts in diploid or aneuploid cells, such as pluripotent stem cells, 3D organoids and cell lines, by co-delivery of CRISPR-Cas9 and a universal conditional intronic cassette.A.A.-R. and K.T. are supported by the Medical Research Council, A.M. is supported by Wntsapp, Marie Curie ITN. J.F. and J.C.R.S. are supported by the Wellcome Trust. W.C.S. received core grant support from the Wellcome Trust to the Wellcome Trust Sanger Institute. B.-K.K. and R.C.M. are supported by a Sir Henry Dale Fellowship from the Wellcome Trust and the Royal Society (101241/Z/13/Z) and receive a core support grant from the Wellcome Trust and MRC to the WT–MRC Cambridge Stem Cell Institute

Crossref

Apollo (Cambridge)

GENCODE reference annotation for the human and mouse genomes

Author: 1000 Genomes Project Consortium
Abascal
Adam Frankish
Aken
Alexandra Bignell
Alexandre Reymond
Altschul
Andrew Berry
Andrew Yates
Anne Parker
Anne-Maud Ferreira
Baertsch
Baikang Pei
Barbara Uszczynska-Ratajczak
Benedict Paten
Bianca M Schmitt
Bronwen Aken
Carlos García Girón
Casper
Cristina Sisu
Daniel Zerbino
Derrien
Eddy
Eloise Stapleton
ENCODE Project Consortium
ENCODE Project Consortium
ENCODE Project Consortium
Ezkurdia
Ezkurdia
Fabio C P Navarro
Fergal J Martin
Fernando Pozo
Fiddes
Fiona Cunningham
Gordon
GTEx Consortium
Hardwick
Harrow
Harrow
Howald
Ian T Fiddes
If Barnes
International Cancer Genome Consortium
Irimia
Irina Sycheva
Irwin Jungreis
Jacqueline Chrast
James Wright
Jane Loveland
Jinuri Xu
Joel Armstrong
Jonathan M Mudge
Jose Manuel Gonzalez
Julien Lagarde
Jyoti S Choudhary
Kalvari
Kozomara
Kronenberg
König
Lagarde
Laura Martínez
Lek
Lilue
Lin
Lowe
Magali Ruffier
Manolis Kellis
Marie-Marthe Suner
Mark Diekhans
Mark Gerstein
Matthew Hardy
Michael L Tress
Navarro
Osagie G Izuogu
Paten
Paul Flicek
Paul Muir
Pujar
Regev
Roderic Guigó
Rodriguez
Rodriguez
Rory Johnson
Sarah Donaldson
Schneider
Shamika Mohanan
Silvia Carbonell Sala
Steijger
Stunnenberg
The UniProt Consortium
Thibaut Hourlier
Tiago Grego
Tilgner
Tim J P Hubbard
Toby Hunt
Tomás Di Domenico
Uszczynska-Ratajczak
Weisser
Wright
Wright
Yan Zhang
Zerbino
Zhang
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2018
Field of study

The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.National Human Genome Research Institute of the National Institutes of Healt

Crossref

DSpace@MIT

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Serveur académique lausannois

REPISALUD

UPF Digital Repository

Bern Open Repository and Information System (BORIS)

King's Research Portal

Institute of Cancer Research Repository

Brunel University Research Archive

Enhancement of chemical entity identification in text using semantic similarity validation.

Author: Francisco M Couto
Tiago Grego
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

With the amount of chemical data being produced and reported in the literature growing at a fast pace, it is increasingly important to efficiently retrieve this information. To tackle this issue text mining tools have been applied, but despite their good performance they still provide many errors that we believe can be filtered by using semantic similarity. Thus, this paper proposes a novel method that receives the results of chemical entity identification systems, such as Whatizit, and exploits the semantic relationships in ChEBI to measure the similarity between the entities found in the text. The method assigns a single validation score to each entity based on its similarities with the other entities also identified in the text. Then, by using a given threshold, the method selects a set of validated entities and a set of outlier entities. We evaluated our method using the results of two state-of-the-art chemical entity identification tools, three semantic similarity measures and two text window sizes. The method was able to increase precision without filtering a significant number of correctly identified entities. This means that the method can effectively discriminate the correctly identified chemical entities, while discarding a significant number of identification errors. For example, selecting a validation set with 75% of all identified entities, we were able to increase the precision by 28% for one of the chemical entity identification tools (Whatizit), maintaining in that subset 97% the correctly identified entities. Our method can be directly used as an add-on by any state-of-the-art entity identification tool that provides mappings to a database, in order to improve their results. The proposed method is included in a freely accessible web tool at www.lasige.di.fc.ul.pt/webtools/ice/

CiteSeerX

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

Identifying Gene Ontology Areas for Automated Enrichment

Author: Catia Pesquita
Francisco M. Couto
Tiago Grego
Publication venue: 3rd International Workshop on Practical Applications of Computational Biology and Bioinformatics (IWPACBB'09)
Publication date: 01/01/2009
Field of study

Universidade de Lisboa: Repositório.UL

Validation of Whatizit annotation results.

Author: Francisco M. Couto (137581)
Tiago Grego (409637)
Publication venue
Publication date
Field of study

<p>Shows the variation in precision and recall with the validation score threshold, using the Resnik measure with a document as text window. Straight dots represent the expected behavior of a random validation system.</p

FigShare

Comparison of the validation scores.

Author: Francisco M. Couto (137581)
Tiago Grego (409637)
Publication venue
Publication date
Field of study

<p>Boxplot of the validation score obtained for the manual annotations in the gold standard, and the automatic annotations provided by the dictionary-based method and the CRF-based method.</p

FigShare

Validation of CRF-based annotation results.

Author: Francisco M. Couto (137581)
Tiago Grego (409637)
Publication venue
Publication date
Field of study

FigShare