14,494 research outputs found
Mapping Subsets of Scholarly Information
We illustrate the use of machine learning techniques to analyze, structure,
maintain, and evolve a large online corpus of academic literature. An emerging
field of research can be identified as part of an existing corpus, permitting
the implementation of a more coherent community structure for its
practitioners.Comment: 10 pages, 4 figures, presented at Arthur M. Sackler Colloquium on
"Mapping Knowledge Domains", 9--11 May 2003, Beckman Center, Irvine, CA,
proceedings to appear in PNA
Text-mining and ontologies: new approaches to knowledge discovery of microbial diversity
Microbiology research has access to a very large amount of public information
on the habitats of microorganisms. Many areas of microbiology research uses
this information, primarily in biodiversity studies. However the habitat
information is expressed in unstructured natural language form, which hinders
its exploitation at large-scale. It is very common for similar habitats to be
described by different terms, which makes them hard to compare automatically,
e.g. intestine and gut. The use of a common reference to standardize these
habitat descriptions as claimed by (Ivana et al., 2010) is a necessity. We
propose the ontology called OntoBiotope that we have been developing since
2010. The OntoBiotope ontology is in a formal machine-readable representation
that enables indexing of information as well as conceptualization and
reasoning.Comment: 5 page
This Just In: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News
The problem of fake news has gained a lot of attention as it is claimed to
have had a significant impact on 2016 US Presidential Elections. Fake news is
not a new problem and its spread in social networks is well-studied. Often an
underlying assumption in fake news discussion is that it is written to look
like real news, fooling the reader who does not check for reliability of the
sources or the arguments in its content. Through a unique study of three data
sets and features that capture the style and the language of articles, we show
that this assumption is not true. Fake news in most cases is more similar to
satire than to real news, leading us to conclude that persuasion in fake news
is achieved through heuristics rather than the strength of arguments. We show
overall title structure and the use of proper nouns in titles are very
significant in differentiating fake from real. This leads us to conclude that
fake news is targeted for audiences who are not likely to read beyond titles
and is aimed at creating mental associations between entities and claims.Comment: Published at The 2nd International Workshop on News and Public
Opinion at ICWS
A Linear Classifier Based on Entity Recognition Tools and a Statistical Approach to Method Extraction in the Protein-Protein Interaction Literature
We participated, in the Article Classification and the Interaction Method
subtasks (ACT and IMT, respectively) of the Protein-Protein Interaction task of
the BioCreative III Challenge. For the ACT, we pursued an extensive testing of
available Named Entity Recognition and dictionary tools, and used the most
promising ones to extend our Variable Trigonometric Threshold linear
classifier. For the IMT, we experimented with a primarily statistical approach,
as opposed to employing a deeper natural language processing strategy. Finally,
we also studied the benefits of integrating the method extraction approach that
we have used for the IMT into the ACT pipeline. For the ACT, our linear article
classifier leads to a ranking and classification performance significantly
higher than all the reported submissions. For the IMT, our results are
comparable to those of other systems, which took very different approaches. For
the ACT, we show that the use of named entity recognition tools leads to a
substantial improvement in the ranking and classification of articles relevant
to protein-protein interaction. Thus, we show that our substantially expanded
linear classifier is a very competitive classifier in this domain. Moreover,
this classifier produces interpretable surfaces that can be understood as
"rules" for human understanding of the classification. In terms of the IMT
task, in contrast to other participants, our approach focused on identifying
sentences that are likely to bear evidence for the application of a PPI
detection method, rather than on classifying a document as relevant to a
method. As BioCreative III did not perform an evaluation of the evidence
provided by the system, we have conducted a separate assessment; the evaluators
agree that our tool is indeed effective in detecting relevant evidence for PPI
detection methods.Comment: BMC Bioinformatics. In Pres
- …