1,486 research outputs found
Information Extraction in Illicit Domains
Extracting useful entities and attribute values from illicit domains such as
human trafficking is a challenging problem with the potential for widespread
social impact. Such domains employ atypical language models, have `long tails'
and suffer from the problem of concept drift. In this paper, we propose a
lightweight, feature-agnostic Information Extraction (IE) paradigm specifically
designed for such domains. Our approach uses raw, unlabeled text from an
initial corpus, and a few (12-120) seed annotations per domain-specific
attribute, to learn robust IE models for unobserved pages and websites.
Empirically, we demonstrate that our approach can outperform feature-centric
Conditional Random Field baselines by over 18\% F-Measure on five annotated
sets of real-world human trafficking datasets in both low-supervision and
high-supervision settings. We also show that our approach is demonstrably
robust to concept drift, and can be efficiently bootstrapped even in a serial
computing environment.Comment: 10 pages, ACM WWW 201
PeakNet: Bragg peak finding in X-ray crystallography experiments with U-Net
Serial crystallography at X-ray free electron laser (XFEL) sources has
experienced tremendous progress in achieving high data rate in recent times.
While this development offers potential to enable novel scientific
investigations, such as imaging molecular events at logarithmic timescales, it
also poses challenges in regards to real-time data analysis, which involves
some degree of data reduction to only save those features or images pertaining
to the science on disks. If data reduction is not effective, it could directly
result in a substantial increase in facility budgetary requirements, or even
hinder the utilization of ultra-high repetition imaging techniques making data
analysis unwieldy. Furthermore, an additional challenge involves providing
real-time feedback to users derived from real-time data analysis. In the
context of serial crystallography, the initial and critical step in real-time
data analysis is finding X-ray Bragg peaks from diffraction images. To tackle
this challenge, we present PeakNet, a Bragg peak finder that utilizes neural
networks and runs about four times faster than Psocake peak finder, while
delivering significantly better indexing rates and comparable number of indexed
events. We formulated the task of peak finding into a semantic segmentation
problem, which is implemented as a classical U-Net architecture. A key
advantage of PeakNet is its ability to scale linearly with respect to data
volume, making it well-suited for real-time serial crystallography data
analysis at high data rates
Information retrieval and text mining technologies for chemistry
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European
Community’s Horizon 2020 Program (project reference:
654021 - OpenMinted). M.K. additionally acknowledges the
Encomienda MINETAD-CNIO as part of the Plan for the
Advancement of Language Technology. O.R. and J.O. thank
the Foundation for Applied Medical Research (FIMA),
University of Navarra (Pamplona, Spain). This work was
partially funded by Consellería
de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic
funding of UID/BIO/04469/2013 unit and COMPETE 2020
(POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi
for useful feedback and discussions during the preparation of
the manuscript.info:eu-repo/semantics/publishedVersio
Neurocognitive Informatics Manifesto.
Informatics studies all aspects of the structure of natural and artificial information systems. Theoretical and abstract approaches to information have made great advances, but human information processing is still unmatched in many areas, including information management, representation and understanding. Neurocognitive informatics is a new, emerging field that should help to improve the matching of artificial and natural systems, and inspire better computational algorithms to solve problems that are still beyond the reach of machines. In this position paper examples of neurocognitive inspirations and promising directions in this area are given
- …