437 research outputs found
Fast Label Embeddings via Randomized Linear Algebra
Many modern multiclass and multilabel problems are characterized by
increasingly large output spaces. For these problems, label embeddings have
been shown to be a useful primitive that can improve computational and
statistical efficiency. In this work we utilize a correspondence between rank
constrained estimation and low dimensional label embeddings that uncovers a
fast label embedding algorithm which works in both the multiclass and
multilabel settings. The result is a randomized algorithm whose running time is
exponentially faster than naive algorithms. We demonstrate our techniques on
two large-scale public datasets, from the Large Scale Hierarchical Text
Challenge and the Open Directory Project, where we obtain state of the art
results.Comment: To appear in the proceedings of the ECML/PKDD 2015 conference.
Reference implementation available at https://github.com/pmineiro/randembe
Knowledge-based Biomedical Data Science 2019
Knowledge-based biomedical data science (KBDS) involves the design and
implementation of computer systems that act as if they knew about biomedicine.
Such systems depend on formally represented knowledge in computer systems,
often in the form of knowledge graphs. Here we survey the progress in the last
year in systems that use formally represented knowledge to address data science
problems in both clinical and biological domains, as well as on approaches for
creating knowledge graphs. Major themes include the relationships between
knowledge graphs and machine learning, the use of natural language processing,
and the expansion of knowledge-based approaches to novel domains, such as
Chinese Traditional Medicine and biodiversity.Comment: Manuscript 43 pages with 3 tables; Supplemental material 43 pages
with 3 table
Comparing different search methods for the open access journal recommendation tool B!SON
Finding a suitable open access journal to publish academic work is a complex task: Researchers have to navigate a constantly growing number of journals, institutional agreements with publishers, funders’ conditions and the risk of predatory publishers. To help with these challenges, we introduce a web-based journal recommendation system called B!SON. A systematic requirements analysis was conducted in the form of a survey. The developed tool suggests open access journals based on title, abstract and references provided by the user. The recommendations are built on open data, publisher-independent and work across domains and languages. Transparency is provided by its open source nature, an open application programming interface (API) and by specifying which matches the shown recommendations are based on. The recommendation quality has been evaluated using two different evaluation techniques, including several new recommendation methods. We were able to improve the results from our previous paper with a pre-trained transformer model. The beta version of the tool received positive feedback from the community and in several test sessions. We developed a recommendation system for open access journals to help researchers find a suitable journal. The open tool has been extensively tested, and we found possible improvements for our current recommendation technique. Development by two German academic libraries ensures the longevity and sustainability of the system.German Federal Ministry of Education and Research (BMBF)/Projekt DEAL/16TOA034A/E
Linking Datasets on Organizations Using Half A Billion Open Collaborated Records
Scholars studying organizations often work with multiple datasets lacking
shared unique identifiers or covariates. In such situations, researchers may
turn to approximate string matching methods to combine datasets. String
matching, although useful, faces fundamental challenges. Even when two strings
appear similar to humans, fuzzy matching often does not work because it fails
to adapt to the informativeness of the character combinations presented. Worse,
many entities have multiple names that are dissimilar (e.g., "Fannie Mae" and
"Federal National Mortgage Association"), a case where string matching has
little hope of succeeding. This paper introduces data from a prominent
employment-related networking site (LinkedIn) as a tool to address these
problems. We propose interconnected approaches to leveraging the massive amount
of information from LinkedIn regarding organizational name-to-name links. The
first approach builds a machine learning model for predicting matches from
character strings, treating the trillions of user-contributed organizational
name pairs as a training corpus: this approach constructs a string matching
metric that explicitly maximizes match probabilities. A second approach
identifies relationships between organization names using network
representations of the LinkedIn data. A third approach combines the first and
second. We document substantial improvements over fuzzy matching in
applications, making all methods accessible in open-source software
("LinkOrgs")
- …