23,504 research outputs found
Improving average ranking precision in user searches for biomedical research datasets
Availability of research datasets is keystone for health and life science
study reproducibility and scientific progress. Due to the heterogeneity and
complexity of these data, a main challenge to be overcome by research data
management systems is to provide users with the best answers for their search
queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we
investigate a novel ranking pipeline to improve the search of datasets used in
biomedical experiments. Our system comprises a query expansion model based on
word embeddings, a similarity measure algorithm that takes into consideration
the relevance of the query terms, and a dataset categorisation method that
boosts the rank of datasets matching query constraints. The system was
evaluated using a corpus with 800k datasets and 21 annotated user queries. Our
system provides competitive results when compared to the other challenge
participants. In the official run, it achieved the highest infAP among the
participants, being +22.3% higher than the median infAP of the participant's
best submissions. Overall, it is ranked at top 2 if an aggregated metric using
the best official measures per participant is considered. The query expansion
method showed positive impact on the system's performance increasing our
baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively.
Our similarity measure algorithm seems to be robust, in particular compared to
Divergence From Randomness framework, having smaller performance variations
under different training conditions. Finally, the result categorization did not
have significant impact on the system's performance. We believe that our
solution could be used to enhance biomedical dataset management systems. In
particular, the use of data driven query expansion methods could be an
alternative to the complexity of biomedical terminologies
Natural Language Processing for Cyberbullying Detection
With the development of digital technologies and the popularity of social media, cyberbullying has become a serious public health concern that can lead to increased risk of mental and behavioral health issues or even suicide. Artificial intelligence like machine learning opens a lot of possibilities to combat cyberbullying, e.g. automatic cyberbullying detection. Most recent research focuses on improving performance by developing complex models that demand more resources and time to run. The research uses publicly available datasets without carefully evaluating their feasibility and limitations. This study uses natural language processing (NLP) to evaluate the model performance and examine the difference between fine-grained classification and binary classification as well as assess the feasibility and quality of the publicly available dataset. The results show that simple classifier can also achieve similar performance as that of more complex models if appropriate preprocessing is used, and the publicly available dataset may have limitations and quality issues that researchers should consider when using the data
Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection
Many important forms of data are stored digitally in XML format. Errors can
occur in the textual content of the data in the fields of the XML. Fixing these
errors manually is time-consuming and expensive, especially for large amounts
of data. There is increasing interest in the research, development, and use of
automated techniques for assisting with data cleaning. Electronic dictionaries
are an important form of data frequently stored in XML format that frequently
have errors introduced through a mixture of manual typographical entry errors
and optical character recognition errors. In this paper we describe methods for
flagging statistical anomalies as likely errors in electronic dictionaries
stored in XML format. We describe six systems based on different sources of
information. The systems detect errors using various signals in the data
including uncommon characters, text length, character-based language models,
word-based language models, tied-field length ratios, and tied-field
transliteration models. Four of the systems detect errors based on expectations
automatically inferred from content within elements of a single field type. We
call these single-field systems. Two of the systems detect errors based on
correspondence expectations automatically inferred from content within elements
of multiple related field types. We call these tied-field systems. For each
system, we provide an intuitive analysis of the type of error that it is
successful at detecting. Finally, we describe two larger-scale evaluations
using crowdsourcing with Amazon's Mechanical Turk platform and using the
annotations of a domain expert. The evaluations consistently show that the
systems are useful for improving the efficiency with which errors in XML
electronic dictionaries can be detected.Comment: 8 pages, 4 figures, 5 tables; published in Proceedings of the 2016
IEEE Tenth International Conference on Semantic Computing (ICSC), Laguna
Hills, CA, USA, pages 79-86, February 201
Exploring Topic-based Language Models for Effective Web Information Retrieval
The main obstacle for providing focused search is the relative opaqueness of search request -- searchers tend to express their complex information needs in only a couple of keywords. Our overall aim is to find out if, and how, topic-based language models can lead to more effective web information retrieval. In this paper we explore retrieval performance of a topic-based model that combines topical models with other language models based on cross-entropy. We first define our topical categories and train our topical models on the .GOV2 corpus by building parsimonious language models. We then test the topic-based model on TREC8 small Web data collection for ad-hoc search.Our experimental results show that the topic-based model outperforms the standard language model and parsimonious model
Socioeconomic disparities in diet vary according to migration status among adolescents in Belgium
Little information concerning social disparities in adolescent dietary habits is currently available, especially regarding migration status. The aim of the present study was to estimate socioeconomic disparities in dietary habits of school adolescents from different migration backgrounds. In the 2014 cross-sectional Health Behavior in School-Aged Children survey in Belgium, food consumption was estimated using a self-administrated short food frequency questionnaire. In total, 19,172 school adolescents aged 10-19 years were included in analyses. Multilevel multiple binary and multinomial logistic regressions were performed, stratified by migration status (natives, 2nd- and 1st-generation immigrants). Overall, immigrants more frequently consumed both healthy and unhealthy foods. Indeed, 32.4% of 1st-generation immigrants, 26.5% of 2nd-generation immigrants, and 16.7% of natives consumed fish two days a week. Compared to those having a high family affluence scale (FAS), adolescents with a low FAS were more likely to consume chips and fries once a day (vs. <once a day: Natives aRRR = 1.39 (95%CI: 1.12-1.73); NS in immigrants). Immigrants at schools in Flanders were less likely than those in Brussels to consume sugar-sweetened beverages 2-6 days a week (vs. once a week: Natives aRRR = 1.86 (95%CI: 1.32-2.62); 2nd-generation immigrants aRRR = 1.52 (1.11-2.09); NS in 1st-generation immigrants). The migration gradient observed here underlines a process of acculturation. Narrower socioeconomic disparities in immigrant dietary habits compared with natives suggest that such habits are primarily defined by culture of origin. Nutrition interventions should thus include cultural components of dietary habits
Collaborative recommendations with content-based filters for cultural activities via a scalable event distribution platform
Nowadays, most people have limited leisure time and the offer of (cultural) activities to spend this time is enormous. Consequently, picking the most appropriate events becomes increasingly difficult for end-users. This complexity of choice reinforces the necessity of filtering systems that assist users in finding and selecting relevant events. Whereas traditional filtering tools enable e.g. the use of keyword-based or filtered searches, innovative recommender systems draw on user ratings, preferences, and metadata describing the events. Existing collaborative recommendation techniques, developed for suggesting web-shop products or audio-visual content, have difficulties with sparse rating data and can not cope at all with event-specific restrictions like availability, time, and location. Moreover, aggregating, enriching, and distributing these events are additional requisites for an optimal communication channel. In this paper, we propose a highly-scalable event recommendation platform which considers event-specific characteristics. Personal suggestions are generated by an advanced collaborative filtering algorithm, which is more robust on sparse data by extending user profiles with presumable future consumptions. The events, which are described using an RDF/OWL representation of the EventsML-G2 standard, are categorized and enriched via smart indexing and open linked data sets. This metadata model enables additional content-based filters, which consider event-specific characteristics, on the recommendation list. The integration of these different functionalities is realized by a scalable and extendable bus architecture. Finally, focus group conversations were organized with external experts, cultural mediators, and potential end-users to evaluate the event distribution platform and investigate the possible added value of recommendations for cultural participation
- …