6,273 research outputs found
Overcoming data scarcity of Twitter: using tweets as bootstrap with application to autism-related topic content analysis
Notwithstanding recent work which has demonstrated the potential of using
Twitter messages for content-specific data mining and analysis, the depth of
such analysis is inherently limited by the scarcity of data imposed by the 140
character tweet limit. In this paper we describe a novel approach for targeted
knowledge exploration which uses tweet content analysis as a preliminary step.
This step is used to bootstrap more sophisticated data collection from directly
related but much richer content sources. In particular we demonstrate that
valuable information can be collected by following URLs included in tweets. We
automatically extract content from the corresponding web pages and treating
each web page as a document linked to the original tweet show how a temporal
topic model based on a hierarchical Dirichlet process can be used to track the
evolution of a complex topic structure of a Twitter community. Using
autism-related tweets we demonstrate that our method is capable of capturing a
much more meaningful picture of information exchange than user-chosen hashtags.Comment: IEEE/ACM International Conference on Advances in Social Networks
Analysis and Mining, 201
Heterogeneous data source integration for smart grid ecosystems based on metadata mining
The arrival of new technologies related to smart grids and the resulting ecosystem of applications andmanagement systems pose many new problems. The databases of the traditional grid and the variousinitiatives related to new technologies have given rise to many different management systems with several formats and different architectures. A heterogeneous data source integration system is necessary toupdate these systems for the new smart grid reality. Additionally, it is necessary to take advantage of theinformation smart grids provide. In this paper, the authors propose a heterogeneous data source integration based on IEC standards and metadata mining. Additionally, an automatic data mining framework isapplied to model the integrated information.Ministerio de Economía y Competitividad TEC2013-40767-
Data Science, Machine learning and big data in Digital Journalism: A survey of state-of-the-art, challenges and opportunities
Digital journalism has faced a dramatic change and media companies are challenged to use data science algo-rithms to be more competitive in a Big Data era. While this is a relatively new area of study in the media landscape, the use of machine learning and artificial intelligence has increased substantially over the last few years. In particular, the adoption of data science models for personalization and recommendation has attracted the attention of several media publishers. Following this trend, this paper presents a research literature analysis on the role of Data Science (DS) in Digital Journalism (DJ). Specifically, the aim is to present a critical literature review, synthetizing the main application areas of DS in DJ, highlighting research gaps, challenges, and op-portunities for future studies. Through a systematic literature review integrating bibliometric search, text min-ing, and qualitative discussion, the relevant literature was identified and extensively analyzed. The review reveals an increasing use of DS methods in DJ, with almost 47% of the research being published in the last three years. An hierarchical clustering highlighted six main research domains focused on text mining, event extraction, online comment analysis, recommendation systems, automated journalism, and exploratory data analysis along with some machine learning approaches. Future research directions comprise developing models to improve personalization and engagement features, exploring recommendation algorithms, testing new automated jour-nalism solutions, and improving paywall mechanisms.Acknowledgements This work was supported by the FCT-Funda?a ? o para a Ciência e Tecnologia, under the Projects: UIDB/04466/2020, UIDP/04466/2020, and UIDB/00319/2020
Econometrics meets sentiment : an overview of methodology and applications
The advent of massive amounts of textual, audio, and visual data has spurred the development of econometric methodology to transform qualitative sentiment data into quantitative sentiment variables, and to use those variables in an econometric analysis of the relationships between sentiment and other variables. We survey this emerging research field and refer to it as sentometrics, which is a portmanteau of sentiment and econometrics. We provide a synthesis of the relevant methodological approaches, illustrate with empirical results, and discuss useful software
Towards a Query Optimizer for Text-Centric Tasks
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for
a variety of tasks. As a notable example, information extraction applications derive structured
relations from unstructured text; as another example, focused crawlers explore the web to locate
pages about specific topics. Execution plans for text-centric tasks follow two general paradigms
for processing a text database: either we can scan, or "crawl," the text database or, alternatively,
we can exploit search engine indexes and retrieve the documents of interest via carefully crafted
queries constructed in task-specific ways. The choice between crawl- and query-based execution
plans can have a substantial impact on both execution time and output "completeness" (e.g.,
in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain
intuition. In this article, we present fundamental building blocks to make the choice of execution
plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to
analyze query- and crawl-based plans in terms of both execution time and output completeness.
We adapt results from random-graph theory and statistics to develop a rigorous cost model for
the execution plans. Our cost model reflects the fact that the performance of the plans depends
on fundamental task-specific properties of the underlying text databases. We identify these
properties and present efficient techniques for estimating the associated parameters of the cost
model. We also present two optimization approaches for text-centric tasks that rely on the cost-model
parameters and select efficient execution plans. Overall, our optimization approaches
help build efficient execution plans for a task, resulting in significant efficiency and output
completeness benefits. We complement our results with a large-scale experimental evaluation
for three important text-centric tasks and over multiple real-life data sets.Information Systems Working Papers Serie
Towards a Query Optimizer for Text-Centric Tasks
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for
a variety of tasks. As a notable example, information extraction applications derive structured
relations from unstructured text; as another example, focused crawlers explore the Web to locate
pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for
processing a text database: either we can scan, or “crawl,” the text database or, alternatively, we can
exploit search engine indexes and retrieve the documents of interest via carefully crafted queries
constructed in task-specific ways. The choice between crawl- and query-based execution plans can
have a substantial impact on both execution time and output “completeness” (e.g., in terms of
recall). Nevertheless, this choice is typically ad hoc and based on heuristics or plain intuition.
In this article, we present fundamental building blocks to make the choice of execution plans
for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze
query- and crawl-based plans in terms of both execution time and output completeness. We adapt
results from random-graph theory and statistics to develop a rigorous cost model for the execution
plans. Our cost model reflects the fact that the performance of the plans depends on fundamental
task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model.We also present two
optimization approaches for text-centric tasks that rely on the cost-model parameters and select
efficient execution plans. Overall, our optimization approaches help build efficient execution plans
for a task, resulting in significant efficiency and output completeness benefits. We complement our
results with a large-scale experimental evaluation for three important text-centric tasks and over
multiple real-life data sets.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
News diversity and recommendation systems : setting the interdisciplinary scene
Concerns about selective exposure and filter bubbles in the digital news environment trigger questions regarding how news recommender systems can become more citizen-oriented and facilitate – rather than limit – normative aims of journalism. Accordingly, this chapter presents building blocks for the construction of such a news algorithm as they are being developed by the Ghent University interdisciplinary research project #NewsDNA, of which the primary aim is to actually build, evaluate and test a diversity-enhancing news recommender. As such, the deployment of artificial intelligence could support the media in providing people with information and stimulating public debate, rather than undermine their role in that respect. To do so, it combines insights from computer sciences (news recommender systems), law (right to receive information), communication sciences (conceptualisations of news diversity), and computational linguistics (automated content extraction from text). To gather feedback from scholars of different backgrounds, this research has been presented and discussed during the 2019 IFIP summer school workshop on ‘co-designing a personalised news diversity algorithmic model based on news consumers’ agency and fine-grained content modelling’. This contribution also reflects the results of that dialogue
- …