635 research outputs found
Exploratory Analysis of Highly Heterogeneous Document Collections
We present an effective multifaceted system for exploratory analysis of
highly heterogeneous document collections. Our system is based on intelligently
tagging individual documents in a purely automated fashion and exploiting these
tags in a powerful faceted browsing framework. Tagging strategies employed
include both unsupervised and supervised approaches based on machine learning
and natural language processing. As one of our key tagging strategies, we
introduce the KERA algorithm (Keyword Extraction for Reports and Articles).
KERA extracts topic-representative terms from individual documents in a purely
unsupervised fashion and is revealed to be significantly more effective than
state-of-the-art methods. Finally, we evaluate our system in its ability to
help users locate documents pertaining to military critical technologies buried
deep in a large heterogeneous sea of information.Comment: 9 pages; KDD 2013: 19th ACM SIGKDD Conference on Knowledge Discovery
and Data Minin
On Recognizing Transparent Objects in Domestic Environments Using Fusion of Multiple Sensor Modalities
Current object recognition methods fail on object sets that include both
diffuse, reflective and transparent materials, although they are very common in
domestic scenarios. We show that a combination of cues from multiple sensor
modalities, including specular reflectance and unavailable depth information,
allows us to capture a larger subset of household objects by extending a state
of the art object recognition method. This leads to a significant increase in
robustness of recognition over a larger set of commonly used objects.Comment: 12 page
Statistical Mechanics of the Chinese Restaurant Process: lack of self-averaging, anomalous finite-size effects and condensation
The Pitman-Yor, or Chinese Restaurant Process, is a stochastic process that
generates distributions following a power-law with exponents lower than two, as
found in a numerous physical, biological, technological and social systems. We
discuss its rich behavior with the tools and viewpoint of statistical
mechanics. We show that this process invariably gives rise to a condensation,
i.e. a distribution dominated by a finite number of classes. We also evaluate
thoroughly the finite-size effects, finding that the lack of stationary state
and self-averaging of the process creates realization-dependent cutoffs and
behavior of the distributions with no equivalent in other statistical
mechanical models.Comment: (5pages, 1 figure
Infinite factorization of multiple non-parametric views
Combined analysis of multiple data sources has increasing application interest, in particular for distinguishing shared and source-specific aspects. We extend this rationale of classical canonical correlation analysis into a flexible, generative and non-parametric clustering
setting, by introducing a novel non-parametric hierarchical
mixture model. The lower level of the model describes each source with a flexible non-parametric mixture, and the top level combines these to describe commonalities of the sources. The lower-level clusters arise from hierarchical Dirichlet Processes, inducing an infinite-dimensional contingency table between the views. The commonalities between the sources are modeled by an infinite block
model of the contingency table, interpretable as non-negative factorization of infinite matrices, or as a prior for infinite contingency tables. With Gaussian mixture components plugged in for continuous measurements, the model is applied to two views of genes, mRNA expression and abundance of the produced proteins, to expose groups of genes that are co-regulated in either or both of the views.
Cluster analysis of co-expression is a standard simple way of screening for co-regulation, and the two-view analysis extends the approach to distinguishing between pre- and post-translational regulation
Meaning-focused and Quantum-inspired Information Retrieval
In recent years, quantum-based methods have promisingly integrated the
traditional procedures in information retrieval (IR) and natural language
processing (NLP). Inspired by our research on the identification and
application of quantum structures in cognition, more specifically our work on
the representation of concepts and their combinations, we put forward a
'quantum meaning based' framework for structured query retrieval in text
corpora and standardized testing corpora. This scheme for IR rests on
considering as basic notions, (i) 'entities of meaning', e.g., concepts and
their combinations and (ii) traces of such entities of meaning, which is how
documents are considered in this approach. The meaning content of these
'entities of meaning' is reconstructed by solving an 'inverse problem' in the
quantum formalism, consisting of reconstructing the full states of the entities
of meaning from their collapsed states identified as traces in relevant
documents. The advantages with respect to traditional approaches, such as
Latent Semantic Analysis (LSA), are discussed by means of concrete examples.Comment: 11 page
Using Social Media to Promote STEM Education: Matching College Students with Role Models
STEM (Science, Technology, Engineering, and Mathematics) fields have become
increasingly central to U.S. economic competitiveness and growth. The shortage
in the STEM workforce has brought promoting STEM education upfront. The rapid
growth of social media usage provides a unique opportunity to predict users'
real-life identities and interests from online texts and photos. In this paper,
we propose an innovative approach by leveraging social media to promote STEM
education: matching Twitter college student users with diverse LinkedIn STEM
professionals using a ranking algorithm based on the similarities of their
demographics and interests. We share the belief that increasing STEM presence
in the form of introducing career role models who share similar interests and
demographics will inspire students to develop interests in STEM related fields
and emulate their models. Our evaluation on 2,000 real college students
demonstrated the accuracy of our ranking algorithm. We also design a novel
implementation that recommends matched role models to the students.Comment: 16 pages, 8 figures, accepted by ECML/PKDD 2016, Industrial Trac
Quantum Aspects of Semantic Analysis and Symbolic Artificial Intelligence
Modern approaches to semanic analysis if reformulated as Hilbert-space
problems reveal formal structures known from quantum mechanics. Similar
situation is found in distributed representations of cognitive structures
developed for the purposes of neural networks. We take a closer look at
similarites and differences between the above two fields and quantum
information theory.Comment: version accepted in J. Phys. A (Letter to the Editor
Seeing Tree Structure from Vibration
Humans recognize object structure from both their appearance and motion;
often, motion helps to resolve ambiguities in object structure that arise when
we observe object appearance only. There are particular scenarios, however,
where neither appearance nor spatial-temporal motion signals are informative:
occluding twigs may look connected and have almost identical movements, though
they belong to different, possibly disconnected branches. We propose to tackle
this problem through spectrum analysis of motion signals, because vibrations of
disconnected branches, though visually similar, often have distinctive natural
frequencies. We propose a novel formulation of tree structure based on a
physics-based link model, and validate its effectiveness by theoretical
analysis, numerical simulation, and empirical experiments. With this
formulation, we use nonparametric Bayesian inference to reconstruct tree
structure from both spectral vibration signals and appearance cues. Our model
performs well in recognizing hierarchical tree structure from real-world videos
of trees and vessels.Comment: ECCV 2018. The first two authors contributed equally to this work.
Project page: http://tree.csail.mit.edu
Topic modeling applied to business research: A latent dirichlet allocation (LDA)-based classification for organization studies
More than 1.5 million academic documents are published each year, and this trend shows an incremental tendency for the following years. One of the main challenges for the academic community is how to organize this huge volume of documentation to have a sense of the knowledge frontier. In this study we applied Latent Dirichlet Allocation (LDA) techniques to identify primary topics in organization studies, and analyzed the relationships between academic impact and belonging to the topics detected by LDA
Does \u2018bigger\u2019mean \u2018better\u2019? Pitfalls and shortcuts associated with big data for social research
\u2018Big data is here to stay.\u2019 This key statement has a double value: is an assumption as well as the reason why a theoretical reflection is needed. Furthermore, Big data is something that is gaining visibility and success in social sciences even, overcoming the division between humanities and computer sciences. In this contribution some considerations on the presence and the certain persistence of Big data as a socio-technical assemblage will be outlined. Therefore, the intriguing opportunities for social research linked to such interaction between practices and technological development will be developed. However, despite a promissory rhetoric, fostered by several scholars since the birth of Big data as a labelled concept, some risks are just around the corner. The claims for the methodological power of bigger and bigger datasets, as well as increasing speed in analysis and data collection, are creating a real hype in social research. Peculiar attention is needed in order to avoid some pitfalls. These risks will be analysed for what concerns the validity of the research results \u2018obtained through Big data. After a pars distruens, this contribution will conclude with a pars construens; assuming the previous critiques, a mixed methods research design approach will be described as a general proposal with the objective of stimulating a debate on the integration of Big data in complex research projecting
- …