81,704 research outputs found
Mapping Subsets of Scholarly Information
We illustrate the use of machine learning techniques to analyze, structure,
maintain, and evolve a large online corpus of academic literature. An emerging
field of research can be identified as part of an existing corpus, permitting
the implementation of a more coherent community structure for its
practitioners.Comment: 10 pages, 4 figures, presented at Arthur M. Sackler Colloquium on
"Mapping Knowledge Domains", 9--11 May 2003, Beckman Center, Irvine, CA,
proceedings to appear in PNA
Ensemble Committees for Stock Return Classification and Prediction
This paper considers a portfolio trading strategy formulated by algorithms in
the field of machine learning. The profitability of the strategy is measured by
the algorithm's capability to consistently and accurately identify stock
indices with positive or negative returns, and to generate a preferred
portfolio allocation on the basis of a learned model. Stocks are characterized
by time series data sets consisting of technical variables that reflect market
conditions in a previous time interval, which are utilized produce binary
classification decisions in subsequent intervals. The learned model is
constructed as a committee of random forest classifiers, a non-linear support
vector machine classifier, a relevance vector machine classifier, and a
constituent ensemble of k-nearest neighbors classifiers. The Global Industry
Classification Standard (GICS) is used to explore the ensemble model's efficacy
within the context of various fields of investment including Energy, Materials,
Financials, and Information Technology. Data from 2006 to 2012, inclusive, are
considered, which are chosen for providing a range of market circumstances for
evaluating the model. The model is observed to achieve an accuracy of
approximately 70% when predicting stock price returns three months in advance.Comment: 15 pages, 4 figures, Neukom Institute Computational Undergraduate
Research prize - second plac
Effective Unsupervised Author Disambiguation with Relative Frequencies
This work addresses the problem of author name homonymy in the Web of
Science. Aiming for an efficient, simple and straightforward solution, we
introduce a novel probabilistic similarity measure for author name
disambiguation based on feature overlap. Using the researcher-ID available for
a subset of the Web of Science, we evaluate the application of this measure in
the context of agglomeratively clustering author mentions. We focus on a
concise evaluation that shows clearly for which problem setups and at which
time during the clustering process our approach works best. In contrast to most
other works in this field, we are sceptical towards the performance of author
name disambiguation methods in general and compare our approach to the trivial
single-cluster baseline. Our results are presented separately for each correct
clustering size as we can explain that, when treating all cases together, the
trivial baseline and more sophisticated approaches are hardly distinguishable
in terms of evaluation results. Our model shows state-of-the-art performance
for all correct clustering sizes without any discriminative training and with
tuning only one convergence parameter.Comment: Proceedings of JCDL 201
- …