246 research outputs found
Recommender Systems
The ongoing rapid expansion of the Internet greatly increases the necessity
of effective recommender systems for filtering the abundant information.
Extensive research for recommender systems is conducted by a broad range of
communities including social and computer scientists, physicists, and
interdisciplinary researchers. Despite substantial theoretical and practical
achievements, unification and comparison of different approaches are lacking,
which impedes further advances. In this article, we review recent developments
in recommender systems and discuss the major challenges. We compare and
evaluate available algorithms and examine their roles in the future
developments. In addition to algorithms, physical aspects are described to
illustrate macroscopic behavior of recommender systems. Potential impacts and
future directions are discussed. We emphasize that recommendation has a great
scientific depth and combines diverse research fields which makes it of
interests for physicists as well as interdisciplinary researchers.Comment: 97 pages, 20 figures (To appear in Physics Reports
ISTopic: Understanding Information Systems Research through Topic Models
What are the fundamental research questions in Information Systems? How do various research topics relate with one another to form the IS research landscape and how do they evolve over time? This study is an initial attempt to answer these questions using topic models to investigate the topics examined by the premier IS journals in 1977-2014. We present an IS Topic Graph that contains 33 research areas, 31 of which are closely connected with one another. Further analyses of this graph reveal how different IS research areas are intertwined to the extent that they are almost inseparable. Looking into IS research at a finer level, we identify 300 research topics, and a chronological analysis reveals a trend of topic diversification and externalization. To guide future research, an intelligent literature search tool called ISTopic is built and is available at http://www.istopic.org for public access
An Enhanced Web Document Search Engine using a Semantic Network
With the rapid advancement of ICT technology, the World Wide Web (referred to as the Web) has become the
biggest information repository whose volume keeps growing on a daily basis. The challenge is how to find the most wanted information from the Web with a minimum effort. This paper presents a novel ontology-based framework for searching the related web pages to a given term within a few given specific websites. With this framework, a web crawler first learns the content of web pages within the given websites, then the topic modeller finds the relations between web pages and topics via keywords found on the web pages using the Latent Dirichlet Allocation (LDA) technique. After that, the ontology builder establishes an ontology which is a semantic network of web pages based on the topic model. Finally, a reasoner can find the related web pages to a given term by making use of the ontology. The framework and related modelling techniques have been verified using a few test websites and the results convince its superiority over the existing web search tools
Metric for seleting the number of topics in the LDA Model
The latest technological trends are driving a vast and growing amount of textual data. Topic modeling is a useful tool for extracting information from large corpora of text. A topic template is based on a corpus of documents, discovers the topics that permeate the corpus and assigns documents to those topics. The Latent Dirichlet Allocation (LDA) model is the main, or most popular, of the probabilistic topic models. The LDA model is conditioned by three parameters: two Dirichlet hyperparameters (α and β ) and the number of topics (K). Determining the parameter K is extremely important and not extensively explored in the literature, mainly due to the intensive computation and long processing time. Most topic modeling methods implicitly assume that the number of topics is known in advance, thus considering it demands an exogenous parameter. That is annoying, leaving the technique prone to subjectivities. The quality of insights offered by LDA is quite sensitive to the value of the parameter K, and perhaps an excess of subjectivity in its choice might influence the confidence managers put on the techniques results, thus undermining its usage by firms. This dissertation’s main objective is to develop a metric to identify the ideal value for the parameter K of the LDA model that allows an adequate representation of the corpus and within a tolerable elapsed time of the process. We apply the proposed metric alongside existing metrics to two datasets. Experiments show that the proposed method selects a number of topics similar to that of other metrics, but with better performance in terms of processing time. Although each metric has its own method for determining the number of topics, some results are similar for the same database, as evidenced in the study. Our metric is superior when considering the processing time. Experiments show this method is effective.As tendências tecnológicas mais recentes impulsionam uma vasta e crescente quantidade de dados textuais. Modelagem de tópicos é uma ferramenta útil para extrair informações relevantes de grandes corpora de texto. Um modelo de tópico é baseado em um corpus de documentos, descobre os tópicos que permeiam o corpus e atribui documentos a esses tópicos. O modelo de Alocação de Dirichlet Latente (LDA) é o principal, ou mais popular, dos modelos de tópicos probabilÃsticos. O modelo LDA é condicionado por três parâmetros: os hiperparâmetros de Dirichlet (α and β ) e o número de tópicos (K). A determinação do parâmetro K é extremamente importante e pouco explorada na literatura, principalmente devido à computação intensiva e ao longo tempo de processamento. A maioria dos métodos de modelagem de tópicos assume implicitamente que o número de tópicos é conhecido com antecedência, portanto, considerando que exige um parâmetro exógeno. Isso é um tanto complicado para o pesquisador pois acaba acrescentando à técnica uma subjetividade. A qualidade dos insights oferecidos pelo LDA é bastante sensÃvel ao valor do parâmetro K, e pode-se argumentar que um excesso de subjetividade em sua escolha possa influenciar a confiança que os gerentes depositam nos resultados da técnica, prejudicando assim seu uso pelas empresas. O principal objetivo desta dissertação é desenvolver uma métrica para identificar o valor ideal para o parâmetro K do modelo LDA que permita uma representação adequada do corpus e dentro de um tempo de processamento tolerável. Embora cada métrica possua método próprio para determinação do número de tópicos, alguns resultados são semelhantes para a mesma base de dados, conforme evidenciado no estudo. Nossa métrica é superior ao considerar o tempo de processamento. Experimentos mostram que esse método é eficaz
Multi-Industry Simplex : A Probabilistic Extension of GICS
Accurate industry classification is a critical tool for many asset management
applications. While the current industry gold-standard GICS (Global Industry
Classification Standard) has proven to be reliable and robust in many settings,
it has limitations that cannot be ignored. Fundamentally, GICS is a
single-industry model, in which every firm is assigned to exactly one group -
regardless of how diversified that firm may be. This approach breaks down for
large conglomerates like Amazon, which have risk exposure spread out across
multiple sectors. We attempt to overcome these limitations by developing MIS
(Multi-Industry Simplex), a probabilistic model that can flexibly assign a firm
to as many industries as can be supported by the data. In particular, we
utilize topic modeling, an natural language processing approach that utilizes
business descriptions to extract and identify corresponding industries. Each
identified industry comes with a relevance probability, allowing for high
interpretability and easy auditing, circumventing the black-box nature of
alternative machine learning approaches. We describe this model in detail and
provide two use-cases that are relevant to asset management - thematic
portfolios and nearest neighbor identification. While our approach has
limitations of its own, we demonstrate the viability of probabilistic industry
classification and hope to inspire future research in this field.Comment: 17 pages, 10 figure
THE IDENTIFICATION OF NOTEWORTHY HOTEL REVIEWS FOR HOTEL MANAGEMENT
The rapid emergence of user-generated content (UGC) inspires knowledge sharing among Internet users. A good example is the well-known travel site TripAdvisor.com, which enables users to share their experiences and express their opinions on attractions, accommodations, restaurants, etc. The UGC about travel provide precious information to the users as well as staff in travel industry. In particular, how to identify reviews that are noteworthy for hotel management is critical to the success of hotels in the competitive travel industry. We have employed two hotel managers to conduct an examination on Taiwan’s hotel reviews in Tripadvisor.com and found that noteworthy reviews can be characterized by their content features, sentiments, and review qualities. Through the experiments using tripadvisor.com data, we find that all three types of features are important in identifying noteworthy hotel reviews. Specifically, content features are shown to have the most impact, followed by sentiments and review qualities. With respect to the various methods for representing content features, LDA method achieves comparable performance to TF-IDF method with higher recall and much fewer features
Quantitative Perspectives on Fifty Years of the Journal of the History of Biology
Journal of the History of Biology provides a fifty-year long record for
examining the evolution of the history of biology as a scholarly discipline. In
this paper, we present a new dataset and preliminary quantitative analysis of
the thematic content of JHB from the perspectives of geography, organisms, and
thematic fields. The geographic diversity of authors whose work appears in JHB
has increased steadily since 1968, but the geographic coverage of the content
of JHB articles remains strongly lopsided toward the United States, United
Kingdom, and western Europe and has diversified much less dramatically over
time. The taxonomic diversity of organisms discussed in JHB increased steadily
between 1968 and the late 1990s but declined in later years, mirroring broader
patterns of diversification previously reported in the biomedical research
literature. Finally, we used a combination of topic modeling and nonlinear
dimensionality reduction techniques to develop a model of multi-article fields
within JHB. We found evidence for directional changes in the representation of
fields on multiple scales. The diversity of JHB with regard to the
representation of thematic fields has increased overall, with most of that
diversification occurring in recent years. Drawing on the dataset generated in
the course of this analysis, as well as web services in the emerging digital
history and philosophy of science ecosystem, we have developed an interactive
web platform for exploring the content of JHB, and we provide a brief overview
of the platform in this article. As a whole, the data and analyses presented
here provide a starting-place for further critical reflection on the evolution
of the history of biology over the past half-century.Comment: 45 pages, 14 figures, 4 table
- …