39,602 research outputs found
Named Entity Extraction and Disambiguation: The Reinforcement Effect.
Named entity extraction and disambiguation have received much attention in recent years. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. Although these topics are highly dependent, almost no existing works examine this dependency. It is the aim of this paper to examine the dependency and show how one affects the other, and vice versa. We conducted experiments with a set of descriptions of holiday homes with the aim to extract and disambiguate toponyms as a representative example of named entities. We experimented with three approaches for disambiguation with the purpose to infer the country of the holiday home. We examined how the effectiveness of extraction influences the effectiveness of disambiguation, and reciprocally, how filtering out ambiguous names (an activity that depends on the disambiguation process) improves the effectiveness of extraction. Since this, in turn, may improve the effectiveness of disambiguation again, it shows that extraction and disambiguation may reinforce each other.\u
Spatio-textual indexing for geographical search on the web
Many web documents refer to specific geographic localities and many
people include geographic context in queries to web search engines. Standard
web search engines treat the geographical terms in the same way as other terms.
This can result in failure to find relevant documents that refer to the place of
interest using alternative related names, such as those of included or nearby
places. This can be overcome by associating text indexing with spatial indexing
methods that exploit geo-tagging procedures to categorise documents with
respect to geographic space. We describe three methods for spatio-textual
indexing based on multiple spatially indexed text indexes, attaching spatial
indexes to the document occurrences of a text index, and merging text index
access results with results of access to a spatial index of documents. These
schemes are compared experimentally with a conventional text index search
engine, using a collection of geo-tagged web documents, and are shown to be
able to compete in speed and storage performance with pure text indexing
Experiments in terabyte searching, genomic retrieval and novelty detection for TREC 2004
In TREC2004, Dublin City University took part in three tracks, Terabyte (in collaboration with University College Dublin), Genomic and Novelty. In this paper we will discuss each track separately and present separate conclusions from this work. In addition, we present a general description of a text retrieval engine that we have developed in the last year to support our experiments into large scale, distributed information retrieval, which underlies all of the track experiments described in this document
Quantitative Perspectives on Fifty Years of the Journal of the History of Biology
Journal of the History of Biology provides a fifty-year long record for
examining the evolution of the history of biology as a scholarly discipline. In
this paper, we present a new dataset and preliminary quantitative analysis of
the thematic content of JHB from the perspectives of geography, organisms, and
thematic fields. The geographic diversity of authors whose work appears in JHB
has increased steadily since 1968, but the geographic coverage of the content
of JHB articles remains strongly lopsided toward the United States, United
Kingdom, and western Europe and has diversified much less dramatically over
time. The taxonomic diversity of organisms discussed in JHB increased steadily
between 1968 and the late 1990s but declined in later years, mirroring broader
patterns of diversification previously reported in the biomedical research
literature. Finally, we used a combination of topic modeling and nonlinear
dimensionality reduction techniques to develop a model of multi-article fields
within JHB. We found evidence for directional changes in the representation of
fields on multiple scales. The diversity of JHB with regard to the
representation of thematic fields has increased overall, with most of that
diversification occurring in recent years. Drawing on the dataset generated in
the course of this analysis, as well as web services in the emerging digital
history and philosophy of science ecosystem, we have developed an interactive
web platform for exploring the content of JHB, and we provide a brief overview
of the platform in this article. As a whole, the data and analyses presented
here provide a starting-place for further critical reflection on the evolution
of the history of biology over the past half-century.Comment: 45 pages, 14 figures, 4 table
Knowledge Discovery in Online Repositories: A Text Mining Approach
Before the advent of the Internet, the newspapers were the prominent instrument of
mobilization for independence and political struggles. Since independence in Nigeria, the
political class has adopted newspapers as a medium of Political Competition and
Communication. Consequently, most political information exists in unstructured form and
hence the need to tap into it using text mining algorithm.
This paper implements a text mining algorithm on some unstructured data format in some newspapers. The algorithm involves the following natural language processing techniques: tokenization, text filtering and refinement. As a follow-up to the natural language techniques, association rule mining technique of data mining is used to extract knowledge using the Modified Generating Association Rules based on Weighting scheme (GARW).
The main contributions of the technique are that it integrates information retrieval scheme (Term Frequency Inverse Document Frequency) (for keyword/feature selection that automatically selects the most discriminative keywords for use in association rules generation) with Data Mining technique for association rules discovery. The program is applied to Pre-Election information gotten from the website of the Nigerian Guardian newspaper. The extracted association rules contained important features and described the informative news included in the documents collection when related to the concluded 2007 presidential election. The system presented useful information that could help sanitize the polity as well as protect the nascent democracy
Explicit diversification of event aspects for temporal summarization
During major events, such as emergencies and disasters, a large volume of information is reported on newswire and social media platforms. Temporal summarization (TS) approaches are used to automatically produce concise overviews of such events by extracting text snippets from related articles over time. Current TS approaches rely on a combination of event relevance and textual novelty for snippet selection. However, for events that span multiple days, textual novelty is often a poor criterion for selecting snippets, since many snippets are textually unique but are semantically redundant or non-informative. In this article, we propose a framework for the diversification of snippets using explicit event aspects, building on recent works in search result diversification. In particular, we first propose two techniques to identify explicit aspects that a user might want to see covered in a summary for different types of event. We then extend a state-of-the-art explicit diversification framework to maximize the coverage of these aspects when selecting summary snippets for unseen events. Through experimentation over the TREC TS 2013, 2014, and 2015 datasets, we show that explicit diversification for temporal summarization significantly outperforms classical novelty-based diversification, as the use of explicit event aspects reduces the amount of redundant and off-topic snippets returned, while also increasing summary timeliness
The Extraction of Community Structures from Publication Networks to Support Ethnographic Observations of Field Differences in Scientific Communication
The scientific community of researchers in a research specialty is an
important unit of analysis for understanding the field specific shaping of
scientific communication practices. These scientific communities are, however,
a challenging unit of analysis to capture and compare because they overlap,
have fuzzy boundaries, and evolve over time. We describe a network analytic
approach that reveals the complexities of these communities through examination
of their publication networks in combination with insights from ethnographic
field studies. We suggest that the structures revealed indicate overlapping
sub- communities within a research specialty and we provide evidence that they
differ in disciplinary orientation and research practices. By mapping the
community structures of scientific fields we aim to increase confidence about
the domain of validity of ethnographic observations as well as of collaborative
patterns extracted from publication networks thereby enabling the systematic
study of field differences. The network analytic methods presented include
methods to optimize the delineation of a bibliographic data set in order to
adequately represent a research specialty, and methods to extract community
structures from this data. We demonstrate the application of these methods in a
case study of two research specialties in the physical and chemical sciences.Comment: Accepted for publication in JASIS
What guidance are researchers given on how to present network meta-analyses to end-users such as policymakers and clinicians? A systematic review
© 2014 Sullivan et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Introduction: Network meta-analyses (NMAs) are complex methodological approaches that may be challenging for non-technical end-users, such as policymakers and clinicians, to understand. Consideration should be given to identifying optimal approaches to presenting NMAs that help clarify analyses. It is unclear what guidance researchers currently have on how to present and tailor NMAs to different end-users. Methods: A systematic review of NMA guidelines was conducted to identify guidance on how to present NMAs. Electronic databases and supplementary sources were searched for NMA guidelines. Presentation format details related to sample formats, target audiences, data sources, analysis methods and results were extracted and frequencies tabulated. Guideline quality was assessed following criteria developed for clinical practice guidelines. Results: Seven guidelines were included. Current guidelines focus on how to conduct NMAs but provide limited guidance to researchers on how to best present analyses to different end-users. None of the guidelines provided reporting templates. Few guidelines provided advice on tailoring presentations to different end-users, such as policymakers. Available guidance on presentation formats focused on evidence networks, characteristics of individual trials, comparisons between direct and indirect estimates and assumptions of heterogeneity and/or inconsistency. Some guidelines also provided examples of figures and tables that could be used to present information. Conclusions: Limited guidance exists for researchers on how best to present NMAs in an accessible format, especially for non-technical end-users such as policymakers and clinicians. NMA guidelines may require further integration with end-users' needs, when NMAs are used to support healthcare policy and practice decisions. Developing presentation formats that enhance understanding and accessibility of NMAs could also enhance the transparency and legitimacy of decisions informed by NMAs.The Canadian Institute of Health Research (CIHR) Drug Safety and Effectiveness Network (Funding reference number – 116573)
- …