370 research outputs found
Creating language resources for under-resourced languages: methodologies, and experiments with Arabic
Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning, information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented
Knowledge representation and text mining in biomedical, healthcare, and political domains
Knowledge representation and text mining can be employed to discover new knowledge and develop services by using the massive amounts of text gathered by modern information systems. The applied methods should take into account the domain-specific nature of knowledge. This thesis explores knowledge representation and text mining in three application domains.
Biomolecular events can be described very precisely and concisely with appropriate representation schemes. Proteinâprotein interactions are commonly modelled in biological databases as binary relationships, whereas the complex relationships used in text mining are rich in information. The experimental results of this thesis show that complex relationships can be reduced to binary relationships and that it is possible to reconstruct complex relationships from mixtures of linguistically similar relationships. This encourages the extraction of complex relationships from the scientific literature even if binary relationships are required by the application at hand. The experimental results on cross-validation schemes for pair-input data help to understand how existing knowledge regarding dependent instances (such those concerning proteinâprotein pairs) can be leveraged to improve the generalisation performance estimates of learned models.
Healthcare documents and news articles contain knowledge that is more difficult to model than biomolecular events and tend to have larger vocabularies than biomedical scientific articles. This thesis describes an ontology that models patient education documents and their content in order to improve the availability and quality of such documents. The experimental results of this thesis also show that the Recall-Oriented Understudy for Gisting Evaluation measures are a viable option for the automatic evaluation of textual patient record summarisation methods and that the area under the receiver operating characteristic curve can be used in a large-scale sentiment analysis. The sentiment analysis of Reuters news corpora suggests that the Western mainstream media portrays China negatively in politics-related articles but not in general, which provides new evidence to consider in the debate over the image of China in the Western media
Reading the news through its structure: new hybrid connectivity based approaches
In this thesis a solution for the problem of identifying the structure of news published
by online newspapers is presented. This problem requires new approaches and algorithms
that are capable of dealing with the massive number of online publications in existence
(and that will grow in the future). The fact that news documents present a high degree of
interconnection makes this an interesting and hard problem to solve. The identification
of the structure of the news is accomplished both by descriptive methods that expose the
dimensionality of the relations between different news, and by clustering the news into
topic groups. To achieve this analysis this integrated whole was studied using different
perspectives and approaches.
In the identification of news clusters and structure, and after a preparatory data collection
phase, where several online newspapers from different parts of the globe were
collected, two newspapers were chosen in particular: the Portuguese daily newspaper
PĂșblico and the British newspaper The Guardian.
In the first case, it was shown how information theory (namely variation of information)
combined with adaptive networks was able to identify topic clusters in the news published
by the Portuguese online newspaper PĂșblico.
In the second case, the structure of news published by the British newspaper The
Guardian is revealed through the construction of time series of news clustered by a kmeans
process. After this approach an unsupervised algorithm, that filters out irrelevant
news published online by taking into consideration the connectivity of the news labels
entered by the journalists, was developed. This novel hybrid technique is based on Qanalysis
for the construction of the filtered network followed by a clustering technique to
identify the topical clusters. Presently this work uses a modularity optimisation clustering technique but this step is general enough that other hybrid approaches can be used without
losing generality.
A novel second order swarm intelligence algorithm based on Ant Colony Systems
was developed for the travelling salesman problem that is consistently better than the
traditional benchmarks. This algorithm is used to construct Hamiltonian paths over the
news published using the eccentricity of the different documents as a measure of distance.
This approach allows for an easy navigation between published stories that is dependent
on the connectivity of the underlying structure.
The results presented in this work show the importance of taking topic detection in
large corpora as a multitude of relations and connectivities that are not in a static state.
They also influence the way of looking at multi-dimensional ensembles, by showing that
the inclusion of the high dimension connectivities gives better results to solving a particular
problem as was the case in the clustering problem of the news published online.Neste trabalho resolvemos o problema da identificação da estrutura das notĂcias publicadas
em linha por jornais e agĂȘncias noticiosas. Este problema requer novas abordagens e
algoritmos que sejam capazes de lidar com o nĂșmero crescente de publicaçÔes em linha
(e que se espera continuam a crescer no futuro). Este facto, juntamente com o elevado
grau de interconexĂŁo que as notĂcias apresentam tornam este problema num problema
interessante e de difĂcil resolução. A identificação da estrutura do sistema de notĂcias foi
conseguido quer através da utilização de métodos descritivos que expÔem a dimensão das
relaçÔes existentes entre as diferentes notĂcias, quer atravĂ©s de algoritmos de agrupamento
das mesmas em tĂłpicos. Para atingir este objetivo foi necessĂĄrio proceder a ao estudo deste
sistema complexo sob diferentes perspectivas e abordagens.
ApĂłs uma fase preparatĂłria do corpo de dados, onde foram recolhidos diversos jornais
publicados online optou-se por dois jornais em particular: O PĂșblico e o The Guardian.
A escolha de jornais em lĂnguas diferentes deve-se Ă vontade de encontrar estratĂ©gias de
anålise que sejam independentes do conhecimento prévio que se tem sobre estes sistemas.
Numa primeira anĂĄlise Ă© empregada uma abordagem baseada em redes adaptativas
e teoria de informação (nomeadamente variação de informação) para identificar tópicos
noticiosos que sĂŁo publicados no jornal portuguĂȘs PĂșblico.
Numa segunda abordagem analisamos a estrutura das notĂcias publicadas pelo jornal
BritĂąnico The Guardian atravĂ©s da construção de sĂ©ries temporais de notĂcias. Estas foram
seguidamente agrupadas através de um processo de k-means. Para além disso desenvolveuse
um algoritmo que permite filtrar de forma nĂŁo supervisionada notĂcias irrelevantes que
apresentam baixa conectividade Ă s restantes notĂcias atravĂ©s da utilização de Q-analysis
seguida de um processo de clustering. Presentemente este mĂ©todo utiliza otimização de modularidade, mas a tĂ©cnica Ă© suficientemente geral para que outras abordagens hĂbridas
possam ser utilizadas sem perda de generalidade do método.
Desenvolveu-se ainda um novo algoritmo baseado em sistemas de colĂłnias de formigas
para solução do problema do caixeiro viajante que consistentemente apresenta resultados
melhores que os tradicionais bancos de testes. Este algoritmo foi aplicado na construção
de caminhos Hamiltonianos das notĂcias publicadas utilizando a excentricidade obtida a
partir da conectividade do sistema estudado como medida da distĂąncia entre notĂcias. Esta
abordagem permitiu construir um sistema de navegação entre as notĂcias publicadas que Ă©
dependente da conectividade observada na estrutura de notĂcias encontrada.
Os resultados apresentados neste trabalho mostram a importĂąncia de analisar sistemas
complexos na sua multitude de relaçÔes e conectividades que não são eståticas e que
influenciam a forma como tradicionalmente se olha para sistema multi-dimensionais.
Mostra-se que a inclusão desta dimensÔes extra produzem melhores resultados na resolução
do problema de identificar a estrutura subjacente a este problema da publicação de notĂcias em linha
Social impact retrieval: measuring author inïŹuence on information retrieval
The increased presence of technologies collectively referred to as Web 2.0 mean the entire process of new media production and dissemination has moved away from an
authorcentric approach. Casual web users and browsers are increasingly able to play a more active role in the information creation process. This means that the traditional ways in which information sources may be validated and scored must adapt accordingly.
In this thesis we propose a new way in which to look at a user's contributions to the network in which they are present, using these interactions to provide a measure of
authority and centrality to the user. This measure is then used to attribute an query-independent interest score to each of the contributions the author makes, enabling us
to provide other users with relevant information which has been of greatest interest to a community of like-minded users. This is done through the development of two
algorithms; AuthorRank and MessageRank.
We present two real-world user experiments which focussed around multimedia annotation and browsing systems that we built; these systems were novel in themselves, bringing together video and text browsing, as well as free-text annotation. Using these systems as examples of real-world applications for our approaches, we then look at a
larger-scale experiment based on the author and citation networks of a ten year period of the ACM SIGIR conference on information retrieval between 1997-2007. We use the
citation context of SIGIR publications as a proxy for annotations, constructing large social networks between authors. Against these networks we show the eïŹectiveness of
incorporating user generated content, or annotations, to improve information retrieval
Recommended from our members
Computational Argumentation Approaches to Improve Sensemaking and Evidence-based Reasoning in Online Deliberation Systems
Deliberation is the process through which communities identify potential solutions for a problem and select the solution that most effectively meets their diverse requirements through dialogic communication. Online deliberation is implemented nowadays with means of social media and online discussion platforms; however, these media present significant challenges and issues that can be traced to inadequate support for Sensemaking processes and poor endorsement of the quality characteristics of deliberation.
This thesis investigates integrating computational argumentation methods in online deliberation platforms as an effective way to improve participants' perception of the quality of the deliberation process, their way of making sense of the overall process and producing healthier social dynamics.
For that, two computational artefacts are proposed: (i) a Synoptical summariser of long discussions and (ii) a Scientific Argument Recommender System (SciArgRecSys).
The two artefacts are designed and developed with state-of-the-art methods (with the use of Large Language Models - LLMs) and evaluated intrinsically and extrinsically when deployed in a real live platform (BCause).
Through extensive evaluation, the positive effect of both artefacts is illustrated in human Sensemaking and essential quality characteristics of deliberation such as reciprocal Engagement, Mutual Understanding, and Social dynamics. In addition, it has been demonstrated that these interventions effectively reduce polarisation, the formation of sub-communities while significantly enhancing the quality of the discussion by making it more coherent and diverse
Context-Aware Message-Level Rumour Detection with Weak Supervision
Social media has become the main source of all sorts of information beyond a communication medium. Its intrinsic nature can allow a continuous and massive flow of misinformation to make a severe impact worldwide. In particular, rumours emerge unexpectedly and spread quickly. It is challenging to track down their origins and stop their propagation. One of the most ideal solutions to this is to identify rumour-mongering messages as early as possible, which is commonly referred to as "Early Rumour Detection (ERD)". This dissertation focuses on researching ERD on social media by exploiting weak supervision and contextual information. Weak supervision is a branch of ML where noisy and less precise sources (e.g. data patterns) are leveraged to learn limited high-quality labelled data (Ratner et al., 2017). This is intended to reduce the cost and increase the efficiency of the hand-labelling of large-scale data. This thesis aims to study whether identifying rumours before they go viral is possible and develop an architecture for ERD at individual post level. To this end, it first explores major bottlenecks of current ERD. It also uncovers a research gap between system design and its applications in the real world, which have received less attention from the research community of ERD. One bottleneck is limited labelled data. Weakly supervised methods to augment limited labelled training data for ERD are introduced. The other bottleneck is enormous amounts of noisy data. A framework unifying burst detection based on temporal signals and burst summarisation is investigated to identify potential rumours (i.e. input to rumour detection models) by filtering out uninformative messages. Finally, a novel method which jointly learns rumour sources and their contexts (i.e. conversational threads) for ERD is proposed. An extensive evaluation setting for ERD systems is also introduced
From Keyword Search to Exploration: How Result Visualization Aids Discovery on the Web
A key to the Web's success is the power of search. The elegant way in which search results are returned is usually remarkably effective. However, for exploratory search in which users need to learn, discover, and understand novel or complex topics, there is substantial room for improvement. Human computer interaction researchers and web browser designers have developed novel strategies to improve Web search by enabling users to conveniently visualize, manipulate, and organize their Web search results. This monograph offers fresh ways to think about search-related cognitive processes and describes innovative design approaches to browsers and related tools. For instance, while key word search presents users with results for specific information (e.g., what is the capitol of Peru), other methods may let users see and explore the contexts of their requests for information (related or previous work, conflicting information), or the properties that associate groups of information assets (group legal decisions by lead attorney). We also consider the both traditional and novel ways in which these strategies have been evaluated. From our review of cognitive processes, browser design, and evaluations, we reflect on the future opportunities and new paradigms for exploring and interacting with Web search results
Concept-based Interactive Query Expansion Support Tool (CIQUEST)
This report describes a three-year project (2000-03) undertaken in the Information Studies
Department at The University of Sheffield and funded by Resource, The Council for
Museums, Archives and Libraries. The overall aim of the research was to provide user
support for query formulation and reformulation in searching large-scale textual resources
including those of the World Wide Web. More specifically the objectives were: to investigate
and evaluate methods for the automatic generation and organisation of concepts derived from
retrieved document sets, based on statistical methods for term weighting; and to conduct
user-based evaluations on the understanding, presentation and retrieval effectiveness of
concept structures in selecting candidate terms for interactive query expansion.
The TREC test collection formed the basis for the seven evaluative experiments conducted in
the course of the project. These formed four distinct phases in the project plan. In the first
phase, a series of experiments was conducted to investigate further techniques for concept
derivation and hierarchical organisation and structure. The second phase was concerned with
user-based validation of the concept structures. Results of phases 1 and 2 informed on the
design of the test system and the user interface was developed in phase 3. The final phase
entailed a user-based summative evaluation of the CiQuest system.
The main findings demonstrate that concept hierarchies can effectively be generated from
sets of retrieved documents and displayed to searchers in a meaningful way. The approach
provides the searcher with an overview of the contents of the retrieved documents, which in
turn facilitates the viewing of documents and selection of the most relevant ones. Concept
hierarchies are a good source of terms for query expansion and can improve precision. The
extraction of descriptive phrases as an alternative source of terms was also effective. With
respect to presentation, cascading menus were easy to browse for selecting terms and for
viewing documents. In conclusion the project dissemination programme and future work are
outlined
- âŠ