69 research outputs found
NLP Driven Models for Automatically Generating Survey Articles for Scientific Topics.
This thesis presents new methods that use natural language processing (NLP) driven models for summarizing research in scientific fields. Given a topic query in the form of a text string, we present methods for finding research articles relevant to the topic as well as summarization algorithms that use lexical and discourse information present in the text of these articles to generate coherent and readable extractive summaries of past research on the topic. In addition to summarizing prior research, good survey articles should also forecast future trends. With this motivation, we present work on forecasting future impact of scientific publications using NLP driven features.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113407/1/rahuljha_1.pd
Recommended from our members
Proceedings of QG2010: The Third Workshop on Question Generation
These are the peer-reviewed proceedings of "QG2010, The Third Workshop on Question Generation". The workshop included a special track for "QGSTEC2010: The First Question Generation Shared Task and Evaluation Challenge".
QG2010 was held as part of The Tenth International Conference on Intelligent Tutoring Systems (ITS2010)
DISCOURSE ANALYSIS OF LYRIC AND LYRIC-BASED CLASSIFICATION OF MUSIC
Master'sMASTER OF SCIENC
The Enhancement of Arabic Information Retrieval Using Arabic Text Summarization
The massive upload of text on the internet makes the text overhead one of the important challenges faces the Information Retrieval (IR) system. The purpose of this research is to maintain reasonable relevancy and increase the efficiency of the information retrieval system by creating a short and informative inverted index and by supporting the user query with a set of semantically related terms extracted automatically. To achieve this purpose, two new models for text mining are developed and implemented, the first one called Multi-Layer Similarity (MLS) model that uses the Latent Semantic Analysis (LSA) in the efficient framework. And the second is called the Noun Based Distinctive Verbs (NBDV) model that investigates the semantic meanings of the nouns by identifying the set of distinctive verbs that describe them.
The Arabic Language has been chosen as the language of the case study, because one of the primary objectives of this research is to measure the effect of the MLS model and NBDV model on the relevancy of the Arabic IR (AIR) systems that use the Vector Space model, and to measure the accuracy of applying the MLS model on the recall and precision of the Arabic language text extraction systems.
The initiating of this research requires holding a deep reading about what has been achieved in the field of Arabic information retrieval. In this regard, a quantitative relevancy survey to measure the enhancements achieved has been established. The survey reviewed the impact of statistical and morphological analysis of Arabic text on improving the AIR relevancy. The survey measured the contributions of Stemming, Indexing, Query Expansion, Automatic Text Summarization, Text Translation, Part of Speech Tagging, and Named Entity Recognition in enhancing the relevancy of AIR. Our survey emphasized the quantitative relevancy measurements provided in the surveyed publications. The survey showed that the researchers achieved significant achievements, especially in building accurate stemmers, with precision rates that convergent to 97%, and in measuring the impact of different indexing strategies. Query expansion and Text Translation showed a positive relevancy effect. However, other tasks such as Named Entity Recognition and Automatic Text Summarization still need more research to realize their impact on Arabic IR.
The use of LSA in text mining demands large space and time requirements. In the first part of this research, a new text extraction model has been proposed, designed, implemented, and evaluated. The new method sets a framework on how to efficiently employ the statistical semantic analysis in the automatic text extraction. The method hires the centrality feature that estimates the similarity of the sentence with respect to every sentence found in the text. The new model omits the segments of text that have significant verbatim, statistical, and semantic resemblance with previously processed texts. The identification of text resemblance is based on a new multi-layer process that estimates the text-similarity at three statistical layers. It employes the Jaccard coefficient similarity and the Vector Space Model (VSM) in the first and second layers respectively and uses the Latent Semantic Analysis in the third layer. Due to high time complexity, the Multi-Layer model restricts the use of the LSA layer for the text segments that the Jaccard and VSM layers failed to estimate their similarities. ROUGE tool is used in the evaluation, and because ROUGE does not consider the extract’s size, it has been supplemented with a new evaluation strategy based on the ratio of sentences intersections between the automatic and the reference extracts and the condensation rate. The MLS model has been compared with the classical LSA that uses the traditional definition of the singular value decomposition and with the traditional Jaccard and VSM text extractions. The results of our comparison showed that the run of the LSA procedure in the MLS-based extraction reduced by 52%, and the original matrix dimensions dwindled by 65%. Also, the new method achieved remarkable accuracy results. We found that combining the centrality feature with the proposed multi-layer framework yields a significant solution regarding the efficiency and precision in the field of automatic text extraction.
The automatic synonym extractor built in this research is based on statistical approaches. The traditional statistical approach in synonyms extraction is time-consuming, especially in real applications such as query expansion and text mining. It is necessary to develop a new model to improve the efficiency and accuracy during the extraction. The research presents the NBDV model in synonym extraction that replaces the traditional tf.idf weighting scheme with a new weighting scheme called the Orbit Weighing Scheme (OWS). The OWS weights the verbs based on their singularity to a group of nouns. The method was manipulated over the Arabic language because it has more varieties in constructing the verbal sentences than the other languages. The results of the new method were compared with traditional models in automatic synonyms extraction, such as the Skip-Gram and Continuous Bag of Words. The NBDV method obtained significant accuracy results (47% R and 51% P in the dictionary-based evaluation, and 57.5% precision using human experts’ assessment). It is found that on average, the synonyms extraction of a single noun requires the process of 186 verbs, and in 63% of the runs, the number of singular verbs was less than 200. It is concluded that the developed new method is efficient and processed the single run in linear time complexity (O(n)).
After implementing the text extractors and the synonyms extractor, the VSM model was used to build the IR system. The inverted index was constructed from two sources of data, the original documents taken from various datasets of the Arabic language (and one from the English language for comparison purposes), and from the automatic summaries of the same documents that were generated from the automatic extractors developed in this research.
A series of experiments were held to test the effectiveness of the extraction methods developed in this research on the relevancy of the IR system. The experiments examined three groups of queries, 60 Arabic queries with manual relevancy assessment, 100 Arabic queries with automatic relevancy assessment, and 60 English queries with automatic relevancy assessment. Also, the experiments were performed with and without synonyms expansions using the synonyms generated by the synonyms extractor developed in the research.
The positive influence of the MLS text extraction was clear in the efficiency of the IR system without noticeable loss in the relevancy results. The intrinsic evaluation in our research showed that the bag of words models failed to reduce the text size, and this appears clearly in the large values of the condensation Rate (68%). Comparing with the previous publications that addressed the use of summaries as a source of the index, The relevancy assessment of our work was higher than their relevancy results. And, the relevancy results were obtained at 42% condensation rate, whereas, the relevancy results in the previous publication achieved at high values of condensation rate. Also, the MLS-based retrieval constructed an inverted index that is 58% smaller than the Main Corpus inverted index.
The influence of the NBDV synonyms expansion on the IR relevancy had a slightly positive impact (only 1% improvement in both recall and precision), but no negative impact has been recorded in all relevancy measures
Information search and similarity based on Web 2.0 and semantic technologies
The World Wide Web provides a huge amount of information described in natural language at the current society’s disposal. Web search engines were born from the necessity of finding a particular piece of that information. Their ease of use and their utility have turned these engines into one of the most used web tools at a daily basis. To make a query, users just have to introduce a set of words - keywords - in natural language and the engine answers with a list of ordered resources which contain those words. The order is given by
ranking algorithms. These algorithms use basically two types of features: dynamic and
static factors. The dynamic factor has into account the query; that is, those documents
which contain the keywords used to describe the query are more relevant for that query.
The hyperlinks structure among documents is an example of a static factor of most current
algorithms. For example, if most documents link to a particular document, this document
may have more relevance than others because it is more popular.
Even though currently there is a wide consensus on the good results that the majority of
web search engines provides, these tools still suffer from some limitations, basically 1) the
loneliness of the searching activity itself; and 2) the simple recovery process, based mainly
on offering the documents that contains the exact terms used to describe the query.
Considering the first problem, there is no doubt in the lonely and time-consuming process
of searching relevant information in the World Wide Web. There are thousands of users out
there that repeat previously executed queries, spending time in taking decisions of which
documents are relevant or not; decisions that may have been taken previously and that
may be do the job for similar or identical queries for other users.
Considering the second problem, the textual nature of the current Web makes the
reasoning capability of web search engines quite restricted; queries and web resources are
described in natural language that, in some cases, can lead to ambiguity or other semantic-related
difficulties. Computers do not know text; however, if semantics is incorporated to the text, meaning and sense is incorporated too. This way, queries and web resources will
not be mere sets of terms, but lists of well-defined concepts.
This thesis proposes a semantic layer, known as Itaca, which joins simplicity and
effectiveness in order to endow with semantics both the resources stored in the World Wide
Web and the queries used by users to find those resources. This is achieved through
collaborative annotations and relevance feedback made by the users themselves, which
describe both the queries and the web resources by means of Wikipedia concepts.
Itaca extends the functional capabilities of current web search engines, providing a new
ranking algorithm without dispensing traditional ranking models. Experiments show that this
new architecture offers more precision in the final results obtained, keeping the simplicity
and usability of the web search engines existing so far. Its particular design as a layer
makes feasible its inclusion to current engines in a simple way.Internet pone a disposición de la sociedad una enorme cantidad de información descrita en
lenguaje natural. Los buscadores web nacieron de la necesidad de encontrar un fragmento
de información entre tanto volumen de datos. Su facilidad de manejo y su utilidad los han
convertido en herramientas de uso diario entre la población. Para realizar una consulta, el
usuario sólo tiene que introducir varias palabras clave en lenguaje natural y el buscador
responde con una lista de recursos que contienen dichas palabras, ordenados en base a
algoritmos de ranking. Estos algoritmos usan dos tipos de factores básicos: factores
dinámicos y estáticos. El factor dinámico tiene en cuenta la consulta en sí; es decir,
aquellos documentos donde estén las palabras utilizadas para describir la consulta serán
más relevantes para dicha consulta. La estructura de hiperenlaces en los documentos
electrónicos es un ejemplo de factor estático. Por ejemplo, si muchos documentos enlazan
a otro documento, éste último documento podrá ser más relevante que otros.
Si bien es cierto que actualmente hay consenso entre los buenos resultados de estos
buscadores, todavía adolecen de ciertos problemas, destacando 1) la soledad en la que un
usuario realiza una consulta; y 2) el modelo simple de recuperación, basado en ver si un
documento contiene o no las palabras exactas usadas para describir la consulta.
Con respecto al primer problema, no hay duda de que navegar en busca de cierta
información relevante es una práctica solitaria y que consume mucho tiempo. Hay miles de
usuarios ahí fuera que repiten sin saberlo una misma consulta, y las decisiones que toman
muchos de ellos, descartando la información irrelevante y quedándose con la que
realmente es útil, podrían servir de guía para otros muchos.
Con respecto al segundo, el carácter textual de la Web actual hace que la capacidad de
razonamiento en los buscadores se vea limitada, pues las consultas y los recursos están
descritos en lenguaje natural que en ocasiones da origen a la ambigüedad. Los equipos
informáticos no comprenden el texto que se incluye. Si se incorpora semántica al lenguaje, se incorpora significado, de forma que las consultas y los recursos electrónicos no son
meros conjuntos de términos, sino una lista de conceptos claramente diferenciados.
La presente tesis desarrolla una capa semántica, Itaca, que dota de significado tanto a los
recursos almacenados en la Web como a las consultas que pueden formular los usuarios
para encontrar dichos recursos. Todo ello se consigue a través de anotaciones
colaborativas y de relevancia realizadas por los propios usuarios, que describen tanto
consultas como recursos electrónicos mediante conceptos extraídos de Wikipedia. Itaca
extiende las características funcionales de los buscadores web actuales, aportando un
nuevo modelo de ranking sin tener que prescindir de los modelos actualmente en uso. Los
experimentos demuestran que aporta una mayor precisión en los resultados finales,
manteniendo la simplicidad y usabilidad de los buscadores que se conocen hasta ahora.
Su particular diseño, a modo de capa, hace que su incorporación a buscadores ya
existentes sea posible y sencilla.Programa Oficial de Posgrado en Ingeniería TelemáticaPresidente: Asunción Gómez Pérez.- Secretario: Mario Muñoz Organero.- Vocal: Anselmo Peñas Padill
Term selection in information retrieval
Systems trained on linguistically annotated data achieve strong performance for many
language processing tasks. This encourages the idea that annotations can improve any
language processing task if applied in the right way. However, despite widespread
acceptance and availability of highly accurate parsing software, it is not clear that ad
hoc information retrieval (IR) techniques using annotated documents and requests consistently
improve search performance compared to techniques that use no linguistic
knowledge. In many cases, retrieval gains made using language processing components,
such as part-of-speech tagging and head-dependent relations, are offset by significant
negative effects. This results in a minimal positive, or even negative, overall
impact for linguistically motivated approaches compared to approaches that do not use
any syntactic or domain knowledge.
In some cases, it may be that syntax does not reveal anything of practical importance
about document relevance. Yet without a convincing explanation for why linguistic
annotations fail in IR, the intuitive appeal of search systems that ‘understand’ text
can result in the repeated application, and mis-application, of language processing to
enhance search performance. This dissertation investigates whether linguistics can improve
the selection of query terms by better modelling the alignment process between
natural language requests and search queries. It is the most comprehensive work on
the utility of linguistic methods in IR to date.
Term selection in this work focuses on identification of informative query terms of
1-3 words that both represent the semantics of a request and discriminate between relevant
and non-relevant documents. Approaches to word association are discussed with
respect to linguistic principles, and evaluated with respect to semantic characterization
and discriminative ability. Analysis is organised around three theories of language that
emphasize different structures for the identification of terms: phrase structure theory,
dependency theory and lexicalism. The structures identified by these theories play
distinctive roles in the organisation of language. Evidence is presented regarding the
value of different methods of word association based on these structures, and the effect
of method and term combinations.
Two highly effective, novel methods for the selection of terms from verbose queries
are also proposed and evaluated. The first method focuses on the semantic phenomenon
of ellipsis with a discriminative filter that leverages diverse text features. The second
method exploits a term ranking algorithm, PhRank, that uses no linguistic information
and relies on a network model of query context. The latter focuses queries so that 1-5
terms in an unweighted model achieve better retrieval effectiveness than weighted IR
models that use up to 30 terms. In addition, unlike models that use a weighted distribution
of terms or subqueries, the concise terms identified by PhRank are interpretable by
users. Evaluation with newswire and web collections demonstrates that PhRank-based
query reformulation significantly improves performance of verbose queries up to 14%
compared to highly competitive IR models, and is at least as good for short, keyword
queries with the same models.
Results illustrate that linguistic processing may help with the selection of word associations
but does not necessarily translate into improved IR performance. Statistical
methods are necessary to overcome the limits of syntactic parsing and word adjacency
measures for ad hoc IR. As a result, probabilistic frameworks that discover, and make
use of, many forms of linguistic evidence may deliver small improvements in IR effectiveness,
but methods that use simple features can be substantially more efficient
and equally, or more, effective. Various explanations for this finding are suggested,
including the probabilistic nature of grammatical categories, a lack of homomorphism
between syntax and semantics, the impact of lexical relations, variability in collection
data, and systemic effects in language systems
Ontological Approach for Semantic Modelling of Malay Translated Qur’an
This thesis contributes to the areas of ontology development and analysis, natural language processing (NLP), Information Retrieval (IR), and Language Resource and Corpus Development. Research in Natural Language Processing and semantic search for English has shown successful results for more than a decade. However, it is difficult to adapt those techniques to the Malay language, because its complex morphology and orthographic forms are very different from English. Moreover, limited resources and tools for computational linguistic analysis are available for Malay. In this thesis, we address those issues and challenges by proposing MyQOS, the Malay Qur’an Ontology System, a prototype ontology-based IR with semantics for representing and accessing a Malay translation of the Qur’an. This supports the development of a semantic search engine and a question answering system and provides a framework for storing and accessing a Malay language corpus and providing computational linguistics resources. The primary use of MyQOS in the current research is for creating and improving the quality and accuracy of the query mechanism to retrieve information embedded in the Malay text of the Qur’an translation. To demonstrate the feasibility of this approach, we describe a new architecture of morphological analysis for MyQOS and query algorithms based on MyQOS. Data analysis consisted of two measures; precision and recall, where data was obtained from MyQOS Corpus conducted in three search engines. The precision and recall for semantic search are 0.8409 (84%) and 0.8043(80%), double the results of the question-answer search which are 0.4971(50%) for precision and 0.6027 (60%) for recall. The semantic search gives high precision and high recall comparing the other two methods. This indicates that semantic search returns more relevant results than irrelevant ones. To conclude, this research is among research in the retrieval of the Qur’an texts in the Malay language that managed to outline state-of-the-art information retrieval system models. Thus, the use of MyQOS will help Malay readers to understand the Qur’an in better ways. Furthermore, the creation of a Malay language corpus and computational linguistics resources will benefit other researchers, especially in religious texts, morphological analysis, and semantic modelling
Recommended from our members
Approaches to Using in Information Word Collocation Retrieval
The thesis explores long-span collocation and its application in information retrieval. The basic research question of the thesis is whether the use of long-span collocates can improve performance of a probabilistic model of IR. The model used in the project is the Robertson & Sparck Jones probabilistic model.
The basic research question was explored by investigating three different ways of integrating collocation information with the probabilistic model:
1. Global collocation analysis. The method consists in expanding the original query with long-span global collocates of query terms. Global collocates of a query term are selected from large fixed-size windows around all occurrences of a term in the corpus and ranked by statistical measures of Mutual Information (MI) and Z score. A fixed number of top-ranked collocates is used in query expansion.
Query expansion with global collocates did not show to be superior to the original queries, the possible reason being the fact that query terms often have a fairly broad meaning and, hence, a rather semantically heterogeneous pattern of occurrence.
2. Local collocation analysis. This method is a form of iterative query expansion following relevance or pseudo-relevance (blind) feedback. The original query is expanded with the query terms’ collocates which are extracted from the long-span windows around all occurrences of query terms in the known relevant documents, and selected using statistical measures of MI and Z. Some parameters whose effect was systematically studied in this experiment set are: window size, measure of collocation significance for collocate ranking, number of query expansion collocates and categories of terms in the expanded queries.
Some results showed a tendency towards performance gain over relevance feedback in the probabilistic model, however it was not significant enough to conclude that this method is superior to the existing relevance feedback used in the model.
3. Lexical cohesion analysis using local collocations. This experiment set aimed to explore whether the level of lexical cohesion between query terms in a document can be linked to the document’s relevance property, and if so, whether it can be used to predict documents’ relevance to the query. Lexical cohesion between different query terms is estimated from the number of collocates they have in common.
The experiments proved that there exists a statistically significant association between the level of lexical cohesion of the query terms in documents and relevance. Another set of experiments, aimed at using lexical cohesion to improve probabilistic document ranking, showed that sets re-ranked by their lexical cohesion scores have similar performance as the original ranking
- …