40 research outputs found

    Using Text Segmentation to Enhance the Cluster Hypothesis

    Get PDF
    An alternative way to tackle Information Retrieval, called Passage Retrieval, considers text fragments independently rather than assessing global relevance of documents. In such a context, the fact that relevant information is surrounded by parts of text deviating from the interesting topic does not penalize the document. In this paper, we propose to study the impact of the consideration of these text fragments on a document clustering process. The use of clustering in the field of Information Retrieval is mainly supported by the cluster hypothesis which states that relevant documents tend to be more similar one to each other than to non-relevant documents and hence a clustering process is likely to gather them. Previous experiments have shown that clustering the first retrieved documents as response to a user’s query allows the Information Retrieval systems to improve their effectiveness. In the clustering process used in these studies, documents have been considered globally. Nevertheless, the assumption stating that a document can refer to more than one topic/concept may have also impacts on the document clustering process. Considering passages of the retrieved documents separately may allow to create more representative clusters of the addressed topics. Different approaches have been assessed and results show that using text fragments in the clustering process may turn out to be actually relevant

    Bir Procrutes Hikâyesi: Türkçe Fransızca Gibi İşlenirmi ?

    Get PDF
    International audienc

    Managing Genetic Algorithm Parameters to Improve SegGen - A Thematic Segmentation Algorithm

    Get PDF
    SegGen [1] is a linear thematic segmentation algorithm grounded on a variant of the Strength Pareto Evolutionary Algorithm [2] and aims at optimizing the two criteria of the Salton\u27s [3] definition of segments: a segment is a part of text whose internal cohesion and dissimilarity with its adjacent segments are maximal. This paper describes improvements that have been implemented in the approach taken by SegGen by tuning the genetic algorithm parameters according with the evolution of the quality of the generated populations. Two kinds of reasons originate the tuning of the parameters and have been implemented here. First as it could be measured by the values of global criteria of the population quality, the global quality of the generated populations increases as the process goes and it seems reasonable to set values to parameters and define new operators, which favor intensification and diminish diversification factors in the search process. Second since individuals in the populations are plausible segmentations it seems reasonable to weight sentences in the current segmentation depending on their distance to the boundaries of the segment they belong to for the calculus of similarities between sentences implied in the two criteria to be optimized. Although this tuning of the parameters of the algorithm currently rests on estimations based on experiments, first results are promising

    RQM description of the charge form factor of the pion and its asymptotic behavior

    Full text link
    The pion charge and scalar form factors, F1(Q2)F_1(Q^2) and F0(Q2)F_0(Q^2), are first calculated in different forms of relativistic quantum mechanics. This is done using the solution of a mass operator that contains both confinement and one-gluon-exchange interactions. Results of calculations, based on a one-body current, are compared to experiment for the first one. As it could be expected, those point-form, and instant and front-form ones in a parallel momentum configuration fail to reproduce experiment. The other results corresponding to a perpendicular momentum configuration (instant form in the Breit frame and front form with q+=0q^+=0) do much better. The comparison of charge and scalar form factors shows that the spin-1/2 nature of the constituents plays an important role. Taking into account that only the last set of results represents a reasonable basis for improving the description of the charge form factor, this one is then discussed with regard to the asymptotic QCD-power-law behavior Q2Q^{-2}. The contribution of two-body currents in achieving the right power law is considered while the scalar form factor, F0(Q2)F_0(Q^2), is shown to have the right power-law behavior in any case. The low-Q2Q^2 behavior of the charge form factor and the pion-decay constant are also discussed.}Comment: 30 pages, 10 figure

    Traveling Among Clusters: A Way to Reconsider the Benefits of the Cluster Hypothesis

    Get PDF
    Relying on the Cluster Hypothesis which states that relevant documents tend to be more similar one to each other than to non-relevant documents, most of information retrieval systems organizing search results as a set of clusters seek to gather all relevant documents in the same cluster. We propose here to reconsider the benefits of the entailed concentration of the relevant information. Contrary to what is commonly admitted, we believe that systems which aim to distribute the relevant documents in different clusters, since being more likely to highlight different aspects of the subject, may be at least as useful for the user as systems gathering all relevant documents in a single group. Since existing evaluation measures tend to greatly favor the latter systems, we first investigate ways to more fairly assess the ability to reach the relevant information from the list of cluster descriptions. At last, we show that systems distributing the relevant information in different clusters may actually provide a better information access than classical systems

    Toward a More Global and Coherent Segmentation of Texts

    Get PDF
    The automatic text segmentation task consists of identifying the most important thematic breaks in a document in order to cut it into homogeneous passages. Text segmentation has motivated a large amount of research. We focus here on the statistical approaches that rely on an analysis of the distribution of the words in the text. Usually, the segmentation of texts is realized sequentially on the basis of very local clues. However, such an approach prevents the consideration of the text in a global way, particularly concerning the granularity degree adopted for the expression of the different topics it addresses. We thus propose here two new segmentation algorithms—ClassStruggle and SegGen—which use criteria rendering global views of texts. ClassStruggle is based on an initial clustering of the sentences of the text, thus allowing the consideration of similarities within a group rather than individually. It relies on the distribution of the occurrences of the members of each class 1 to segment the texts. SegGen proposes to evaluate potential segmentations of the whole text thanks to a genetic algorithm. It attempts to find a solution of segmentation optimizing two criteria, the maximization of the internal cohesion of the segments and the minimization of the similarity between adjacent ones. According to experimental results, both approaches appear to be very competitive compared to existing methods

    Thematic Segment Retrieval Revisited

    Get PDF
    Documents, especially long ones, may contain very diverse passages related to different topics. Passages Retrieval approaches have shown that, in most cases, there is a great potential benefit in considering these passages independently when computing the similarity of a document with a user’s query. Experiments have been realized in order to identify the kinds of passage which are the best suited for such a process. Contrarily to what could have been expected, working with thematic segments, which are likely to represent only one topic each, has led to greatly lower effectiveness results than the use of arbitrary sequences of words. In this paper, we show that this paradoxical observation is mainly due to biases induced by the great length diversity of the thematic passages. Therefore, we propose here to cope with these biases by using a more powerful text length normalization technique. Experiments show that, when length biases are laid aside, the use of thematic passages is better suited than arbitrary sequences of words to retrieve relevant informations as response to a user’s query

    Taking Differences between Turkish and English Languages into account in Internal Representations

    Get PDF
    It is generally assumed that the representation of the meaning of sentences in a knowledge representation language does not depend of the natural language in which this meaning is initially expressed. We argue here that, despite the fact that the translation of a sentence from one language to another one is always possible, this rests mainly on the fact that the two languages are natural languages. Using online translations systems (e.g. Google, Yandex translators) make it clear that structural differences between languages gives rise to more or less faithful translations depending on the proximity of the implied languages and there is no doubt that effect of the differences between languages are more crucial if one of the language is a knowledge representation language. Our purpose is illustrated through numerous examples of sentences in Turkish and their translation in English, emphasizing differences between these languages which belong to two different natural language families. As knowledge representations languages we use the first order predicate logic (FOPP) and the conceptual graph (CG) language and its associated logical semantics. We show that important Turkish constructions like gerunds, action names and differences in focus lead to representations corresponding to the reification of verbal predicates and to favor CG as semantic network representation language, whereas English seems more suited to the traditional predicates centered representation schema. We conclude that this first study give rise toideas to be considered as new inspirations in the area of knowledge representation of linguistics data and its uses in natural language translation systems
    corecore