21 research outputs found

    Cross-language Information Retrieval

    Full text link
    Two key assumptions shape the usual view of ranked retrieval: (1) that the searcher can choose words for their query that might appear in the documents that they wish to see, and (2) that ranking retrieved documents will suffice because the searcher will be able to recognize those which they wished to find. When the documents to be searched are in a language not known by the searcher, neither assumption is true. In such cases, Cross-Language Information Retrieval (CLIR) is needed. This chapter reviews the state of the art for CLIR and outlines some open research questions.Comment: 49 pages, 0 figure

    LIA@CLEF 2018: Mining events opinion argumentation from raw unlabeled Twitter data using convolutional neural network

    Get PDF
    International audienceSocial networks on the Internet are becoming increasingly important in our society. In recent years, this type of media, through communication platforms such as Twitter, has brought new research issues due to the massive size of data exchanged and the important number of ever-increasing users. In this context, the CLEF 2018 Mining opinion argumentation task aims to retrieve, for a specific event (festival name or topic), the most diverse argumentative microblogs from a large collection of tweets about festivals in different languages. In this paper, we propose a four-step approach for extracting argumentative microblogs related to a specific query (or event) while no reference data is provided

    Discovering Mathematical Objects of Interest -- A Study of Mathematical Notations

    Full text link
    Mathematical notation, i.e., the writing system used to communicate concepts in mathematics, encodes valuable information for a variety of information search and retrieval systems. Yet, mathematical notations remain mostly unutilized by today's systems. In this paper, we present the first in-depth study on the distributions of mathematical notation in two large scientific corpora: the open access arXiv (2.5B mathematical objects) and the mathematical reviewing service for pure and applied mathematics zbMATH (61M mathematical objects). Our study lays a foundation for future research projects on mathematical information retrieval for large scientific corpora. Further, we demonstrate the relevance of our results to a variety of use-cases. For example, to assist semantic extraction systems, to improve scientific search engines, and to facilitate specialized math recommendation systems. The contributions of our presented research are as follows: (1) we present the first distributional analysis of mathematical formulae on arXiv and zbMATH; (2) we retrieve relevant mathematical objects for given textual search queries (e.g., linking Pn(α,β) ⁣(x)P_{n}^{(\alpha, \beta)}\!\left(x\right) with `Jacobi polynomial'); (3) we extend zbMATH's search engine by providing relevant mathematical formulae; and (4) we exemplify the applicability of the results by presenting auto-completion for math inputs as the first contribution to math recommendation systems. To expedite future research projects, we have made available our source code and data.Comment: Proceedings of The Web Conference 2020 (WWW'20), April 20--24, 2020, Taipei, Taiwa

    Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context

    Full text link
    Mathematical formulae represent complex semantic information in a concise form. Especially in Science, Technology, Engineering, and Mathematics, mathematical formulae are crucial to communicate information, e.g., in scientific papers, and to perform computations using computer algebra systems. Enabling computers to access the information encoded in mathematical formulae requires machine-readable formats that can represent both the presentation and content, i.e., the semantics, of formulae. Exchanging such information between systems additionally requires conversion methods for mathematical representation formats. We analyze how the semantic enrichment of formulae improves the format conversion process and show that considering the textual context of formulae reduces the error rate of such conversions. Our main contributions are: (1) providing an openly available benchmark dataset for the mathematical format conversion task consisting of a newly created test collection, an extensive, manually curated gold standard and task-specific evaluation metrics; (2) performing a quantitative evaluation of state-of-the-art tools for mathematical format conversions; (3) presenting a new approach that considers the textual context of formulae to reduce the error rate for mathematical format conversions. Our benchmark dataset facilitates future research on mathematical format conversions as well as research on many problems in mathematical information retrieval. Because we annotated and linked all components of formulae, e.g., identifiers, operators and other entities, to Wikidata entries, the gold standard can, for instance, be used to train methods for formula concept discovery and recognition. Such methods can then be applied to improve mathematical information retrieval systems, e.g., for semantic formula search, recommendation of mathematical content, or detection of mathematical plagiarism.Comment: 10 pages, 4 figure

    Effective Math-Aware Ad-Hoc Retrieval based on Structure Search and Semantic Similarities

    Get PDF
    Despite the prevalence of digital scientific and educational contents on the Internet, only a few search engines are capable to retrieve them efficiently and effectively. The main challenge in freely searching scientific literature arises from the presence of structured math formulas and their heterogeneous and contextually important surrounding words. This thesis introduces an effective math-aware, ad-hoc retrieval model that incorporates structure search and semantic similarities. Transformer-based neural retrievers have been adopted to capture additional semantics using domain-adapted supervised retrieval. To enable structure search, I suggest an unsupervised retrieval model that can filter potential mathematical formulas based on structure similarity. This similarity is determined by measuring the largest common substructure(s) in a formula tree representation, known as the Operator Tree (OPT). The structure matching is approximated by employing maximum matching of path-based structure features. The proposed structure similarity measurement can be tailored based on the desired effectiveness and efficiency trade-offs. It may consider various node types, such as operators and operands, and accommodate different numbers of common subtrees with varying weights. In addition to structure similarity, this unsupervised model also captures symbol substitutions through a greedy matching algorithm applied to the matched substructure(s). To achieve efficient structure search, I introduce a dynamic pruning algorithm to the problem of structure retrieval. The proposed retrieval algorithm efficiently identifies the maximum common subtree among formula candidates and safely eliminates potential structure matches that exceed a dynamic threshold. To accomplish this, three rank-safe pruning strategies are suggested and compared against exhaustive search baselines. Additionally, more aggressive thresholding policies are proposed to balance effectiveness with further speed improvements. A novel hierarchical inverted index has been implemented. This index is designed to be compatible with traditional information retrieval (IR) infrastructure and optimization techniques. To capture other semantic similarities, I have incorporated neural retrievers into a hybrid setting with structure search. This approach has achieved the state-of-the-art effectiveness in recent math information retrieval tasks. In comparison to strict and unsupervised matching, I have found that supervised neural retrievers are able to capture additional semantic similarities in a highly complementary manner. In order to learn effective representations in heterogeneous math contents, I have proposed a novel pretraining architecture that can improve the contextual awareness between math and its surrounding texts. This pretraining scheme generates effective downstream single-vector representations, eliminating the efficiency bottleneck from using multi-vector dense representations. In the end, the thesis examines future directions, specifically the integration of recent advancements in language modeling. This includes incorporating ongoing exciting developments of large language models for improved math information retrieval. A preliminary evaluation has been conducted to assess the impact of these advancements

    Making Presentation Math Computable

    Get PDF
    This Open-Access-book addresses the issue of translating mathematical expressions from LaTeX to the syntax of Computer Algebra Systems (CAS). Over the past decades, especially in the domain of Sciences, Technology, Engineering, and Mathematics (STEM), LaTeX has become the de-facto standard to typeset mathematical formulae in publications. Since scientists are generally required to publish their work, LaTeX has become an integral part of today's publishing workflow. On the other hand, modern research increasingly relies on CAS to simplify, manipulate, compute, and visualize mathematics. However, existing LaTeX import functions in CAS are limited to simple arithmetic expressions and are, therefore, insufficient for most use cases. Consequently, the workflow of experimenting and publishing in the Sciences often includes time-consuming and error-prone manual conversions between presentational LaTeX and computational CAS formats. To address the lack of a reliable and comprehensive translation tool between LaTeX and CAS, this thesis makes the following three contributions. First, it provides an approach to semantically enhance LaTeX expressions with sufficient semantic information for translations into CAS syntaxes. Second, it demonstrates the first context-aware LaTeX to CAS translation framework LaCASt. Third, the thesis provides a novel approach to evaluate the performance for LaTeX to CAS translations on large-scaled datasets with an automatic verification of equations in digital mathematical libraries. This is an open access book

    Making Presentation Math Computable

    Get PDF
    This Open-Access-book addresses the issue of translating mathematical expressions from LaTeX to the syntax of Computer Algebra Systems (CAS). Over the past decades, especially in the domain of Sciences, Technology, Engineering, and Mathematics (STEM), LaTeX has become the de-facto standard to typeset mathematical formulae in publications. Since scientists are generally required to publish their work, LaTeX has become an integral part of today's publishing workflow. On the other hand, modern research increasingly relies on CAS to simplify, manipulate, compute, and visualize mathematics. However, existing LaTeX import functions in CAS are limited to simple arithmetic expressions and are, therefore, insufficient for most use cases. Consequently, the workflow of experimenting and publishing in the Sciences often includes time-consuming and error-prone manual conversions between presentational LaTeX and computational CAS formats. To address the lack of a reliable and comprehensive translation tool between LaTeX and CAS, this thesis makes the following three contributions. First, it provides an approach to semantically enhance LaTeX expressions with sufficient semantic information for translations into CAS syntaxes. Second, it demonstrates the first context-aware LaTeX to CAS translation framework LaCASt. Third, the thesis provides a novel approach to evaluate the performance for LaTeX to CAS translations on large-scaled datasets with an automatic verification of equations in digital mathematical libraries. This is an open access book

    Q(sqrt(-3))-Integral Points on a Mordell Curve

    Get PDF
    We use an extension of quadratic Chabauty to number fields,recently developed by the author with Balakrishnan, Besser and M ̈uller,combined with a sieving technique, to determine the integral points overQ(√−3) on the Mordell curve y2 = x3 − 4

    Activité du trou noir supermassif au centre de la Galaxie

    Get PDF
    Sagittarius A⋆ is the supermassive black hole at the Galactic center. Due to its proximity, this specimen is an excellent laboratory to study the accretion processes occurring around black holes and to constrain the duty cycle of these objects. Sgr A* is currently extremely faint and despite the detection of daily flares, its luminosity remains at least eight orders of magnitude below its Eddington luminosity, making this specimen one of the least luminous known supermassive black holes. The radiative processes responsible for the daily variations of its luminosity have not been clearly identified yet. We present the results of a multi-wavelength campaign observing Sgr A* simultaneously in X-rays and in the near-infrared, using the XMM-Newton observatory and the VLT/NACO instrument. We studied the spectral variability of Sgr A* using the infrared data we obtained through a spectro-imaging technique. Uncertainties linked to the systematic errors are still large but the first tests applied seem to show that the spectral index of Sgr A* could depend on the black hole luminosity. On longer timescales, we demonstrate that Sgr A* experienced a higher level of activity in the recent past. Indeed, echoes of its past activity can be detected in the molecular material surrounding the black hole. They are traced by a strong signal in the iron fluorescence line at 6.4 keV. We achieved a complete and systematic study of this variable emission detected from the central molecular zone, using Chandra and XMM-Newton observatories. Our results confirm that Sgr A* experienced intense flares in the past few centuries, with a luminosity at least six orders of magnitude higher than its current one. In particular, we highlight for the first time the existence of two distinct transient events of relatively short duration, which are probably due to catastrophic events. These results are the first step needed to include Sgr A*’s activity into a broader understanding of the galactic nuclei.Le centre de la Galaxie abrite un trou noir supermassif, Sagittarius A*. Sa proximité en fait un laboratoire privilégié pour étudier les phénomènes d’accrétion à l’œuvre autour des trous noirs et contraindre le cycle d’activité de ces astres. Sgr A* est actuellement extrêmement peu lumineux et malgré des sursauts d’activité quotidiens sa luminosité demeure au moins huit ordres de grandeur en dessous de sa luminosité d’Eddington. Cet objet est ainsi l’un des trous noirs supermassifs connus les moins lumineux. Les mécanismes radiatifs à l’origine des variations quotidiennes observées ne sont pas clairement identifiés. Nous présentons les résultats d’une campagne d’observation multi-longueurs d’onde visant à mesurer le spectre de ces événements simultanément en rayons X et en infrarouge proche, à l’aide de l’observatoire XMM-Newton et de l’instrument VLT/NACO. Les données infrarouges obtenues grâce à la technique de spectro-imagerie en bande large ont permis d’étudier la variabilité du spectre de Sgr A* en infrarouge. Les incertitudes liées aux erreurs systématiques sont encore importantes mais les premiers tests réalisés semblent indiquer que l’indice spectral pourrait dépendre de la luminosité du trou noir. Sur des échelles de temps plus grandes, nous montrons également que Sgr A* n’a pas toujours été aussi peu actif. Des traces de son activité passée sont en effet visibles dans la matière moléculaire directement autour du trou noir, notamment sous la forme d’un rayonnement réfléchi visible dans la raie de fluorescence du fer à 6.4 keV. Nous avons réalisé une étude complète et systématique des variations de cette émission détectée dans la zone moléculaire centrale en utilisant les observatoires Chandra et XMM-Newton. Nos résultats confirment que Sgr A* a connu des sursauts intenses au cours des derniers siècles, au moins six ordre de grandeur en dessus de la luminosité actuelle. En particulier, nous avons mis en évidence, pour la première fois, la présence de deux événements transitoires distincts de relativement courte durée, probablement liés à des événements catastrophiques. Ces résultats constituent une première étape pour relier l’activité de ce trou noir spécifique aux autres noyaux de galaxie présents dans l’Univers
    corecore