206 research outputs found

    Forecasting Chemical Abundance Precision for Extragalactic Stellar Archaeology

    Get PDF
    Increasingly powerful and multiplexed spectroscopic facilities promise detailed chemical abundance patterns for millions of resolved stars in galaxies beyond the Milky Way (MW). Here, we employ the Cram\'er-Rao Lower Bound (CRLB) to forecast the precision to which stellar abundances for metal-poor, low-mass stars outside the MW can be measured for 41 current (e.g., Keck, MMT, VLT, DESI) and planned (e.g., MSE, JWST, ELTs) spectrograph configurations. We show that moderate resolution (R5000R\lesssim5000) spectroscopy at blue-optical wavelengths (λ4500\lambda\lesssim4500 \AA) (i) enables the recovery of 2-4 times as many elements as red-optical spectroscopy (5000λ100005000\lesssim\lambda\lesssim10000 \AA) at similar or higher resolutions (R10000R\sim 10000) and (ii) can constrain the abundances of several neutron capture elements to \lesssim0.3 dex. We further show that high-resolution (R20000R\gtrsim 20000), low S/N (\sim10 pixel1^{-1}) spectra contain rich abundance information when modeled with full spectral fitting techniques. We demonstrate that JWST/NIRSpec and ELTs can recover (i) \sim10 and 30 elements, respectively, for metal-poor red giants throughout the Local Group and (ii) [Fe/H] and [α\alpha/Fe] for resolved stars in galaxies out to several Mpc with modest integration times. We show that select literature abundances are within a factor of \sim2 (or better) of our CRLBs. We suggest that, like ETCs, CRLBs should be used when planning stellar spectroscopic observations. We include an open source python package, \texttt{Chem-I-Calc}, that allows users to compute CRLBs for spectrographs of their choosing.Comment: 60 pages, 24 figures, accepted for publication in ApJ

    An investigation into weighted data fusion for content-based multimedia information retrieval

    Get PDF
    Content Based Multimedia Information Retrieval (CBMIR) is characterised by the combination of noisy sources of information which, in unison, are able to achieve strong performance. In this thesis we focus on the combination of ranked results from the independent retrieval experts which comprise a CBMIR system through linearly weighted data fusion. The independent retrieval experts are low-level multimedia features, each of which contains an indexing function and ranking algorithm. This thesis is comprised of two halves. In the first half, we perform a rigorous empirical investigation into the factors which impact upon performance in linearly weighted data fusion. In the second half, we leverage these finding to create a new class of weight generation algorithms for data fusion which are capable of determining weights at query-time, such that the weights are topic dependent

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

    Yleiskäyttöinen tekstinluokittelija suomenkielisille potilaskertomusteksteille

    Get PDF
    Medical texts are an underused source of data in clinical analytics. Extracting the relevant information from unstructured texts is difficult and while there are some tools available, they are often targeted for English texts. The situation is worse for smaller languages, such as Finnish. In this work, we reviewed literature in text mining and natural language processing fields in the scope of analyzing medical texts. Using the results of our literature review, we created an algorithm for information extraction from patient record texts. During this thesis work we created a decent text mining tool that works through text classification. This algorithm can be used detect medical conditions solely from medical texts. The usage of the algorithm is limited through the availability of large training data.Potilaskertomustekstejä käytetään kliinisessä analytiikassa huomattavan vähäisessä määrin. Olennaisen tiedon poimiminen tekstin joukosta on vaikeaa, ja vaikka siihen on työkaluja saatavilla, ovat ne useimmiten tehty englanninkielisille teksteille. Pienempien kielten, kuten suomen kohdalla tilanne on heikompi. Tässä työssä tehtiin kirjallisuuskatsaus tekstinlouhintaan ja luonnollisen kielen käsittelyyn liittyvään kirjallisuuteen, keskittyen erityisesti menetelmiin jotka soveltuvat lääketieteellisten tekstien analysointiin. Kirjallisuuskatsauksen pohjalta loimme algoritmin, joka soveltuu yleisesti lääketieteellisten tekstien luokitteluun. Tämän diplomityön osana luotiin tekstinlouhintatyökalu suomenkielisille lääketieteellisille teksteille. Kehitettyä algoritmia voidaan käyttää erilaisten tilojen tunnistamiseen potilaskertomusteksteistä. Algoritmin käyttöä kuitenkin rajoittaa tarve suurehkolle määrälle opetusdataa

    Automatic Identification of Online Predators in Chat Logs by Anomaly Detection and Deep Learning

    Get PDF
    Providing a safe environment for juveniles and children in online social networks is considered as a major factor in improving public safety. Due to the prevalence of the online conversations, mitigating the undesirable effects of juvenile abuse in cyberspace has become inevitable. Using automatic ways to address this kind of crime is challenging and demands efficient and scalable data mining techniques. The problem can be casted as a combination of textual preprocessing in data/text mining and binary classification in machine learning. This thesis proposes two machine learning approaches to deal with the following two issues in the domain of online predator identification: 1) The first problem is gathering a comprehensive set of negative training samples which is unrealistic due to the nature of the problem. This problem is addressed by applying an existing method for semi-supervised anomaly detection that allows the training process based on only one class label. The method was tested on two datasets; 2) The second issue is improving the performance of current binary classification methods in terms of classification accuracy and F1-score. In this regard, we have customized a deep learning approach called Convolutional Neural Network to be used in this domain. Using this approach, we show that the classification performance (F1-score) is improved by almost 1.7% compared to the classification method (Support Vector Machine). Two different datasets were used in the empirical experiments: PAN-2012 and SQ (Sûreté du Québec). The former is a large public dataset that has been used extensively in the literature and the latter is a small dataset collected from the Sûreté du Québec

    On the Use of PU Learning for Quality Flaw Prediction in Wikipedia

    Full text link
    [EN] In this article we describe a new approach to assess Quality Flaw Prediction in Wikipedia. The partially supervised method studied, called PU Learning, has been successfully applied in classi cations tasks with traditional corpora like Reuters-21578 or 20-Newsgroups. To the best of our knowledge, this is the rst time that it is applied in this domain. Throughout this paper, we describe how the original PU Learning approach was evaluated for assessing quality flaws and the modi cations introduced to get a quality aws predictor which obtained the best F1 scores in the task \Quality Flaw Prediction in Wikipedia" of the PAN challenge.Edgardo Ferretti and Marcelo Errecalde thank Universidad Nacional de San Luis (PROICO 30310). The collaboration of UNSL, INAOE and UPV has been funded by the European Commission as part of the WIQ-EI project (project no. 269180) within the FP7 People Programme. Manuel Montes is partially supported by CONACYT, No. 134186. The work of Paolo Rosso was carried out also in the framework of the MICINN Text-Enterprise (TIN2009-13391-C04-03) research project and the Microcluster VLC/Campus (International Campus of Excellence) on Multimodal Intelligent Systems.Ferretti, E.; Hernández Fusilier, D.; Guzmán Cabrera, R.; Montes Y Gómez, M.; Errecalde, M.; Rosso, P. (2012). On the Use of PU Learning for Quality Flaw Prediction in Wikipedia. CEUR Workshop Proceedings. 1178. http://hdl.handle.net/10251/46566S117

    Semantic Representation and Inference for NLP

    Full text link
    Semantic representation and inference is essential for Natural Language Processing (NLP). The state of the art for semantic representation and inference is deep learning, and particularly Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and transformer Self-Attention models. This thesis investigates the use of deep learning for novel semantic representation and inference, and makes contributions in the following three areas: creating training data, improving semantic representations and extending inference learning. In terms of creating training data, we contribute the largest publicly available dataset of real-life factual claims for the purpose of automatic claim verification (MultiFC), and we present a novel inference model composed of multi-scale CNNs with different kernel sizes that learn from external sources to infer fact checking labels. In terms of improving semantic representations, we contribute a novel model that captures non-compositional semantic indicators. By definition, the meaning of a non-compositional phrase cannot be inferred from the individual meanings of its composing words (e.g., hot dog). Motivated by this, we operationalize the compositionality of a phrase contextually by enriching the phrase representation with external word embeddings and knowledge graphs. Finally, in terms of inference learning, we propose a series of novel deep learning architectures that improve inference by using syntactic dependencies, by ensembling role guided attention heads, incorporating gating layers, and concatenating multiple heads in novel and effective ways. This thesis consists of seven publications (five published and two under review).Comment: PhD thesis, the University of Copenhage

    A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning

    Full text link
    Tesis por compendioNatural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human languages. One of its most challenging aspects involves enabling computers to derive meaning from human natural language. To do so, several meaning or context representations have been proposed with competitive performance. However, these representations still have room for improvement when working in a cross-domain or cross-language scenario. In this thesis we study the use of knowledge graphs as a cross-domain and cross-language representation of text and its meaning. A knowledge graph is a graph that expands and relates the original concepts belonging to a set of words. We obtain its characteristics using a wide-coverage multilingual semantic network as knowledge base. This allows to have a language coverage of hundreds of languages and millions human-general and -specific concepts. As starting point of our research we employ knowledge graph-based features - along with other traditional ones and meta-learning - for the NLP task of single- and cross-domain polarity classification. The analysis and conclusions of that work provide evidence that knowledge graphs capture meaning in a domain-independent way. The next part of our research takes advantage of the multilingual semantic network and focuses on cross-language Information Retrieval (IR) tasks. First, we propose a fully knowledge graph-based model of similarity analysis for cross-language plagiarism detection. Next, we improve that model to cover out-of-vocabulary words and verbal tenses and apply it to cross-language document retrieval, categorisation, and plagiarism detection. Finally, we study the use of knowledge graphs for the NLP tasks of community questions answering, native language identification, and language variety identification. The contributions of this thesis manifest the potential of knowledge graphs as a cross-domain and cross-language representation of text and its meaning for NLP and IR tasks. These contributions have been published in several international conferences and journals.El Procesamiento del Lenguaje Natural (PLN) es un campo de la informática, la inteligencia artificial y la lingüística computacional centrado en las interacciones entre las máquinas y el lenguaje de los humanos. Uno de sus mayores desafíos implica capacitar a las máquinas para inferir el significado del lenguaje natural humano. Con este propósito, diversas representaciones del significado y el contexto han sido propuestas obteniendo un rendimiento competitivo. Sin embargo, estas representaciones todavía tienen un margen de mejora en escenarios transdominios y translingües. En esta tesis estudiamos el uso de grafos de conocimiento como una representación transdominio y translingüe del texto y su significado. Un grafo de conocimiento es un grafo que expande y relaciona los conceptos originales pertenecientes a un conjunto de palabras. Sus propiedades se consiguen gracias al uso como base de conocimiento de una red semántica multilingüe de amplia cobertura. Esto permite tener una cobertura de cientos de lenguajes y millones de conceptos generales y específicos del ser humano. Como punto de partida de nuestra investigación empleamos características basadas en grafos de conocimiento - junto con otras tradicionales y meta-aprendizaje - para la tarea de PLN de clasificación de la polaridad mono- y transdominio. El análisis y conclusiones de ese trabajo muestra evidencias de que los grafos de conocimiento capturan el significado de una forma independiente del dominio. La siguiente parte de nuestra investigación aprovecha la capacidad de la red semántica multilingüe y se centra en tareas de Recuperación de Información (RI). Primero proponemos un modelo de análisis de similitud completamente basado en grafos de conocimiento para detección de plagio translingüe. A continuación, mejoramos ese modelo para cubrir palabras fuera de vocabulario y tiempos verbales, y lo aplicamos a las tareas translingües de recuperación de documentos, clasificación, y detección de plagio. Por último, estudiamos el uso de grafos de conocimiento para las tareas de PLN de respuesta de preguntas en comunidades, identificación del lenguaje nativo, y identificación de la variedad del lenguaje. Las contribuciones de esta tesis ponen de manifiesto el potencial de los grafos de conocimiento como representación transdominio y translingüe del texto y su significado en tareas de PLN y RI. Estas contribuciones han sido publicadas en diversas revistas y conferencias internacionales.El Processament del Llenguatge Natural (PLN) és un camp de la informàtica, la intel·ligència artificial i la lingüística computacional centrat en les interaccions entre les màquines i el llenguatge dels humans. Un dels seus majors reptes implica capacitar les màquines per inferir el significat del llenguatge natural humà. Amb aquest propòsit, diverses representacions del significat i el context han estat proposades obtenint un rendiment competitiu. No obstant això, aquestes representacions encara tenen un marge de millora en escenaris trans-dominis i trans-llenguatges. En aquesta tesi estudiem l'ús de grafs de coneixement com una representació trans-domini i trans-llenguatge del text i el seu significat. Un graf de coneixement és un graf que expandeix i relaciona els conceptes originals pertanyents a un conjunt de paraules. Les seves propietats s'aconsegueixen gràcies a l'ús com a base de coneixement d'una xarxa semàntica multilingüe d'àmplia cobertura. Això permet tenir una cobertura de centenars de llenguatges i milions de conceptes generals i específics de l'ésser humà. Com a punt de partida de la nostra investigació emprem característiques basades en grafs de coneixement - juntament amb altres tradicionals i meta-aprenentatge - per a la tasca de PLN de classificació de la polaritat mono- i trans-domini. L'anàlisi i conclusions d'aquest treball mostra evidències que els grafs de coneixement capturen el significat d'una forma independent del domini. La següent part de la nostra investigació aprofita la capacitat\hyphenation{ca-pa-ci-tat} de la xarxa semàntica multilingüe i se centra en tasques de recuperació d'informació (RI). Primer proposem un model d'anàlisi de similitud completament basat en grafs de coneixement per a detecció de plagi trans-llenguatge. A continuació, vam millorar aquest model per cobrir paraules fora de vocabulari i temps verbals, i ho apliquem a les tasques trans-llenguatges de recuperació de documents, classificació, i detecció de plagi. Finalment, estudiem l'ús de grafs de coneixement per a les tasques de PLN de resposta de preguntes en comunitats, identificació del llenguatge natiu, i identificació de la varietat del llenguatge. Les contribucions d'aquesta tesi posen de manifest el potencial dels grafs de coneixement com a representació trans-domini i trans-llenguatge del text i el seu significat en tasques de PLN i RI. Aquestes contribucions han estat publicades en diverses revistes i conferències internacionals.Franco Salvador, M. (2017). A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/84285TESISCompendi

    Increasing the Efficiency of High-Recall Information Retrieval

    Get PDF
    The goal of high-recall information retrieval (HRIR) is to find all, or nearly all, relevant documents while maintaining reasonable assessment effort. Achieving high recall is a key problem in the use of applications such as electronic discovery, systematic review, and construction of test collections for information retrieval tasks. State-of-the-art HRIR systems commonly rely on iterative relevance feedback in which human assessors continually assess machine learning-selected documents. The relevance of the assessed documents is then fed back to the machine learning model to improve its ability to select the next set of potentially relevant documents for assessment. In many instances, thousands of human assessments might be required to achieve high recall. These assessments represent the main cost of such HRIR applications. Therefore, their effectiveness in achieving high recall is limited by their reliance on human input when assessing the relevance of documents. In this thesis, we test different methods in order to improve the effectiveness and efficiency of finding relevant documents using state-of-the-art HRIR system. With regard to the effectiveness, we try to build a machine-learned model that retrieves relevant documents more accurately. For efficiency, we try to help human assessors make relevance assessments more easily and quickly via our HRIR system. Furthermore, we try to establish a stopping criteria for the assessment process so as to avoid excessive assessment. In particular, we hypothesize that total assessment effort to achieve high recall can be reduced by using shorter document excerpts (e.g., extractive summaries) in place of full documents for the assessment of relevance and using a high-recall retrieval system based on continuous active learning (CAL). In order to test this hypothesis, we implemented a high-recall retrieval system based on state-of-the-art implementation of CAL. This high-recall retrieval system could display either full documents or short document excerpts for relevance assessment. A search engine was also integrated into our system to provide assessors the option of conducting interactive search and judging. We conducted a simulation study, and separately, a 50-person controlled user study to test our hypothesis. The results of the simulation study show that judging even a single extracted sentence for relevance feedback may be adequate for CAL to achieve high recall. The results of the controlled user study confirmed that human assessors were able to find a significantly larger number of relevant documents within limited time when they used the system with paragraph-length document excerpts as opposed to full documents. In addition, we found that allowing participants to compose and execute their own search queries did not improve their ability to find relevant documents and, by some measures, impaired performance. Moreover, integrating sampling methods with active learning can yield accurate estimates of the number of relevant documents, and thus avoid excessive assessments