85 research outputs found

    Automatic Classification of Text Databases through Query Probing

    Get PDF
    Many text databases on the web are "hidden" behind search interfaces, and their documents are only accessible through querying. Search engines typically ignore the contents of such search-only databases. Recently, Yahoo-like directories have started to manually organize these databases into categories that users can browse to find these valuable resources. We propose a novel strategy to automate the classification of search-only text databases. Our technique starts by training a rule-based document classifier, and then uses the classifier's rules to generate probing queries. The queries are sent to the text databases, which are then classified based on the number of matches that they produce for each query. We report some initial exploratory experiments that show that our approach is promising to automatically characterize the contents of text databases accessible on the web.Comment: 7 pages, 1 figur

    UKP-SQuARE v3: A Platform for Multi-Agent QA Research

    Full text link
    The continuous development of Question Answering (QA) datasets has drawn the research community's attention toward multi-domain models. A popular approach is to use multi-dataset models, which are models trained on multiple datasets to learn their regularities and prevent overfitting to a single dataset. However, with the proliferation of QA models in online repositories such as GitHub or Hugging Face, an alternative is becoming viable. Recent works have demonstrated that combining expert agents can yield large performance gains over multi-dataset models. To ease research in multi-agent models, we extend UKP-SQuARE, an online platform for QA research, to support three families of multi-agent systems: i) agent selection, ii) early-fusion of agents, and iii) late-fusion of agents. We conduct experiments to evaluate their inference speed and discuss the performance vs. speed trade-off compared to multi-dataset models. UKP-SQuARE is open-source and publicly available at http://square.ukp-lab.de

    Застосування байєсівських мереж в системах аналізу даних

    Get PDF
    Запропоновано огляд методів побудови (навчання) структури мереж Байєса. Показано, що на сьогодні існує множина методів структурного навчання МБ та критеріїв оптимізації, які можна використати при їх побудові. Тому вибір методу навчання структури мережі повинен ґрунтуватись на докладному поглибленому аналізі задачі, яка розв’язується за допомогою мережі, та можливості отримання достовірних експертних і статистичних даних. Наведено практичний приклад застосування БМ.Предложен обзор методов построения (обучения) структуры сетей Байеса (СБ). Показано, что на сегодня существует множество методов структурного обучения СБ и критериев оптимизации, которые можно использовать при их построении. Поэтому выбор метода обучения структуры сети должен базироваться на углубленном анализе задачи, которая решается с помощью сети, и возможности получения достоверных экспертных и статистических данных. Приведен практический пример использования БС.A review is proposed of structural learning for Bayesian networks (BN). It is shown that today exists a wide set of structural learning methods for BN as well as optimization criteria that could be used for learning. That is why the selection of a learning method should be based on profound analysis of the problem to be solved by BN and the possibility of obtaining truthful expert and statistical data. A practical example of Bayesian network application is given

    Fine-Grained Static Detection of Obfuscation Transforms Using Ensemble-Learning and Semantic Reasoning

    Get PDF
    International audienceThe ability to efficiently detect the software protections used is at a prime to facilitate the selection and application of adequate deob-fuscation techniques. We present a novel approach that combines semantic reasoning techniques with ensemble learning classification for the purpose of providing a static detection framework for obfuscation transformations. By contrast to existing work, we provide a methodology that can detect multiple layers of obfuscation, without depending on knowledge of the underlying functionality of the training-set used. We also extend our work to detect constructions of obfuscation transformations, thus providing a fine-grained methodology. To that end, we provide several studies for the best practices of the use of machine learning techniques for a scalable and efficient model. According to our experimental results and evaluations on obfuscators such as Tigress and OLLVM, our models have up to 91% accuracy on state-of-the-art obfuscation transformations. Our overall accuracies for their constructions are up to 100%

    Composition Classification of Ultra-High Energy Cosmic Rays

    Get PDF
    The study of cosmic rays remains as one of the most challenging research fields in Physics. From the many questions still open in this area, knowledge of the type of primary for each event remains as one of the most important issues. All of the cosmic rays observatories have been trying to solve this question for at least six decades, but have not yet succeeded. The main obstacle is the impossibility of directly detecting high energy primary events, being necessary to use Monte Carlo models and simulations to characterize generated particles cascades. This work presents the results attained using a simulated dataset that was provided by the Monte Carlo code CORSIKA, which is a simulator of high energy particles interactions with the atmosphere, resulting in a cascade of secondary particles extending for a few kilometers (in diameter) at ground level. Using this simulated data, a set of machine learning classifiers have been designed and trained, and their computational cost and effectiveness compared, when classifying the type of primary under ideal measuring conditions. Additionally, a feature selection algorithm has allowed for identifying the relevance of the considered features. The results confirm the importance of the electromagnetic-muonic component separation from signal data measured for the problem. The obtained results are quite encouraging and open new work lines for future more restrictive simulations.Spanish Ministry of Science, Innovation and Universities FPA2017-85197-P RTI2018-101674-B-I00European Union (EU)CENAPAD-SP (Centro Nacional de Processamento de Alto Desempenho em Sao Paulo) UNICAMP/FINEP - MCTFundacao de Amparo a Pesquisa do Estado de Sao Paulo (FAPESP)National Council for Scientific and Technological Development (CNPq) 2016/19764-9404993/2016-

    Die thematische Erschließung von Sprachkorpora

    Get PDF
    Ziel des Teilprojekts ist die thematische Erschließung der Korpora, um sowohl themenspezifische virtuelle Subkorpora zusammenstellen zu können als auch aufgrund der Analyse sachgebietsbezogener Häufigkeitsverteilungen z.B. Lesarten disambiguieren zu können. Ausgangspunkt ist die Erstellung einer Taxonomie von Sachgebietsthemen. Dies erfolgt in einem semiautomatischen Verfahren, welches die Anwendung von Textmining (Dokumentclustering) und die manuelle Zuordnung von Clustern in eine externen Ontologie beinhaltet. Es wird argumentiert, dass die so gewonnene Taxonomie sowohl intuitiver als auch objektiver ist als bestehende, rein manuelle Ansätze. Sie eignet sich zudem gleichermaßen für manuelle als auch für maschinelle Klassifikation. Für letzteres wird der Naive Bayes'sche Textklassifikator motiviert und für ein klassifiziertes Korpus von knapp zwei Milliarden Wörtern evaluiert
    corecore