85 research outputs found
Automatic Classification of Text Databases through Query Probing
Many text databases on the web are "hidden" behind search interfaces, and
their documents are only accessible through querying. Search engines typically
ignore the contents of such search-only databases. Recently, Yahoo-like
directories have started to manually organize these databases into categories
that users can browse to find these valuable resources. We propose a novel
strategy to automate the classification of search-only text databases. Our
technique starts by training a rule-based document classifier, and then uses
the classifier's rules to generate probing queries. The queries are sent to the
text databases, which are then classified based on the number of matches that
they produce for each query. We report some initial exploratory experiments
that show that our approach is promising to automatically characterize the
contents of text databases accessible on the web.Comment: 7 pages, 1 figur
UKP-SQuARE v3: A Platform for Multi-Agent QA Research
The continuous development of Question Answering (QA) datasets has drawn the
research community's attention toward multi-domain models. A popular approach
is to use multi-dataset models, which are models trained on multiple datasets
to learn their regularities and prevent overfitting to a single dataset.
However, with the proliferation of QA models in online repositories such as
GitHub or Hugging Face, an alternative is becoming viable. Recent works have
demonstrated that combining expert agents can yield large performance gains
over multi-dataset models. To ease research in multi-agent models, we extend
UKP-SQuARE, an online platform for QA research, to support three families of
multi-agent systems: i) agent selection, ii) early-fusion of agents, and iii)
late-fusion of agents. We conduct experiments to evaluate their inference speed
and discuss the performance vs. speed trade-off compared to multi-dataset
models. UKP-SQuARE is open-source and publicly available at
http://square.ukp-lab.de
Застосування байєсівських мереж в системах аналізу даних
Запропоновано огляд методів побудови (навчання) структури мереж Байєса. Показано, що на сьогодні існує множина методів структурного навчання МБ та критеріїв оптимізації, які можна використати при їх побудові. Тому вибір методу навчання структури мережі повинен ґрунтуватись на докладному поглибленому аналізі задачі, яка розв’язується за допомогою мережі, та можливості отримання достовірних експертних і статистичних даних. Наведено практичний приклад застосування БМ.Предложен обзор методов построения (обучения) структуры сетей Байеса (СБ). Показано, что на сегодня существует множество методов структурного обучения СБ и критериев оптимизации, которые можно использовать при их построении. Поэтому выбор метода обучения структуры сети должен базироваться на углубленном анализе задачи, которая решается с помощью сети, и возможности получения достоверных экспертных и статистических данных. Приведен практический пример использования БС.A review is proposed of structural learning for Bayesian networks (BN). It is shown that today exists a wide set of structural learning methods for BN as well as optimization criteria that could be used for learning. That is why the selection of a learning method should be based on profound analysis of the problem to be solved by BN and the possibility of obtaining truthful expert and statistical data. A practical example of Bayesian network application is given
Fine-Grained Static Detection of Obfuscation Transforms Using Ensemble-Learning and Semantic Reasoning
International audienceThe ability to efficiently detect the software protections used is at a prime to facilitate the selection and application of adequate deob-fuscation techniques. We present a novel approach that combines semantic reasoning techniques with ensemble learning classification for the purpose of providing a static detection framework for obfuscation transformations. By contrast to existing work, we provide a methodology that can detect multiple layers of obfuscation, without depending on knowledge of the underlying functionality of the training-set used. We also extend our work to detect constructions of obfuscation transformations, thus providing a fine-grained methodology. To that end, we provide several studies for the best practices of the use of machine learning techniques for a scalable and efficient model. According to our experimental results and evaluations on obfuscators such as Tigress and OLLVM, our models have up to 91% accuracy on state-of-the-art obfuscation transformations. Our overall accuracies for their constructions are up to 100%
Composition Classification of Ultra-High Energy Cosmic Rays
The study of cosmic rays remains as one of the most challenging research fields in Physics.
From the many questions still open in this area, knowledge of the type of primary for each event
remains as one of the most important issues. All of the cosmic rays observatories have been trying
to solve this question for at least six decades, but have not yet succeeded. The main obstacle is the
impossibility of directly detecting high energy primary events, being necessary to use Monte Carlo
models and simulations to characterize generated particles cascades. This work presents the results
attained using a simulated dataset that was provided by the Monte Carlo code CORSIKA, which is
a simulator of high energy particles interactions with the atmosphere, resulting in a cascade of
secondary particles extending for a few kilometers (in diameter) at ground level. Using this simulated
data, a set of machine learning classifiers have been designed and trained, and their computational
cost and effectiveness compared, when classifying the type of primary under ideal measuring
conditions. Additionally, a feature selection algorithm has allowed for identifying the relevance of the
considered features. The results confirm the importance of the electromagnetic-muonic component
separation from signal data measured for the problem. The obtained results are quite encouraging
and open new work lines for future more restrictive simulations.Spanish Ministry of Science, Innovation and Universities
FPA2017-85197-P
RTI2018-101674-B-I00European Union (EU)CENAPAD-SP (Centro Nacional de Processamento de Alto Desempenho em Sao Paulo)
UNICAMP/FINEP - MCTFundacao de Amparo a Pesquisa do Estado de Sao Paulo (FAPESP)National Council for Scientific and Technological Development (CNPq)
2016/19764-9404993/2016-
Die thematische Erschließung von Sprachkorpora
Ziel des Teilprojekts ist die thematische Erschließung der Korpora, um sowohl themenspezifische virtuelle Subkorpora zusammenstellen zu können als auch aufgrund der Analyse sachgebietsbezogener Häufigkeitsverteilungen z.B. Lesarten disambiguieren zu können. Ausgangspunkt ist die Erstellung einer Taxonomie von Sachgebietsthemen. Dies erfolgt in einem semiautomatischen Verfahren, welches die Anwendung von Textmining (Dokumentclustering) und die manuelle Zuordnung von Clustern in eine externen Ontologie beinhaltet. Es wird argumentiert, dass die so gewonnene Taxonomie sowohl intuitiver als auch objektiver ist als bestehende, rein manuelle Ansätze. Sie eignet sich zudem gleichermaßen für manuelle als auch für maschinelle Klassifikation. Für letzteres wird der Naive Bayes'sche Textklassifikator motiviert und für ein klassifiziertes Korpus von knapp zwei Milliarden Wörtern evaluiert
- …