84 research outputs found

    Adaptive scheduling for adaptive sampling in pos taggers construction

    Get PDF
    We introduce an adaptive scheduling for adaptive sampling as a novel way of machine learning in the construction of part-of-speech taggers. The goal is to speed up the training on large data sets, without significant loss of performance with regard to an optimal configuration. In contrast to previous methods using a random, fixed or regularly rising spacing between the instances, ours analyzes the shape of the learning curve geometrically in conjunction with a functional model to increase or decrease it at any time. The algorithm proves to be formally correct regarding our working hypotheses. Namely, given a case, the following one is the nearest ensuring a net gain of learning ability from the former, it being possible to modulate the level of requirement for this condition. We also improve the robustness of sampling by paying greater attention to those regions of the training data base subject to a temporary inflation in performance, thus preventing the learning from stopping prematurely. The proposal has been evaluated on the basis of its reliability to identify the convergence of models, corroborating our expectations. While a concrete halting condition is used for testing, users can choose any condition whatsoever to suit their own specific needs.Agencia Estatal de Investigación | Ref. TIN2017-85160-C2-1-RAgencia Estatal de Investigación | Ref. TIN2017-85160-C2-2-RXunta de Galicia | Ref. ED431C 2018/50Xunta de Galicia | Ref. ED431D 2017/1

    Surfing the modeling of pos taggers in low-resource scenarios

    Get PDF
    The recent trend toward the application of deep structured techniques has revealed the limits of huge models in natural language processing. This has reawakened the interest in traditional machine learning algorithms, which have proved still to be competitive in certain contexts, particularly in low-resource settings. In parallel, model selection has become an essential task to boost performance at reasonable cost, even more so when we talk about processes involving domains where the training and/or computational resources are scarce. Against this backdrop, we evaluate the early estimation of learning curves as a practical mechanism for selecting the most appropriate model in scenarios characterized by the use of non-deep learners in resource-lean settings. On the basis of a formal approximation model previously evaluated under conditions of wide availability of training and validation resources, we study the reliability of such an approach in a different and much more demanding operational environment. Using as a case study the generation of pos taggers for Galician, a language belonging to the Western Ibero-Romance group, the experimental results are consistent with our expectations.Ministerio de Ciencia e Innovación | Ref. PID2020-113230RB-C21Ministerio de Ciencia e Innovación | Ref. PID2020-113230RB-C22Xunta de Galicia | Ref. ED431C 2020/1

    Creating a Korean Engineering Academic Vocabulary List (KEAVL): Computational Approach

    Get PDF
    With a growing number of international students in South Korea, the need for developing materials to study Korean for academic purposes is becoming increasingly pressing. According to statistics, engineering colleges in Korea attract the largest number of international students (Korean National Institute for International Education, 2018). However, despite the availability of technical vocabulary lists for some engineering sub-fields, a list of vocabulary common for the majority of the engineering sub-fields has not yet been built. Therefore, this study was aimed at creating a list of Korean academic vocabulary of engineering for non-native Korean speakers that may help future or first-year engineering students and engineers working in Korea. In order to compile this list, a corpus of Korean textbooks and research articles of 12 major engineering sub-fields, named as the Corpus of Korean Engineering Academic Texts (CKEAT), was compiled. Then, in order to analyze the corpus and compile the preliminary list, I designed a Python-based tool called KWordList. The KWordList lemmatizes all words in the corpus while excluding general Korean vocabulary included in the Korean Learner’s List (Jo, 2003). Then, for the remaining words, KWordList calculates the range, frequency, and dispersion (in this study deviation of proportions or DP (Gries, 2008)) and excludes words that do not pass the study’s criteria (range ≥ 6, frequency ≥ 100, DP ≤ 0.5). The final version of the list, called Korean Engineering Academic Vocabulary List or KEAVL, includes 830 lemmas (318 of intermediate level and 512 of advanced level). For each word, the collocations that occur more than 30 times in the corpus are provided. The comparison of the coverage of the Korean Academic Vocabulary List (Shin, 2004) and KEAVL based on the Corpus of Korean Engineering Academic Texts showed that KEAVL covers more lemmas in the corpus. Moreover, only 313 lemmas from the Korean Academic Vocabulary List (Shin, 2004) passed the criteria of the study. Therefore, KEAVL may be more efficient for engineering students’ vocabulary training than the Korean Academic Vocabulary List and may be used for the engineering Korean teaching materials and curriculum development. Moreover, the KWordList program written for the study can be used by other researchers, teachers, and even students and is open access (https://github.com/HelgaKr/KWordList)

    Statistical langauge models for alternative sequence selection

    No full text

    BNAIC 2008:Proceedings of BNAIC 2008, the twentieth Belgian-Dutch Artificial Intelligence Conference

    Get PDF

    A facility to Search for Hidden Particles (SHiP) at the CERN SPS

    Get PDF
    A new general purpose fixed target facility is proposed at the CERN SPS accelerator which is aimed at exploring the domain of hidden particles and make measurements with tau neutrinos. Hidden particles are predicted by a large number of models beyond the Standard Model. The high intensity of the SPS 400~GeV beam allows probing a wide variety of models containing light long-lived exotic particles with masses below O{\cal O}(10)~GeV/c2^2, including very weakly interacting low-energy SUSY states. The experimental programme of the proposed facility is capable of being extended in the future, e.g. to include direct searches for Dark Matter and Lepton Flavour Violation.Comment: Technical Proposa

    Scalable and Declarative Information Extraction in a Parallel Data Analytics System

    Get PDF
    Informationsextraktions (IE) auf sehr großen Datenmengen erfordert hochkomplexe, skalierbare und anpassungsfähige Systeme. Obwohl zahlreiche IE-Algorithmen existieren, ist die nahtlose und erweiterbare Kombination dieser Werkzeuge in einem skalierbaren System immer noch eine große Herausforderung. In dieser Arbeit wird ein anfragebasiertes IE-System für eine parallelen Datenanalyseplattform vorgestellt, das für konkrete Anwendungsdomänen konfigurierbar ist und für Textsammlungen im Terabyte-Bereich skaliert. Zunächst werden konfigurierbare Operatoren für grundlegende IE- und Web-Analytics-Aufgaben definiert, mit denen komplexe IE-Aufgaben in Form von deklarativen Anfragen ausgedrückt werden können. Alle Operatoren werden hinsichtlich ihrer Eigenschaften charakterisiert um das Potenzial und die Bedeutung der Optimierung nicht-relationaler, benutzerdefinierter Operatoren (UDFs) für Data Flows hervorzuheben. Anschließend wird der Stand der Technik in der Optimierung nicht-relationaler Data Flows untersucht und herausgearbeitet, dass eine umfassende Optimierung von UDFs immer noch eine Herausforderung ist. Darauf aufbauend wird ein erweiterbarer, logischer Optimierer (SOFA) vorgestellt, der die Semantik von UDFs mit in die Optimierung mit einbezieht. SOFA analysiert eine kompakte Menge von Operator-Eigenschaften und kombiniert eine automatisierte Analyse mit manuellen UDF-Annotationen, um die umfassende Optimierung von Data Flows zu ermöglichen. SOFA ist in der Lage, beliebige Data Flows aus unterschiedlichen Anwendungsbereichen logisch zu optimieren, was zu erheblichen Laufzeitverbesserungen im Vergleich mit anderen Techniken führt. Als Viertes wird die Anwendbarkeit des vorgestellten Systems auf Korpora im Terabyte-Bereich untersucht und systematisch die Skalierbarkeit und Robustheit der eingesetzten Methoden und Werkzeuge beurteilt um schließlich die kritischsten Herausforderungen beim Aufbau eines IE-Systems für sehr große Datenmenge zu charakterisieren.Information extraction (IE) on very large data sets requires highly complex, scalable, and adaptive systems. Although numerous IE algorithms exist, their seamless and extensible combination in a scalable system still is a major challenge. This work presents a query-based IE system for a parallel data analysis platform, which is configurable for specific application domains and scales for terabyte-sized text collections. First, configurable operators are defined for basic IE and Web Analytics tasks, which can be used to express complex IE tasks in the form of declarative queries. All operators are characterized in terms of their properties to highlight the potential and importance of optimizing non-relational, user-defined operators (UDFs) for dataflows. Subsequently, we survey the state of the art in optimizing non-relational dataflows and highlight that a comprehensive optimization of UDFs is still a challenge. Based on this observation, an extensible, logical optimizer (SOFA) is introduced, which incorporates the semantics of UDFs into the optimization process. SOFA analyzes a compact set of operator properties and combines automated analysis with manual UDF annotations to enable a comprehensive optimization of data flows. SOFA is able to logically optimize arbitrary data flows from different application areas, resulting in significant runtime improvements compared to other techniques. Finally, the applicability of the presented system to terabyte-sized corpora is investigated. Hereby, we systematically evaluate scalability and robustness of the employed methods and tools in order to pinpoint the most critical challenges in building an IE system for very large data sets

    Advanced Knowledge Application in Practice

    Get PDF
    The integration and interdependency of the world economy leads towards the creation of a global market that offers more opportunities, but is also more complex and competitive than ever before. Therefore widespread research activity is necessary if one is to remain successful on the market. This book is the result of research and development activities from a number of researchers worldwide, covering concrete fields of research

    Weighted Networks: Applications from Power grid construction to crowd control

    Get PDF
    Since their discovery in the 1950\u27s by Erdos and Renyi, network theory (the study of objects and their associations) has blossomed into a full-fledged branch of mathematics. Due to the network\u27s flexibility, diverse scientific problems can be reformulated as networks and studied using a common set of tools. I define a network G = (V,E) composed of two parts: (i) the set of objects V, called nodes, and (ii) set of relationships (associations) E, called links, that connect objects in V. We can extend the classic network of nodes and links by describing the intensity of these associations with weights. More formally, weighted networks augment the classic network with a function f(e) from links to the real line, uncovering powerful ways to model real-world applications. This thesis studies new ways to construct robust micro powergrids, mine people\u27s perceptions of causality on a social network, and proposes a new way to analyze crowdsourcing all in the context of the weighted network model. The current state of Earth\u27s ecosystem and intensifying climate calls on scientists to find new ways to harvest clean affordable energy. A microgrid, or neighborhood-scale powergrid built using renewable energy sources attached to personal homes, suggest one way to ameliorate this energy crisis. We can study the stability (robustness) of such a small-scale system with weighted networks. A novel use of weighted networks and percolation theory guides the safe and efficient construction of power lines (links, E) connecting a small set of houses (nodes, V) to one another and weights each power line by the distance between houses. This new look at the robustness of microgrid structures calls into question the efficacy of the traditional utility. The next study uses the twitter social network to compare and contrast causal language from everyday conversation. Collecting a set of 1 million tweets, we find a set of words (unigrams), parts of speech, named entities, and sentiment signal the use of informal causal language. Breaking a problem difficult for a computer to solve into many parts and distributing these tasks to a group of humans to solve is called Crowdsourcing. My final project asks volunteers to \u27reply\u27 to questions asked of them and \u27supply\u27 novel questions for others to answer. I model this \u27reply and supply\u27 framework as a dynamic weighted network, proposing new theories about this network\u27s behavior and how to steer it toward worthy goals. This thesis demonstrates novel uses of, enhances the current scientific literature on, and presents novel methodology for, weighted networks
    corecore