2,466 research outputs found

    Pytrec_eval: An Extremely Fast Python Interface to trec_eval

    Full text link
    We introduce pytrec_eval, a Python interface to the tree_eval information retrieval evaluation toolkit. pytrec_eval exposes the reference implementations of trec_eval within Python as a native extension. We show that pytrec_eval is around one order of magnitude faster than invoking trec_eval as a sub process from within Python. Compared to a native Python implementation of NDCG, pytrec_eval is twice as fast for practically-sized rankings. Finally, we demonstrate its effectiveness in an application where pytrec_eval is combined with Pyndri and the OpenAI Gym where query expansion is learned using Q-learning.Comment: SIGIR '18. The 41st International ACM SIGIR Conference on Research & Development in Information Retrieva

    Corpus Annotation for Parser Evaluation

    Full text link
    We describe a recently developed corpus annotation scheme for evaluating parsers that avoids shortcomings of current methods. The scheme encodes grammatical relations between heads and dependents, and has been used to mark up a new public-domain corpus of naturally occurring English text. We show how the corpus can be used to evaluate the accuracy of a robust parser, and relate the corpus to extant resources.Comment: 7 pages, LaTeX (uses eaclap.sty

    Automatic Extraction of Subcategorization from Corpora

    Full text link
    We describe a novel technique and implemented system for constructing a subcategorization dictionary from textual corpora. Each dictionary entry encodes the relative frequency of occurrence of a comprehensive set of subcategorization classes for English. An initial experiment, on a sample of 14 verbs which exhibit multiple complementation patterns, demonstrates that the technique achieves accuracy comparable to previous approaches, which are all limited to a highly restricted set of subcategorization classes. We also demonstrate that a subcategorization dictionary built with the system improves the accuracy of a parser by an appreciable amount.Comment: 8 pages; requires aclap.sty. To appear in ANLP-9

    Can Subcategorisation Probabilities Help a Statistical Parser?

    Full text link
    Research into the automatic acquisition of lexical information from corpora is starting to produce large-scale computational lexicons containing data on the relative frequencies of subcategorisation alternatives for individual verbal predicates. However, the empirical question of whether this type of frequency information can in practice improve the accuracy of a statistical parser has not yet been answered. In this paper we describe an experiment with a wide-coverage statistical grammar and parser for English and subcategorisation frequencies acquired from ten million words of text which shows that this information can significantly improve parse accuracy.Comment: 9 pages, uses colacl.st

    Saggitarius: A DSL for Specifying Grammatical Domains

    Full text link
    Common data types like dates, addresses, phone numbers and tables can have multiple textual representations, and many heavily-used languages, such as SQL, come in several dialects. These variations can cause data to be misinterpreted, leading to silent data corruption, failure of data processing systems, or even security vulnerabilities. Saggitarius is a new language and system designed to help programmers reason about the format of data, by describing grammatical domains -- that is, sets of context-free grammars that describe the many possible representations of a datatype. We describe the design of Saggitarius via example and provide a relational semantics. We show how Saggitarius may be used to analyze a data set: given example data, it uses an algorithm based on semi-ring parsing and MaxSAT to infer which grammar in a given domain best matches that data. We evaluate the effectiveness of the algorithm on a benchmark suite of 110 example problems, and we demonstrate that our system typically returns a satisfying grammar within a few seconds with only a small number of examples. We also delve deeper into a more extensive case study on using Saggitarius for CSV dialect detection. Despite being general-purpose, we find that Saggitarius offers comparable results to hand-tuned, specialized tools; in the case of CSV, it infers grammars for 84% of benchmarks within 60 seconds, and has comparable accuracy to custom-built dialect detection tools.Comment: OOPSLA 202

    HUDDL for description and archive of hydrographic binary data

    Get PDF
    Many of the attempts to introduce a universal hydrographic binary data format have failed or have been only partially successful. In essence, this is because such formats either have to simplify the data to such an extent that they only support the lowest common subset of all the formats covered, or they attempt to be a superset of all formats and quickly become cumbersome. Neither choice works well in practice. This paper presents a different approach: a standardized description of (past, present, and future) data formats using the Hydrographic Universal Data Description Language (HUDDL), a descriptive language implemented using the Extensible Markup Language (XML). That is, XML is used to provide a structural and physical description of a data format, rather than the content of a particular file. Done correctly, this opens the possibility of automatically generating both multi-language data parsers and documentation for format specification based on their HUDDL descriptions, as well as providing easy version control of them. This solution also provides a powerful approach for archiving a structural description of data along with the data, so that binary data will be easy to access in the future. Intending to provide a relatively low-effort solution to index the wide range of existing formats, we suggest the creation of a catalogue of format descriptions, each of them capturing the logical and physical specifications for a given data format (with its subsequent upgrades). A C/C++ parser code generator is used as an example prototype of one of the possible advantages of the adoption of such a hydrographic data format catalogue

    User interface enhancement report

    Get PDF
    The existing user interfaces to TEMPUS, Plaid, and other systems in the OSDS are fundamentally based on only two modes of communication: alphanumeric commands or data input and grapical interaction. The latter are especially suited to the types of interaction necessary for creating workstation objects with BUILD and with performing body positioning in TEMPUS. Looking toward the future application of TEMPUS, however, the long-term goals of OSDS will include the analysis of extensive tasks in space involving one or more individuals working in concert over a period of time. In this context, the TEMPUS body positioning capability, though extremely useful in creating and validating a small number of particular body positions, will become somewhat tedious to use. The macro facility helps somewhat, since frequently used positions may be easily applied by executing a stored macro. The difference between body positioning and task execution, though subtle, is important. In the case of task execution, the important information at the user's level is what actions are to be performed rather than how the actions are performed. Viewed slightly differently, the what is constant over a set of individuals though the how may vary

    The Construction of a Static Source Code Scanner Focused on SQL Injection Vulnerabilties in Java

    Get PDF
    SQL injection attacks are a significant threat to web application security, allowing attackers to execute arbitrary SQL commands and gain unauthorized access to sensitive data. Static source code analysis is a widely used technique to identify security vulnerabilities in software, including SQL injection attacks. However, existing static source code scanners often produce false positives and require a high level of expertise to use effectively. This thesis presents the design and implementation of a static source code scanner for SQL injection vulnerabilities in Java queries. The scanner uses a combination of pattern matching and data flow analysis to detect SQL injection vulnerabilities in code. The scanner identifies vulnerable code by analyzing method calls, expressions, and variable declarations to detect potential vulnerabilities. To evaluate the scanner, malicious SQL code is manually injected in queries to test the scanner\u27s ability to detect vulnerabilities. The results showed that the scanner could identify a high percentage of SQL injection vulnerabilities. The limitations of the scanner include the inability to detect runtime user input validation and the reliance on predefined patterns and heuristics to identify vulnerabilities. Despite these limitations, the scanner provides a useful tool for junior developers to identify and address SQL injection vulnerabilities in their code. This thesis presents a static source code scanner that can effectively detect SQL injection vulnerabilities in Java web applications. The scanner\u27s design and implementation provide a useful contribution to the field of software security, and future work could focus on improving the scanner\u27s precision and addressing its limitations
    corecore