3,839 research outputs found
Generalizing input-driven languages: theoretical and practical benefits
Regular languages (RL) are the simplest family in Chomsky's hierarchy. Thanks
to their simplicity they enjoy various nice algebraic and logic properties that
have been successfully exploited in many application fields. Practically all of
their related problems are decidable, so that they support automatic
verification algorithms. Also, they can be recognized in real-time.
Context-free languages (CFL) are another major family well-suited to
formalize programming, natural, and many other classes of languages; their
increased generative power w.r.t. RL, however, causes the loss of several
closure properties and of the decidability of important problems; furthermore
they need complex parsing algorithms. Thus, various subclasses thereof have
been defined with different goals, spanning from efficient, deterministic
parsing to closure properties, logic characterization and automatic
verification techniques.
Among CFL subclasses, so-called structured ones, i.e., those where the
typical tree-structure is visible in the sentences, exhibit many of the
algebraic and logic properties of RL, whereas deterministic CFL have been
thoroughly exploited in compiler construction and other application fields.
After surveying and comparing the main properties of those various language
families, we go back to operator precedence languages (OPL), an old family
through which R. Floyd pioneered deterministic parsing, and we show that they
offer unexpected properties in two fields so far investigated in totally
independent ways: they enable parsing parallelization in a more effective way
than traditional sequential parsers, and exhibit the same algebraic and logic
properties so far obtained only for less expressive language families
Trustworthy Formal Natural Language Specifications
Interactive proof assistants are computer programs carefully constructed to
check a human-designed proof of a mathematical claim with high confidence in
the implementation. However, this only validates truth of a formal claim, which
may have been mistranslated from a claim made in natural language. This is
especially problematic when using proof assistants to formally verify the
correctness of software with respect to a natural language specification. The
translation from informal to formal remains a challenging, time-consuming
process that is difficult to audit for correctness.
This paper shows that it is possible to build support for specifications
written in expressive subsets of natural language, within existing proof
assistants, consistent with the principles used to establish trust and
auditability in proof assistants themselves. We implement a means to provide
specifications in a modularly extensible formal subset of English, and have
them automatically translated into formal claims, entirely within the Lean
proof assistant. Our approach is extensible (placing no permanent restrictions
on grammatical structure), modular (allowing information about new words to be
distributed alongside libraries), and produces proof certificates explaining
how each word was interpreted and how the sentence's structure was used to
compute the meaning.
We apply our prototype to the translation of various English descriptions of
formal specifications from a popular textbook into Lean formalizations; all can
be translated correctly with a modest lexicon with only minor modifications
related to lexicon size.Comment: arXiv admin note: substantial text overlap with arXiv:2205.0781
Stroke order normalization for improving recognition of online handwritten mathematical expressions
We present a technique based on stroke order normalization for improving recognition of online handwritten mathematical expressions (ME). The stroke order dependent system has less time complexity than the stroke order free system, but it must incorporate special grammar rules to cope with stroke order variations. The stroke order normalization technique solves this problem and also the problem of unexpected stroke order variations without increasing the time complexity of ME recognition.
In order to normalize stroke order, the X-Y cut method is modified since its original form causes problems when structural components in ME overlap. First, vertically ordered strokes are located by detecting vertical symbols and their upper/lower components, which are treated as MEs and reordered recursively. Second, unordered strokes on the left side of the vertical symbols are reordered as horizontally ordered strokes. Third, the remaining strokes are reordered recursively. The horizontally ordered strokes are reordered from left to right, and the vertically ordered strokes are reordered from top to bottom. Finally, the proposed stroke order normalization is combined with the stroke order dependent ME recognition system. The evaluations on the CROHME 2014 database show that the ME recognition system incorporating the stroke order normalization outperforms all other systems that use only CROHME 2014 for training while the processing time is kept low
Gradual computerisation and verification of mathematics : MathLang's path into Mizar
There are many proof checking tools that allow capturing mathematical knowledge
into formal representation. Those proof systems allow further automatic verifica-
tion of the logical correctness of the captured knowledge. However, the process of
encoding common mathematical documents in a chosen proof system is still labour-
intensive and requires comprehensive knowledge of such system. This makes the
use of proof checking tools inaccessible for ordinary mathematicians. This thesis
provides a solution for the computerisation of mathematical documents via a num-
ber of gradual steps using the MathLang framework. We express the full process
of formalisation into the Mizar proof checker.
The first levels of such gradual computerisation path have been developing well
before the course of this PhD started.
The whole project, called MathLang, dates back to 2000 when F. Kamareddine
and J.B. Wells started expressing their ideas of novel approach for computerising
mathematical texts. They mainly aimed at developing a mathematical framework
which is flexible enough to connect existing, in many cases different, approaches of
computerisation mathematics, which allows various degrees of formalisation (e.g.,
partial, full formalisation of chosen parts, or full formalisation of the entire doc-
ument), which is compatible with different mathematical foundations (e.g., type
theory, set theory, category theory, etc.) and proof systems (e.g., Mizar, Isar, Coq,
HOL, Vampire). The first two steps in the gradual formalisation were developed by
F. Kamareddine, J.B. Wells and M. Maarek with a small contribution of R. Lamar
to the second step. In this thesis we develop the third level of the gradual path,
which aims at capturing the rhetorical structure of mathematical documents. We
have also integrated further steps of the gradual formalisation, whose final goal is
the Mizar system.
We present in this thesis a full path of computerisation and formalisation of math-
ematical documents into the Mizar proof checker using the MathLang framework.
The development of this method was driven by the experience of computerising a
number of mathematical documents (covering different authoring styles)
Wittgenstein and the Grammar of Physics: A Study of Ludwig Wittgenstein\u27s 1929-1930 Manuscripts and the Roots of His Later Philosophy
In 1929 Wittgenstein began to work on the first philosophical manuscripts he had kept since completing the Tractatus Logico-Philosophicus (TLP) in 1918. The impetus for this was his conviction that the logic of the TLP was flawed: it was unable to account for the fact that a proposition that assigns a single value on a continuum to a simple object thereby excludes all assignments of different values to the object (the color exclusion problem). Consequently Wittgenstein\u27s atomic propositions could not be logically independent of one another.
Initially he thought he could replace the logically perfect language of the TLP with a phenomenological language in which experiential propositions about various spaces (e.g., visual space ) would form systems. The system described by a phenomenological language would be independent of the one described by ordinary physicalistic language; the physical world was known only by inference from the phenomenological. But he soon realized there was a fundamental error in this conception: phenomenology has to describe the same world as physics or it fails to provide a foundation for it. This suggests that there is only one world, and one language with two different modes of expression. Wittgenstein\u27s now comes to his fundamental insight: ordinary language is biased towards the description of physical objects and their relations, but it is our only method of expressing phenomenological or abstract concepts. Failure to recognize this difficulty leads to the misapplication of physicalistic concepts, i.e., to grammatical errors.
This insight had the following impact: (a) the task of philosophy is not to invent logical or phenomenological languages, but to understand the grammar of ordinary language; (b) the notion of a language of pure experience was itself connected with a false, physicalistic idea of the soul as an observer of a private world; (c) the conception of analysis that had guided philosophy since Frege was based on a misused metaphor taken from the grammar of physics: the analysis of an object into its parts or chemical constituents. These ideas reach their ultimate expression in Wittgenstein\u27s Philosophical Investigations. Thus his later work actually begins with these 1929–30 manuscripts
Mathematical Formula Recognition and Automatic Detection and Translation of Algorithmic Components into Stochastic Petri Nets in Scientific Documents
A great percentage of documents in scientific and engineering disciplines include mathematical formulas and/or algorithms. Exploring the mathematical formulas in the technical documents, we focused on the mathematical operations associations, their syntactical correctness, and the association of these components into attributed graphs and Stochastic Petri Nets (SPN). We also introduce a formal language to generate mathematical formulas and evaluate their syntactical correctness. The main contribution of this work focuses on the automatic segmentation of mathematical documents for the parsing and analysis of detected algorithmic components. To achieve this, we present a synergy of methods, such as string parsing according to mathematical rules, Formal Language Modeling, optical analysis of technical documents in forms of images, structural analysis of text in images, and graph and Stochastic Petri Net mapping. Finally, for the recognition of the algorithms, we enriched our rule based model with machine learning techniques to acquire better results
Programming language complexity analysis and its impact on Checkmarx activities
Dissertação de mestrado integrado em Informatics EngineeringTools for Programming Languages processing, like Static Analysers (for instance, a Static
Application Security Testing (SAST) tool, one of Checkmarx’s main products), must be
adapted to cope with a given input when the source programming language changes.
Complexity of the programming language is one of the key factors that deeply impact the
time of giving support to it.
This Master’s Project aims at proposing an approach for assessing language complexity,
measuring, at a first stage, the complexity of its underlying context-free grammar (CFG).
From the analysis of concrete case studies, factors have been identified that make the
support process more time-consuming, in particular in the stages of language recognition
and in the transformation to an abstract syntax tree (AST). In this sense, at a second stage, a
set of language features is analysed in order to take into account the referred factors that
also impact on the language processing.
The main objective of the Master’s work here reported is to help development teams to
improve the estimation of time and effort needed to adapt the SAST Tool in order to cope
with a new programming language.
In this dissertation a tool is proposed, that allows for the evaluation of the complexity of a
language based on a set of metrics to classify the complexity of its grammar, along with a set
of language properties. The tool compares the new language complexity so far determined
with previously supported languages, to predict the effort to process the new language.Ferramentas para processamento de Linguagens de Programação, como os Analisadores
Estáticos (por exemplo, uma ferramenta de Testes Estáticos para Análise da Segurança de
Aplicações, um dos principais produtos da Checkmarx), devem ser adaptadas para lidar
com uma dada entrada quando a linguagem de programação de origem muda.
A complexidade da linguagem de programação é um dos fatores-chave que influencia
profundamente o tempo de suporte à mesma.
Este projeto de Mestrado visa propor uma abordagem para avaliar a complexidade de uma
linguagem de programação, medindo, numa primeira fase, a complexidade da gramática
independente de contexto (GIC) subjacente.
A partir da análise de casos concretos, foram identificados fatores (relacionados como
facilidades específicas oferecidas pela linguagem) que tornam o processo de suporte mais
demorado, em particular nas fases de reconhecimento da linguagem e na transformação para
uma árvore de sintaxe abstrata (AST). Neste sentido, numa segunda fase, foi identificado
um conjunto de características linguísticas de modo a ter em conta os referidos fatores que
também têm impacto no processamento da linguagem.
O principal objetivo do trabalho de mestrado aqui relatado é auxiliar as equipas de
desenvolvimento a melhorar a estimativa do tempo e esforço necessários para adaptar a
ferramenta SAST de modo a lidar com uma nova linguagem de programação.
Como resultado deste projeto, tal como se descreve na dissertação, é proposta uma
ferramenta, que permite a avaliação da complexidade de uma linguagem com base num
conjunto de métricas para classificar a complexidade da sua gramática, e em um conjunto
de propriedades linguísticas. A ferramenta compara a complexidade da nova linguagem,
avaliada por aplicação do processo referido, com as linguagens anteriormente suportadas,
para prever o esforço para processar a nova linguagem
Do Neural Nets Learn Statistical Laws behind Natural Language?
The performance of deep learning in natural language processing has been
spectacular, but the reasons for this success remain unclear because of the
inherent complexity of deep learning. This paper provides empirical evidence of
its effectiveness and of a limitation of neural networks for language
engineering. Precisely, we demonstrate that a neural language model based on
long short-term memory (LSTM) effectively reproduces Zipf's law and Heaps' law,
two representative statistical properties underlying natural language. We
discuss the quality of reproducibility and the emergence of Zipf's law and
Heaps' law as training progresses. We also point out that the neural language
model has a limitation in reproducing long-range correlation, another
statistical property of natural language. This understanding could provide a
direction for improving the architectures of neural networks.Comment: 21 pages, 11 figure
- …