254 research outputs found
Towards General-Purpose Representation Learning of Polygonal Geometries
Neural network representation learning for spatial data is a common need for
geographic artificial intelligence (GeoAI) problems. In recent years, many
advancements have been made in representation learning for points, polylines,
and networks, whereas little progress has been made for polygons, especially
complex polygonal geometries. In this work, we focus on developing a
general-purpose polygon encoding model, which can encode a polygonal geometry
(with or without holes, single or multipolygons) into an embedding space. The
result embeddings can be leveraged directly (or finetuned) for downstream tasks
such as shape classification, spatial relation prediction, and so on. To
achieve model generalizability guarantees, we identify a few desirable
properties: loop origin invariance, trivial vertex invariance, part permutation
invariance, and topology awareness. We explore two different designs for the
encoder: one derives all representations in the spatial domain; the other
leverages spectral domain representations. For the spatial domain approach, we
propose ResNet1D, a 1D CNN-based polygon encoder, which uses circular padding
to achieve loop origin invariance on simple polygons. For the spectral domain
approach, we develop NUFTspec based on Non-Uniform Fourier Transformation
(NUFT), which naturally satisfies all the desired properties. We conduct
experiments on two tasks: 1) shape classification based on MNIST; 2) spatial
relation prediction based on two new datasets - DBSR-46K and DBSR-cplx46K. Our
results show that NUFTspec and ResNet1D outperform multiple existing baselines
with significant margins. While ResNet1D suffers from model performance
degradation after shape-invariance geometry modifications, NUFTspec is very
robust to these modifications due to the nature of the NUFT.Comment: 58 pages, 20 figures, Accepted to GeoInformatic
fr2sql : Interrogation de bases de données en français
National audienceDatabases are increasingly common and are becoming increasingly important in actual applications and Web sites. They often used by people who do not have great competence in this domain and who do not know exactly their structure. This is why translators from natural language to SQL queries are developed. Unfortunately, most of these translators is confined to a single database due to the specificity of the base architecture. In this paper, we propose a method to query any database from french. We evaluate our application on two different databases and we also show that it supports more operations than most other translators.Les bases de donnĂ©es sont de plus en plus courantes et prennent de plus en plus d'ampleur au sein des applications et sites Web actuels. Elles sont souvent amenĂ©es Ă ĂȘtre utilisĂ©es par des personnes n'ayant pas une grande compĂ©tence en la matiĂšre et ne connaissant pas rigoureusement leur structure. C'est pour cette raison que des traducteurs du langage naturel aux requĂȘtes SQL sont dĂ©veloppĂ©s. Malheureusement, la plupart de ces traducteurs se cantonnent Ă une seule base du fait de la spĂ©cificitĂ© de l'architecture de celle-ci. Dans cet article, nous proposons une mĂ©thode visant Ă pouvoir interroger n'importe quelle base de donnĂ©es Ă partir de questions en français. Nous Ă©valuons notre application sur deux bases Ă la structure diffĂ©rente et nous montrons Ă©galement qu'elle supporte plus d'opĂ©rations que la plupart des autres traducteurs
Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey
Large language models (LLMs) have significantly advanced the field of natural
language processing (NLP), providing a highly useful, task-agnostic foundation
for a wide range of applications. However, directly applying LLMs to solve
sophisticated problems in specific domains meets many hurdles, caused by the
heterogeneity of domain data, the sophistication of domain knowledge, the
uniqueness of domain objectives, and the diversity of the constraints (e.g.,
various social norms, cultural conformity, religious beliefs, and ethical
standards in the domain applications). Domain specification techniques are key
to make large language models disruptive in many applications. Specifically, to
solve these hurdles, there has been a notable increase in research and
practices conducted in recent years on the domain specialization of LLMs. This
emerging field of study, with its substantial potential for impact,
necessitates a comprehensive and systematic review to better summarize and
guide ongoing work in this area. In this article, we present a comprehensive
survey on domain specification techniques for large language models, an
emerging direction critical for large language model applications. First, we
propose a systematic taxonomy that categorizes the LLM domain-specialization
techniques based on the accessibility to LLMs and summarizes the framework for
all the subcategories as well as their relations and differences to each other.
Second, we present an extensive taxonomy of critical application domains that
can benefit dramatically from specialized LLMs, discussing their practical
significance and open challenges. Last, we offer our insights into the current
research status and future trends in this area
Dwelling on ontology - semantic reasoning over topographic maps
The thesis builds upon the hypothesis that the spatial arrangement of topographic
features, such as buildings, roads and other land cover parcels, indicates how land is
used. The aim is to make this kind of high-level semantic information explicit within
topographic data. There is an increasing need to share and use data for a wider range of
purposes, and to make data more definitive, intelligent and accessible. Unfortunately,
we still encounter a gap between low-level data representations and high-level concepts
that typify human qualitative spatial reasoning. The thesis adopts an ontological
approach to bridge this gap and to derive functional information by using standard
reasoning mechanisms offered by logic-based knowledge representation formalisms. It
formulates a framework for the processes involved in interpreting land use information
from topographic maps. Land use is a high-level abstract concept, but it is also an
observable fact intimately tied to geography. By decomposing this relationship, the
thesis correlates a one-to-one mapping between high-level conceptualisations
established from human knowledge and real world entities represented in the data.
Based on a middle-out approach, it develops a conceptual model that incrementally
links different levels of detail, and thereby derives coarser, more meaningful
descriptions from more detailed ones. The thesis verifies its proposed ideas by
implementing an ontology describing the land use âresidential areaâ in the ontology
editor Protégé. By asserting knowledge about high-level concepts such as types of
dwellings, urban blocks and residential districts as well as individuals that link directly
to topographic features stored in the database, the reasoner successfully infers instances
of the defined classes. Despite current technological limitations, ontologies are a
promising way forward in the manner we handle and integrate geographic data,
especially with respect to how humans conceptualise geographic space
Improving Model Finding for Integrated Quantitative-qualitative Spatial Reasoning With First-order Logic Ontologies
Many spatial standards are developed to harmonize the semantics and specifications of GIS data and for sophisticated reasoning. All these standards include some types of simple and complex geometric features, and some of them incorporate simple mereotopological relations. But the relations as used in these standards, only allow the extraction of qualitative information from geometric data and lack formal semantics that link geometric representations with mereotopological or other qualitative relations. This impedes integrated reasoning over qualitative data obtained from geometric sources and ânativeâ topological information â for example as provided from textual sources where precise locations or spatial extents are unknown or unknowable. To address this issue, the first contribution in this dissertation is a first-order logical ontology that treats geometric features (e.g. polylines, polygons) and relations between them as specializations of more general types of features (e.g. any kind of 2D or 1D features) and mereotopological relations between them. Key to this endeavor is the use of a multidimensional theory of space wherein, unlike traditional logical theories of mereotopology (like RCC), spatial entities of different dimensions can co-exist and be related. However terminating or tractable reasoning with such an expressive ontology and potentially large amounts of data is a challenging AI problem. Model finding tools used to verify FOL ontologies with data usually employ a SAT solver to determine the satisfiability of the propositional instantiations (SAT problems) of the ontology. These solvers often experience scalability issues with increasing number of objects and size and complexity of the ontology, limiting its use to ontologies with small signatures and building small models with less than 20 objects. To investigate how an ontology influences the size of its SAT translation and consequently the model finderâs performance, we develop a formalization of FOL ontologies with data. We theoretically identify parameters of an ontology that significantly contribute to the dramatic growth in size of the SAT problem. The search space of the SAT problem is exponential in the signature of the ontology (the number of predicates in the axiomatization and any additional predicates from skolemization) and the number of distinct objects in the model. Axiomatizations that contain many definitions lead to large number of SAT propositional clauses. This is from the conversion of biconditionals to clausal form. We therefore postulate that optional definitions are ideal sentences that can be eliminated from an ontology to boost model finderâs performance. We then formalize optional definition elimination (ODE) as an FOL ontology preprocessing step and test the simplification on a set of spatial benchmark problems to generate smaller SAT problems (with fewer clauses and variables) without changing the satisfiability and semantic meaning of the problem. We experimentally demonstrate that the reduction in SAT problem size also leads to improved model finding with state-of-the-art model finders, with speedups of 10-99%. Altogether, this dissertation improves spatial reasoning capabilities using FOL ontologies â in terms of a formal framework for integrated qualitative-geometric reasoning, and specific ontology preprocessing steps that can be built into automated reasoners to achieve better speedups in model finding times, and scalability with moderately-sized datasets
Dagstuhl News January - December 2008
"Dagstuhl News" is a publication edited especially for the members of the Foundation "Informatikzentrum Schloss Dagstuhl" to thank them for their support. The News give a summary of the scientific work being done in Dagstuhl. Each Dagstuhl Seminar is presented by a small abstract describing the contents and scientific highlights of the seminar as well as the perspectives or challenges of the research topic
Ranked Similarity Search of Scientific Datasets: An Information Retrieval Approach
In the past decade, the amount of scientific data collected and generated by scientists has grown dramatically. This growth has intensified an existing problem: in large archives consisting of datasets stored in many files, formats and locations, how can scientists find data relevant to their research interests? We approach this problem in a new way: by adapting Information Retrieval techniques, developed for searching text documents, into the world of (primarily numeric) scientific data. We propose an approach that uses a blend of automated and curated methods to extract metadata from large repositories of scientific data. We then perform searches over this metadata, returning results ranked by similarity to the search criteria. We present a model of this approach, and describe a specific implementation thereof performed at an ocean-observatory data archive and now running in production. Our prototype implements scanners that extract metadata from datasets that contain different kinds of environmental observations, and a search engine with a candidate similarity measure for comparing a set of search terms to the extracted metadata. We evaluate the utility of the prototype by performing two user studies; these studies show that the approach resonates with users, and that our proposed similarity measure performs well when analyzed using standard Information Retrieval evaluation methods. We performed performance tests to explore how continued archive growth will affect our goal of interactive response, developed and applied techniques that mitigate the effects of that growth, and show that the techniques are effective. Lastly, we describe some of the research needed to extend this initial work into a true Google for data
Effective information integration and reutilization : solutions to technological deficiency and legal uncertainty
Thesis (Ph. D.)--Massachusetts Institute of Technology, Engineering Systems Division, Technology, Management, and Policy Program, February 2006."September 2005."Includes bibliographical references (p. 141-148).The amount of electronically accessible information has been growing exponentially. How to effectively use this information has become a significant challenge. A post 9/11 study indicated that the deficiency of semantic interoperability technology hindered the ability to integrate information from disparate sources in a meaningful and timely fashion to allow for preventive precautions. Meanwhile, organizations that provided useful services by combining and reusing information from publicly accessible sources have been legally challenged. The Database Directive has been introduced in the European Union and six legislative proposals have been made in the U.S. to provide legal protection for non-copyrightable database contents, but the Directive and the proposals have differing and sometimes conflicting scope and strength, which creates legal uncertainty for valued-added data reuse practices. The need for clearer data reuse policy will become more acute as information integration technology improves to make integration much easier. This Thesis takes an interdisciplinary approach to addressing both the technology and the policy challenges, identified above, in the effective use and reuse of information from disparate sources.(cont.) The technology component builds upon the existing Context Interchange (COIN) framework for large-scale semantic interoperability. We focus on the problem of temporal semantic heterogeneity where data sources and receivers make time-varying assumptions about data semantics. A collection of time-varying assumptions are called a temporal context. We extend the existing COIN representation formalism to explicitly represent temporal contexts, and the COIN reasoning mechanism to reconcile temporal semantic heterogeneity in the presence of semantic heterogeneity of time. We also perform a systematic and analytic evaluation of the flexibility and scalability of the COIN approach. Compared with several traditional approaches, the COIN approach has much greater flexibility and scalability. For the policy component, we develop an economic model that formalizes the policy instruments in one of the latest legislative proposals in the U.S. The model allows us to identify the circumstances under which legal protection for non-copyrightable content is needed, the different conditions, and the corresponding policy choices.(cont.) Our analysis indicates that depending on the cost level of database creation, the degree of differentiation of the reuser database, and the efficiency of policy administration, the optimal policy choice can be protecting a legal monopoly, encouraging competition via compulsory licensing, discouraging voluntary licensing, or even allowing free riding. The results provide useful insights for the formulation of a socially beneficial database protection policy.by Hongwei Zhu.Ph.D
- âŠ