62 research outputs found

    A portable natural language interface from Arabic to SQL.

    Get PDF
    In recent years, natural language interface systems have been built based on the Front End and the Back End architecture which gives a guarantee of modularity and portability to the system as a whole. An Arabic Front End has been built that takes an input sentence, producing syntactic and semantic representations, which it maps into First Order Logic. Expressing the meaning of the user's question in terms of high level world concepts makes the natural language interface independent of the database structure. It is then easier to port the interface Front End to a database for a different domain. The syntactic treatments are based on Generalised Phrase Structure Grammar (GPSG) whereas the semantics are expressed in formal semantics theory. The focus is mainly to provide syntactic and semantic analyses for Arabic queries based on correct Arabic linguistic principles. The proposed treatments are proved and tested by building a prototype system. The prototype is implemented using one of the existing systems called Squirrel. An Arabic morphological analyser is also proposed and implemented to distinguish between two types of morphemes: internal morphemes which are a part of the word's pattern, and external morphemes which are independent words attached to the word but which are not part of the word's pattern. So, the system focuses on the extraction of morphemes from the various inflexions or forms of any Arabic word

    Mrežni sintaksno-semantički okvir za izvlačenje leksičkih relacija deterministričkim modelom prirodnog jezika

    Get PDF
    Given the extraordinary growth in online documents, methods for automated extraction of semantic relations became popular, and shortly after, became necessary. This thesis proposes a new deterministic language model, with the associated artifact, which acts as an online Syntactic and Semantic Framework (SSF) for the extraction of morphosyntactic and semantic relations. The model covers all fundamental linguistic fields: Morphology (formation, composition, and word paradigms), Lexicography (storing words and their features in network lexicons), Syntax (the composition of words in meaningful parts: phrases, sentences, and pragmatics), and Semantics (determining the meaning of phrases). To achieve this, a new tagging system with more complex structures was developed. Instead of the commonly used vectored systems, this new tagging system uses tree-likeT-structures with hierarchical, grammatical Word of Speech (WOS), and Semantic of Word (SOW) tags. For relations extraction, it was necessary to develop a syntactic (sub)model of language, which ultimately is the foundation for performing semantic analysis. This was achieved by introducing a new ‘O-structure’, which represents the union of WOS/SOW features from T-structures of words and enables the creation of syntagmatic patterns. Such patterns are a powerful mechanism for the extraction of conceptual structures (e.g., metonymies, similes, or metaphors), breaking sentences into main and subordinate clauses, or detection of a sentence’s main construction parts (subject, predicate, and object). Since all program modules are developed as general and generative entities, SSF can be used for any of the Indo-European languages, although validation and network lexicons have been developed for the Croatian language only. The SSF has three types of lexicons (morphs/syllables, words, and multi-word expressions), and the main words lexicon is included in the Global Linguistic Linked Open Data (LLOD) Cloud, allowing interoperability with all other world languages. The SSF model and its artifact represent a complete natural language model which can be used to extract the lexical relations from single sentences, paragraphs, and also from large collections of documents.Pojavom velikoga broja digitalnih dokumenata u okružju virtualnih mreža (interneta i dr.), postali su zanimljivi, a nedugo zatim i nužni, načini identifikacije i strojnoga izvlačenja semantičkih relacija iz (digitalnih) dokumenata (tekstova). U ovome radu predlaže se novi, deterministički jezični model s pripadnim artefaktom (Syntactic and Semantic Framework - SSF), koji će služiti kao mrežni okvir za izvlačenje morfosintaktičkih i semantičkih relacija iz digitalnog teksta, ali i pružati mnoge druge jezikoslovne funkcije. Model pokriva sva temeljna područja jezikoslovlja: morfologiju (tvorbu, sastav i paradigme riječi) s leksikografijom (spremanjem riječi i njihovih značenja u mrežne leksikone), sintaksu (tj. skladnju riječi u cjeline: sintagme, rečenice i pragmatiku) i semantiku (određivanje značenja sintagmi). Da bi se to ostvarilo, bilo je nužno označiti riječ složenijom strukturom, umjesto do sada korištenih vektoriziranih gramatičkih obilježja predložene su nove T-strukture s hijerarhijskim, gramatičkim (Word of Speech - WOS) i semantičkim (Semantic of Word - SOW) tagovima. Da bi se relacije mogle pronalaziti bilo je potrebno osmisliti sintaktički (pod)model jezika, na kojem će se u konačnici graditi i semantička analiza. To je postignuto uvođenjem nove, tzv. O-strukture, koja predstavlja uniju WOS/SOW obilježja iz T-struktura pojedinih riječi i omogućuje stvaranje sintagmatskih uzoraka. Takvi uzorci predstavljaju snažan mehanizam za izvlačenje konceptualnih struktura (npr. metonimija, simila ili metafora), razbijanje zavisnih rečenica ili prepoznavanje rečeničnih dijelova (subjekta, predikata i objekta). S obzirom da su svi programski moduli mrežnog okvira razvijeni kao opći i generativni entiteti, ne postoji nikakav problem korištenje SSF-a za bilo koji od indoeuropskih jezika, premda su provjera njegovog rada i mrežni leksikoni izvedeni za sada samo za hrvatski jezik. Mrežni okvir ima tri vrste leksikona (morphovi/slogovi, riječi i višeriječnice), a glavni leksikon riječi već je uključen u globalni lingvistički oblak povezanih podataka, što znači da je interoperabilnost s drugim jezicima već postignuta. S ovako osmišljenim i realiziranim načinom, SSF model i njegov realizirani artefakt, predstavljaju potpuni model prirodnoga jezika s kojim se mogu izvlačiti leksičke relacije iz pojedinačne rečenice, odlomka, ali i velikog korpusa (eng. big data) podataka

    Szintaktikailag elemzett birtokos kifejezések algoritmizált fordítása adott formális nyelvre

    Get PDF
    Számos nemzetközi szakirodalom [5; 7; 10; 17; 20] foglakozott a birtokos szerkezetek szemantikai modellezésével, szemantikai sajátosságainak bemutatásával, azonban az eddig megalkotott modellek valamely konkrét birtokos szerkezetnek pontosan megfelelő formális mondat automatizált előállítását nem biztosítják. A cikkben megmutatjuk, hogyan lehet a problémát általános formában megoldani, illetve megmutatjuk, hogy az algoritmussal támogatott feldolgozásnak hol vannak a korlátai, melyek a még megoldandó feladatok

    The TXM Portal Software giving access to Old French Manuscripts Online

    Get PDF
    Texte intégral en ligne : http://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdfInternational audiencehttp://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdf This paper presents the new TXM software platform giving online access to Old French Text Manuscripts images and tagged transcriptions for concordancing and text mining. This platform is able to import medieval sources encoded in XML according to the TEI Guidelines for linking manuscript images to transcriptions, encode several diplomatic levels of transcription including abbreviations and word level corrections. It includes a sophisticated tokenizer able to deal with TEI tags at different levels of linguistic hierarchy. Words are tagged on the fly during the import process using IMS TreeTagger tool with a specific language model. Synoptic editions displaying side by side manuscript images and text transcriptions are automatically produced during the import process. Texts are organized in a corpus with their own metadata (title, author, date, genre, etc.) and several word properties indexes are produced for the CQP search engine to allow efficient word patterns search to build different type of frequency lists or concordances. For syntactically annotated texts, special indexes are produced for the Tiger Search engine to allow efficient syntactic concordances building. The platform has also been tested on classical Latin, ancient Greek, Old Slavonic and Old Hieroglyphic Egyptian corpora (including various types of encoding and annotations)

    Corpus-consulting probabilistic approach to parsing: the CCPX parser and its complementary components

    Get PDF
    Corpus linguistics is now a major field in the study of language. In recent years corpora that are syntactically analysed have become available to researchers, and these clearly have great potential for use in the field of parsing natural language. This thesis describes a project that exploits this possibility. It makes four distinct contributions to these two fields. The first is an updated version of a corpus that is (a) analysed in terms of the rich syntax of Systemic Functional Grammar (SFG), and (b) annotated using the extensible Mark-up Language (XML). The second contribution is a native XML corpus database, and the third is a sophisticated corpus query tool for accessing it. The fourth contribution is a new type of parser that is both corpus-consulting and probabilistic. It draws its knowledge of syntactic probabilities from the corpus database, and it stores its working data within the database, so that it is strongly database-oriented. SFG has been widely used in natural language generation for approaching two decades, but it has been used far less frequently in parsing (the first stage in natural language understanding). Previous SFG corpus-based parsers have utilised traditional parsing algorithms, but they have experienced problems of efficiency and coverage, due to (a) the richness of the syntax and (b) the challenge of parsing unrestricted spoken and written texts. The present research overcomes these problems by introducing a new type of parsing algorithm that is 'semi-deterministic' (as human readers are), and utilises its knowledge of the rules—including probabilities—of English syntax. A language, however, is constantly evolving. New words and uses are added, while others become less frequent and drop out altogether. The new parsing system seeks to replicate this. As new sentences are parsed they are added to the corpus, and this slowly changes the frequencies of the words and the syntactic patterns. The corpus is in this sense dynamic, and so simulates a human's changing knowledge of words and syntax
    corecore