    Key steps for the construction of a glossary based on FunGramKB Term Extractor and referred to international cooperation against organised crime and terrorism

    The employment of new technological instruments for the processing of natural languages is crucial to improve the way humans interact with machines. The Functional Grammar Knowledge Base (FunGramKB henceforth) has been designed to cover Natural Language Processing (NLP henceforth) tasks in the area of Artificial Intelligence. The multipurpose lexical conceptual knowledge base FunGramKB is capable of combining linguistic knowledge and human cognitive abilities within its system as a whole. The conceptual module of FunGramKB contains both common-sense knowledge (Ontology), procedural knowledge (Cognicon) as well as knowledge about named entities representing people, places, organisations or other entities (Onomasticon). The Onomastical component is used to process the information from the perspective of specialised discourse. The definition in Natural Language of a consistent list of encyclopaedic terms existent referred to the legislation and to entities which fight against organised crime and terrorism existent in the GCTC would be the stepping stone for the future development of the Onomasticon. The FunGramKB Term Extractor (FGKBTE henceforth) is used to process the information. To cope with the inclusion of the terms in the Onomasticon according to the Conceptual Representation Language (COREL henceforth) schemata, the DBpedia project has been of paramount importance to develop specific patterns for the structure of the definitions.El empleo de nuevas herramientas tecnológicas para el Procesamiento del Lenguaje Natural (PLN en adelante) es fundamental para mejorar la forma en que las máquinas se relacionan con los seres humanos. FunGramKB ha sido diseñada para abordar tareas de PLN inmersas en el área de la Inteligencia Artificial. La base de conocimiento léxico conceptual multipropósito FunGramKB es capaz de combinar el conocimiento lingüístico con las habilidades cognitivas humanas dentro de su sistema como conjunto. El modulo conceptual de FunGramKB se basa en el sentido común (Ontología) y en el conocimiento procedimental (Cognicón), a la vez que en el conocimiento sobre entidades nombradas que representan personas, lugares, organizaciones u otras entidades (Onomasticon). La definición en Lenguaje Natural de una lista consistente de términos enciclopédicos concerniente tanto a instrumentos legales como a organizaciones que luchan contra el crimen organizado y el terrorismo que se ha incluido en el GCTC supondrá un gran adelanto en aras al futuro desarrollo del Onomasticon. El FGKBTE se usa para procesar la información. Con vistas a incluir los términos en el Onomasticón de acuerdo al esquema COREL, el proyecto DBpedia ha sido de una importancia fundamental para desarrollar patrones determinados con los que estructurar las definiciones.Universidad de Granada. Departamento de Filologías Inglesa y Alemana. Máster en Lingüística y Literatura Inglesas, curso 2013-201

    LiDom builder: Automatising the construction of multilingual domain modules

    136 p.Laburpena Lan honetan LiDOM Builder tresnaren analisi, diseinu eta ebaluazioa aurkezten dira. Teknologian oinarritutako hezkuntzarako tresnen Domeinu Modulu Eleaniztunak testuliburu elektronikoetatik era automatikoan erauztea ahalbidetzen du LiDOM Builderek. Ezagutza eskuratzeko, Hizkuntzaren Prozesamendurako eta Ikaste Automatikorako teknikekin batera, hainbat baliabide eleaniztun erabiltzen ditu, besteak beste, Wikipedia eta WordNet.Domeinu Modulu Elebakarretik Domeinu Modulu Eleaniztunerako bidean, LiDOM Builder tresna DOM-Sortze ingurunearen (Larrañaga, 2012; Larrañaga et al., 2014) bilakaera dela esan genezake. Horretarako, LiDOM Builderek domeinua ikuspegi eleaniztun batetik adieraztea ahalbidetzen duen mekanismoa dakar. Domeinu Modulu Eleaniztunak bi maila ezberdinetako ezagutza jasotzen du: Ikaste Domeinuaren Ontologia (IDO), non hizkuntza ezberdinetan etiketatutako topikoak eta hauen arteko erlazio pedagogikoak jasotzen baitira, eta Ikaste Objektuak (IO), hau da, metadatuekin etiketatutako baliabide didaktikoen bilduma, hizkuntza horietan. LiDOM Builderek onartutako hizkuntza guztietan domeinuaren topikoak adierazteko aukera ematen du. Topiko bakoitza lotuta dago dagokion hizkuntzako bere etiketa baliokidearekin. Gainera, IOak deskribatzeko metadatu aberastuak erabiltzen ditu hizkuntza desberdinetan parekideak diren baliabide didaktikoak lotzeko.LiDOM Builderen, hasiera batean, domeinu-modulua hizkuntza jakin batean idatzitako dokumentu batetik erauziko da eta, baliabide eleaniztunak erabiliko dira, gerora, bai topikoak bai IOak beste hizkuntzetan ere lortzeko. Lan honetan, Ingelesez idatzitako liburuek osatuko dute informazio-iturri nagusia bai doitze-prozesuan bai ebaluazio-prozesuan. Zehazki, honako testuliburu hauek erabili dira: Principles of Object Oriented Programming (Wong and Nguyen, 2010), Introduction to Astronomy (Morison, 2008) eta Introduction to Molecular Biology (Raineri, 2010). Baliabide eleaniztunei dagokienez, Wikipedia, WordNet eta Wikipediatik erauzitako beste hainbat ezagutza-base erabili dira. Testuliburuetatik Domeinu Modulu Eleaniztunak eraikitzeko, LiDOM Builder hiru modulu nagusitan oinarritzen da: LiTeWi eta LiReWi moduluak IDO eleaniztuna eraikitzeaz arduratuko dira eta LiLoWi, aldiz, IO eleaniztunak eraikitzeaz. Jarraian, aipatutako modulu bakoitza xehetasun gehiagorekin azaltzen da.¿ LiTeWi (Conde et al., 2015) moduluak, edozein ikaste-domeinutako testuliburu batetik abiatuta, Hezkuntzarako Ontologia bati dagozkion hainbat termino eleaniztun identifikatuko ditu, hala nola TF-IDF, KP-Miner, CValue eta Shallow Parsing Grammar. Hori lortzeko, gainbegiratu gabeko datu-erauzketa teknikez eta Wikipediaz baliatzen da. Ontologiako topikoak erauzteak LiTeWi-n hiru urrats ditu: lehenik hautagai diren terminoen erauzketa; bigarrenik, lortutako terminoen konbinatzea eta fintzea azken termino zerrenda osatuz; eta azkenik, zerrendako terminoak beste hizkuntzetara mapatzea Wikipedia baliatuz.¿ LiReWi (Conde et al., onartzeko) moduluak Hezkuntzarako Ontologia erlazio pedagogikoez aberastuko du, beti ere testuliburua abiapuntu gisa erabilita. Lau motatako erlazio pedagogikoak erauziko ditu (isA, partOf, prerequisite eta pedagogicallyClose) hainbat teknika eta ezagutza-base konbinatuz. Ezagutza-baseen artean Wikipedia, WordNet, WikiTaxonomy, WibiTaxonomy eta WikiRelations daude. LiReWi-k ere hiru urrats emango ditu erlazioak lortzeko: hasteko, ontologiako topikoak erlazioak erauzteko erabiliko diren ezagutza-base desberdinekin mapatuko ditu; gero, hainbat erlazio-erauzle, bakoitza teknika desberdin batean oinarritzen dena, exekutatuko ditu konkurrenteki erlazio hautagaiak erauzteko; eta, bukatzeko, lortutako emaitza guztiak konbinatu eta iragaziko ditu erlazio pedagogikoen azken multzoa lortuz. Gainera, DOM-Sortzetik LiDOM Buildererako trantsizioan, tesi honetan hobetu egin dira dokumentuen indizeetatik erauzitako isA eta partOf erlazioak, Wikipedia baliabide gehigarri bezala erabilita (Conde et al., 2014).¿ LiLoWi moduluak IOak -batzuk eleaniztunak- erauziko ditu, abiapuntuko testuliburutik ez ezik Wikipedia edo WordNet bezalako ezagutza-baseetatik ere. IDO ontologiako topiko bakoitza Wikipedia eta WordNet-ekin mapatu ostean, LiLoWi-k baliabide didaktikoak erauziko ditu hainbat IO erauzlez baliatuz.IO erauzketa-prozesuan, DOM-Sortzetik LiDOM Buildereko bidean, eta Wikipedia eta WordNet erabili aurretik, ingelesa hizkuntza ere gehitu eta ebaluatu da (Conde et al., 2012).LiDOM Builderen ebaluaziori dagokionez, modulu bakoitza bere aldetik testatua eta ebaluatua izan da bai Gold-standard teknika bai aditu-ebaluazioa baliatuz. Gainera, Wikipedia eta WordNet ezagutza-baseen integrazioak IOen erauzketari ekarri dion hobekuntza ere ebaluatu da. Esan genezake kasu guztietan lortu diren emaitzak oso onak direla.Bukatzeko, eta laburpen gisa, lau dira LiDOM Builderek Domeinu Modulu Eleaniztunaren arloari egin dizkion ekarpen nagusiak:¿ Domeinu Modulu Eleaniztunak adierazteko mekanismo egokia.¿ LiTeWiren garapena. Testuliburuetatik Hezkuntzarako Ontologietarako terminologia eleaniztuna erauztea ahalbidetzen du modulu honek. Ingelesa eta Gaztelera hizkuntzentzako termino-erauzlea eskura dago https://github.com/Neuw84/LiTe URLan.¿ LiReWiren garapena. Testuliburuetatik Hezkuntzarako Ontologietarako erlazio pedagogikoak erauztea ahalbidetzen du modulu honek. Erabiltzen duen Wikipedia/WordNet mapatzailea eskura dago https://github.com/Neuw84/Wikipedia2WordNet URLan.¿ LiLoWiren garapena. Testuliburua eta Wikipedia eta WordNet ezagutza-baseak erabilita IO eleaniztunak erauztea ahalbidetzen du modulu honek

    Vigorous Module Based Data Management

    Data is important in today’s life and it must be saved using less amount of memory. Data is important in day to day life for many purposes, like Government activities, any organization needs their own database, hospitals, schools etc. It is necessary to save data into database as per the user’s query generation with less memory conjunction. One of the novel techniques we have developed for saving data into database by using file similarity algorithm. This technique is used to split the text file into number of paragraphs and save these paragraphs using appropriate reference number. These reference numbers are stored in database, whenever same paragraph will appeared in another text file it will check database and then save the other references of that file which are new for that file. This technique requires less memory and data can be stored in appropriate manner

    Probabilistic Reference to Suspect or Victim in Nationality Extraction from Unstructured Crime News Documents

    There is valuable information in unstructured crime news documents which crime analysts must manually search for. To solve this issue, several information extraction models have been implemented, all of which are capable of being enhanced. This gap has created the motivation to propose an enhanced information extraction model that uses named entity recognition to extract the nationality from crime news documents and coreference resolution to associate the nationality to either the suspect or the victim. After the proposed model extracts the nationality, it references it to the suspect or victim by looking up all of the victim related keywords and the suspect related keywords within the text, and their corresponding distances from the position of the nationality keyword. Based on their total distances, a probability score algorithm decides whether the nationality is more likely to belong to either the victim or the suspect. Two experiments were conducted to evaluate the nationality extractor component and the reference identification component used by the model. The former experiment had achieved 90%, 94%, and 91% for precision, recall, and F-measure values respectively. The latter experiment had achieved 65%, 68%, and 66% for precision, recall, and F-measure respectively. The model had achieved promising results after evaluation. Keywords: information extraction, named entity recognition, coreference resolution, crime domai

    Interactive Malayalam Question Answering System: A Neural Word Embedding And Similarity Measure Based Approach.

    This innovative system operates as an automated, domain-specific knowledge repository designed specifically to furnish reliable Malayalam responses to inquiries pertaining to COVID-19. Leveraging advanced Natural Language Processing (NLP) algorithms, both Malayalam documents and questions undergo meticulous processing. The semantic modelling and document conversion stages employ the Word Embedding approach, specifically Continuous Bag of Words (CBOW), to enhance the system's understanding of the language nuances. Subsequently, the retrieved results for a given query are meticulously ranked using the cosine similarity measure, ensuring that the most relevant and accurate information is presented to the user. Integral to the system's efficacy is our proprietary Malayalam question-answering dataset. This dataset has been meticulously curated, drawing from reliable and publicly accessible sources related to COVID-19. It serves as the foundation for experimentation, reflecting the system's ability to provide accurate responses. The system's performance is quantified using the F1 score, a metric that combines precision and recall, yielding a comprehensive evaluation. In our experimentation, the F1 score of the Semantic Malayalam Question-Answering System is found to be 76%, attesting to its robustness and effectiveness in delivering trustworthy information in the Malayalam language within the context of COVID-19

    Combination of web usage, content and structure information for diverse web mining applications in the tourism context and the context of users with disabilities

    188 p.This PhD focuses on the application of machine learning techniques for behaviourmodelling in different types of websites. Using data mining techniques two aspects whichare problematic and difficult to solve have been addressed: getting the system todynamically adapt to possible changes of user preferences, and to try to extract theinformation necessary to ensure the adaptation in a transparent manner for the users,without infringing on their privacy. The work in question combines information of differentnature such as usage information, content information and website structure and usesappropriate web mining techniques to extract as much knowledge as possible from thewebsites. The extracted knowledge is used for different purposes such as adaptingwebsites to the users through proposals of interesting links, so that the users can get therelevant information more easily and comfortably; for discovering interests or needs ofusers accessing the website and to inform the service providers about it; or detectingproblems during navigation.Systems have been successfully generated for two completely different fields: thefield of tourism, working with the website of bidasoa turismo (www.bidasoaturismo.com)and, the field of disabled people, working with discapnet website (www.discapnet.com)from ONCE/Tecnosite foundation