199 research outputs found
Supporting the Maintenance of Identifier Names: A Holistic Approach to High-Quality Automated Identifier Naming
A considerable part of the source code is identifier names-- unique lexical tokens that provide information about entities, and entity interactions, within the code. Identifier names provide human-readable descriptions of classes, functions, variables, etc. Poor or ambiguous identifier names (i.e., names that do not correctly describe the code behavior they are associated with) will lead developers to spend more time working towards understanding the code\u27s behavior. Bad naming can also have detrimental effects on tools that rely on natural language clues; degrading the quality of their output and making them unreliable. Additionally, misinterpretations of the code, caused by poor names, can result in the injection of quality issues into the system under maintenance. Thus, improved identifier naming increases developer effectiveness, higher-quality software, and higher-quality software analysis tools. In this dissertation, I establish several novel concepts that help measure and improve the quality of identifiers. The output of this dissertation work is a set of identifier name appraisal and quality tools that integrate into the developer workflow. Through a sequence of empirical studies, I have formulated a series of heuristics and linguistic patterns to evaluate the quality of identifier names in the code and provide naming structure recommendations. I envision and working towards supporting developers in integrating my contributions, discussed in this dissertation, into their development workflow to significantly improve the process of crafting and maintaining high-quality identifier names in the source code
Scalable and Declarative Information Extraction in a Parallel Data Analytics System
Informationsextraktions (IE) auf sehr großen Datenmengen erfordert hochkomplexe, skalierbare und anpassungsfähige Systeme. Obwohl zahlreiche IE-Algorithmen existieren, ist die nahtlose und erweiterbare Kombination dieser Werkzeuge in einem skalierbaren System immer noch eine große Herausforderung. In dieser Arbeit wird ein anfragebasiertes IE-System für eine parallelen Datenanalyseplattform vorgestellt, das für konkrete Anwendungsdomänen konfigurierbar ist und für Textsammlungen im Terabyte-Bereich skaliert. Zunächst werden konfigurierbare Operatoren für grundlegende IE- und Web-Analytics-Aufgaben definiert, mit denen komplexe IE-Aufgaben in Form von deklarativen Anfragen ausgedrückt werden können. Alle Operatoren werden hinsichtlich ihrer Eigenschaften charakterisiert um das Potenzial und die Bedeutung der Optimierung nicht-relationaler, benutzerdefinierter Operatoren (UDFs) für Data Flows hervorzuheben. Anschließend wird der Stand der Technik in der Optimierung nicht-relationaler Data Flows untersucht und herausgearbeitet, dass eine umfassende Optimierung von UDFs immer noch eine Herausforderung ist. Darauf aufbauend wird ein erweiterbarer, logischer Optimierer (SOFA) vorgestellt, der die Semantik von UDFs mit in die Optimierung mit einbezieht. SOFA analysiert eine kompakte Menge von Operator-Eigenschaften und kombiniert eine automatisierte Analyse mit manuellen UDF-Annotationen, um die umfassende Optimierung von Data Flows zu ermöglichen. SOFA ist in der Lage, beliebige Data Flows aus unterschiedlichen Anwendungsbereichen logisch zu optimieren, was zu erheblichen Laufzeitverbesserungen im Vergleich mit anderen Techniken führt. Als Viertes wird die Anwendbarkeit des vorgestellten Systems auf Korpora im Terabyte-Bereich untersucht und systematisch die Skalierbarkeit und Robustheit der eingesetzten Methoden und Werkzeuge beurteilt um schließlich die kritischsten Herausforderungen beim Aufbau eines IE-Systems für sehr große Datenmenge zu charakterisieren.Information extraction (IE) on very large data sets requires highly complex, scalable, and adaptive systems. Although numerous IE algorithms exist, their seamless and extensible combination in a scalable system still is a major challenge. This work presents a query-based IE system for a parallel data analysis platform, which is configurable for specific application domains and scales for terabyte-sized text collections. First, configurable operators are defined for basic IE and Web Analytics tasks, which can be used to express complex IE tasks in the form of declarative queries. All operators are characterized in terms of their properties to highlight the potential and importance of optimizing non-relational, user-defined operators (UDFs) for dataflows. Subsequently, we survey the state of the art in optimizing non-relational dataflows and highlight that a comprehensive optimization of UDFs is still a challenge. Based on this observation, an extensible, logical optimizer (SOFA) is introduced, which incorporates the semantics of UDFs into the optimization process. SOFA analyzes a compact set of operator properties and combines automated analysis with manual UDF annotations to enable a comprehensive optimization of data flows. SOFA is able to logically optimize arbitrary data flows from different application areas, resulting in significant runtime improvements compared to other techniques. Finally, the applicability of the presented system to terabyte-sized corpora is investigated. Hereby, we systematically evaluate scalability and robustness of the employed methods and tools in order to pinpoint the most critical challenges in building an IE system for very large data sets
Semantic networks
AbstractA semantic network is a graph of the structure of meaning. This article introduces semantic network systems and their importance in Artificial Intelligence, followed by I. the early background; II. a summary of the basic ideas and issues including link types, frame systems, case relations, link valence, abstraction, inheritance hierarchies and logic extensions; and III. a survey of ‘world-structuring’ systems including ontologies, causal link models, continuous models, relevance, formal dictionaries, semantic primitives and intersecting inference hierarchies. Speed and practical implementation are briefly discussed. The conclusion argues for a synthesis of relational graph theory, graph-grammar theory and order theory based on semantic primitives and multiple intersecting inference hierarchies
A Named Entity Recognition System Applied to Arabic Text in the Medical Domain
Currently, 30-35% of the global population uses the Internet. Furthermore, there is a rapidly increasing number of non-English language internet users, accompanied by an also increasing amount of unstructured text online. One area replete with underexploited online text is the Arabic medical domain, and one method that can be used to extract valuable data from Arabic medical texts is Named Entity Recognition (NER). NER is the process by which a system can automatically detect and categorise Named Entities (NE). NER has numerous applications in many domains, and medical texts are no exception. NER applied to the medical domain could assist in detection of patterns in medical records, allowing doctors to make better diagnoses and treatment decisions, enabling medical staff to quickly assess a patient's records and ensuring that patients are informed about their data, as just a few examples. However, all these applications would require a very high level of accuracy. To improve the accuracy of NER in this domain, new approaches need to be developed that are tailored to the types of named entities to be extracted and categorised. In an effort to solve this problem, this research applied Bayesian Belief Networks (BBN) to the process. BBN, a probabilistic model for prediction of random variables and their dependencies, can be used to detect and predict entities. The aim of this research is to apply BBN to the NER task to extract relevant medical entities such as disease names, symptoms, treatment methods, and diagnosis methods from modern Arabic texts in the medical domain. To achieve this aim, a new corpus related to the medical domain has been built and annotated. Our BBN approach achieved a 96.60% precision, 90.79% recall, and 93.60% F-measure for the disease entity, while for the treatment method entity, it achieved 69.33%, 70.99%, and 70.15% for precision, recall, and F-measure, respectively. For the diagnosis method and symptom categories, our system achieved 84.91% and 71.34%, respectively, for precision, 53.36% and 49.34%, respectively, for recall, and 65.53% and 58.33%, for F-measure, respectively. Our BBN strategy achieved good accuracy for NEs in the categories of disease and treatment method. However, the average word length of the other two NE categories observed, diagnosis method and symptom, may have had a negative effect on their accuracy. Overall, the application of BBN to Arabic medical NER is successful, but more development is needed to improve accuracy to a standard at which the results can be applied to real medical systems
Classifying the suras by their lexical semantics :an exploratory multivariate analysis approach to understanding the Qur'an
PhD ThesisThe Qur'an is at the heart of Islamic culture. Careful, well-informed interpretation of
it is fundamental both to the faith of millions of Muslims throughout the world, and
also to the non-Islamic world's understanding of their religion. There is a long and
venerable tradition of Qur'anic interpretation, and it has necessarily been based on
literary-historical methods for exegesis of hand-written and printed text.
Developments in electronic text representation and analysis since the second half of
the twentieth century now offer the opportunity to supplement traditional techniques
by applying the newly-emergent computational technology of exploratory
multivariate analysis to interpretation of the Qur'an. The general aim of the present
discussion is to take up that opportunity.
Specifically, the discussion develops and applies a methodology for discovering the
thematic structure of the Qur'an based on a fundamental idea in a range of
computationally oriented disciplines: that, with respect to some collection of texts, the
lexical frequency profiles of the individual texts are a good indicator of their semantic
content, and thus provide a reliable criterion for their conceptual categorization
relative to one another. This idea is applied to the discovery of thematic
interrelationships among the suras that constitute the Qur'an by abstracting lexical
frequency data from them and then analyzing that data using exploratory multivariate
methods in the hope that this will generate hypotheses about the thematic structure of
the Qur'an.
The discussion is in eight main parts. The first part introduces the discussion. The
second gives an overview of the structure and thematic content of the Qur'an and of
the tradition of Qur'anic scholarship devoted to its interpretation. The third part
xvi
defines the research question to be addressed together with a methodology for doing
so. The fourth reviews the existing literature on the research question. The fifth
outlines general principles of data creation and applies them to creation of the data on
which the analysis of the Qur'an in this study is based. The sixth outlines general
principles of exploratory multivariate analysis, describes in detail the analytical
methods selected for use, and applies them to the data created in part five. The
seventh part interprets the results of the analyses conducted in part six with reference
to the existing results in Qur'anic interpretation described in part two. And, finally, the
eighth part draws conclusions relative to the research question and identifies
directions along which the work presented in this study can be developed
Anaphora resolution for Arabic machine translation :a case study of nafs
PhD ThesisIn the age of the internet, email, and social media there is an increasing need for processing online information, for example, to support education and business. This has led to the rapid development of natural language processing technologies such as computational linguistics, information retrieval, and data mining. As a branch of computational linguistics, anaphora resolution has attracted much interest. This is reflected in the large number of papers on the topic published in journals such as Computational Linguistics. Mitkov (2002) and Ji et al. (2005) have argued that the overall quality of anaphora resolution systems remains low, despite practical advances in the area, and that major challenges include dealing with real-world knowledge and accurate parsing.
This thesis investigates the following research question: can an algorithm be found for the resolution of the anaphor nafs in Arabic text which is accurate to at least 90%, scales linearly with text size, and requires a minimum of knowledge resources? A resolution algorithm intended to satisfy these criteria is proposed. Testing on a corpus of contemporary Arabic shows that it does indeed satisfy the criteria.Egyptian Government
Contributions to privacy in web search engines
Els motors de cerca d’Internet recullen i emmagatzemen informació sobre els seus usuaris per tal d’oferir-los millors serveis. A canvi de rebre un servei personalitzat, els usuaris perden el control de les seves pròpies dades. Els registres de cerca poden revelar informació sensible de l’usuari, o fins i tot revelar la seva identitat. En aquesta tesis tractem com limitar aquests problemes de privadesa mentre mantenim suficient informació a les dades.
La primera part d’aquesta tesis tracta els mètodes per prevenir la recollida d’informació per part dels motores de cerca. Ja que aquesta informació es requerida per oferir un servei precís, l’objectiu es proporcionar registres de cerca que siguin adequats per proporcionar personalització. Amb aquesta finalitat, proposem un protocol que empra una xarxa social per tal d’ofuscar els perfils dels usuaris.
La segona part tracta la disseminació de registres de cerca. Proposem tècniques que la permeten, proporcionant k-anonimat i minimitzant la pèrdua d’informació.Web Search Engines collects and stores information about their users in order to tailor their services better to their users' needs. Nevertheless, while receiving a personalized attention, the users lose the control over their own data. Search logs can disclose sensitive information and the identities of the users, creating risks of privacy breaches. In this thesis we discuss the problem of limiting the disclosure risks while minimizing the information loss.
The first part of this thesis focuses on the methods to prevent the gathering of information by WSEs. Since search logs are needed in order to receive an accurate service, the aim is to provide logs that are still suitable to provide personalization. We propose a protocol which uses a social network to obfuscate users' profiles.
The second part deals with the dissemination of search logs. We propose microaggregation techniques which allow the publication of search logs, providing -anonymity while minimizing the information loss
Coordinate structures: constraint-based syntax-semantics processing
Tese de doutoramento em Línguística (Linguística Computacional), apresentada à Universidade de Lisboa através da Faculdade de Letras, 200
- …