    Does Enrichment of Clinical Texts by Ontology Concepts Increases Classification Accuracy?

    In the medical domain, multiple ontologies and terminology systems are available. However, existing classification and prediction algorithms in the clinical domain often ignore or insufficiently utilize semantic information as it is provided in those ontologies. To address this issue, we introduce a concept for augmenting embeddings, the input to deep neural networks, with semantic information retrieved from ontologies. To do this, words and phrases of sentences are mapped to concepts of a medical ontology aggregating synonyms in the same concept. A semantically enriched vector is generated and used for sentence classification. We study our approach on a sentence classification task using a real world dataset which comprises 640 sentences belonging to 22 categories. A deep neural network model is defined with an embedding layer followed by two LSTM layers and two dense layers. Our experiments show, classification accuracy without content enriched embeddings is for some categories higher than without enrichment. We conclude that semantic information from ontologies has potential to provide a useful enrichment of text. Future research will assess to what extent semantic relationships from the ontology can be used for enrichment

    Privacy Enhancing Technologies for solving the privacy-personalization paradox : taxonomy and survey

    Personal data are often collected and processed in a decentralized fashion, within different contexts. For instance, with the emergence of distributed applications, several providers are usually correlating their records, and providing personalized services to their clients. Collected data include geographical and indoor positions of users, their movement patterns as well as sensor-acquired data that may reveal users’ physical conditions, habits and interests. Consequently, this may lead to undesired consequences such as unsolicited advertisement and even to discrimination and stalking. To mitigate privacy threats, several techniques emerged, referred to as Privacy Enhancing Technologies, PETs for short. On one hand, the increasing pressure on service providers to protect users’ privacy resulted in PETs being adopted. One the other hand, service providers have built their business model on personalized services, e.g. targeted ads and news. The objective of the paper is then to identify which of the PETs have the potential to satisfy both usually divergent - economical and ethical - purposes. This paper identifies a taxonomy classifying eight categories of PETs into three groups, and for better clarity, it considers three categories of personalized services. After defining and presenting the main features of PETs with illustrative examples, the paper points out which PETs best fit each personalized service category. Then, it discusses some of the inter-disciplinary privacy challenges that may slow down the adoption of these techniques, namely: technical, social, legal and economic concerns. Finally, it provides recommendations and highlights several research directions

    On the linguistic constitution of research practices

    This thesis explores sociologists' routine research activities, including observation, participant observation, interviewing, and transcription. It suggests that the constitutive activities of sociological research methods - writing field-notes, doing looking and categorising, and the endogenous structure of members' ordinary language transactions are suffused with culturally methodic, i.e. ordinary language activities. "Membership categories" are the ordinary organising practices of description that society-members - including sociologists - routinely use in assembling sense of settings. This thesis addresses the procedural bases of activities which are constituent features of the research: disguising identities of informants, reviewing literature, writing-up research outcomes, and compiling bibliographies. These activities are themselves loci of practical reasoning. Whilst these activities are assemblages of members' cultural methods, they have not been recognised as "research practices" by methodologically ironic sociology. The thesis presents a series of studies in Membership Categorisation Analysis. Using both sequential and membership categorisational aspects of Conversation Analysis, as well as textual analysis of published research, this thesis examines how members' cultural practices coincide with research practices. Data are derived from a period of participant observation in an organisation, video-recordings of the organisation's work; and interviews following the 1996 bombing in Manchester. A major, cumulative theme within this thesis is confidentiality - within an organisation, within a research project and within sociology itself. Features of confidentiality are explored through ethnographic observation, textual analysis and Membership Categorisation Analysis. Membership Categorisation Analysis brings seen-but-unnoticed features of confidentiality into relief. Central to the thesis are the works of Edward Rose, particularly his ethnographic inquiries of Skid Row, and Harvey Sacks, on the cultural logic shared by society-members. Rose and Sacks explicate the visibility and recognition of members' activities to other members, and research activities as linguistic activities

    Whistleblowing for Change

    The courageous acts of whistleblowing that inspired the world over the past few years have changed our perception of surveillance and control in today's information society. But what are the wider effects of whistleblowing as an act of dissent on politics, society, and the arts? How does it contribute to new courses of action, digital tools, and contents? This urgent intervention based on the work of Berlin's Disruption Network Lab examines this growing phenomenon, offering interdisciplinary pathways to empower the public by investigating whistleblowing as a developing political practice that has the ability to provoke change from within

    Differential Privacy - A Balancing Act

    Data privacy is an ever important aspect of data analyses. Historically, a plethora of privacy techniques have been introduced to protect data, but few have stood the test of time. From investigating the overlap between big data research, and security and privacy research, I have found that differential privacy presents itself as a promising defender of data privacy.Differential privacy is a rigorous, mathematical notion of privacy. Nevertheless, privacy comes at a cost. In order to achieve differential privacy, we need to introduce some form of inaccuracy (i.e. error) to our analyses. Hence, practitioners need to engage in a balancing act between accuracy and privacy when adopting differential privacy. As a consequence, understanding this accuracy/privacy trade-off is vital to being able to use differential privacy in real data analyses.In this thesis, I aim to bridge the gap between differential privacy in theory, and differential privacy in practice. Most notably, I aim to convey a better understanding of the accuracy/privacy trade-off, by 1) implementing tools to tweak accuracy/privacy in a real use case, 2) presenting a methodology for empirically predicting error, and 3) systematizing and analyzing known accuracy improvement techniques for differentially private algorithms. Additionally, I also put differential privacy into context by investigating how it can be applied in the automotive domain. Using the automotive domain as an example, I introduce the main challenges that constitutes the balancing act, and provide advice for moving forward

    Contributions to Lifelogging Protection In Streaming Environments

    Tots els dies, més de cinc mil milions de persones generen algun tipus de dada a través d'Internet. Per accedir a aquesta informació, necessitem utilitzar serveis de recerca, ja siguin motors de cerca web o assistents personals. A cada interacció amb ells, el nostre registre d'accions, logs, s'utilitza per oferir una millor experiència. Per a les empreses, també són molt valuosos, ja que ofereixen una forma de monetitzar el servei. La monetització s'aconsegueix venent dades a tercers, però, els logs de consultes podrien exposar informació confidencial de l'usuari (identificadors, malalties, tendències sexuals, creences religioses) o usar-se per al que es diu "life-logging ": Un registre continu de les activitats diàries. La normativa obliga a protegir aquesta informació. S'han proposat prèviament sistemes de protecció per a conjunts de dades tancats, la majoria d'ells treballant amb arxius atòmics o dades estructurades. Desafortunadament, aquests sistemes no s'adapten quan es fan servir en el creixent entorn de dades no estructurades en temps real que representen els serveis d'Internet. Aquesta tesi té com objectiu dissenyar tècniques per protegir la informació confidencial de l'usuari en un entorn no estructurat d’streaming en temps real, garantint un equilibri entre la utilitat i la protecció de dades. S'han fet tres propostes per a una protecció eficaç dels logs. La primera és un nou mètode per anonimitzar logs de consultes, basat en k-anonimat probabilística i algunes eines de desanonimització per determinar fuites de dades. El segon mètode, s'ha millorat afegint un equilibri configurable entre privacitat i usabilitat, aconseguint una gran millora en termes d'utilitat de dades. La contribució final es refereix als assistents personals basats en Internet. La informació generada per aquests dispositius es pot considerar "life-logging" i pot augmentar els riscos de privacitat de l'usuari. Es proposa un esquema de protecció que combina anonimat de logs i signatures sanitizables.Todos los días, más de cinco mil millones de personas generan algún tipo de dato a través de Internet. Para acceder a esa información, necesitamos servicios de búsqueda, ya sean motores de búsqueda web o asistentes personales. En cada interacción con ellos, nuestro registro de acciones, logs, se utiliza para ofrecer una experiencia más útil. Para las empresas, también son muy valiosos, ya que ofrecen una forma de monetizar el servicio, vendiendo datos a terceros. Sin embargo, los logs podrían exponer información confidencial del usuario (identificadores, enfermedades, tendencias sexuales, creencias religiosas) o usarse para lo que se llama "life-logging": Un registro continuo de las actividades diarias. La normativa obliga a proteger esta información. Se han propuesto previamente sistemas de protección para conjuntos de datos cerrados, la mayoría de ellos trabajando con archivos atómicos o datos estructurados. Desafortunadamente, esos sistemas no se adaptan cuando se usan en el entorno de datos no estructurados en tiempo real que representan los servicios de Internet. Esta tesis tiene como objetivo diseñar técnicas para proteger la información confidencial del usuario en un entorno no estructurado de streaming en tiempo real, garantizando un equilibrio entre utilidad y protección de datos. Se han hecho tres propuestas para una protección eficaz de los logs. La primera es un nuevo método para anonimizar logs de consultas, basado en k-anonimato probabilístico y algunas herramientas de desanonimización para determinar fugas de datos. El segundo método, se ha mejorado añadiendo un equilibrio configurable entre privacidad y usabilidad, logrando una gran mejora en términos de utilidad de datos. La contribución final se refiere a los asistentes personales basados en Internet. La información generada por estos dispositivos se puede considerar “life-logging” y puede aumentar los riesgos de privacidad del usuario. Se propone un esquema de protección que combina anonimato de logs y firmas sanitizables.Every day, more than five billion people generate some kind of data over the Internet. As a tool for accessing that information, we need to use search services, either in the form of Web Search Engines or through Personal Assistants. On each interaction with them, our record of actions via logs, is used to offer a more useful experience. For companies, logs are also very valuable since they offer a way to monetize the service. Monetization is achieved by selling data to third parties, however query logs could potentially expose sensitive user information: identifiers, sensitive data from users (such as diseases, sexual tendencies, religious beliefs) or be used for what is called ”life-logging”: a continuous record of one’s daily activities. Current regulations oblige companies to protect this personal information. Protection systems for closed data sets have previously been proposed, most of them working with atomic files or structured data. Unfortunately, those systems do not fit when used in the growing real-time unstructured data environment posed by Internet services. This thesis aims to design techniques to protect the user’s sensitive information in a non-structured real-time streaming environment, guaranteeing a trade-off between data utility and protection. In this regard, three proposals have been made in efficient log protection. The first is a new method to anonymize query logs, based on probabilistic k-anonymity and some de-anonymization tools to determine possible data leaks. A second method has been improved in terms of a configurable trade-off between privacy and usability, achieving a great improvement in terms of data utility. Our final contribution concerns Internet-based Personal Assistants. The information generated by these devices is likely to be considered life-logging, and it can increase the user’s privacy risks. The proposal is a protection scheme that combines log anonymization and sanitizable signatures

    Measuring Productive Depth of Vocabulary Knowledge of the Most Frequent Words

    Productive depth of vocabulary knowledge (PDVK) is associated with writing and speaking skills (Laufer & Goldstein, 2004). These skills are essential for English for Academic Purposes (EAP) students, who have difficulties with expressing themselves in oral presentations or written assignments (Evans & Green, 2007). As a result, diagnostic measurement of PDVK is of vital importance, especially in regard to the most frequent 1,000 word families because these word families cover 81% of written text and 85% of spoken text (Nation, 2006). Depth of vocabulary knowledge has been investigated and measured in various studies (see Chen & Truscatt, 2010; Pigada & Schmitt, 2006; Schmitt & Meara, 1997; Schmitt, 1998, 1999; Webb, 2005, 2007a, 2007b, 2007c, 2009a, 2009b) leading to successful multi-dimensional batteries of tests for its measurement. However, no study, to date, has productively measured the depth (and strength) of knowledge of the most frequent words. Nation’s (2013) conception of vocabulary knowledge—the proposition that vocabulary knowledge has three main aspects of Form, Meaning, and Use—structured the current study. Considering that the development of a test battery to measure all aspects of vocabulary knowledge outlined by Nation (2013) was impractical (Ishii & Schmitt, 2009), the current Ph.D. project focused on four aspects of vocabulary knowledge: (a) word parts, (b) associations, (c) collocations, and (d) form and meaning. The study measured 46 Iranian university EAP students’ productive vocabulary knowledge of the words at the 1,000 word frequency level. One productive test of word parts, two productive tests of semantic associations (synonym & antonym, and superordination & subordination tests), one productive test of collocation, and four corresponding productive tests of form-meaning connection for the aforementioned tests were developed for the present research. The results showed that while the participants had a strong performance on form-meaning connection and superordination and subordination, their knowledge of collocations was considerably lower. The results also showed that the participants’ performance on synonymy and antonymy, on association as a general term (synonym and antonym, superordination and subordination, and collocation altogether), and on word parts was not as strong as expected and was considerably lower than the maximum possible performance. Together the findings indicate that while Iranian university students had the productive Meaning knowledge of the words at 1,000 level, they did not seem to have extensive Form knowledge of the same words, and their Use knowledge was limited. This assists in diagnosing areas of weakness and the degree to which instructional emphasis on high frequency words might improve their knowledge

    Financial reporting fraud : a practical guide to detection and internal control

