Search CORE

3 research outputs found

Semantic metadata for supporting exploratory OLAP

Author: Varga Jovan
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2016
Field of study

Cotutela Universitat Politècnica de Catalunya i Aalborg UniversitetOn-Line Analytical Processing (OLAP) is an approach widely used for data analysis. OLAP is based on the multidimensional (MD) data model where factual data are related to their analytical perspectives called dimensions and together they form an n-dimensional data space referred to as data cube. MD data are typically stored in a data warehouse, which integrates data from in-house data sources, and then analyzed by means of OLAP operations, e.g., sales data can be (dis)aggregated along the location dimension. As OLAP proved to be quite intuitive, it became broadly accepted by non-technical and business users. However, as users still encountered difficulties in their analysis, different approaches focused on providing user assistance. These approaches collect situational metadata about users and their actions and provide suggestions and recommendations that can help users' analysis. However, although extensively exploited and evidently needed, little attention is paid to metadata in this context. Furthermore, new emerging tendencies call for expanding the use of OLAP to consider external data sources and heterogeneous settings. This leads to the Exploratory OLAP approach that especially argues for the use of Semantic Web (SW) technologies to facilitate the description and integration of external sources. With data becoming publicly available on the (Semantic) Web, the number and diversity of non-technical users are also significantly increasing. Thus, the metadata to support their analysis become even more relevant. This PhD thesis focuses on metadata for supporting Exploratory OLAP. The study explores the kinds of metadata artifacts used for the user assistance purposes and how they are exploited to provide assistance. Based on these findings, the study then aims at providing theoretical and practical means such as models, algorithms, and tools to address the gaps and challenges identified. First, based on a survey of existing user assistance approaches related to OLAP, the thesis proposes the analytical metadata (AM) framework. The framework includes the definition of the assistance process, the AM artifacts that are classified in a taxonomy, and the artifacts organization and related types of processing to support the user assistance. Second, the thesis proposes a semantic metamodel for AM. Hence, Resource Description Framework (RDF) is used to represent the AM artifacts in a flexible and re-usable manner, while the metamodeling abstraction level is used to overcome the heterogeneity of (meta)data models in the Exploratory OLAP context. Third, focusing on the schema as a fundamental metadata artifact for enabling OLAP, the thesis addresses some important challenges on constructing an MD schema on the SW using RDF. It provides the algorithms, method, and tool to construct an MD schema over statistical linked open data sets. Especially, the focus is on enabling that even non-technical users can perform this task. Lastly, the thesis deals with queries as the second most relevant artifact for user assistance. In the spirit of Exploratory OLAP, the thesis proposes an RDF-based model for OLAP queries created by instantiating the previously proposed metamodel. This model supports the sharing and reuse of queries across the SW and facilitates the metadata preparation for the assistance exploitation purposes. Finally, the results of this thesis provide metadata foundations for supporting Exploratory OLAP and advocate for greater attention to the modeling and use of semantics related to metadata.El processament analític en línia (OLAP) és una tècnica àmpliament utilitzada per a l'anàlisi de dades. OLAP es basa en el model multi-dimensional (MD) de dades, on dades factuals es relacionen amb les seves perspectives analítiques, anomenades dimensions, i conjuntament formen un espai de dades n-dimensional anomenat cub de dades. Les dades MD s'emmagatzemen típicament en un data warehouse (magatzem de dades), el qual integra dades de fonts internes, les quals posteriorment s'analitzen mitjançant operacions OLAP, per exemple, dades de vendes poden ser (des)agregades a partir de la dimensió ubicació. Un cop OLAP va ser provat com a intuïtiu, va ser ampliament acceptat tant per usuaris no tècnics com de negoci. Tanmateix, donat que els usuaris encara trobaven dificultats per a realitzar el seu anàlisi, diferents tècniques s'han enfocat en la seva assistència. Aquestes tècniques recullen metadades situacionals sobre els usuaris i les seves accions, i proporcionen suggerències i recomanacions per tal d'ajudar en aquest anàlisi. Tot i ésser extensivament emprades i necessàries, poca atenció s'ha prestat a les metadades en aquest context. A més a més, les noves tendències demanden l'expansió d'ús d'OLAP per tal de considerar fonts de dades externes en escenaris heterogenis. Això ens porta a la tècnica d'OLAP exploratori, la qual es basa en l'ús de tecnologies en la web semàntica (SW) per tal de facilitar la descripció i integració d'aquestes fonts externes. Amb les dades essent públicament disponibles a la web (semàntica), el nombre i diversitat d'usuaris no tècnics també incrementa signifícativament. Així doncs, les metadades per suportar el seu anàlisi esdevenen més rellevants. Aquesta tesi doctoral s'enfoca en l'ús de metadades per suportar OLAP exploratori. L'estudi explora els tipus d'artefactes de metadades utilitzats per l'assistència a l'usuari, i com aquests són explotats per proporcionar assistència. Basat en aquestes troballes, l'estudi preté proporcionar mitjans teòrics i pràctics, com models, algorismes i eines, per abordar els reptes identificats. Primerament, basant-se en un estudi de tècniques per assistència a l'usuari en OLAP, la tesi proposa el marc de treball de metadades analítiques (AM). Aquest marc inclou la definició del procés d'assistència, on els artefactes d'AM són classificats en una taxonomia, i l'organització dels artefactes i tipus relacionats de processament pel suport d'assistència a l'usuari. En segon lloc, la tesi proposa un meta-model semàntic per AM. Així doncs, s'utilitza el Resource Description Framework (RDF) per representar els artefactes d'AM d'una forma flexible i reusable, mentre que el nivell d'abstracció de metamodel s'utilitza per superar l'heterogeneïtat dels models de (meta)dades en un context d'OLAP exploratori. En tercer lloc, centrant-se en l'esquema com a artefacte fonamental de metadades per a OLAP, la tesi adreça reptes importants en la construcció d'un esquema MD en la SW usant RDF. Proporciona els algorismes, mètodes i eines per construir un esquema MD sobre conjunts de dades estadístics oberts i relacionats. Especialment, el focus rau en permetre que usuaris no tècnics puguin realitzar aquesta tasca. Finalment, la tesi tracta amb consultes com el segon artefacte més rellevant per l'assistència a usuari. En l'esperit d'OLAP exploratori, la tesi proposa un model basat en RDF per consultes OLAP instanciant el meta-model prèviament proposat. Aquest model suporta el compartiment i reutilització de consultes sobre la SW i facilita la preparació de metadades per l'explotació de l'assistència. Finalment, els resultats d'aquesta tesi proporcionen els fonaments en metadades per suportar l'OLAP exploratori i propugnen la major atenció al model i ús de semàntica relacionada a metadades.On-Line Analytical Processing (OLAP) er en bredt anvendt tilgang til dataanalyse. OLAP er baseret på den multidimensionelle (MD) datamodel, hvor faktuelle data relateres til analytiske synsvinkler, såkaldte dimensioner. Tilsammen danner de et n-dimensionelt rum af data kaldet en data cube. Multidimensionelle data er typisk lagret i et data warehouse, der integrerer data fra forskellige interne datakilder, og kan analyseres ved hjælp af OLAPoperationer. For eksempel kan salgsdata disaggregeres langs sted-dimensionen. OLAP har vist sig at være intuitiv at forstå og er blevet taget i brug af ikketekniske og orretningsorienterede brugere. Nye tilgange er siden blevet udviklet i forsøget på at afhjælpe de problemer, som denne slags brugere dog stadig står over for. Disse tilgange indsamler metadata om brugerne og deres handlinger og kommer efterfølgende med forslag og anbefalinger, der kan bidrage til brugernes analyse. På trods af at der er en klar nytteværdi i metadata (givet deres udbredelse), har stadig ikke været meget opmærksomhed på metadata i denne kotekst. Desuden lægger nye fremspirende teknikker nu op til en udvidelse af brugen af OLAP til også at bruge eksterne og uensartede datakilder. Dette har ført til Exploratory OLAP, en tilgang til OLAP, der benytter teknologier fra Semantic Web til at understøtte beskrivelse og integration af eksterne kilder. Efterhånden som mere data gøres offentligt tilgængeligt via Semantic Web, kommer flere og mere forskelligartede ikketekniske brugere også til. Derfor er metadata til understøttelsen af deres dataanalyser endnu mere relevant. Denne ph.d.-afhandling omhandler metadata, der understøtter Exploratory OLAP. Der foretages en undersøgelse af de former for metadata, der benyttes til at hjælpe brugere, og af, hvordan sådanne metadata kan udnyttes. Med grundlag i disse fund søges der løsninger til de identificerede problemer igennem teoretiske såvel som praktiske midler. Det vil sige modeller, algoritmer og værktøjer. På baggrund af en afdækning af eksisterende tilgange til brugerassistance i forbindelse med OLAP præsenteres først rammeværket Analytical Metadata (AM). Det inkluderer definition af assistanceprocessen, en taksonomi over tilhørende artefakter og endelig relaterede processeringsformer til brugerunderstøttelsen. Dernæst præsenteres en semantisk metamodel for AM. Der benyttes Resource Description Framework (RDF) til at repræsentere AM-artefakterne på en genbrugelig og fleksibel facon, mens metamodellens abstraktionsniveau har til formål at nedbringe uensartetheden af (meta)data i Exploratory OLAPs kontekst. Så fokuseres der på skemaet som en fundamental metadata-artefakt i OLAP, og afhandlingen tager fat i vigtige udfordringer i forbindelse med konstruktionen af multidimensionelle skemaer i Semantic Web ved brug af RDF. Der præsenteres algoritmer, metoder og redskaber til at konstruere disse skemaer sammenkoblede åbne statistiske datasæt. Der lægges særlig vægt på, at denne proces skal kunne udføres af ikke-tekniske brugere. Til slut tager afhandlingen fat i forespørgsler som anden vigtig artefakt inden for bruger-assistance. I samme ånd som Exploratory OLAP foreslås en RDF-baseret model for OLAP-forespørgsler, hvor førnævnte metamodel benyttes. Modellen understøtter deling og genbrug af forespørgsler over Semantic Web og fordrer klargørelsen af metadata med øje for assistance-relaterede formål. Endelig leder resultaterne af afhandlingen til fundamenterne for metadata i støttet Exploratory OLAP og opfordrer til en øget opmærksomhed på modelleringen og brugen af semantik i forhold til metadataPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

VBN

Diversité et recommandation : une investigation sur l’apport de la fouille d’opinions pour la distinction d’articles d’opinion dans une controverse médiatique

Author: Carvalho Baiocchi Marcela
Publication venue
Publication date: 01/08/2019
Field of study

Les plateformes de consultation d’articles de presse en format numérique comme Google Actualités et Yahoo! Actualités sont devenues de plus en plus populaires pour la recherche et la lecture de l’information journalistique en ligne. Dans le but d’aider les usagers à s’orienter parmi la multitude de sources d’information, ces plateformes intègrent à leurs moteurs de recherche des mécanismes de filtrage automatisés, connus comme systèmes de recommandation. Ceux-ci aident les usagers à retrouver des ressources informationnelles qui correspondent davantage à leurs intérêts et goûts personnels, en prenant comme base des comportements antérieurs, par exemple, l’historique de documents consultés. Cependant, ces systèmes peuvent nuire à la diversité d’idées et de perspectives politiques dans l’environnement informationnel qu’ils créent : la génération de résultats de recherche ou de recommandations excessivement spécialisées, surtout dans le contexte de la presse en ligne, pourrait cacher des idées qui sont importantes dans un débat. Quand l’environnement informationnel est insuffisamment divers, il y a un manque d’opportunité pour produire l’enquête ouverte, le dialogique et le désaccord constructif, ce qui peut résulter dans l’émergence d’opinions extrémistes et la dégradation générale du débat. Les travaux du domaine de l’intelligence artificielle qui tentent de répondre au problème de la diversité dans les systèmes de recommandation d’articles de presse sont confrontés par plusieurs questions, dont la représentation de textes numériques dans le modèle vectoriel à partir d’un ensemble de mots statistiquement discriminants dans ces textes, ainsi que le développement d’une mesure statistique capable de maximiser la différence entre des articles similaires qui sont retournés lors d’un processus de recommandation à un usager. Un courant de recherche propose des systèmes de recommandation basés sur des techniques de fouille d’opinions afin de détecter de manière automatique la différence d’opinions entre des articles de presse qui traitent d’un même thème lors du processus de recommandation. Dans cette approche, la représentation des textes numériques se fait par un ensemble de mots qui peuvent être associés, dans les textes, à l’expression d’opinions, comme les adjectifs et les émotions. Néanmoins, ces techniques s’avèrent moins efficaces pour détecter les différences entre les opinions relatives à un débat public argumenté, puisque l’expression de l’opinion dans les discussions politiques n’est pas nécessairement liée à l’expression de la subjectivité ou des émotions du journaliste. Notre recherche doctorale s’inscrit dans l’objectif de (1) systématiser et de valider une méthodologie de fouille d’opinions permettant d’assister l’identification d’opinions divergentes dans le cadre d’une controverse et (2) d’explorer l’applicabilité de cette méthodologie pour un système de recommandation d’articles de presse. Nous assimilons la controverse à un type de débat d’opinions dans la presse, dont la particularité est la formation de camps explicitement opposés quant à la façon de voir et de comprendre une question d’importance pour la collectivité. Notre recherche apporte des questionnements sur la définition d’opinion dans ce contexte précis et discute la pertinence d’exploiter les théories discursives et énonciatives dans les recherches de fouille d’opinions. Le corpus expérimental est composé par 495 articles d’opinion publiés dans la presse au sujet de la mobilisation étudiante du Québec en 2012 contre la hausse de droits de scolarité annoncée par le gouvernement de Jean Charest. Ils ont été classés dans deux catégories, ETUD et GOUV, en fonction du type d’opinion qu’ils véhiculent. Soit ils sont favorables aux étudiants et à la continuité de la grève soit favorables au gouvernement et critiques envers le mouvement de grève. Sur le plan méthodologique, notre recherche se base sur la démarche proposée par les travaux qui explorent des techniques du champ de la linguistique du corpus dans la fouille d’opinions, ainsi que les concepts de la sémantique interprétative de François Rastier. Elle systématise les étapes de cette démarche, en préconisant la description des textes du corpus, pour relever et interpréter les mots spécifiques qui contrastent les types d’opinions qui devront être classés. Ce travail permet de sélectionner des critères textuels interprétables et descriptifs des phénomènes énonciatifs étudiés dans le corpus qui serviront à représenter les textes numériques dans le format vectoriel. La démarche proposée par ces travaux a été validée avec l’utilisation du corpus de presse constitué pour l’expérimentation. Les résultats démontrent que la sélection de 447 critères textuels par une approche interprétative du corpus est plus performante pour la classification automatique des articles que le choix d’un ensemble de mots dont la sélection ne prend pas en compte de facteurs linguistiques liés au corpus. Notre recherche a également évalué la possibilité d’une application dans les systèmes de recommandation d’articles de presse, en faisant une étude sur l’évolution chronologique du vocabulaire du corpus de l’expérimentation. Nous démontrons que la sélection de critères textuels effectuée au début de la controverse est efficace pour prédire l’opinion des articles qui sont publiés par la suite, suggérant que la démarche de sélection de critères interprétables peut être mise au profit d’un système de recommandation qui propose des articles d’opinion issus d’une controverse médiatique.Web-based reading services such as Google News and Yahoo! News have become increasingly popular with the growth of online news consumption. To help users cope with information overload on these search engines, recommender systems and personalization techniques are utilized. These services help users find content that matches their personal interests and tastes, using their browser history and past behavior as a basis for recommendations. However, recommender systems can limit diversity of thought and the range of political perspectives that circulate within the informational environment. In consequence, relevant ideas and questions may not be seen, debatable assumptions may be taken as facts, and overspecialized recommendations may reinforce confirmation bias, special interests, tribalism, and extremist opinions. When the informational environment is insufficiently diverse, there is a loss of open inquiry, dialogue and constructive disagreement—and, as a result, an overall degradation of public discourse. Studies within the artificial intelligence field that try to solve the diversity problem for news recommender systems are confronted by many questions, including the vector model representation of digital texts and the development of a statistical measure that maximizes the difference between similar articles that are proposed to the user by the recommendation process. Studies based on opinion mining techniques propose to tackle the diversity problem in a different manner, by automatically detecting the difference of perspectives between news articles that are related by content in the recommendation process. In this latter approach, the representation of digital texts in the vector model considers a set of words that are associated with opinion expressions, such as adjectives or emotions. However, those techniques are less effective in detecting differences of opinion in a publicly argued debate, because journalistic opinions are not necessarily linked with the journalist’s subjectivity or emotions. The aims of our research are (1) to systematize and validate an opinion mining method that can classify divergent opinions within a controversial debate in the press and (2) to explore the applicability of this method in a news recommender system. We equate controversy to an opinion debate in the press where at least two camps are explicitly opposed in their understanding of a consequential question in their community. Our research raises questions about how to define opinion in this context and discusses the relevance of using discursive and enunciation theoretical approaches in opinion mining. The corpus of our experiment has 495 opinion articles about the 2012 student protest in Quebec against the raise of tuition fees announced by the Liberal Premier Minister Jean Charest. Articles were classified into two categories, ETUD and GOUV, representing the two types of opinions that dominated the debate: namely, those that favored the students and the continuation of the strike or those that favored the government and criticized the student movement. Methodologically, our research is based on the approach of previous studies that explore techniques from the corpus linguistics field in the context of opinion mining, as well as theoretical concepts of François Rastier’s Interpretative Semantics. Our research systematizes the steps of this approach, advocating for a contrastive and interpretative description of the corpus, with the aim of discovering linguistic features that better describe the types of opinion that are to be classified. This approach allows us to select textual features that are interpretable and compatible with the enunciative phenomena in the corpus that are then used to represent the digital texts in the vector model. The approach of previous works has been validated by our analysis of the corpus. The results show that the selection of 447 textual features by an interpretative approach of the corpus performs better for the automatic classification of the opinion articles than a selection process in which the set of words are not identified by linguistic factors. Our research also evaluated the possibility of applying this approach to the development of a news recommender system, by studying the chronological evolution of the vocabulary in the corpus. We show that the selection of features at the beginning of the controversy effectively predicts the opinion of the articles that are published later, suggesting that the selection of interpretable features can benefit the development of a news recommender system in a controversial debate

Dépôt Institutionnel Numérique