9 research outputs found

    Estimating The Quality Of Data Using Provenance: A Case Study In Escience

    Get PDF
    Data quality assessment is a key factor in data-intensive domains. The data deluge is aggravated by an increasing need for interoperability and cooperation across groups and organizations. New alternatives must be found to select the data that best satisfy users' needs in a given context. This paper presents a strategy to provide information to support the evaluation of the quality of data sets. This strategy is based on combining metadata on the provenance of a data set (derived from workflows that generate it) and quality dimensions defined by the set's users, based on the desired context of use. Our solution, validated via a case study, takes advantage of a semantic model to preserve data provenance related to applications in a specific domain. © (2013) by the AIS/ICIS Administrative Office All rights reserved.214421451IBM,SAP University Alliances,Microsoft,DePaul University,Georgia State University - J. Mack Robinson College of Business,et alBallou, D., Modeling Information Manufacturing Systems to Determine Information Product Quality (1998) Manage. Sci, 44, pp. 462-484Barga, R.S., Digiampietri, L.A., Automatic capture and efficient storage of e-Science experiment provenance (2008) Concurr. Comput.□: Pract. Exper, 20 (5), pp. 419-429Batini, C., Scannapieco, M., (2006) Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications), , Springer-VerlagBlake, R., Mangiameli, P., The Effects and Interactions of Data Quality and Problem Complexity on Classification (2011) Journal of Data and Information Quality, 2 (2), pp. 1-28Chapman, A.D., (2005) Principles of Data Quality, , Global Biodiversity Information Facility, CopenhagenChen, P., Plale, B., Aktas, M.S., Temporal Representation for Scientific Data Provenance (2012) In Proc. 8th IEEE Int. Conf. On EScience 2012Cugler, D.C., Medeiros, C.B., Toledo, F., An architecture for retrieval of animal sound recordings based on context variables (2012) Concurrency and Computation - Practice and ExperienceDavies, J., Studer, R., Warren, P., (2006) Semantic Web Technologies: Trends and Research In Ontology-based Systems, , Wiley(2010) The Dublin Core Metadata Initiative, , http://dublincore.org/, DCMI, Available atDeVries, P.J., (2009) GeoSpecies Ontology, , http://bioportal.bioontology.org/ontologies/1247, Available at(2009) Darwin Core Task Group, , http://www.tdwg.org/standards/450/, DwC, Available atGoodchild, M.F., Li, L., Assuring the quality of volunteered geographic information (2012) Spatial Statistics, 1, pp. 110-120Hartig, O., Zhao, J., Using web data provenance for quality assessment (2009) In Proc. of the Workshop On Semantic Web and Provenance Management At ISWC(2011) The Kepler Project, , https://kepler-project.org/, Kepler, Available atKondo, A.A., Traceability in Food for Supply Chains (2007) In Proc. 3rd Int. Conf. On Web Information Systems and Technologies (WEBIST), pp. 121-127. , INSTICCLassila, O., Swick, R.R., (1999) Resource Description Framework (RDF) Model and Syntax SpecificationMalaverri, J.E.G., Medeiros, C.B., A Provenance-based Approach to Evaluate Data Quality in eScience (2013) Int. J. Metadata, Semantics and Ontology - Special Issue On Metadata For E-science and E-researchMoreau, L., The Open Provenance Model core specification (v1.1) (2011) Future Generation Comp. Syst, 27 (6), pp. 743-756Parssian, A., Managerial decision support with knowledge of accuracy and completeness of the relational aggregate functions (2006) Decis. Support Syst, 42, pp. 1494-1502Pernici, B., Scannapieco, M., Data Quality in Web Information Systems (2002) In Proc. of the 21st Int. Conf. On Conceptual Modeling, pp. 397-413. , Springer-VerlagPipino, L.L., Lee, Y.W., Wang, R.Y., Data Quality Assessment (2002) Commun. ACM, 45, pp. 211-218Prat, N., Madnick, S., Measuring Data Believability: A Provenance Approach (2008) Proc. of the 41st Hawaii Int. Conf. On System Sciences, p. 393Richard, Y., Diane, M., Beyond accuracy□: What data quality means to data consumers (1996) Journal of ManagementSahoo, S.S., Sheth, A.P., Henson, C.A., Semantic Provenance for eScience: Managing the Deluge of Scientific Data (2008) IEEE Internet Computing, 12 (4), pp. 46-54Simmhan, Y., Plale, B., Using Provenance for Personalized Quality Ranking of Scientific Datasets (2011) I. J. Comput. Appl, 18 (3), pp. 180-195(2009) The Taverna Project, , http://www.taverna.org.uk/, Taverna, Available at(2011) The VisTrails Project, , http://www.vistrails.org, VisTrails, Available at(2012) The PROV Ontology, , http://www.w3.org/TR/prov-o/, W3C, Available atWang, X., Gorlitsky, R., Almeida, J.S., From XML to RDF: How semantic web technologies will change the design of omic standards (2005) Nat Biotech, 23 (9), pp. 1099-1103Yeganeh, S.H., Hassanzadeh, O., Miller, R.J., Linking Semistructured Data on the Web (2011) In Proc. 14th Int. Workshop On the Web and DatabasesZhao, J., Mining Taverna's semantic web of provenance (2008) Concurr. Comput.□: Pract. Exper, 20, pp. 463-47

    Shadows : uma nova forma de representar documentos

    Get PDF
    Orientador: Claudia Maria Bauzer MedeirosDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Ferramentas de produção de documentos estão cada vez mais acessíveis e sofisticadas, resultando em um crescimento exponencial de documentos cada vez mais complexos, distribuídos e heterogêneos. Isto dificulta os processos de troca, anotação e recuperação de documentos. Enquanto mecanismos de recuperação da informação concentram-se apenas no processamento de características textuais (análise de corpus), estratégias de anotação de documentos procuram concentrar-se em formatos específicos ou exigem que o documento a ser anotado siga padrões de interoperabilidade - definidos por esquemas. Este trabalho apresenta o nosso esforço para lidar com estes problemas, propondo uma solução mais flexível para estes e outros processos. Ao invés de tentar modificar ou converter um documento, ou concentrar-se apenas nas características textuais deste, a estratégia descrita nesta dissertação propõe a elaboração de um descritor intermediário - denominado shadow - que representa e sumariza aspectos e elementos da estrutura e do conteúdo de um documento que sejam relevantes a um dado domínio. Shadows não se restringem à descrição de características textuais de um documento, preservando, por exemplo, a hierarquia entre os elementos e descrevendo outros tipos de artefatos, como artefatos multimídia. Além disto, Shadows podem ser anotados e armazenados em bancos de dados, permitindo consultas sobre a estrutura e conteúdo de documentos, independentemente de formatosAbstract: Document production tools are present everywhere, resulting in an exponential growth of increasingly complex, distributed and heterogeneous documents. This hampers document exchange, as well as their annotation and retrieval. While information retrieval mechanisms concentrate on textual features (corpus analysis), annotation approaches either target specific formats or require that a document follows interoperable standards - defined via schemas. This work presents our effort to handle these problems, providing a more flexible solution. Rather than trying to modify or convert the document itself, or to target only textual characteristics, the strategy described in this work is based on an intermediate descriptor - the document shadow. A shadow represents domain-relevant aspects and elements of both structure and content of a given document. Shadows are not restricted to the description of textual features, but also concern other elements, such as multimedia artifacts. Furthermore, shadows can be stored in a database, thereby supporting queries on document structure and content, regardless document formatsMestradoCiência da ComputaçãoMestre em Ciência da Computaçã

    Apoio à avaliação da qualidade de dados em eScience : uma abordagem baseada em proveniência

    Get PDF
    Orientador: Claudia Maria Bauzer MedeirosTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Qualidade dos dados é um problema recorrente em todos os domínios da ciência. Os experimentos analisam e manipulam uma grande quantidade de conjuntos de dados gerando novos dados para serem (re) utilizados por outros experimentos. A base para a obtenção de bons resultados científicos está fortemente associada ao grau de qualidade de tais da- dos. No entanto, os dados utilizados nos experimentos são manipulados por uma diversa variedade de usuários, os quais visam interesses diferentes de pesquisa, utilizando seus próprios vocabulários, metodologias de trabalho, modelos, e necessidades de amostragem. Considerando este cenário, um desafio em ciência da computação é oferecer soluções que auxiliem aos cientistas na avaliação da qualidade dos seus dados. Diferentes esforços têm sido propostos abordando a avaliação de qualidade. Alguns trabalhos salientam que os atributos de proveniência dos dados poderiam ser utilizados para avaliar qualidade. No entanto, a maioria destas iniciativas aborda a avaliação de um atributo de qualidade específico, frequentemente focando em valores atômicos de dados. Isto reduz a aplicabilidade destas abordagens. Apesar destes esforços, há uma necessidade de novas soluções que os cientistas possam adotar para avaliar o quão bons seus dados são. Nesta pesquisa de doutorado, apresentamos uma abordagem para lidar com este problema, a qual explora a noção de proveniência de dados. Ao contrário de outras abordagens, nossa proposta combina os atributos de qualidade especificados dentro de um contexto pelos especialistas e os metadados que descrevem a proveniência de um conjunto de dados. As principais contribuições deste trabalho são: (i) a especificação de um framework que aproveita a proveniência dos dados para obter informação de qualidade, (ii) uma metodologia associada a este framework que descreve os procedimentos para apoiar a avaliação da qualidade, (iii) a proposta de dois modelos diferentes de proveniência que possibilitem a captura das informações de proveniência, para cenários fixos e extensíveis, e (iv) a validação dos itens (i) a (iii), com suas discussões via estudos de caso em agricultura e biodiversidadeAbstract: Data quality is a recurrent concern in all scientific domains. Experiments analyze and manipulate several kinds of datasets, and generate data to be (re)used by other experiments. The basis for obtaining good scientific results is highly associated with the degree of quality of such datasets. However, data involved with the experiments are manipulated by a wide range of users, with distinct research interests, using their own vocabularies, work methodologies, models, and sampling needs. Given this scenario, a challenge in computer science is to come up with solutions that help scientists to assess the quality of their data. Different efforts have been proposed addressing the estimation of quality. Some of these efforts outline that data provenance attributes should be used to evaluate quality. However, most of these initiatives address the evaluation of a specific quality attribute, frequently focusing on atomic data values, thereby reducing the applicability of these approaches. Taking this scenario into account, there is a need for new solutions that scientists can adopt to assess how good their data are. In this PhD research, we present an approach to attack this problem based on the notion of data provenance. Unlike other similar approaches, our proposal combines quality attributes specified within a context by specialists and metadata on the provenance of a data set. The main contributions of this work are: (i) the specification of a framework that takes advantage of data provenance to derive quality information; (ii) a methodology associated with this framework that outlines the procedures to support the assessment of quality; (iii) the proposal of two different provenance models to capture provenance information, for fixed and extensible scenarios; and (iv) validation of items (i) through (iii), with their discussion via case studies in agriculture and biodiversityDoutoradoCiência da ComputaçãoDoutora em Ciência da Computaçã

    Linked data wrapper curation: A platform perspective

    Get PDF
    131 p.Linked Data Wrappers (LDWs) turn Web APIs into RDF end-points, leveraging the LOD cloud with current data. This potential is frequently undervalued, regarding LDWs as mere by-products of larger endeavors, e.g. developing mashup applications. However, LDWs are mainly data-driven, not contaminated by application semantics, hence with an important potential for reuse. If LDWs could be decoupled from their breakout projects, this would increase the chances of LDWs becoming truly RDF end-points. But this vision is still under threat by LDW fragility upon API upgrades, and the risk of unmaintained LDWs. LDW curation might help. Similar to dataset curation, LDW curation aims to clean up datasets but, in this case, the dataset is implicitly described by the LDW definition, and ¿stains¿ are not limited to those related with the dataset quality but also include those related to the underlying API. This requires the existence of LDW Platforms that leverage existing code repositories with additional functionalities that cater for LDW definition, deployment and curation. This dissertation contributes to this vision through: (1) identifying a set of requirements for LDW Platforms; (2) instantiating these requirements in SYQL, a platform built upon Yahoo's YQL; (3) evaluating SYQL through a fully-developed proof of concept; and (4), validating the extent to which this approach facilitates LDW curation

    Introducing Shadows: Flexible Document Representation And Annotation On The Web

    No full text
    The Web is witnessing an exponential growth of increasingly complex, distributed and heterogeneous documents. This hampers document exchange, as well as their annotation and retrieval. While information retrieval mechanisms concentrate on textual features (corpus analysis), annotation approaches either target specific formats or require that a document follows interoperable standards. This work presents our effort to handle these problems, providing a more flexible solution. Rather than trying to modify or convert the document itself, or to target only textual characteristics, the strategy described in this work is based on an intermediate descriptor - the document shadow. A shadow represents domain-relevant aspects and elements of both structure and content of a given document, as defined by a user group. Rather than annotating documents themselves, it is the shadows that are annotated, thereby providing independence between annotations and document formats. Our annotations take advantage of the LOD initiative. Via annotations users can derive correlations across shadows, in a flexible way. Moreover, shadows and annotations are stored in databases, therefore allowing uniform database treatments of heterogeneous documents. © 2013 IEEE.1318Santanche, A., Mota, M., Costa, D.P., Oliveira, N., Dalforno, C.O., Componere-Web authoring based on components (2009) Proc. of XV Brazilian Symp. on Multimedia and the WebHey, T., Tansley, S., Tolle, K., (2009) The Fourth Paradigm: Data-Intensive Scientific Discovery, , Redmond, Washington: Microsoft ResearchKahan, J., Annotea: An open rdf infrastructure for shared web annotations (2002) Computer Networks, 39 (5), pp. 589-608. , AugKoivunen, M., Swick, R., Prud'hommeaux, E., Annotea shared bookmarks (2003) Proc. of the KCAP 2003 Workshop on Knowledge Markup and Semantic Annotation. Citeseer, pp. 25-26Cunningham, H., GATE, a general architecture for text engineering (2002) Computers and the Humanities, 36 (2), pp. 223-254Da Torres, R.S., Falcao, A., Content-based image retrieval: Theory and applications (2006) Revista de Informatica Teorica e Aplicada, 2 (13), pp. 161-185Van Ossenbruggen, F.N.J., Hardman, L., That obscure object of desire: Multimedia metadata on the web, part 1 (2004) IEEE MultiMedia, 11 (4), pp. 38-48Greenberg, J., Understanding metadata and metadata schemes (2005) Cataloging & Classification Quarterly, 40 (3), pp. 17-36. , SepDuval, E., Hodgins, W., Sutton, S., Weibel, S.L., Metadata principles and practicalities (2002) D-Lib Magazine, 8 (4), pp. 1-10. , AprLejsek, H., Smundsson, F.H.A., Jonsson, B.T., Amsaleg, L., Scalability of local image descriptors: A comparative study (2006) Multimedia '06. NY, USA: ACM, pp. 589-598Liu, Y., Zhang, D., Lu, G., Ma, W.-Y., A survey of content-based image retrieval with high-level semantics (2007) Pattern Recognition, 40 (1), pp. 262-282Oren, E., Moller, K.H., Scerri, S., Handschuh, S., Sintek, M., (2006) What Are Semantic Annotations, , technical report, DERI GalwayEuzenat, J., Eight questions about semantic web annotations (2002) IEEE Intelligent S., 17 (2), pp. 55-62Cano, P., Automatic sound annotation (2004) IEEE Workshop on Machine Learning for Signal Processing, pp. 391-400Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D., Semantic annotation, indexing, and retrieval (2004) Web Semantics: Science, Services and Agents on the WWW, 2 (1), pp. 49-79Yeganeh, S.H., Hassanzadeh, O., Miller, R.J., Linking semistructured data on the web (2011) Proceedings of the 14th International Workshop on the Web and DatabasesGrust, T., Mayr, M., Rittinger, J., Let sql drive the xquery workhorse (2010) Proc. EDBT, pp. 147-158Zhang, L., Zhang, Y., Xing, Q., Filtering semi-structured documents based on faceted feedback (2011) Proc. 34th SIGIR Conference, pp. 645-654Cundiff, M.V., An introduction to the metadata encoding and transmission standard (2004) Library Hi Tech, 22 (1), pp. 52-64http://dublincore.org/metadata-basics/, DCMI Dublin core metadata initiative Aug. 2010, accessed on 01/2011Gamma, E., Helm, R., Johnson, R., Vlissides, J.M., (1995) Design Patterns. Boston, , USA: Addison-Wesley Longman Publishing Co., IncMota, M.S., (2012) Shadows: A New Means of Representing Documents, , Master's thesis, Institute of Computing-UNICAMP, MayMota, M.S., (2012) Shadows: A New Means of Representing Documents, , Master's thesis, Institute fo Computing, Unicamp, MayMota, M.S., Longo, J.S.C., Cugler, D.C., Medeiros, C.B., (2011) Using Linked Data to Extract Geo-knowledge, , in XII Brazilian Symposium on GeoInformatics-GeoInfo, NovemberHan, X., Sun, L., Zhao, J., Collective entity linking in web text: A graph-based method (2011) Proc SIGIR, pp. 765-76
    corecore