7,153 research outputs found

    Scalable aggregation predictive analytics: a query-driven machine learning approach

    Get PDF
    We introduce a predictive modeling solution that provides high quality predictive analytics over aggregation queries in Big Data environments. Our predictive methodology is generally applicable in environments in which large-scale data owners may or may not restrict access to their data and allow only aggregation operators like COUNT to be executed over their data. In this context, our methodology is based on historical queries and their answers to accurately predict ad-hoc queries’ answers. We focus on the widely used set-cardinality, i.e., COUNT, aggregation query, as COUNT is a fundamental operator for both internal data system optimizations and for aggregation-oriented data exploration and predictive analytics. We contribute a novel, query-driven Machine Learning (ML) model whose goals are to: (i) learn the query-answer space from past issued queries, (ii) associate the query space with local linear regression & associative function estimators, (iii) define query similarity, and (iv) predict the cardinality of the answer set of unseen incoming queries, referred to the Set Cardinality Prediction (SCP) problem. Our ML model incorporates incremental ML algorithms for ensuring high quality prediction results. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general Big Data environments, which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for query-driven data exploration, and (iii) offers a performance (in terms of scalability, SCP accuracy, processing time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model evaluating its sensitivity, scalability and efficiency for quality predictive analytics. In addition, we report on the development and incorporation of our ML model in Spark showing its superior performance compared to the Spark’s COUNT method

    Addressing the Inadequacies of Information Available on the Internet: The Prospect for a Technical Solution

    Get PDF
    In the past ten years the Internet has been the carrier and transmitter of vast amounts of information. Most of it has never been subjected to peer review or even casual review and has therefore been the source of misinformation. Additionally, there is need for more researchers to utilize critical thinking techniques of evaluating the credibility of sources. This paper chronicles my critical and creative thinking processes and results regarding these three areas of the information problems that are prevalent on the Internet. The first area is the problem of bad, biased or incorrect information including hoaxes and scams. I used critical thinking techniques to analyze these areas to provide a basis to define a problem to be solved. The second area of concern is the critical thinking process that should be used to evaluate the reliability of resources and the credibility of information. This process can help prevent the Internet user from being a victim of bad or biased information. The third area deals with similar problems of information that were solved both inside and outside the Internet that could provide bases for solutions. Here, I used critical thinking in regard to other possible outcomes. I discuss what other industries such as consumer product review companies and academia have done to deal with similar problems. I take a look at Underwriters Laboratories and others who devised systems that verified the quality of product as well as research methods that assure quality of information. I developed a conceptual framework for a software-based solution that can help assure that high quality information is presented on the Internet. I used a process of divergent and convergent thinking to arrive at a best solution. The solution allows for those who use the Internet data to leave information with or without leaving evaluation comments that describe the quality and usefulness of what was presented. The results of this user feedback are not only available to others who search for this information, but it can be presented in a prioritized form from most reviewed to least reviewed thus saving researchers time and effort while assuring a better quality of information

    Trusting (and Verifying) Online Intermediaries\u27 Policing

    Get PDF
    All is not well in the land of online self-regulation. However competently internet intermediaries police their sites, nagging questions will remain about their fairness and objectivity in doing so. Is Comcast blocking BitTorrent to stop infringement, to manage traffic, or to decrease access to content that competes with its own for viewers? How much digital due process does Google need to give a site it accuses of harboring malware? If Facebook censors a video of war carnage, is that a token of respect for the wounded or one more reflexive effort of a major company to ingratiate itself with the Washington establishment? Questions like these will persist, and erode the legitimacy of intermediary self-policing, as long as key operations of leading companies are shrouded in secrecy. Administrators must develop an institutional competence for continually monitoring rapidly-changing business practices. A trusted advisory council charged with assisting the Federal Trade Commission (FTC) and Federal Communications Commission (FCC) could help courts and agencies adjudicate controversies concerning intermediary practices. An Internet Intermediary Regulatory Council (IIRC) would spur the development of expertise necessary to understand whether companies’ controversial decisions are socially responsible or purely self-interested. Monitoring is a prerequisite for assuring a level playing field online

    CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap

    Get PDF
    After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in multimedia search engines, we have identified and analyzed gaps within European research effort during our second year. In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio- economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal challenges

    Development of a web application for the optimization of administrative processes: application of the lean methodology for priority classification

    Get PDF
    This study is part of the context of the Research Project signed between the Ministry of Agriculture, Livestock and Supply (MAPA) and the Federal Institute of Education, Science and Technology Goiano (IF Goiano) for the development of methodologies with a view to implement and operate an agreement monitoring center. Its objective is to present the tool (software) developed via web to organize the structure of the chain of value of the Ministry of Agriculture's system of agreements, map performance indicators and classify process optimization priorities, based on parameters of efficiency and effectiveness of the reengineering processes. This is applied research in the empirical scenario of the Agreements Sector of MAPA, which uses the case study procedure to achieve its objective. The software development is based on the theoretical framework of Business Process Management (BPM) and Lean Six Sigma, with the application of the GUT Matrix tool and the Eisenhower Matrix for decision making. The primary data were collected through unstructured interviews with nine key informants working in the macro-processes of formalization, execution and monitoring and rendering of accounts of agreements. The results contain the characterization of the MAPA agreement area, highlighting the main activities carried out in the three macroprocesses, the description of the modelling characteristics and functionalities of the developed software and the discussion of benefits arising from the application of information technologies to the Business Process Management (BPM).  It can be inferred that the Lean methodology is plausible as a logical “production line”, since it is adaptable to a structured algorithm, which provides an orderly solution of problems focused on continuous improvement

    Critique of Architectures for Long-Term Digital Preservation

    Get PDF
    Evolving technology and fading human memory threaten the long-term intelligibility of many kinds of documents. Furthermore, some records are susceptible to improper alterations that make them untrustworthy. Trusted Digital Repositories (TDRs) and Trustworthy Digital Objects (TDOs) seem to be the only broadly applicable digital preservation methodologies proposed. We argue that the TDR approach has shortfalls as a method for long-term digital preservation of sensitive information. Comparison of TDR and TDO methodologies suggests differentiating near-term preservation measures from what is needed for the long term. TDO methodology addresses these needs, providing for making digital documents durably intelligible. It uses EDP standards for a few file formats and XML structures for text documents. For other information formats, intelligibility is assured by using a virtual computer. To protect sensitive information—content whose inappropriate alteration might mislead its readers, the integrity and authenticity of each TDO is made testable by embedded public-key cryptographic message digests and signatures. Key authenticity is protected recursively in a social hierarchy. The proper focus for long-term preservation technology is signed packages that each combine a record collection with its metadata and that also bind context—Trustworthy Digital Objects.

    The development and deployment of Common Data Elements for tissue banks for translational research in cancer – An emerging standard based approach for the Mesothelioma Virtual Tissue Bank

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Recent advances in genomics, proteomics, and the increasing demands for biomarker validation studies have catalyzed changes in the landscape of cancer research, fueling the development of tissue banks for translational research. A result of this transformation is the need for sufficient quantities of clinically annotated and well-characterized biospecimens to support the growing needs of the cancer research community. Clinical annotation allows samples to be better matched to the research question at hand and ensures that experimental results are better understood and can be verified. To facilitate and standardize such annotation in bio-repositories, we have combined three accepted and complementary sets of data standards: the College of American Pathologists (CAP) Cancer Checklists, the protocols recommended by the Association of Directors of Anatomic and Surgical Pathology (ADASP) for pathology data, and the North American Association of Central Cancer Registry (NAACCR) elements for epidemiology, therapy and follow-up data. Combining these approaches creates a set of International Standards Organization (ISO) – compliant Common Data Elements (CDEs) for the mesothelioma tissue banking initiative supported by the National Institute for Occupational Safety and Health (NIOSH) of the Center for Disease Control and Prevention (CDC).</p> <p>Methods</p> <p>The purpose of the project is to develop a core set of data elements for annotating mesothelioma specimens, following standards established by the CAP checklist, ADASP cancer protocols, and the NAACCR elements. We have associated these elements with modeling architecture to enhance both syntactic and semantic interoperability. The system has a Java-based multi-tiered architecture based on Unified Modeling Language (UML).</p> <p>Results</p> <p>Common Data Elements were developed using controlled vocabulary, ontology and semantic modeling methodology. The CDEs for each case are of different types: demographic, epidemiologic data, clinical history, pathology data including block level annotation, and follow-up data including treatment, recurrence and vital status. The end result of such an effort would eventually provide an increased sample set to the researchers, and makes the system interoperable between institutions.</p> <p>Conclusion</p> <p>The CAP, ADASP and the NAACCR elements represent widely established data elements that are utilized in many cancer centers. Herein, we have shown these representations can be combined and formalized to create a core set of annotations for banked mesothelioma specimens. Because these data elements are collected as part of the normal workflow of a medical center, data sets developed on the basis of these elements can be easily implemented and maintained.</p

    A systematic approach to searching: an efficient and complete method to develop literature searches

    Get PDF
    Creating search strategies for systematic reviews, finding the best balance between sensitivity and specificity, and translating search strategies between databases is challenging. Several methods describe standards for systematic search strategies, but a consistent approach for creating an exhaustive search strategy has not yet been fully described in enough detail to be fully replicable. The authors have established a method that describes step by step the process of developing a systematic search strategy as needed in the systematic review. This method describes how single-line search strategies can be prepared in a text document by typing search syntax (such as field codes, parentheses, and Boolean operators) before copying and pasting search terms (keywords and free-text synonyms) that are found in the thesaurus. To help ensure term completeness, we developed a novel optimization technique that is mainly based on comparing the results retrieved by thesaurus terms with those retrieved by the free-text search words to identify potentially relevant candidate search terms. Macros in Microsoft Word have been developed to convert syntaxes between databases and interfaces almost automatically. This method helps information specialists in developing librarian-mediated searches for systematic reviews as well as medical and health care practitioners who are searching for evidence to answer clinical questions. The described method can be used to create complex and comprehensive search strategies for different databases and interfaces, such as those that are needed when searching for relevant references for systematic reviews, and will assist both information specialists and practitioners when they are searching the biomedical literature

    Better Evidence Syntheses: Improving literature retrieval in systematic reviews

    Get PDF
    • …
    corecore