102 research outputs found

    The Feasibility of Brute Force Scans for Real-Time Tweet Search

    Full text link
    The real-time search problem requires making ingested doc-uments immediately searchable, which presents architectural challenges for systems built around inverted indexing. In this paper, we explore a radical proposition: What if we abandon document inversion and instead adopt an architec-ture based on brute force scans of document representations? In such a design, “indexing ” simply involves appending the parsed representation of an ingested document to an exist-ing buffer, which is simple and fast. Quite surprisingly, ex-periments with TREC Microblog test collections show that query evaluation with brute force scans is feasible and per-formance compares favorably to a traditional search archi-tecture based on an inverted index, especially if we take ad-vantage of vectorized SIMD instructions and multiple cores in modern processor architectures. We believe that such a novel design is worth further exploration by IR researchers and practitioners

    Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data

    Get PDF
    Thesis (Ph.D.) - Indiana University, Computer Sciences, 2015As Big Data processing problems evolve, many modern applications demonstrate special characteristics. Data exists in the form of both large historical datasets and high-speed real-time streams, and many analysis pipelines require integrated parallel batch processing and stream processing. Despite the large size of the whole dataset, most analyses focus on specific subsets according to certain criteria. Correspondingly, integrated support for efficient queries and post- query analysis is required. To address the system-level requirements brought by such characteristics, this dissertation proposes a scalable architecture for integrated queries, batch analysis, and streaming analysis of Big Data in the cloud. We verify its effectiveness using a representative application domain - social media data analysis - and tackle related research challenges emerging from each module of the architecture by integrating and extending multiple state-of-the-art Big Data storage and processing systems. In the storage layer, we reveal that existing text indexing techniques do not work well for the unique queries of social data, which put constraints on both textual content and social context. To address this issue, we propose a flexible indexing framework over NoSQL databases to support fully customizable index structures, which can embed necessary social context information for efficient queries. The batch analysis module demonstrates that analysis workflows consist of multiple algorithms with different computation and communication patterns, which are suitable for different processing frameworks. To achieve efficient workflows, we build an integrated analysis stack based on YARN, and make novel use of customized indices in developing sophisticated analysis algorithms. In the streaming analysis module, the high-dimensional data representation of social media streams poses special challenges to the problem of parallel stream clustering. Due to the sparsity of the high-dimensional data, traditional synchronization method becomes expensive and severely impacts the scalability of the algorithm. Therefore, we design a novel strategy that broadcasts the incremental changes rather than the whole centroids of the clusters to achieve scalable parallel stream clustering algorithms. Performance tests using real applications show that our solutions for parallel data loading/indexing, queries, analysis tasks, and stream clustering all significantly outperform implementations using current state-of-the-art technologies

    WRITE-INTENSIVE DATA MANAGEMENT IN LOG-STRUCTURED STORAGE

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Data Science in Healthcare

    Get PDF
    Data science is an interdisciplinary field that applies numerous techniques, such as machine learning, neural networks, and deep learning, to create value based on extracting knowledge and insights from available data. Advances in data science have a significant impact on healthcare. While advances in the sharing of medical information result in better and earlier diagnoses as well as more patient-tailored treatments, information management is also affected by trends such as increased patient centricity (with shared decision making), self-care (e.g., using wearables), and integrated care delivery. The delivery of health services is being revolutionized through the sharing and integration of health data across organizational boundaries. Via data science, researchers can deliver new approaches to merge, analyze, and process complex data and gain more actionable insights, understanding, and knowledge at the individual and population levels. This Special Issue focuses on how data science is used in healthcare (e.g., through predictive modeling) and on related topics, such as data sharing and data management

    Cyber Attack Surface Mapping For Offensive Security Testing

    Get PDF
    Security testing consists of automated processes, like Dynamic Application Security Testing (DAST) and Static Application Security Testing (SAST), as well as manual offensive security testing, like Penetration Testing and Red Teaming. This nonautomated testing is frequently time-constrained and difficult to scale. Previous literature suggests that most research is spent in support of improving fully automated processes or in finding specific vulnerabilities, with little time spent improving the interpretation of the scanned attack surface critical to nonautomated testing. In this work, agglomerative hierarchical clustering is used to compress the Internet-facing hosts of 13 representative companies as collected by the Shodan search engine, resulting in an average 89% reduction in attack surface complexity. The work is then extended to map network services and also analyze the characteristics of the Log4Shell security vulnerability and its impact on attack surface mapping. The results highlighted outliers indicative of possible anti-patterns as well as opportunities to improve how testers and tools map the web attack surface. Ultimately the work is extended to compress web attack surfaces based on security relevant features, demonstrating via accuracy measurements not only that this compression is feasible but can also be automated. In the process a framework is created which could be extended in future work to compress other attack surfaces, including physical structures/campuses for physical security testing and even humans for social engineering tests

    Supporting Large Scale Communication Systems on Infrastructureless Networks Composed of Commodity Mobile Devices: Practicality, Scalability, and Security.

    Full text link
    Infrastructureless Delay Tolerant Networks (DTNs) composed of commodity mobile devices have the potential to support communication applications resistant to blocking and censorship, as well as certain types of surveillance. In this thesis we study the utility, practicality, robustness, and security of these networks. We collected two sets of wireless connectivity traces of commodity mobile devices with different granularity and scales. The first dataset is collected through active installation of measurement software on volunteer users' own smartphones, involving 111 users of a DTN microblogging application that we developed. The second dataset is collected through passive observation of WiFi association events on a university campus, involving 119,055 mobile devices. Simulation results show consistent message delivery performances of the two datasets. Using an epidemic flooding protocol, the large network achieves an average delivery rate of 0.71 in 24 hours and a median delivery delay of 10.9 hours. We show that this performance is appropriate for sharing information that is not time sensitive, e.g., blogs and photos. We also show that using an energy efficient variant of the epidemic flooding protocol, even the large network can support text messages while only consuming 13.7% of a typical smartphone battery in 14 hours. We found that the network delivery rate and delay are robust to denial-of-service and censorship attacks. Attacks that randomly remove 90% of the network participants only reduce delivery rates by less than 10%. Even when subjected to targeted attacks, the network suffered a less than 10% decrease in delivery rate when 40% of its participants were removed. Although structurally robust, the openness of the proposed network introduces numerous security concerns. The Sybil attack, in which a malicious node poses as many identities in order to gain disproportionate influence, is especially dangerous as it breaks the assumption underlying majority voting. Many defenses based on spatial variability of wireless channels exist, and we extend them to be practical for ad hoc networks of commodity 802.11 devices without mutual trust. We present the Mason test, which uses two efficient methods for separating valid channel measurement results of behaving nodes from those falsified by malicious participants.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120779/1/liuyue_1.pd

    Shopping and Guns: an analysis of public discourses in social media about mall robberies in South Africa

    Get PDF
    A research report submitted to the Faculty of Humanities, University of the Witwatersrand, Johannesburg, in partial fulfilment of the requirements for the degree of Master of Art in International relations, 2017This research project investigates public opinions about South African mall robberies discussed on Twitter. Using the principles of discourse and multimodal analysis, it provides critical insights constructed from the represented narratives of select, proposed middle-class consumers illustrating distinct sentiments about malls, crime and shopping. Malls are empirical objects that have been trivialised as ordinary and mundane consumer sites, devoid of any sociological significance embedded within the daily practices of shopping. This paper makes the argument that when contested by criminal activity, malls become valuable sites for critical enquiry towards gaining a deeper understanding of what these shopping attitudes mean within a post-apartheid, South African consumer landscape. The central issue of crime threatening public safety at malls diverges into an array of thematic discussions, revealing distinct indoctrinations surrounding apartheid’s iniquitous system of racial and social engineering. This study’s principle argument makes the claim that anxieties concerning public safety are only the tip of the iceberg, and this serves as an entry point into a discourse contesting exclusive shopping rights above constitutional equality for all. The test tube of mall robberies mixes desirable pleasures and humanitarian moralities together and creates a volatile cocktail of conflicting, consumer aspirations. In short, the public discourse of mall crimes is about maintaining self-entitled spaces of exclusivity within a desperate socioeconomic climate. This study concludes with questions and considerations raised by these authors which could springboard into opportunities for future inquiry.XL201

    24th International Conference on Information Modelling and Knowledge Bases

    Get PDF
    In the last three decades information modelling and knowledge bases have become essentially important subjects not only in academic communities related to information systems and computer science but also in the business area where information technology is applied. The series of European – Japanese Conference on Information Modelling and Knowledge Bases (EJC) originally started as a co-operation initiative between Japan and Finland in 1982. The practical operations were then organised by professor Ohsuga in Japan and professors Hannu Kangassalo and Hannu Jaakkola in Finland (Nordic countries). Geographical scope has expanded to cover Europe and also other countries. Workshop characteristic - discussion, enough time for presentations and limited number of participants (50) / papers (30) - is typical for the conference. Suggested topics include, but are not limited to: 1. Conceptual modelling: Modelling and specification languages; Domain-specific conceptual modelling; Concepts, concept theories and ontologies; Conceptual modelling of large and heterogeneous systems; Conceptual modelling of spatial, temporal and biological data; Methods for developing, validating and communicating conceptual models. 2. Knowledge and information modelling and discovery: Knowledge discovery, knowledge representation and knowledge management; Advanced data mining and analysis methods; Conceptions of knowledge and information; Modelling information requirements; Intelligent information systems; Information recognition and information modelling. 3. Linguistic modelling: Models of HCI; Information delivery to users; Intelligent informal querying; Linguistic foundation of information and knowledge; Fuzzy linguistic models; Philosophical and linguistic foundations of conceptual models. 4. Cross-cultural communication and social computing: Cross-cultural support systems; Integration, evolution and migration of systems; Collaborative societies; Multicultural web-based software systems; Intercultural collaboration and support systems; Social computing, behavioral modeling and prediction. 5. Environmental modelling and engineering: Environmental information systems (architecture); Spatial, temporal and observational information systems; Large-scale environmental systems; Collaborative knowledge base systems; Agent concepts and conceptualisation; Hazard prediction, prevention and steering systems. 6. Multimedia data modelling and systems: Modelling multimedia information and knowledge; Contentbased multimedia data management; Content-based multimedia retrieval; Privacy and context enhancing technologies; Semantics and pragmatics of multimedia data; Metadata for multimedia information systems. Overall we received 56 submissions. After careful evaluation, 16 papers have been selected as long paper, 17 papers as short papers, 5 papers as position papers, and 3 papers for presentation of perspective challenges. We thank all colleagues for their support of this issue of the EJC conference, especially the program committee, the organising committee, and the programme coordination team. The long and the short papers presented in the conference are revised after the conference and published in the Series of “Frontiers in Artificial Intelligence” by IOS Press (Amsterdam). The books “Information Modelling and Knowledge Bases” are edited by the Editing Committee of the conference. We believe that the conference will be productive and fruitful in the advance of research and application of information modelling and knowledge bases. Bernhard Thalheim Hannu Jaakkola Yasushi Kiyok
    corecore