1,270 research outputs found

    Scaling out for extreme scale corpus data

    Get PDF
    Much of the previous work in Big Data has focussed on numerical sources of information. However, with the `narrative turn' in many disciplines gathering pace and commercial organisations beginning to realise the value of their textual assets, natural language data is fast catching up as an exploitable source of information for decision making. With vast quantities of unstructured textual data on the web, in social media, and in newly digitised historical document archives, the 5Vs (Volume, Velocity, Variety, Value and Veracity) apply equally well, if not more so, to big textual data. Corpus linguistics, the computer-aided study of large collections of naturally occurring language data, has been dealing with big data for fifty years. Corpus linguistics methods impose complex requirements on the retrieval, annotation and analysis of text in terms of displaying narrow contexts for each occurrence of a word or linguistic feature being studied and counting co-occurrences with other words or features to determine significant patterns in language. This, coupled with the distribution of language features in accordance with Zipf's Law, poses complex challenges for data models and corpus software dealing with extreme scale language data. A related issue is the non-random nature of language and the `burstiness' of word occurrences, or what we might put in Big Data terms as a sixth `V' called Viscosity. We report experiments to examine and compare the capabilities of two No-SQL databases in clustered configurations for the indexing, retrieval and analysis of billion-word corpora, since this size is the current state-of-the-art in corpus linguistics. We find that modern DBMSs (Database Management Systems) are capable of handling this extreme scale corpus data set for simple queries but are limited when querying for more frequent words or more complex queries

    lexiDB:a scalable corpus database management system

    Get PDF
    lexiDB is a scalable corpus database management system designed to fulfill corpus linguistics retrieval queries on multi-billion-word multiply-annotated corpora. It is based on a distributed architecture that allows the system to scale out to support ever larger text collections. This paper presents an overview of the architecture behind lexiDB as well as a demonstration of its functionality. We present lexiDB's performance metrics based on the AWS (Amazon Web Services) infrastructure with two part-of-speech and semantically tagged billion word corpora: Historical Hansard and EEBO (Early English Books Online)

    Towards Interactive Multidimensional Visualisations for Corpus Linguistics

    Get PDF
    We propose the novel application of dynamic and interactive visualisation techniques to support the iterative and exploratory investigations typical of the corpus linguistics methodology. Very large scale text analysis is already carried out in corpus-based language analysis by employing methods such as frequency profiling, keywords, concordancing, collocations and n-grams. However, at present only basic visualisation methods are utilised. In this paper, we describe case studies of multiple types of key word clouds, explorer tools for collocation networks, and compare network and language distance visualisations for online social networks. These are shown to fit better with the iterative data-driven corpus methodology, and permit some level of scalability to cope with ever increasing corpus size and complexity. In addition, they will allow corpus linguistic methods to be used more widely in the digital humanities and social sciences since the learning curve with visualisations is shallower for non-expert

    Measurement of the Splitting Function in &ITpp &ITand Pb-Pb Collisions at root&ITsNN&IT=5.02 TeV

    Get PDF
    Data from heavy ion collisions suggest that the evolution of a parton shower is modified by interactions with the color charges in the dense partonic medium created in these collisions, but it is not known where in the shower evolution the modifications occur. The momentum ratio of the two leading partons, resolved as subjets, provides information about the parton shower evolution. This substructure observable, known as the splitting function, reflects the process of a parton splitting into two other partons and has been measured for jets with transverse momentum between 140 and 500 GeV, in pp and PbPb collisions at a center-of-mass energy of 5.02 TeV per nucleon pair. In central PbPb collisions, the splitting function indicates a more unbalanced momentum ratio, compared to peripheral PbPb and pp collisions.. The measurements are compared to various predictions from event generators and analytical calculations.Peer reviewe

    Measurement of nuclear modification factors of gamma(1S)), gamma(2S), and gamma(3S) mesons in PbPb collisions at root s(NN)=5.02 TeV

    Get PDF
    The cross sections for ϒ(1S), ϒ(2S), and ϒ(3S) production in lead-lead (PbPb) and proton-proton (pp) collisions at √sNN = 5.02 TeV have been measured using the CMS detector at the LHC. The nuclear modification factors, RAA, derived from the PbPb-to-pp ratio of yields for each state, are studied as functions of meson rapidity and transverse momentum, as well as PbPb collision centrality. The yields of all three states are found to be significantly suppressed, and compatible with a sequential ordering of the suppression, RAA(ϒ(1S)) > RAA(ϒ(2S)) > RAA(ϒ(3S)). The suppression of ϒ(1S) is larger than that seen at √sNN = 2.76 TeV, although the two are compatible within uncertainties. The upper limit on the RAA of ϒ(3S) integrated over pT, rapidity and centrality is 0.096 at 95% confidence level, which is the strongest suppression observed for a quarkonium state in heavy ion collisions to date. © 2019 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). Funded by SCOAP3.Peer reviewe

    Electroweak production of two jets in association with a Z boson in proton-proton collisions root s =13 TeV

    Get PDF
    A measurement of the electroweak (EW) production of two jets in association with a Z boson in proton-proton collisions at root s = 13 TeV is presented, based on data recorded in 2016 by the CMS experiment at the LHC corresponding to an integrated luminosity of 35.9 fb(-1). The measurement is performed in the lljj final state with l including electrons and muons, and the jets j corresponding to the quarks produced in the hard interaction. The measured cross section in a kinematic region defined by invariant masses m(ll) > 50 GeV, m(jj) > 120 GeV, and transverse momenta P-Tj > 25 GeV is sigma(EW) (lljj) = 534 +/- 20 (stat) fb (syst) fb, in agreement with leading-order standard model predictions. The final state is also used to perform a search for anomalous trilinear gauge couplings. No evidence is found and limits on anomalous trilinear gauge couplings associated with dimension-six operators are given in the framework of an effective field theory. The corresponding 95% confidence level intervals are -2.6 <cwww/Lambda(2) <2.6 TeV-2 and -8.4 <cw/Lambda(2) <10.1 TeV-2. The additional jet activity of events in a signal-enriched region is also studied, and the measurements are in agreement with predictions.Peer reviewe

    Search for anomalous couplings in boosted WW/WZ -> l nu q(q)over-bar production in proton-proton collisions at root s=8TeV

    Get PDF
    Peer reviewe

    Bose-Einstein correlations of charged hadrons in proton-proton collisions at s\sqrt s = 13 TeV

    Get PDF
    Bose-Einstein correlations of charged hadrons are measured over a broad multiplicity range, from a few particles up to about 250 reconstructed charged hadrons in proton-proton collisions at s \sqrt{s} = 13 TeV. The results are based on data collected using the CMS detector at the LHC during runs with a special low-pileup configuration. Three analysis techniques with different degrees of dependence on simulations are used to remove the non-Bose-Einstein background from the correlation functions. All three methods give consistent results. The measured lengths of homogeneity are studied as functions of particle multiplicity as well as average pair transverse momentum and mass. The results are compared with data from both CMS and ATLAS at s \sqrt{s} = 7 TeV, as well as with theoretical predictions.[graphic not available: see fulltext]Bose-Einstein correlations of charged hadrons are measured over a broad multiplicity range, from a few particles up to about 250 reconstructed charged hadrons in proton-proton collisions at s=\sqrt{s} = 13 TeV. The results are based on data collected using the CMS detector at the LHC during runs with a special low-pileup configuration. Three analysis techniques with different degrees of dependence on simulations are used to remove the non-Bose-Einstein background from the correlation functions. All three methods give consistent results. The measured lengths of homogeneity are studied as functions of particle multiplicity as well as average pair transverse momentum and mass. The results are compared with data from both CMS and ATLAS at s=\sqrt{s} = 7 TeV, as well as with theoretical predictions

    An embedding technique to determine ττ backgrounds in proton-proton collision data

    Get PDF
    An embedding technique is presented to estimate standard model tau tau backgrounds from data with minimal simulation input. In the data, the muons are removed from reconstructed mu mu events and replaced with simulated tau leptons with the same kinematic properties. In this way, a set of hybrid events is obtained that does not rely on simulation except for the decay of the tau leptons. The challenges in describing the underlying event or the production of associated jets in the simulation are avoided. The technique described in this paper was developed for CMS. Its validation and the inherent uncertainties are also discussed. The demonstration of the performance of the technique is based on a sample of proton-proton collisions collected by CMS in 2017 at root s = 13 TeV corresponding to an integrated luminosity of 41.5 fb(-1).Peer reviewe

    Measurement of t(t)over-bar normalised multi-differential cross sections in pp collisions at root s=13 TeV, and simultaneous determination of the strong coupling strength, top quark pole mass, and parton distribution functions

    Get PDF
    Peer reviewe
    • 

    corecore