14 research outputs found

    Evaluating holistic aggregators efficiently for very large datasets

    Get PDF
    In data warehousing applications, numerous OLAP queries involve the processing of holistic aggregators such as computing the “top n,” median, quantiles, etc. In this paper, we present a novel approach called dynamic bucketing to efficiently evaluate these aggregators. We partition data into equiwidth buckets and further partition dense buckets into sub-buckets as needed by allocating and reclaiming memory space. The bucketing process dynamically adapts to the input order and distribution of input datasets. The histograms of the buckets and subbuckets are stored in our new data structure called structure trees. A recent selection algorithm based on regular sampling is generalized and its analysis extended. We have also compared our new algorithms with this generalized algorithm and several other recent algorithms. Experimental results show that our new algorithms significantly outperform prior ones not only in the runtime but also in accuracy

    CMFRI Annual Report 2018-19

    Get PDF
    CMFRI had 37 in-house research projects, 34 externally funded projects and 12 consultancy projects in operation in the year 2018-19. Total marine fish landings along the coast of mainland of India for the year 2018 is estimated at 3.49 million tonnes showing a decline of about 3.47 lakh tonnes (9%) compared to 3.83 million tonnes in 2107. Among the nine maritime states Gujarat remained in the first position with landings of 7.80 lakh tonnes followed by Tamil Nadu with 7.02 lakh tonnes. Indian oil sardine, the topmost contributor to the Indian marine fish basket recorded the sharpest fall of 54%, plummeting to ninth position from its first position in 2017. Indian mackerel became the topmost resource with a contribution on 2.84 lakh tonnes towards the total landings (8.1%). Sustained bumper landings of red toothed triggerfish (Odonus niger) were observed in the west coast since August 2018. There was considerable reduction in the number of fishing days in West Bengal, Odisha, Andhra Pradesh, Tamil Nadu and SummaryPuducherry due to cyclonic storms Titli, Gaja and Phethai. The assemblage wise marine fish landings of Gujarat for the year 2018 showed the predominance of molluscan resources (7%). Pelagic finfish resources (38%), followed by demersal (30%), crustaceans (25%) and molluscan resources (7%). The marine fish landings in Maharashtra during 2018 was 2.95 lakh t with 22.5% decrease from previous year (3.81 lakh t in 2017). The prominent species/groups that contributed to the fishery of the state were non-penaeid shrimps (12.6%), penaeid shrimps (11.4%), croakers (10.2%), threadfin breams (8.4%), Indian mackerel (7.1%), Bombay duck (5.6%) and squids (5.2%). Marine fish landings in Kerala during 2018 were 6.42 lakh t which was 9.8% higher than that of the previous year (2017). The major resources in the catch was Indian mackerel (12.6%) followed by oil sardine (12%), threadfin breams (8.3%), Stolephorus (8%) and penaeid shrimps (7.9%). Pelagic finfishes dominated the landings with a share of 62%, which was 6.1% higher than that of the previous year’s estimated pelagic catch. The total marine landing in Tamil Nadu in 2018 was 7.02 lakh t showing an increase of 7% when compared to previous year. Pelagic finfishes formed 52.1%, demersal fin fishes 33%, crustaceans and cephalopod 7.5% each. The total landing in Puducherry was 45406 t showing an increase of 68% when compared to previous year. Pelagic resources formed 30.5%, demersal 27.2%, crustaceans 17.7% and cephalopods 22.2%. Marine landings of Andhra Pradesh were 1.92 lakh t in 2018. There was a decline of 3.6% in marine landings of the state from 2018 to 2017. The marine landings of the state have been in constant decline since the peak landings of 2014. Pelagic fishes were the dominant resource followed by demersal, crustaceans and molluscans. Lesser sardines dominated by weight accounting for 17.8% of the total fish landed. Among pelagics, major resources landed were clupeids (47.7%), mackerel (13.84%), carangids (12.4%), ribbonfish (7.25%), tunas (6.3%) and seerfish (3.15%). Barracuda and billfish contributed 2.49% and 1.6%, respectively. The major demersal resources were croakers (17.8%), other perches (10.2%), goatfish (9.9%), threadfin breams (8.9%) and catfish (8.6%). Crustacean landing was contributed by penaeid shrimps (68.9%), non-penaeid shrimps (2.8%), crabs (27.4%), lobsters (0.2%) and stomatopods (0.7%). The major molluscan resources were the cephalopods which comprised of the cuttlefishes (76.44%) and squids (23.56%). West Bengal during 2018 was 1.6 lakh t which decreased by about 56% compared to the previous year (3.6 lakh t). The total marine landings of Odisha coast during 2018 was estimated at 89178 t registering a decline of about 30% compared to the previous year (126958 t). Large pelagic fish landing during 2018 was only 249,876 t by registering an improvement of about 22% over the previous landing. Major share of the landing was constituted by tunas, followed by barracudas, seerfishes and billfishes. Among the maritime states Tamil Nadu is the major contributor, followed by Kerala, Gujarat and Karnataka. Elasmobranch landings in India during 2018 was 42,117 t, increasing marginally by 2% from the previous year. Tamil Nadu and Gujarat were the major contributors. The west coast accounted for 50.5% of the landings and the east coast, 49.5%. Tamil Nadu, Puducherry, Gujarat and Daman and Diu together accounted for 68.4% of the total elasmobranch landings in the country. Bivalve production in 2018 in the country was estimated at 1,32,531 tonnes. The fishery was dominated by clams, consisting of 76.3%, followed by mussels, 15.3% and oysters, 8.4%. Clams dominated the fishery contributing 76.3% to the annual bivalve production followed by mussels, 15.3% and oysters, 8.4%. Gastropod fisheries assessment and developments in shell craft industry was also a part of the molluscan research

    Relative error streaming quantiles

    Get PDF
    Approximating ranks, quantiles, and distributions over streaming data is a central task in data analysis and monitoring. Given a stream of n items from a data universe U equipped with a total order, the task is to compute a sketch (data structure) of size poly (log(n), 1/ε). Given the sketch and a query item y ∈ U, one should be able to approximate its rank in the stream, i.e., the number of stream elements smaller than or equal to y. Most works to date focused on additive ε n error approximation, culminating in the KLL sketch that achieved optimal asymptotic behavior. This paper investigates multiplicative (1±ε)$-error approximations to the rank. Practical motivation for multiplicative error stems from demands to understand the tails of distributions, and hence for sketches to be more accurate near extreme values. The most space-efficient algorithms due to prior work store either O(log(ε2 n)/ε2) or O(log3(ε n)/ε) universe items. This paper presents a randomized algorithm storing O(log1.5 (ε n)/ε) items, which is within an O(√log(ε n)) factor of optimal. The algorithm does not require prior knowledge of the stream length and is fully mergeable, rendering it suitable for parallel and distributed computing environments

    IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

    Full text link
    India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, there was (i) no parallel training data spanning all the 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India. In this work, we aim to address this gap by focusing on the missing pieces required for enabling wide, easy, and open access to good machine translation systems for all 22 scheduled Indian languages. We identify four key areas of improvement: curating and creating larger training datasets, creating diverse and high-quality benchmarks, training multilingual models, and releasing models with open access. Our first contribution is the release of the Bharat Parallel Corpus Collection (BPCC), the largest publicly available parallel corpora for Indic languages. BPCC contains a total of 230M bitext pairs, of which a total of 126M were newly added, including 644K manually translated sentence pairs created as part of this work. Our second contribution is the release of the first n-way parallel benchmark covering all 22 Indian languages, featuring diverse domains, Indian-origin content, and source-original test sets. Next, we present IndicTrans2, the first model to support all 22 languages, surpassing existing models on multiple existing and new benchmarks created as a part of this work. Lastly, to promote accessibility and collaboration, we release our models and associated data with permissive licenses at https://github.com/ai4bharat/IndicTrans2

    Big Data Analytics in Static and Streaming Provenance

    Get PDF
    Thesis (Ph.D.) - Indiana University, Informatics and Computing,, 2016With recent technological and computational advances, scientists increasingly integrate sensors and model simulations to understand spatial, temporal, social, and ecological relationships at unprecedented scale. Data provenance traces relationships of entities over time, thus providing a unique view on over-time behavior under study. However, provenance can be overwhelming in both volume and complexity; the now forecasting potential of provenance creates additional demands. This dissertation focuses on Big Data analytics of static and streaming provenance. It develops filters and a non-preprocessing slicing technique for in-situ querying of static provenance. It presents a stream processing framework for online processing of provenance data at high receiving rate. While the former is sufficient for answering queries that are given prior to the application start (forward queries), the latter deals with queries whose targets are unknown beforehand (backward queries). Finally, it explores data mining on large collections of provenance and proposes a temporal representation of provenance that can reduce the high dimensionality while effectively supporting mining tasks like clustering, classification and association rules mining; and the temporal representation can be further applied to streaming provenance as well. The proposed techniques are verified through software prototypes applied to Big Data provenance captured from computer network data, weather models, ocean models, remote (satellite) imagery data, and agent-based simulations of agricultural decision making

    Representation Learning for Words and Entities

    Get PDF
    This thesis presents new methods for unsupervised learning of distributed representations of words and entities from text and knowledge bases. The first algorithm presented in the thesis is a multi-view algorithm for learning representations of words called Multiview Latent Semantic Analysis (MVLSA). By incorporating up to 46 different types of co-occurrence statistics for the same vocabulary of english words, I show that MVLSA outperforms other state-of-the-art word embedding models. Next, I focus on learning entity representations for search and recommendation and present the second method of this thesis, Neural Variational Set Expansion (NVSE). NVSE is also an unsupervised learning method, but it is based on the Variational Autoencoder framework. Evaluations with human annotators show that NVSE can facilitate better search and recommendation of information gathered from noisy, automatic annotation of unstructured natural language corpora. Finally, I move from unstructured data and focus on structured knowledge graphs. I present novel approaches for learning embeddings of vertices and edges in a knowledge graph that obey logical constraints.Comment: phd thesis, Machine Learning, Natural Language Processing, Representation Learning, Knowledge Graphs, Entities, Word Embeddings, Entity Embedding

    Recovery of Missing Values using Matrix Decomposition Techniques

    Full text link
    Time series data is prominent in many real world applications, e.g., hydrology or finance stock market. In many of these applications, time series data is missing in blocks, i.e., multiple consecutive values are missing. For example, in the hydrology field around 20% of the data is missing in blocks. However, many time series analysis tasks, such as prediction, require the existence of complete data. The recovery of blocks of missing values in time series is challenging if the missing block is a peak or a valley. The problem is more challenging in real world time series because of the irregularity in the data. The state-of-the-art recovery techniques are suitable either for the recovery of single missing values or for the recovery of blocks of missing values in regular time series. The goal of this thesis is to propose an accurate recovery of blocks of missing values in irregular time series. The recovery solution we propose is based on matrix decomposition techniques. The main idea of the recovery is to represent correlated time series as columns of an input matrix where missing values have been initialized and iteratively apply matrix decomposition technique to refine the initialized missing values. A key property of our recovery solution is that it learns the shape, the width and the amplitude of the missing blocks from the history of the time series that contains the missing blocks and the history of its correlated time series. Our experiments on real world hydrological time series show that our approach outperforms the state-of-the-art recovery techniques for the recovery of missing blocks in irregular time series. The recovery solution is implemented as a graphical tool that displays, browses and accurately recovers missing blocks in irregular time series. The proposed approach supports learning from highly and lowly correlated time series. This is important since lowly correlated time series, e.g., shifted time series, that exhibit shape and/or trend similarities are beneficial for the recovery process. We reduce the space complexity of the proposed solution from quadratic to linear. This allows to use time series with long histories without prior segmentation. We prove the scalability and the correctness of the solution

    Event-Log Analyse mittels Clustering und Mustererkennung

    Get PDF
    Die Analyse von Log-Dateien als Spezialfall des Text Mining dient in der Regel dazu Laufzeitfehler oder Angriffe auf ein Systems nachzuvollziehen. Gegen erkannte Fehlerzustände können Maßnahmen ergriffen werden, um diese zu vermeiden. Muster in semi-strukturierten Log-Dateien aus dynamischen Umgebungen zu erkennen ist komplex und erfordert einen mehrstufigen Prozess. Zur Analyse werden die Log-Dateien in einen strukturierten Event-Log (event log) überführt. Diese Arbeit bietet dem Anwender ein Werkzeug, um häufige (frequent) oder seltene (rare) Ereignisse (events), sowie temporale Muster (temporal patterns) in den Daten zu erkennen. Dazu werden verschiedene Techniken des Data-Minig miteinander verbunden. Zentrales Element ist dieser Arbeit das Clustering. Es wird untersucht, ob durch Neuronale Netze mittels unüberwachtem Lernen (Autoencoder) geeignete Repräsentationen (embeddings) von Ereignissen erstellt werden können, um syntaktisch und semantisch ähnliche Instanzen zusammenzufassen. Dies dient zur Klassifikation von Ereignissen, Erkennung von Ausreißern (outlier detection), sowie zur Inferenz einer nachvollziehbaren visuellen Repräsentation (Regular Expressions; Pattern Expressions). Um verborgene Muster in den Daten zu finden werden diese mittels sequenzieller Mustererkennung (Sequential Pattern Mining) und dem auffinden von Episoden (Episode Mining) in einem zweiten Analyseschritt untersucht. Durch das Pattern Mining können alle enthaltenen Muster im einem Event-Log gefunden werden. Der enorme Suchraum erfordert effiziente Algorithmen, um in angemessener Zeit Ergebnisse zu erzielen. Das Clustering dient daher ebenfalls zur Reduktion (pruning) des Suchraums für das Pattern Mining. Um die Menge der Ergebnisse einzuschränken werden verschiedene Strategien auf ihre praktische Tauglichkeit hin untersucht, um neue Erkenntnisse zu erlangen. Zum einen die Mustererkennung mittels verschiedener Kriterien (Constrained Pattern Mining) und zum anderen durch die Nützlichkeit (High Utility Pattern Mining) von Mustern. Interessante temporale Muster können auf anderen Log-Dateien angewendet werden, um diese auf das Vorkommen dieser Muster zu untersuchen

    Representation Learning for Words and Entities

    Get PDF
    This thesis presents new methods for unsupervised learning of distributed representations of words and entities from text and knowledge bases. The first algorithm presented in the thesis is a multi-view algorithm for learning representations of words called Multiview LSA (MVLSA). Through experiments on close to 50 different views, I show that MVLSA outperforms other state-of-the-art word embedding models. After that, I focus on learning entity representations for search and recommendation and present the second algorithm of this thesis called Neural Variational Set Expansion (NVSE). NVSE is also an unsupervised learning method, but it is based on the Variational Autoencoder framework. Evaluations with human annotators show that NVSE can facilitate better search and recommendation of information gathered from noisy, automatic annotation of unstructured natural language corpora. Finally, I move from unstructured data and focus on structured knowledge graphs. Moreover, I present novel approaches for learning embeddings of vertices and edges in a knowledge graph that obey logical constraints
    corecore