    CMFRI Annual Report 2018-19

    CMFRI had 37 in-house research projects, 34 externally funded projects and 12 consultancy projects in operation in the year 2018-19. Total marine fish landings along the coast of mainland of India for the year 2018 is estimated at 3.49 million tonnes showing a decline of about 3.47 lakh tonnes (9%) compared to 3.83 million tonnes in 2107. Among the nine maritime states Gujarat remained in the first position with landings of 7.80 lakh tonnes followed by Tamil Nadu with 7.02 lakh tonnes. Indian oil sardine, the topmost contributor to the Indian marine fish basket recorded the sharpest fall of 54%, plummeting to ninth position from its first position in 2017. Indian mackerel became the topmost resource with a contribution on 2.84 lakh tonnes towards the total landings (8.1%). Sustained bumper landings of red toothed triggerfish (Odonus niger) were observed in the west coast since August 2018. There was considerable reduction in the number of fishing days in West Bengal, Odisha, Andhra Pradesh, Tamil Nadu and SummaryPuducherry due to cyclonic storms Titli, Gaja and Phethai. The assemblage wise marine fish landings of Gujarat for the year 2018 showed the predominance of molluscan resources (7%). Pelagic finfish resources (38%), followed by demersal (30%), crustaceans (25%) and molluscan resources (7%). The marine fish landings in Maharashtra during 2018 was 2.95 lakh t with 22.5% decrease from previous year (3.81 lakh t in 2017). The prominent species/groups that contributed to the fishery of the state were non-penaeid shrimps (12.6%), penaeid shrimps (11.4%), croakers (10.2%), threadfin breams (8.4%), Indian mackerel (7.1%), Bombay duck (5.6%) and squids (5.2%). Marine fish landings in Kerala during 2018 were 6.42 lakh t which was 9.8% higher than that of the previous year (2017). The major resources in the catch was Indian mackerel (12.6%) followed by oil sardine (12%), threadfin breams (8.3%), Stolephorus (8%) and penaeid shrimps (7.9%). Pelagic finfishes dominated the landings with a share of 62%, which was 6.1% higher than that of the previous year’s estimated pelagic catch. The total marine landing in Tamil Nadu in 2018 was 7.02 lakh t showing an increase of 7% when compared to previous year. Pelagic finfishes formed 52.1%, demersal fin fishes 33%, crustaceans and cephalopod 7.5% each. The total landing in Puducherry was 45406 t showing an increase of 68% when compared to previous year. Pelagic resources formed 30.5%, demersal 27.2%, crustaceans 17.7% and cephalopods 22.2%. Marine landings of Andhra Pradesh were 1.92 lakh t in 2018. There was a decline of 3.6% in marine landings of the state from 2018 to 2017. The marine landings of the state have been in constant decline since the peak landings of 2014. Pelagic fishes were the dominant resource followed by demersal, crustaceans and molluscans. Lesser sardines dominated by weight accounting for 17.8% of the total fish landed. Among pelagics, major resources landed were clupeids (47.7%), mackerel (13.84%), carangids (12.4%), ribbonfish (7.25%), tunas (6.3%) and seerfish (3.15%). Barracuda and billfish contributed 2.49% and 1.6%, respectively. The major demersal resources were croakers (17.8%), other perches (10.2%), goatfish (9.9%), threadfin breams (8.9%) and catfish (8.6%). Crustacean landing was contributed by penaeid shrimps (68.9%), non-penaeid shrimps (2.8%), crabs (27.4%), lobsters (0.2%) and stomatopods (0.7%). The major molluscan resources were the cephalopods which comprised of the cuttlefishes (76.44%) and squids (23.56%). West Bengal during 2018 was 1.6 lakh t which decreased by about 56% compared to the previous year (3.6 lakh t). The total marine landings of Odisha coast during 2018 was estimated at 89178 t registering a decline of about 30% compared to the previous year (126958 t). Large pelagic fish landing during 2018 was only 249,876 t by registering an improvement of about 22% over the previous landing. Major share of the landing was constituted by tunas, followed by barracudas, seerfishes and billfishes. Among the maritime states Tamil Nadu is the major contributor, followed by Kerala, Gujarat and Karnataka. Elasmobranch landings in India during 2018 was 42,117 t, increasing marginally by 2% from the previous year. Tamil Nadu and Gujarat were the major contributors. The west coast accounted for 50.5% of the landings and the east coast, 49.5%. Tamil Nadu, Puducherry, Gujarat and Daman and Diu together accounted for 68.4% of the total elasmobranch landings in the country. Bivalve production in 2018 in the country was estimated at 1,32,531 tonnes. The fishery was dominated by clams, consisting of 76.3%, followed by mussels, 15.3% and oysters, 8.4%. Clams dominated the fishery contributing 76.3% to the annual bivalve production followed by mussels, 15.3% and oysters, 8.4%. Gastropod fisheries assessment and developments in shell craft industry was also a part of the molluscan research

    IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

    India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, there was (i) no parallel training data spanning all the 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India. In this work, we aim to address this gap by focusing on the missing pieces required for enabling wide, easy, and open access to good machine translation systems for all 22 scheduled Indian languages. We identify four key areas of improvement: curating and creating larger training datasets, creating diverse and high-quality benchmarks, training multilingual models, and releasing models with open access. Our first contribution is the release of the Bharat Parallel Corpus Collection (BPCC), the largest publicly available parallel corpora for Indic languages. BPCC contains a total of 230M bitext pairs, of which a total of 126M were newly added, including 644K manually translated sentence pairs created as part of this work. Our second contribution is the release of the first n-way parallel benchmark covering all 22 Indian languages, featuring diverse domains, Indian-origin content, and source-original test sets. Next, we present IndicTrans2, the first model to support all 22 languages, surpassing existing models on multiple existing and new benchmarks created as a part of this work. Lastly, to promote accessibility and collaboration, we release our models and associated data with permissive licenses at https://github.com/ai4bharat/IndicTrans2

    Big Data Analytics in Static and Streaming Provenance

    Thesis (Ph.D.) - Indiana University, Informatics and Computing,, 2016With recent technological and computational advances, scientists increasingly integrate sensors and model simulations to understand spatial, temporal, social, and ecological relationships at unprecedented scale. Data provenance traces relationships of entities over time, thus providing a unique view on over-time behavior under study. However, provenance can be overwhelming in both volume and complexity; the now forecasting potential of provenance creates additional demands. This dissertation focuses on Big Data analytics of static and streaming provenance. It develops filters and a non-preprocessing slicing technique for in-situ querying of static provenance. It presents a stream processing framework for online processing of provenance data at high receiving rate. While the former is sufficient for answering queries that are given prior to the application start (forward queries), the latter deals with queries whose targets are unknown beforehand (backward queries). Finally, it explores data mining on large collections of provenance and proposes a temporal representation of provenance that can reduce the high dimensionality while effectively supporting mining tasks like clustering, classification and association rules mining; and the temporal representation can be further applied to streaming provenance as well. The proposed techniques are verified through software prototypes applied to Big Data provenance captured from computer network data, weather models, ocean models, remote (satellite) imagery data, and agent-based simulations of agricultural decision making

    Recovery of Missing Values using Matrix Decomposition Techniques

    Time series data is prominent in many real world applications, e.g., hydrology or finance stock market. In many of these applications, time series data is missing in blocks, i.e., multiple consecutive values are missing. For example, in the hydrology field around 20% of the data is missing in blocks. However, many time series analysis tasks, such as prediction, require the existence of complete data. The recovery of blocks of missing values in time series is challenging if the missing block is a peak or a valley. The problem is more challenging in real world time series because of the irregularity in the data. The state-of-the-art recovery techniques are suitable either for the recovery of single missing values or for the recovery of blocks of missing values in regular time series. The goal of this thesis is to propose an accurate recovery of blocks of missing values in irregular time series. The recovery solution we propose is based on matrix decomposition techniques. The main idea of the recovery is to represent correlated time series as columns of an input matrix where missing values have been initialized and iteratively apply matrix decomposition technique to refine the initialized missing values. A key property of our recovery solution is that it learns the shape, the width and the amplitude of the missing blocks from the history of the time series that contains the missing blocks and the history of its correlated time series. Our experiments on real world hydrological time series show that our approach outperforms the state-of-the-art recovery techniques for the recovery of missing blocks in irregular time series. The recovery solution is implemented as a graphical tool that displays, browses and accurately recovers missing blocks in irregular time series. The proposed approach supports learning from highly and lowly correlated time series. This is important since lowly correlated time series, e.g., shifted time series, that exhibit shape and/or trend similarities are beneficial for the recovery process. We reduce the space complexity of the proposed solution from quadratic to linear. This allows to use time series with long histories without prior segmentation. We prove the scalability and the correctness of the solution

    Event-Log Analyse mittels Clustering und Mustererkennung

    Die Analyse von Log-Dateien als Spezialfall des Text Mining dient in der Regel dazu Laufzeitfehler oder Angriffe auf ein Systems nachzuvollziehen. Gegen erkannte FehlerzustĂ€nde können Maßnahmen ergriffen werden, um diese zu vermeiden. Muster in semi-strukturierten Log-Dateien aus dynamischen Umgebungen zu erkennen ist komplex und erfordert einen mehrstufigen Prozess. Zur Analyse werden die Log-Dateien in einen strukturierten Event-Log (event log) ĂŒberfĂŒhrt. Diese Arbeit bietet dem Anwender ein Werkzeug, um hĂ€ufige (frequent) oder seltene (rare) Ereignisse (events), sowie temporale Muster (temporal patterns) in den Daten zu erkennen. Dazu werden verschiedene Techniken des Data-Minig miteinander verbunden. Zentrales Element ist dieser Arbeit das Clustering. Es wird untersucht, ob durch Neuronale Netze mittels unĂŒberwachtem Lernen (Autoencoder) geeignete ReprĂ€sentationen (embeddings) von Ereignissen erstellt werden können, um syntaktisch und semantisch Ă€hnliche Instanzen zusammenzufassen. Dies dient zur Klassifikation von Ereignissen, Erkennung von Ausreißern (outlier detection), sowie zur Inferenz einer nachvollziehbaren visuellen ReprĂ€sentation (Regular Expressions; Pattern Expressions). Um verborgene Muster in den Daten zu finden werden diese mittels sequenzieller Mustererkennung (Sequential Pattern Mining) und dem auffinden von Episoden (Episode Mining) in einem zweiten Analyseschritt untersucht. Durch das Pattern Mining können alle enthaltenen Muster im einem Event-Log gefunden werden. Der enorme Suchraum erfordert effiziente Algorithmen, um in angemessener Zeit Ergebnisse zu erzielen. Das Clustering dient daher ebenfalls zur Reduktion (pruning) des Suchraums fĂŒr das Pattern Mining. Um die Menge der Ergebnisse einzuschrĂ€nken werden verschiedene Strategien auf ihre praktische Tauglichkeit hin untersucht, um neue Erkenntnisse zu erlangen. Zum einen die Mustererkennung mittels verschiedener Kriterien (Constrained Pattern Mining) und zum anderen durch die NĂŒtzlichkeit (High Utility Pattern Mining) von Mustern. Interessante temporale Muster können auf anderen Log-Dateien angewendet werden, um diese auf das Vorkommen dieser Muster zu untersuchen

    Query estimation techniques in database systems

    The effctiveness of query optimization in database systems critically depends on the system';s ability to assess the execution costs of different query execution plans. For this purpose, the sizes and data distributions of the intermediate results generated during plan execution need to be estimated as accurately as possible. This estimation requires the maintenance of statistics on the data stored in the database, which are referred to as data synopses. While the problem of query cost estimation has received significant attention for over a decade, it has remained an open issue in practice, because most previous techniques have focused on singular aspects of the problem such as minimizing the estimation error of a single type of query and a single data distribution, whereas database management systems generally need to support a wide range of queries over a number of datasets. In this thesis I introduce a new technique for query result estimation, which extends existing techniques in that it offers estimation for all combinations of the three major database operators selection, projection, and join. The approach is based on separate and independent approximations of the attribute values contained in a dataset and their frequencies. Through the use of space-filling curves, the approach extends to multi-dimensional data, while maintaining its accuracy and computational properties. The resulting estimation accuracy is competitive with specialized techniques and superior to the histogram techniques currently implemented in commercial database management systems. Because data synopses reside in main memory, they compete for available space with the database cache and query execution buffers. Consequently, the memory available to data synopses needs to be used efficiently. This results in a physical design problem for data synopses, which is to determine the best set of synopses for a given combination of datasets, queries, and available memory. This thesis introduces a formalization of the problem, and efficient algorithmic solutions. All discussed techniques are evaluated with regard to their overhead and resulting estimation accuracy on a variety of synthetic and real-life datasets.Die EffektivitĂ€t der Anfrage-Optimierung in Datenbanksystemen hĂ€ngt entscheidend von der FĂ€higkeit des Systems ab, die Kosten der verschiedenen Möglichkeiten, eine Anfrage auszufĂŒhren, abzuschĂ€tzen. Zu diesem Zweck ist es nötig, die GrĂ¶ĂŸen und Datenverteilungen der Zwischenresultate, die wĂ€hrend der AusfĂŒhrung einer Anfrage generiert werden, so genau wie möglich zu schĂ€tzen. Zur Lösung dieses SchĂ€tzproblems benötigt man Statistiken ĂŒber die Daten, welche in dem Datenbanksystem gespeichert werden; diese Statistiken werden auch als Daten Synopsen bezeichnet. Obwohl das Problem der SchĂ€tzung von Anfragekosten innerhalb der letzten 10 Jahre intensiv untersucht wurde, gilt es weiterhin als offen, da viele der vorgeschlagenen AnsĂ€tze nur einen Teilaspekt des Problems betrachten. In den meisten FĂ€llen wurden Techniken fĂŒr das AbschĂ€tzen eines einzelnen Operators auf einer einzelnen Datenverteilung untersucht, wohingegen Datenbanksysteme in der Praxis eine Vielfalt von Anfragen ĂŒber diverse DatensĂ€tze unterstĂŒtzen mĂŒssen. Aus diesem Grund stellt diese Arbeit einen neuen Ansatz zur ResultatsabschĂ€tzung vor, welcher insofern ĂŒber bestehende AnsĂ€tze hinausgeht, als dass er akkurate AbschĂ€tzung beliebiger Kombinationen der drei wichtigsten Datenbank-Operatoren erlaubt: Selektion, Projektion und Join. Meine Technik basiert auf separaten und unabhĂ€ngigen Approximationen der Verteilung der Attributwerte eines Datensatzes und der Verteilung der HĂ€ufigkeiten dieser Attributwerte. Durch den Einsatz raumfĂŒllender Kurven können diese Approximationstechniken zudem auf mehrdimensionale Datenverteilungen angewandt werden, ohne ihre Genauigkeit und geringen Berechnungskosten einzubĂŒĂŸen. Die resultierende SchĂ€tzgenauigkeit ist vergleichbar mit der von auf einen einzigen Operator spezialisierten Techniken, und deutlich höher als die der auf Histogrammen basierenden AnsĂ€tze, welche momentan in kommerziellen Datenbanksystemen eingesetzt werden. Da Daten Synopsen im Arbeitsspeicher residieren, reduzieren sie den Speicher, der fĂŒr den Seitencache oder AusfĂŒhrungspuffer zur VerfĂŒgung steht. Somit sollte der fĂŒr Synopsen reservierte Speicher effizient genutzt werden, bzw. möglichst klein sein. Dies fĂŒhrt zu dem Problem, die optimale Kombination von Synopsen fĂŒr eine gegebene Kombination an Daten, Anfragen und verfĂŒgbarem Speicher zu bestimmen. Diese Arbeit stellt eine formale Beschreibung des Problems, sowie effiziente Algorithmen zu dessen Lösung vor. Alle beschriebenen Techniken werden in Hinsicht auf ihren Aufwand und die resultierende SchĂ€tzgenauigkeit mittels Experimenten ĂŒber eine Vielzahl von Datenverteilungen evaluiert

    The effctiveness of query optimization in database systems critically depends on the system\u27;s ability to assess the execution costs of different query execution plans. For this purpose, the sizes and data distributions of the intermediate results generated during plan execution need to be estimated as accurately as possible. This estimation requires the maintenance of statistics on the data stored in the database, which are referred to as data synopses. While the problem of query cost estimation has received significant attention for over a decade, it has remained an open issue in practice, because most previous techniques have focused on singular aspects of the problem such as minimizing the estimation error of a single type of query and a single data distribution, whereas database management systems generally need to support a wide range of queries over a number of datasets. In this thesis I introduce a new technique for query result estimation, which extends existing techniques in that it offers estimation for all combinations of the three major database operators selection, projection, and join. The approach is based on separate and independent approximations of the attribute values contained in a dataset and their frequencies. Through the use of space-filling curves, the approach extends to multi-dimensional data, while maintaining its accuracy and computational properties. The resulting estimation accuracy is competitive with specialized techniques and superior to the histogram techniques currently implemented in commercial database management systems. Because data synopses reside in main memory, they compete for available space with the database cache and query execution buffers. Consequently, the memory available to data synopses needs to be used efficiently. This results in a physical design problem for data synopses, which is to determine the best set of synopses for a given combination of datasets, queries, and available memory. This thesis introduces a formalization of the problem, and efficient algorithmic solutions. All discussed techniques are evaluated with regard to their overhead and resulting estimation accuracy on a variety of synthetic and real-life datasets.Die EffektivitĂ€t der Anfrage-Optimierung in Datenbanksystemen hĂ€ngt entscheidend von der FĂ€higkeit des Systems ab, die Kosten der verschiedenen Möglichkeiten, eine Anfrage auszufĂŒhren, abzuschĂ€tzen. Zu diesem Zweck ist es nötig, die GrĂ¶ĂŸen und Datenverteilungen der Zwischenresultate, die wĂ€hrend der AusfĂŒhrung einer Anfrage generiert werden, so genau wie möglich zu schĂ€tzen. Zur Lösung dieses SchĂ€tzproblems benötigt man Statistiken ĂŒber die Daten, welche in dem Datenbanksystem gespeichert werden; diese Statistiken werden auch als Daten Synopsen bezeichnet. Obwohl das Problem der SchĂ€tzung von Anfragekosten innerhalb der letzten 10 Jahre intensiv untersucht wurde, gilt es weiterhin als offen, da viele der vorgeschlagenen AnsĂ€tze nur einen Teilaspekt des Problems betrachten. In den meisten FĂ€llen wurden Techniken fĂŒr das AbschĂ€tzen eines einzelnen Operators auf einer einzelnen Datenverteilung untersucht, wohingegen Datenbanksysteme in der Praxis eine Vielfalt von Anfragen ĂŒber diverse DatensĂ€tze unterstĂŒtzen mĂŒssen. Aus diesem Grund stellt diese Arbeit einen neuen Ansatz zur ResultatsabschĂ€tzung vor, welcher insofern ĂŒber bestehende AnsĂ€tze hinausgeht, als dass er akkurate AbschĂ€tzung beliebiger Kombinationen der drei wichtigsten Datenbank-Operatoren erlaubt: Selektion, Projektion und Join. Meine Technik basiert auf separaten und unabhĂ€ngigen Approximationen der Verteilung der Attributwerte eines Datensatzes und der Verteilung der HĂ€ufigkeiten dieser Attributwerte. Durch den Einsatz raumfĂŒllender Kurven können diese Approximationstechniken zudem auf mehrdimensionale Datenverteilungen angewandt werden, ohne ihre Genauigkeit und geringen Berechnungskosten einzubĂŒĂŸen. Die resultierende SchĂ€tzgenauigkeit ist vergleichbar mit der von auf einen einzigen Operator spezialisierten Techniken, und deutlich höher als die der auf Histogrammen basierenden AnsĂ€tze, welche momentan in kommerziellen Datenbanksystemen eingesetzt werden. Da Daten Synopsen im Arbeitsspeicher residieren, reduzieren sie den Speicher, der fĂŒr den Seitencache oder AusfĂŒhrungspuffer zur VerfĂŒgung steht. Somit sollte der fĂŒr Synopsen reservierte Speicher effizient genutzt werden, bzw. möglichst klein sein. Dies fĂŒhrt zu dem Problem, die optimale Kombination von Synopsen fĂŒr eine gegebene Kombination an Daten, Anfragen und verfĂŒgbarem Speicher zu bestimmen. Diese Arbeit stellt eine formale Beschreibung des Problems, sowie effiziente Algorithmen zu dessen Lösung vor. Alle beschriebenen Techniken werden in Hinsicht auf ihren Aufwand und die resultierende SchĂ€tzgenauigkeit mittels Experimenten ĂŒber eine Vielzahl von Datenverteilungen evaluiert

    Efficient Maximum A-Posteriori Inference in Markov Logic and Application in Description Logics

    Maximum a-posteriori (MAP) query in statistical relational models computes the most probable world given evidence and further knowledge about the domain. It is arguably one of the most important types of computational problems, since it is also used as a subroutine in weight learning algorithms. In this thesis, we discuss an improved inference algorithm and an application for MAP queries. We focus on Markov logic (ML) as statistical relational formalism. Markov logic combines Markov networks with first-order logic by attaching weights to first-order formulas. For inference, we improve existing work which translates MAP queries to integer linear programs (ILP). The motivation is that existing ILP solvers are very stable and fast and are able to precisely estimate the quality of an intermediate solution. In our work, we focus on improving the translation process such that we result in ILPs having fewer variables and fewer constraints. Our main contribution is the Cutting Plane Aggregation (CPA) approach which leverages symmetries in ML networks and parallelizes MAP inference. Additionally, we integrate the cutting plane inference (Riedel 2008) algorithm which significantly reduces the number of groundings by solving multiple smaller ILPs instead of one large ILP. We present the new Markov logic engine RockIt which outperforms state-of-the-art engines in standard Markov logic benchmarks. Afterwards, we apply the MAP query to description logics. Description logics (DL) are knowledge representation formalisms whose expressivity is higher than propositional logic but lower than first-order logic. The most popular DLs have been standardized in the ontology language OWL and are an elementary component in the Semantic Web. We combine Markov logic, which essentially follows the semantic of a log-linear model, with description logics to log-linear description logics. In log-linear description logic weights can be attached to any description logic axiom. Furthermore, we introduce a new query type which computes the most-probable 'coherent' world. Possible applications of log-linear description logics are mainly located in the area of ontology learning and data integration. With our novel log-linear description logic reasoner ELog, we experimentally show that more expressivity increases quality and that the solutions of optimal solving strategies have higher quality than the solutions of approximate solving strategies