21 research outputs found

    Artificial intelligence and visual analytics in geographical space and cyberspace: Research opportunities and challenges

    Get PDF
    In recent decades, we have witnessed great advances on the Internet of Things, mobile devices, sensor-based systems, and resulting big data infrastructures, which have gradually, yet fundamentally influenced the way people interact with and in the digital and physical world. Many human activities now not only operate in geographical (physical) space but also in cyberspace. Such changes have triggered a paradigm shift in geographic information science (GIScience), as cyberspace brings new perspectives for the roles played by spatial and temporal dimensions, e.g., the dilemma of placelessness and possible timelessness. As a discipline at the brink of even bigger changes made possible by machine learning and artificial intelligence, this paper highlights the challenges and opportunities associated with geographical space in relation to cyberspace, with a particular focus on data analytics and visualization, including extended AI capabilities and virtual reality representations. Consequently, we encourage the creation of synergies between the processing and analysis of geographical and cyber data to improve sustainability and solve complex problems with geospatial applications and other digital advancements in urban and environmental sciences

    Dynamic label propagation in social networks

    Get PDF

    The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

    Full text link
    The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. Of the few query logs publicly available, none combines size, scope, and diversity. The AQL is the first to do so, enabling research on new retrieval models and (diachronic) search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.Comment: SIGIR 2023 resource paper, 13 page

    A Comprehensive Bibliometric Analysis on Social Network Anonymization: Current Approaches and Future Directions

    Full text link
    In recent decades, social network anonymization has become a crucial research field due to its pivotal role in preserving users' privacy. However, the high diversity of approaches introduced in relevant studies poses a challenge to gaining a profound understanding of the field. In response to this, the current study presents an exhaustive and well-structured bibliometric analysis of the social network anonymization field. To begin our research, related studies from the period of 2007-2022 were collected from the Scopus Database then pre-processed. Following this, the VOSviewer was used to visualize the network of authors' keywords. Subsequently, extensive statistical and network analyses were performed to identify the most prominent keywords and trending topics. Additionally, the application of co-word analysis through SciMAT and the Alluvial diagram allowed us to explore the themes of social network anonymization and scrutinize their evolution over time. These analyses culminated in an innovative taxonomy of the existing approaches and anticipation of potential trends in this domain. To the best of our knowledge, this is the first bibliometric analysis in the social network anonymization field, which offers a deeper understanding of the current state and an insightful roadmap for future research in this domain.Comment: 73 pages, 28 figure

    m5U-GEPred: prediction of RNA 5-methyluridine sites based on sequence-derived and graph embedding features.

    Get PDF
    5-Methyluridine (m5U) is one of the most common post-transcriptional RNA modifications, which is involved in a variety of important biological processes and disease development. The precise identification of the m5U sites allows for a better understanding of the biological processes of RNA and contributes to the discovery of new RNA functional and therapeutic targets. Here, we present m5U-GEPred, a prediction framework, to combine sequence characteristics and graph embedding-based information for m5U identification. The graph embedding approach was introduced to extract the global information of training data that complemented the local information represented by conventional sequence features, thereby enhancing the prediction performance of m5U identification. m5U-GEPred outperformed the state-of-the-art m5U predictors built on two independent species, with an average AUROC of 0.984 and 0.985 tested on human and yeast transcriptomes, respectively. To further validate the performance of our newly proposed framework, the experimentally validated m5U sites identified from Oxford Nanopore Technology (ONT) were collected as independent testing data, and in this project, m5U-GEPred achieved reasonable prediction performance with ACC of 91.84%. We hope that m5U-GEPred should make a useful computational alternative for m5U identification

    Multilingual Lexicography with a Focus on Less-Resourced Languages: Data Mining, Expert Input, Crowdsourcing, and Gamification

    Get PDF
    This paper looks at the challenges that the Kamusi Project faces for acquiring open lexical data for less-resourced languages (LRLs), of a range, depth, and quality that can be useful within Human Language Technology (HLT). These challenges include accessing and reforming existing lexicons into interoperable data, recruiting language specialists and citizen linguists, and obtaining large volumes of quality input from the crowd. We introduce our crowdsourcing model, specifically (1) motivating participation using a “play to pay” system, games, social rewards, and material prizes; (2) steering the crowd to contribute structured and reliable data via targeted questions; and (3) evaluating participants’ input through crowd validation and statistical analysis to ensure that only trust-worthy material is incorporated into Kamusi’s master database. We discuss the mobile application Kamusi has developed for crowd participation that elicits high-quality structured data directly from each language’s speakers through narrow questions that can be answered with a minimum of time and effort. Through the integration of existing lexicons, expert input, and innovative methods of acquiring knowledge from the crowd, an accurate and reliable multilingual dictionary with a focus on LRLs will grow and become available as a free public resource

    Approach for the Development of a Framework for the Identification of Activities of Daily Living Using Sensors in Mobile Devices

    Get PDF
    Sensors available on mobile devices allow the automatic identification of Activities of Daily Living (ADL). This paper describes an approach for the creation of a framework for the identification of ADL, taking into account several concepts, including data acquisition, data processing, data fusion, and pattern recognition. These concepts can be mapped onto different modules of the framework. The proposed framework should perform the identification of ADL without Internet connection, performing these tasks locally on the mobile device, taking in account the hardware and software limitations of these devices. The main purpose of this paper is to present a new approach for the creation of a framework for the recognition of ADL, analyzing the allowed sensors available in the mobile devices, and the existing methods available in the literature.This work was supported by FCT project UID/EEA/50008/2013. The authors would also like to acknowledge the contribution of the COST Action IC1303–AAPELE–Architectures, Algorithms and Protocols for Enhanced Living Environments

    Multi-sensor data fusion in mobile devices for the identification of Activities of Daily Living

    Get PDF
    Following the recent advances in technology and the growing use of mobile devices such as smartphones, several solutions may be developed to improve the quality of life of users in the context of Ambient Assisted Living (AAL). Mobile devices have different available sensors, e.g., accelerometer, gyroscope, magnetometer, microphone and Global Positioning System (GPS) receiver, which allow the acquisition of physical and physiological parameters for the recognition of different Activities of Daily Living (ADL) and the environments in which they are performed. The definition of ADL includes a well-known set of tasks, which include basic selfcare tasks, based on the types of skills that people usually learn in early childhood, including feeding, bathing, dressing, grooming, walking, running, jumping, climbing stairs, sleeping, watching TV, working, listening to music, cooking, eating and others. On the context of AAL, some individuals (henceforth called user or users) need particular assistance, either because the user has some sort of impairment, or because the user is old, or simply because users need/want to monitor their lifestyle. The research and development of systems that provide a particular assistance to people is increasing in many areas of application. In particular, in the future, the recognition of ADL will be an important element for the development of a personal digital life coach, providing assistance to different types of users. To support the recognition of ADL, the surrounding environments should be also recognized to increase the reliability of these systems. The main focus of this Thesis is the research on methods for the fusion and classification of the data acquired by the sensors available in off-the-shelf mobile devices in order to recognize ADL in almost real-time, taking into account the large diversity of the capabilities and characteristics of the mobile devices available in the market. In order to achieve this objective, this Thesis started with the review of the existing methods and technologies to define the architecture and modules of the method for the identification of ADL. With this review and based on the knowledge acquired about the sensors available in off-the-shelf mobile devices, a set of tasks that may be reliably identified was defined as a basis for the remaining research and development to be carried out in this Thesis. This review also identified the main stages for the development of a new method for the identification of the ADL using the sensors available in off-the-shelf mobile devices; these stages are data acquisition, data processing, data cleaning, data imputation, feature extraction, data fusion and artificial intelligence. One of the challenges is related to the different types of data acquired from the different sensors, but other challenges were found, including the presence of environmental noise, the positioning of the mobile device during the daily activities, the limited capabilities of the mobile devices and others. Based on the acquired data, the processing was performed, implementing data cleaning and feature extraction methods, in order to define a new framework for the recognition of ADL. The data imputation methods were not applied, because at this stage of the research their implementation does not have influence in the results of the identification of the ADL and environments, as the features are extracted from a set of data acquired during a defined time interval and there are no missing values during this stage. The joint selection of the set of usable sensors and the identifiable set of tasks will then allow the development of a framework that, considering multi-sensor data fusion technologies and context awareness, in coordination with other information available from the user context, such as his/her agenda and the time of the day, will allow to establish a profile of the tasks that the user performs in a regular activity day. The classification method and the algorithm for the fusion of the features for the recognition of ADL and its environments needs to be deployed in a machine with some computational power, while the mobile device that will use the created framework, can perform the identification of the ADL using a much less computational power. Based on the results reported in the literature, the method chosen for the recognition of the ADL is composed by three variants of Artificial Neural Networks (ANN), including simple Multilayer Perceptron (MLP) networks, Feedforward Neural Networks (FNN) with Backpropagation, and Deep Neural Networks (DNN). Data acquisition can be performed with standard methods. After the acquisition, the data must be processed at the data processing stage, which includes data cleaning and feature extraction methods. The data cleaning method used for motion and magnetic sensors is the low pass filter, in order to reduce the noise acquired; but for the acoustic data, the Fast Fourier Transform (FFT) was applied to extract the different frequencies. When the data is clean, several features are then extracted based on the types of sensors used, including the mean, standard deviation, variance, maximum value, minimum value and median of raw data acquired from the motion and magnetic sensors; the mean, standard deviation, variance and median of the maximum peaks calculated with the raw data acquired from the motion and magnetic sensors; the five greatest distances between the maximum peaks calculated with the raw data acquired from the motion and magnetic sensors; the mean, standard deviation, variance, median and 26 Mel- Frequency Cepstral Coefficients (MFCC) of the frequencies obtained with FFT based on the raw data acquired from the microphone data; and the distance travelled calculated with the data acquired from the GPS receiver. After the extraction of the features, these will be grouped in different datasets for the application of the ANN methods and to discover the method and dataset that reports better results. The classification stage was incrementally developed, starting with the identification of the most common ADL (i.e., walking, running, going upstairs, going downstairs and standing activities) with motion and magnetic sensors. Next, the environments were identified with acoustic data, i.e., bedroom, bar, classroom, gym, kitchen, living room, hall, street and library. After the environments are recognized, and based on the different sets of sensors commonly available in the mobile devices, the data acquired from the motion and magnetic sensors were combined with the recognized environment in order to differentiate some activities without motion, i.e., sleeping and watching TV. The number of recognized activities in this stage was increased with the use of the distance travelled, extracted from the GPS receiver data, allowing also to recognize the driving activity. After the implementation of the three classification methods with different numbers of iterations, datasets and remaining configurations in a machine with high processing capabilities, the reported results proved that the best method for the recognition of the most common ADL and activities without motion is the DNN method, but the best method for the recognition of environments is the FNN method with Backpropagation. Depending on the number of sensors used, this implementation reports a mean accuracy between 85.89% and 89.51% for the recognition of the most common ADL, equals to 86.50% for the recognition of environments, and equals to 100% for the recognition of activities without motion, reporting an overall accuracy between 85.89% and 92.00%. The last stage of this research work was the implementation of the structured framework for the mobile devices, verifying that the FNN method requires a high processing power for the recognition of environments and the results reported with the mobile application are lower than the results reported with the machine with high processing capabilities used. Thus, the DNN method was also implemented for the recognition of the environments with the mobile devices. Finally, the results reported with the mobile devices show an accuracy between 86.39% and 89.15% for the recognition of the most common ADL, equal to 45.68% for the recognition of environments, and equal to 100% for the recognition of activities without motion, reporting an overall accuracy between 58.02% and 89.15%. Compared with the literature, the results returned by the implemented framework show only a residual improvement. However, the results reported in this research work comprehend the identification of more ADL than the ones described in other studies. The improvement in the recognition of ADL based on the mean of the accuracies is equal to 2.93%, but the maximum number of ADL and environments previously recognized was 13, while the number of ADL and environments recognized with the framework resulting from this research is 16. In conclusion, the framework developed has a mean improvement of 2.93% in the accuracy of the recognition for a larger number of ADL and environments than previously reported. In the future, the achievements reported by this PhD research may be considered as a start point of the development of a personal digital life coach, but the number of ADL and environments recognized by the framework should be increased and the experiments should be performed with different types of devices (i.e., smartphones and smartwatches), and the data imputation and other machine learning methods should be explored in order to attempt to increase the reliability of the framework for the recognition of ADL and its environments.Após os recentes avanços tecnológicos e o crescente uso dos dispositivos móveis, como por exemplo os smartphones, várias soluções podem ser desenvolvidas para melhorar a qualidade de vida dos utilizadores no contexto de Ambientes de Vida Assistida (AVA) ou Ambient Assisted Living (AAL). Os dispositivos móveis integram vários sensores, tais como acelerómetro, giroscópio, magnetómetro, microfone e recetor de Sistema de Posicionamento Global (GPS), que permitem a aquisição de vários parâmetros físicos e fisiológicos para o reconhecimento de diferentes Atividades da Vida Diária (AVD) e os seus ambientes. A definição de AVD inclui um conjunto bem conhecido de tarefas que são tarefas básicas de autocuidado, baseadas nos tipos de habilidades que as pessoas geralmente aprendem na infância. Essas tarefas incluem alimentar-se, tomar banho, vestir-se, fazer os cuidados pessoais, caminhar, correr, pular, subir escadas, dormir, ver televisão, trabalhar, ouvir música, cozinhar, comer, entre outras. No contexto de AVA, alguns indivíduos (comumente chamados de utilizadores) precisam de assistência particular, seja porque o utilizador tem algum tipo de deficiência, seja porque é idoso, ou simplesmente porque o utilizador precisa/quer monitorizar e treinar o seu estilo de vida. A investigação e desenvolvimento de sistemas que fornecem algum tipo de assistência particular está em crescente em muitas áreas de aplicação. Em particular, no futuro, o reconhecimento das AVD é uma parte importante para o desenvolvimento de um assistente pessoal digital, fornecendo uma assistência pessoal de baixo custo aos diferentes tipos de pessoas. pessoas. Para ajudar no reconhecimento das AVD, os ambientes em que estas se desenrolam devem ser reconhecidos para aumentar a fiabilidade destes sistemas. O foco principal desta Tese é o desenvolvimento de métodos para a fusão e classificação dos dados adquiridos a partir dos sensores disponíveis nos dispositivos móveis, para o reconhecimento quase em tempo real das AVD, tendo em consideração a grande diversidade das características dos dispositivos móveis disponíveis no mercado. Para atingir este objetivo, esta Tese iniciou-se com a revisão dos métodos e tecnologias existentes para definir a arquitetura e os módulos do novo método de identificação das AVD. Com esta revisão da literatura e com base no conhecimento adquirido sobre os sensores disponíveis nos dispositivos móveis disponíveis no mercado, um conjunto de tarefas que podem ser identificadas foi definido para as pesquisas e desenvolvimentos desta Tese. Esta revisão também identifica os principais conceitos para o desenvolvimento do novo método de identificação das AVD, utilizando os sensores, são eles: aquisição de dados, processamento de dados, correção de dados, imputação de dados, extração de características, fusão de dados e extração de resultados recorrendo a métodos de inteligência artificial. Um dos desafios está relacionado aos diferentes tipos de dados adquiridos pelos diferentes sensores, mas outros desafios foram encontrados, sendo os mais relevantes o ruído ambiental, o posicionamento do dispositivo durante a realização das atividades diárias, as capacidades limitadas dos dispositivos móveis. As diferentes características das pessoas podem igualmente influenciar a criação dos métodos, escolhendo pessoas com diferentes estilos de vida e características físicas para a aquisição e identificação dos dados adquiridos a partir de sensores. Com base nos dados adquiridos, realizou-se o processamento dos dados, implementando-se métodos de correção dos dados e a extração de características, para iniciar a criação do novo método para o reconhecimento das AVD. Os métodos de imputação de dados foram excluídos da implementação, pois não iriam influenciar os resultados da identificação das AVD e dos ambientes, na medida em que são utilizadas as características extraídas de um conjunto de dados adquiridos durante um intervalo de tempo definido. A seleção dos sensores utilizáveis, bem como das AVD identificáveis, permitirá o desenvolvimento de um método que, considerando o uso de tecnologias para a fusão de dados adquiridos com múltiplos sensores em coordenação com outras informações relativas ao contexto do utilizador, tais como a agenda do utilizador, permitindo estabelecer um perfil de tarefas que o utilizador realiza diariamente. Com base nos resultados obtidos na literatura, o método escolhido para o reconhecimento das AVD são as diferentes variantes das Redes Neuronais Artificiais (RNA), incluindo Multilayer Perceptron (MLP), Feedforward Neural Networks (FNN) with Backpropagation and Deep Neural Networks (DNN). No final, após a criação dos métodos para cada fase do método para o reconhecimento das AVD e ambientes, a implementação sequencial dos diferentes métodos foi realizada num dispositivo móvel para testes adicionais. Após a definição da estrutura do método para o reconhecimento de AVD e ambientes usando dispositivos móveis, verificou-se que a aquisição de dados pode ser realizada com os métodos comuns. Após a aquisição de dados, os mesmos devem ser processados no módulo de processamento de dados, que inclui os métodos de correção de dados e de extração de características. O método de correção de dados utilizado para sensores de movimento e magnéticos é o filtro passa-baixo de modo a reduzir o ruído, mas para os dados acústicos, a Transformada Rápida de Fourier (FFT) foi aplicada para extrair as diferentes frequências. Após a correção dos dados, as diferentes características foram extraídas com base nos tipos de sensores usados, sendo a média, desvio padrão, variância, valor máximo, valor mínimo e mediana de dados adquiridos pelos sensores magnéticos e de movimento, a média, desvio padrão, variância e mediana dos picos máximos calculados com base nos dados adquiridos pelos sensores magnéticos e de movimento, as cinco maiores distâncias entre os picos máximos calculados com os dados adquiridos dos sensores de movimento e magnéticos, a média, desvio padrão, variância e 26 Mel-Frequency Cepstral Coefficients (MFCC) das frequências obtidas com FFT com base nos dados obtidos a partir do microfone, e a distância calculada com os dados adquiridos pelo recetor de GPS. Após a extração das características, as mesmas são agrupadas em diferentes conjuntos de dados para a aplicação dos métodos de RNA de modo a descobrir o método e o conjunto de características que reporta melhores resultados. O módulo de classificação de dados foi incrementalmente desenvolvido, começando com a identificação das AVD comuns com sensores magnéticos e de movimento, i.e., andar, correr, subir escadas, descer escadas e parado. Em seguida, os ambientes são identificados com dados de sensores acústicos, i.e., quarto, bar, sala de aula, ginásio, cozinha, sala de estar, hall, rua e biblioteca. Com base nos ambientes reconhecidos e os restantes sensores disponíveis nos dispositivos móveis, os dados adquiridos dos sensores magnéticos e de movimento foram combinados com o ambiente reconhecido para diferenciar algumas atividades sem movimento (i.e., dormir e ver televisão), onde o número de atividades reconhecidas nesta fase aumenta com a fusão da distância percorrida, extraída a partir dos dados do recetor GPS, permitindo também reconhecer a atividade de conduzir. Após a implementação dos três métodos de classificação com diferentes números de iterações, conjuntos de dados e configurações numa máquina com alta capacidade de processamento, os resultados relatados provaram que o melhor método para o reconhecimento das atividades comuns de AVD e atividades sem movimento é o método DNN, mas o melhor método para o reconhecimento de ambientes é o método FNN with Backpropagation. Dependendo do número de sensores utilizados, esta implementação reporta uma exatidão média entre 85,89% e 89,51% para o reconhecimento das AVD comuns, igual a 86,50% para o reconhecimento de ambientes, e igual a 100% para o reconhecimento de atividades sem movimento, reportando uma exatidão global entre 85,89% e 92,00%. A última etapa desta Tese foi a implementação do método nos dispositivos móveis, verificando que o método FNN requer um alto poder de processamento para o reconhecimento de ambientes e os resultados reportados com estes dispositivos são inferiores aos resultados reportados com a máquina com alta capacidade de processamento utilizada no desenvolvimento do método. Assim, o método DNN foi igualmente implementado para o reconhecimento dos ambientes com os dispositivos móveis. Finalmente, os resultados relatados com os dispositivos móveis reportam uma exatidão entre 86,39% e 89,15% para o reconhecimento das AVD comuns, igual a 45,68% para o reconhecimento de ambientes, e igual a 100% para o reconhecimento de atividades sem movimento, reportando uma exatidão geral entre 58,02% e 89,15%. Com base nos resultados relatados na literatura, os resultados do método desenvolvido mostram uma melhoria residual, mas os resultados desta Tese identificam mais AVD que os demais estudos disponíveis na literatura. A melhoria no reconhecimento das AVD com base na média das exatidões é igual a 2,93%, mas o número máximo de AVD e ambientes reconhecidos pelos estudos disponíveis na literatura é 13, enquanto o número de AVD e ambientes reconhecidos com o método implementado é 16. Assim, o método desenvolvido tem uma melhoria de 2,93% na exatidão do reconhecimento num maior número de AVD e ambientes. Como trabalho futuro, os resultados reportados nesta Tese podem ser considerados um ponto de partida para o desenvolvimento de um assistente digital pessoal, mas o número de ADL e ambientes reconhecidos pelo método deve ser aumentado e as experiências devem ser repetidas com diferentes tipos de dispositivos móveis (i.e., smartphones e smartwatches), e os métodos de imputação e outros métodos de classificação de dados devem ser explorados de modo a tentar aumentar a confiabilidade do método para o reconhecimento das AVD e ambientes

    Schema-aware keyword search on linked data

    Get PDF
    Keyword search is a popular technique for querying the ever growing repositories of RDF graph data on the Web. This is due to the fact that the users do not need to master complex query languages (e.g., SQL, SPARQL) and they do not need to know the underlying structure of the data on the Web to compose their queries. Keyword search is simple and flexible. However, it is at the same time ambiguous since a keyword query can be interpreted in different ways. This feature of keyword search poses at least two challenges: (a) identifying relevant results among a multitude of candidate results, and (b) dealing with the performance scalability issue of the query evaluation algorithms. In the literature, multiple schema-unaware approaches are proposed to cope with the above challenges. Some of them identify as relevant results only those candidate results which maintain the keyword instances in close proximity. Other approaches filter out irrelevant results using their structural characteristics or rank and top-k process the retrieved results based on statistical information about the data. In any case, these approaches cannot disambiguate the query to identify the intent of the user and they cannot scale satisfactorily when the size of the data and the number of the query keywords grow. In recent years, different approaches tried to exploit the schema (structural summary) of the RDF (Resource Description Framework) data graph to address the problems above. In this context, an original hierarchical clustering technique is introduced in this dissertation. This approach clusters the results based on a semantic interpretation of the keyword instances and takes advantage of relevance feedback from the user. The clustering hierarchy uses pattern graphs which are structured queries and clustering together result graphs with the same structure. Pattern graphs represent possible interpretations for the keyword query. By navigating though the hierarchy the user can select the pattern graph which is relevant to her intent. Nevertheless, structural summaries are approximate representations of the data and, therefore, might return empty answers or miss results which are relevant to the user intent. To address this issue, a novel approach is presented which combines the use of the structural summary and the user feedback with a relaxation technique for pattern graphs to extract additional results potentially of interest to the user. Query caching and multi-query optimization techniques are leveraged for the efficient evaluation of relaxed pattern graphs. Although the approaches which consider the structural summary of the data graph are promising, they require interaction with the user. It is claimed in this dissertation that without additional information from the user, it is not possible to produce results of high quality from keyword search on RDF data with the existing techniques. In this regard, an original keyword query language on RDF data is introduced which allows the user to convey his intention flexibly and effortlessly by specifying cohesive keyword groups. A cohesive group of keywords in a query indicates that its keywords should form a cohesive unit in the query results. It is experimentally demonstrated that cohesive keyword queries improve the result quality effectively and prune the search space of the pattern graphs efficiently compared to traditional keyword queries. Most importantly, these benefits are achieved while retaining the simplicity and the convenience of traditional keyword search. The last issue addressed in this dissertation is the diversification problem for keyword search on RDF data. The goal of diversification is to trade off relevance and diversity in the results set of a keyword query in order to minimize the dissatisfaction of the average user. Novel metrics are developed for assessing relevance and diversity along with techniques for the generation of a relevant and diversified set of query interpretations for a keyword query on an RDF data graph. Experimental results show the effectiveness of the metrics and the efficiency of the approach

    A compact and scalable encoding for updating XML based on node labeling schemes

    Get PDF
    The eXtensible Markup Language (XML) has been adopted as the new standard for data exchange on the World Wide Web. As the rate of adoption increases, there is an ever pressing need to store, query and update XML in its native format, thereby eliminating the overhead of parsing and transforming XML in and out of various data formats. However, the hierarchical, ordered and semi-structured properties of the tree structure underlying the XML data model presents many challenges to updating XML. In particular, many of the tree labeling schemes were designed to solve a particular problem or provide a particular feature, often at the expense of other important features. In this dissertation, we identify the core properties that are representative of the desirable characteristics of a good dynamic labeling scheme for XML. We focus on four features central to the outstanding problems in existing dynamic labeling schemes; namely a compact label encoding, scalability, deleted node label reuse and a label storage scheme for binary-encoded bit-string node labels. At present there is no dynamic labeling scheme that integrates support for all four features. We present a novel compact and scalable adaptive encoding method to facilitate a highly constrained growth rate of label size under arbitrary node insertion and deletion scenarios and our encoding method can scale efficiently. We deploy our encoding method in two novel dynamic labeling schemes for XML that can completely avoid node relabeling, process frequently skewed insertions gracefully and reuse deleted node labels
    corecore