1,354 research outputs found
The INCF Digital Atlasing Program: Report on Digital Atlasing Standards in the Rodent Brain
The goal of the INCF Digital Atlasing Program is to provide the vision and direction necessary to make the rapidly growing collection of multidimensional data of the rodent brain (images, gene expression, etc.) widely accessible and usable to the international research community. This Digital Brain Atlasing Standards Task Force was formed in May 2008 to investigate the state of rodent brain digital atlasing, and formulate standards, guidelines, and policy recommendations.

Our first objective has been the preparation of a detailed document that includes the vision and specific description of an infrastructure, systems and methods capable of serving the scientific goals of the community, as well as practical issues for achieving
the goals. This report builds on the 1st INCF Workshop on Mouse and Rat Brain Digital Atlasing Systems (Boline et al., 2007, _Nature Preceedings_, doi:10.1038/npre.2007.1046.1) and includes a more detailed analysis of both the current state and desired state of digital atlasing along with specific recommendations for achieving these goals
The Curious Case of the PDF Converter that Likes Mozart: Dissecting and Mitigating the Privacy Risk of Personal Cloud Apps
Third party apps that work on top of personal cloud services such as Google
Drive and Dropbox, require access to the user's data in order to provide some
functionality. Through detailed analysis of a hundred popular Google Drive apps
from Google's Chrome store, we discover that the existing permission model is
quite often misused: around two thirds of analyzed apps are over-privileged,
i.e., they access more data than is needed for them to function. In this work,
we analyze three different permission models that aim to discourage users from
installing over-privileged apps. In experiments with 210 real users, we
discover that the most successful permission model is our novel ensemble method
that we call Far-reaching Insights. Far-reaching Insights inform the users
about the data-driven insights that apps can make about them (e.g., their
topics of interest, collaboration and activity patterns etc.) Thus, they seek
to bridge the gap between what third parties can actually know about users and
users perception of their privacy leakage. The efficacy of Far-reaching
Insights in bridging this gap is demonstrated by our results, as Far-reaching
Insights prove to be, on average, twice as effective as the current model in
discouraging users from installing over-privileged apps. In an effort for
promoting general privacy awareness, we deploy a publicly available privacy
oriented app store that uses Far-reaching Insights. Based on the knowledge
extracted from data of the store's users (over 115 gigabytes of Google Drive
data from 1440 users with 662 installed apps), we also delineate the ecosystem
for third-party cloud apps from the standpoint of developers and cloud
providers. Finally, we present several general recommendations that can guide
other future works in the area of privacy for the cloud
Relevance Analysis for Document Retrieval
Document retrieval systems recover documents from a dataset and order them according to their perceived relevance to a userâs search query. This is a diïŹcult task for machines to accomplish because there exists a semantic gap between the meaning of the terms in a userâs literal query and a userâs true intentions. Even with this ambiguity that arises with a lack of context, users still expect that the set of documents returned by a search engine is both highly relevant to their query and properly ordered. The focus of this thesis is on document retrieval systems that explore methods of ordering documents from unstructured, textual corpora using text queries. The main goal of this study is to enhance the Okapi BM25 document retrieval model. In doing so, this research hypothesizes that the structure of text inside documents and queries hold valuable semantic information that can be incorporated into the Okapi BM25 model to increase its performance. ModiïŹcations that account for a termâs part of speech, the proximity between a pair of related terms, the proximity of a term with respect to its location in a document, and query expansion are used to augment Okapi BM25 to increase the modelâs performance. The study resulted in 87 modiïŹcations which were all validated using open source corpora. The top scoring modiïŹcation from the validation phase was then tested under the Lisa corpus and the model performed 10.25% better than Okapi BM25 when evaluated under mean average precision. When compared against two industry standard search engines, Lucene and Solr, the top scoring modiïŹcation largely outperforms these systems by upwards to 21.78% and 23.01%, respectively
Knowledge extraction from unstructured data
Data availability is becoming more essential, considering the current growth of web-based data. The data available on the web are represented as unstructured, semi-structured, or structured data. In order to make the web-based data available for several Natural Language Processing or Data Mining tasks, the data needs to be presented as machine-readable data in a structured format. Thus, techniques for addressing the problem of capturing knowledge from unstructured data sources are needed. Knowledge extraction methods are used by the research communities to address this problem; methods that are able to capture knowledge in a natural language text and map the extracted knowledge to existing knowledge presented in knowledge graphs (KGs). These knowledge extraction methods include Named-entity recognition, Named-entity Disambiguation, Relation Recognition, and Relation Linking. This thesis addresses the problem of extracting knowledge over unstructured data and discovering patterns in the extracted knowledge. We devise a rule-based approach for entity and relation recognition and linking. The defined approach effectively maps entities and relations within a text to their resources in a target KG. Additionally, it overcomes the challenges of recognizing and linking entities and relations to a specific KG by employing devised catalogs of linguistic and domain-specific rules that state the criteria to recognize entities in a sentence of a particular language, and a deductive database that encodes knowledge in community-maintained KGs. Moreover, we define a Neuro-symbolic approach for the tasks of knowledge extraction in encyclopedic and domain-specific domains; it combines symbolic and sub-symbolic components to overcome the challenges of entity recognition and linking and the limitation of the availability of training data while maintaining the accuracy of recognizing and linking entities. Additionally, we present a context-aware framework for unveiling semantically related posts in a corpus; it is a knowledge-driven framework that retrieves associated posts effectively. We cast the problem of unveiling semantically related posts in a corpus into the Vertex Coloring Problem. We evaluate the performance of our techniques on several benchmarks related to various domains for knowledge extraction tasks. Furthermore, we apply these methods in real-world scenarios from national and international projects. The outcomes show that our techniques are able to effectively extract knowledge encoded in unstructured data and discover patterns over the extracted knowledge presented as machine-readable data. More importantly, the evaluation results provide evidence to the effectiveness of combining the reasoning capacity of the symbolic frameworks with the power of pattern recognition and classification of sub-symbolic models
A Survey of Source Code Search: A 3-Dimensional Perspective
(Source) code search is widely concerned by software engineering researchers
because it can improve the productivity and quality of software development.
Given a functionality requirement usually described in a natural language
sentence, a code search system can retrieve code snippets that satisfy the
requirement from a large-scale code corpus, e.g., GitHub. To realize effective
and efficient code search, many techniques have been proposed successively.
These techniques improve code search performance mainly by optimizing three
core components, including query understanding component, code understanding
component, and query-code matching component. In this paper, we provide a
3-dimensional perspective survey for code search. Specifically, we categorize
existing code search studies into query-end optimization techniques, code-end
optimization techniques, and match-end optimization techniques according to the
specific components they optimize. Considering that each end can be optimized
independently and contributes to the code search performance, we treat each end
as a dimension. Therefore, this survey is 3-dimensional in nature, and it
provides a comprehensive summary of each dimension in detail. To understand the
research trends of the three dimensions in existing code search studies, we
systematically review 68 relevant literatures. Different from existing code
search surveys that only focus on the query end or code end or introduce
various aspects shallowly (including codebase, evaluation metrics, modeling
technique, etc.), our survey provides a more nuanced analysis and review of the
evolution and development of the underlying techniques used in the three ends.
Based on a systematic review and summary of existing work, we outline several
open challenges and opportunities at the three ends that remain to be addressed
in future work.Comment: submitted to ACM Transactions on Software Engineering and Methodolog
Let's Chat to Find the APIs: Connecting Human, LLM and Knowledge Graph through AI Chain
API recommendation methods have evolved from literal and semantic keyword
matching to query expansion and query clarification. The latest query
clarification method is knowledge graph (KG)-based, but limitations include
out-of-vocabulary (OOV) failures and rigid question templates. To address these
limitations, we propose a novel knowledge-guided query clarification approach
for API recommendation that leverages a large language model (LLM) guided by
KG. We utilize the LLM as a neural knowledge base to overcome OOV failures,
generating fluent and appropriate clarification questions and options. We also
leverage the structured API knowledge and entity relationships stored in the KG
to filter out noise, and transfer the optimal clarification path from KG to the
LLM, increasing the efficiency of the clarification process. Our approach is
designed as an AI chain that consists of five steps, each handled by a separate
LLM call, to improve accuracy, efficiency, and fluency for query clarification
in API recommendation. We verify the usefulness of each unit in our AI chain,
which all received high scores close to a perfect 5. When compared to the
baselines, our approach shows a significant improvement in MRR, with a maximum
increase of 63.9% higher when the query statement is covered in KG and 37.2%
when it is not. Ablation experiments reveal that the guidance of knowledge in
the KG and the knowledge-guided pathfinding strategy are crucial for our
approach's performance, resulting in a 19.0% and 22.2% increase in MAP,
respectively. Our approach demonstrates a way to bridge the gap between KG and
LLM, effectively compensating for the strengths and weaknesses of both.Comment: Accepted on ASE'202
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
- âŠ