3,325 research outputs found
Self-disclosure model for classifying & predicting text-based online disclosure
Les médias sociaux et les sites de réseaux sociaux sont devenus des babillards numériques pour les internautes à cause de leur évolution accélérée. Comme ces sites encouragent les consommateurs à exposer des informations personnelles via des profils et des publications, l'utilisation accrue des médias sociaux a généré des problèmes d’invasion de la vie privée. Des chercheurs ont fait de nombreux efforts pour détecter l'auto-divulgation en utilisant des techniques d'extraction d'informations. Des recherches récentes sur l'apprentissage automatique et les méthodes de traitement du langage naturel montrent que la compréhension du sens contextuel des mots peut entraîner une meilleure précision que les méthodes d'extraction de données traditionnelles.
Comme mentionné précédemment, les utilisateurs ignorent souvent la quantité d'informations personnelles publiées dans les forums en ligne. Il est donc nécessaire de détecter les diverses divulgations en langage naturel et de leur donner le choix de tester la possibilité de divulgation avant de publier.
Pour ce faire, ce travail propose le « SD_ELECTRA », un modèle de langage spécifique au contexte. Ce type de modèle détecte les divulgations d'intérêts, de données personnelles, d'éducation et de travail, de relations, de personnalité, de résidence, de voyage et d'accueil dans les données des médias sociaux. L'objectif est de créer un modèle linguistique spécifique au contexte sur une plate-forme de médias sociaux qui fonctionne mieux que les modèles linguistiques généraux.
De plus, les récents progrès des modèles de transformateurs ont ouvert la voie à la formation de modèles de langage à partir de zéro et à des scores plus élevés. Les résultats expérimentaux montrent que SD_ELECTRA a surpassé le modèle de base dans toutes les métriques considérées pour la méthode de classification de texte standard. En outre, les résultats montrent également que l'entraînement d'un modèle de langage avec un corpus spécifique au contexte de préentraînement plus petit sur un seul GPU peut améliorer les performances.
Une application Web illustrative est conçue pour permettre aux utilisateurs de tester les possibilités de divulgation dans leurs publications sur les réseaux sociaux. En conséquence, en utilisant l'efficacité du modèle suggéré, les utilisateurs pourraient obtenir un apprentissage en temps réel sur l'auto-divulgation.Social media and social networking sites have evolved into digital billboards for internet users due to their rapid expansion. As these sites encourage consumers to expose personal information via profiles and postings, increased use of social media has generated privacy concerns. There have been notable efforts from researchers to detect self-disclosure using Information extraction (IE) techniques. Recent research on machine learning and natural language processing methods shows that understanding the contextual meaning of the words can result in better accuracy than traditional data extraction methods.
Driven by the facts mentioned earlier, users are often ignorant of the quantity of personal information published in online forums, there is a need to detect various disclosures in natural language and give them a choice to test the possibility of disclosure before posting.
For this purpose, this work proposes "SD_ELECTRA," a context-specific language model to detect Interest, Personal, Education and Work, Relationship, Personality, Residence, Travel plan, and Hospitality disclosures in social media data. The goal is to create a context-specific language model on a social media platform that performs better than the general language models.
Moreover, recent advancements in transformer models paved the way to train language models from scratch and achieve higher scores. Experimental results show that SD_ELECTRA has outperformed the base model in all considered metrics for the standard text classification method. In addition, the results also show that training a language model with a smaller pre-training context-specific corpus on a single GPU can improve its performance.
An illustrative web application designed allows users to test the disclosure possibilities in their social media posts. As a result, by utilizing the efficiency of the suggested model, users would be able to get real-time learning on self-disclosure
AGI for Agriculture
Artificial General Intelligence (AGI) is poised to revolutionize a variety of
sectors, including healthcare, finance, transportation, and education. Within
healthcare, AGI is being utilized to analyze clinical medical notes, recognize
patterns in patient data, and aid in patient management. Agriculture is another
critical sector that impacts the lives of individuals worldwide. It serves as a
foundation for providing food, fiber, and fuel, yet faces several challenges,
such as climate change, soil degradation, water scarcity, and food security.
AGI has the potential to tackle these issues by enhancing crop yields, reducing
waste, and promoting sustainable farming practices. It can also help farmers
make informed decisions by leveraging real-time data, leading to more efficient
and effective farm management. This paper delves into the potential future
applications of AGI in agriculture, such as agriculture image processing,
natural language processing (NLP), robotics, knowledge graphs, and
infrastructure, and their impact on precision livestock and precision crops. By
leveraging the power of AGI, these emerging technologies can provide farmers
with actionable insights, allowing for optimized decision-making and increased
productivity. The transformative potential of AGI in agriculture is vast, and
this paper aims to highlight its potential to revolutionize the industry
BEKG: A Built Environment Knowledge Graph
Practices in the built environment have become more digitalized with the
rapid development of modern design and construction technologies. However, the
requirement of practitioners or scholars to gather complicated professional
knowledge in the built environment has not been satisfied yet. In this paper,
more than 80,000 paper abstracts in the built environment field were obtained
to build a knowledge graph, a knowledge base storing entities and their
connective relations in a graph-structured data model. To ensure the retrieval
accuracy of the entities and relations in the knowledge graph, two
well-annotated datasets have been created, containing 2,000 instances and 1,450
instances each in 29 relations for the named entity recognition task and
relation extraction task respectively. These two tasks were solved by two
BERT-based models trained on the proposed dataset. Both models attained an
accuracy above 85% on these two tasks. More than 200,000 high-quality relations
and entities were obtained using these models to extract all abstract data.
Finally, this knowledge graph is presented as a self-developed visualization
system to reveal relations between various entities in the domain. Both the
source code and the annotated dataset can be found here:
https://github.com/HKUST-KnowComp/BEKG
Cyber Security
This open access book constitutes the refereed proceedings of the 18th China Annual Conference on Cyber Security, CNCERT 2022, held in Beijing, China, in August 2022. The 17 papers presented were carefully reviewed and selected from 64 submissions. The papers are organized according to the following topical sections: ​​data security; anomaly detection; cryptocurrency; information security; vulnerabilities; mobile internet; threat intelligence; text recognition
LeanContext: Cost-Efficient Domain-Specific Question Answering Using LLMs
Question-answering (QA) is a significant application of Large Language Models
(LLMs), shaping chatbot capabilities across healthcare, education, and customer
service. However, widespread LLM integration presents a challenge for small
businesses due to the high expenses of LLM API usage. Costs rise rapidly when
domain-specific data (context) is used alongside queries for accurate
domain-specific LLM responses. One option is to summarize the context by using
LLMs and reduce the context. However, this can also filter out useful
information that is necessary to answer some domain-specific queries. In this
paper, we shift from human-oriented summarizers to AI model-friendly summaries.
Our approach, LeanContext, efficiently extracts key sentences from the
context that are closely aligned with the query. The choice of is neither
static nor random; we introduce a reinforcement learning technique that
dynamically determines based on the query and context. The rest of the less
important sentences are reduced using a free open source text reduction method.
We evaluate LeanContext against several recent query-aware and query-unaware
context reduction approaches on prominent datasets (arxiv papers and BBC news
articles). Despite cost reductions of to , LeanContext's
ROUGE-1 score decreases only by to compared to a baseline
that retains the entire context (no summarization). Additionally, if free
pretrained LLM-based summarizers are used to reduce context (into human
consumable summaries), LeanContext can further modify the reduced context to
enhance the accuracy (ROUGE-1 score) by to .Comment: The paper is under revie
Natural Language Interfaces to Data
Recent advances in NLU and NLP have resulted in renewed interest in natural
language interfaces to data, which provide an easy mechanism for non-technical
users to access and query the data. While early systems evolved from keyword
search and focused on simple factual queries, the complexity of both the input
sentences as well as the generated SQL queries has evolved over time. More
recently, there has also been a lot of focus on using conversational interfaces
for data analytics, empowering a line of non-technical users with quick
insights into the data. There are three main challenges in natural language
querying (NLQ): (1) identifying the entities involved in the user utterance,
(2) connecting the different entities in a meaningful way over the underlying
data source to interpret user intents, and (3) generating a structured query in
the form of SQL or SPARQL.
There are two main approaches for interpreting a user's NLQ. Rule-based
systems make use of semantic indices, ontologies, and KGs to identify the
entities in the query, understand the intended relationships between those
entities, and utilize grammars to generate the target queries. With the
advances in deep learning (DL)-based language models, there have been many
text-to-SQL approaches that try to interpret the query holistically using DL
models. Hybrid approaches that utilize both rule-based techniques as well as DL
models are also emerging by combining the strengths of both approaches.
Conversational interfaces are the next natural step to one-shot NLQ by
exploiting query context between multiple turns of conversation for
disambiguation. In this article, we review the background technologies that are
used in natural language interfaces, and survey the different approaches to
NLQ. We also describe conversational interfaces for data analytics and discuss
several benchmarks used for NLQ research and evaluation.Comment: The full version of this manuscript, as published by Foundations and
Trends in Databases, is available at http://dx.doi.org/10.1561/190000007
A Contextual Topic Modeling and Content Analysis of Iranian laws and Regulations
A constitution is the highest legal document of a country and serves as a
guide for the establishment of other laws. The constitution defines the
political principles, structure, hierarchy, position, and limits of the
political power of a country's government. It determines and guarantees the
rights of citizens. This study aimed at topic modeling of Iranian laws. As part
of this research, 11760 laws were collected from the Dotic website. Then, topic
modeling was conducted on the title and content of the regularizations using
LDA. Data analysis with topic modeling led to the identification of 10 topics
including Economic, Customs, Housing and Urban Development, Agriculture,
Insurance, Legal and judicial, Cultural, Information Technology, Political, and
Government. The largest topic, Economic, accounts for 29% of regulations, while
the smallest are Political and Government, accounting for 2%. This research
utilizes a topic modeling method in exploring law texts and identifying trends
in regularizations from 2016-2023. In this study, it was found that
regularizations constitute a significant percentage of law, most of which are
related to economics and customs. Cultural regularizations have increased in
2023. It can be concluded any law enacted each year can reflect society's
conditions and legislators' top concerns
- …