88 research outputs found
CAPTCHA Types and Breaking Techniques: Design Issues, Challenges, and Future Research Directions
The proliferation of the Internet and mobile devices has resulted in
malicious bots access to genuine resources and data. Bots may instigate
phishing, unauthorized access, denial-of-service, and spoofing attacks to
mention a few. Authentication and testing mechanisms to verify the end-users
and prohibit malicious programs from infiltrating the services and data are
strong defense systems against malicious bots. Completely Automated Public
Turing test to tell Computers and Humans Apart (CAPTCHA) is an authentication
process to confirm that the user is a human hence, access is granted. This
paper provides an in-depth survey on CAPTCHAs and focuses on two main things:
(1) a detailed discussion on various CAPTCHA types along with their advantages,
disadvantages, and design recommendations, and (2) an in-depth analysis of
different CAPTCHA breaking techniques. The survey is based on over two hundred
studies on the subject matter conducted since 2003 to date. The analysis
reinforces the need to design more attack-resistant CAPTCHAs while keeping
their usability intact. The paper also highlights the design challenges and
open issues related to CAPTCHAs. Furthermore, it also provides useful
recommendations for breaking CAPTCHAs
Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods
Machine generated text is increasingly difficult to distinguish from human
authored text. Powerful open-source models are freely available, and
user-friendly tools that democratize access to generative models are
proliferating. ChatGPT, which was released shortly after the first preprint of
this survey, epitomizes these trends. The great potential of state-of-the-art
natural language generation (NLG) systems is tempered by the multitude of
avenues for abuse. Detection of machine generated text is a key countermeasure
for reducing abuse of NLG models, with significant technical challenges and
numerous open problems. We provide a survey that includes both 1) an extensive
analysis of threat models posed by contemporary NLG systems, and 2) the most
complete review of machine generated text detection methods to date. This
survey places machine generated text within its cybersecurity and social
context, and provides strong guidance for future work addressing the most
critical threat models, and ensuring detection systems themselves demonstrate
trustworthiness through fairness, robustness, and accountability.Comment: Manuscript submitted to ACM Special Session on Trustworthy AI.
2022/11/19 - Updated reference
POTATO: The Portable Text Annotation Tool
We present POTATO, the Portable text annotation tool, a free, fully
open-sourced annotation system that 1) supports labeling many types of text and
multimodal data; 2) offers easy-to-configure features to maximize the
productivity of both deployers and annotators (convenient templates for common
ML/NLP tasks, active learning, keypress shortcuts, keyword highlights,
tooltips); and 3) supports a high degree of customization (editable UI,
inserting pre-screening questions, attention and qualification tests).
Experiments over two annotation tasks suggest that POTATO improves labeling
speed through its specially-designed productivity features, especially for long
documents and complex tasks. POTATO is available at
https://github.com/davidjurgens/potato and will continue to be updated.Comment: EMNLP 2022 DEM
Avatar captcha : telling computers and humans apart via face classification and mouse dynamics.
Bots are malicious, automated computer programs that execute malicious scripts and predefined functions on an affected computer. They pose cybersecurity threats and are one of the most sophisticated and common types of cybercrime tools today. They spread viruses, generate spam, steal personal sensitive information, rig online polls and commit other types of online crime and fraud. They sneak into unprotected systems through the Internet by seeking vulnerable entry points. They access the system’s resources like a human user does. Now the question arises how do we counter this? How do we prevent bots and on the other hand allow human users to access the system resources? One solution is by designing a CAPTCHA (Completely Automated Public Turing Tests to tell Computers and Humans Apart), a program that can generate and grade tests that most humans can pass but computers cannot. It is used as a tool to distinguish humans from malicious bots. They are a class of Human Interactive Proofs (HIPs) meant to be easily solvable by humans and economically infeasible for computers. Text CAPTCHAs are very popular and commonly used. For each challenge, they generate a sequence of alphabets by distorting standard fonts, requesting users to identify them and type them out. However, they are vulnerable to character segmentation attacks by bots, English language dependent and are increasingly becoming too complex for people to solve. A solution to this is to design Image CAPTCHAs that use images instead of text and require users to identify certain images to solve the challenges. They are user-friendly and convenient for human users and a much more challenging problem for bots to solve. In today’s Internet world the role of user profiling or user identification has gained a lot of significance. Identity thefts, etc. can be prevented by providing authorized access to resources. To achieve timely response to a security breach frequent user verification is needed. However, this process must be passive, transparent and non-obtrusive. In order for such a system to be practical it must be accurate, efficient and difficult to forge. Behavioral biometric systems are usually less prominent however, they provide numerous and significant advantages over traditional biometric systems. Collection of behavior data is non-obtrusive and cost-effective as it requires no special hardware. While these systems are not unique enough to provide reliable human identification, they have shown to be highly accurate in identity verification. In accomplishing everyday tasks, human beings use different styles, strategies, apply unique skills and knowledge, etc. These define the behavioral traits of the user. Behavioral biometrics attempts to quantify these traits to profile users and establish their identity. Human computer interaction (HCI)-based biometrics comprise of interaction strategies and styles between a human and a computer. These unique user traits are quantified to build profiles for identification. A specific category of HCI-based biometrics is based on recording human interactions with mouse as the input device and is known as Mouse Dynamics. By monitoring the mouse usage activities produced by a user during interaction with the GUI, a unique profile can be created for that user that can help identify him/her. Mouse-based verification approaches do not record sensitive user credentials like usernames and passwords. Thus, they avoid privacy issues. An image CAPTCHA is proposed that incorporates Mouse Dynamics to help fortify it. It displays random images obtained from Yahoo’s Flickr. To solve the challenge the user must identify and select a certain class of images. Two theme-based challenges have been designed. They are Avatar CAPTCHA and Zoo CAPTCHA. The former displays human and avatar faces whereas the latter displays different animal species. In addition to the dynamically selected images, while attempting to solve the CAPTCHA, the way each user interacts with the mouse i.e. mouse clicks, mouse movements, mouse cursor screen co-ordinates, etc. are recorded nonobtrusively at regular time intervals. These recorded mouse movements constitute the Mouse Dynamics Signature (MDS) of the user. This MDS provides an additional secure technique to segregate humans from bots. The security of the CAPTCHA is tested by an adversary executing a mouse bot attempting to solve the CAPTCHA challenges
Anglicisms in the Field of IT (GitHub and 3D Slicer): Multilingual Evidence from European Languages (French, German, Italian, Portuguese and Spanish)
This paper provides evidence of the noticeable adoption of Anglicisms in the professional field of IT by different European languages (French, German, Italian, Portuguese and Spanish). Two different domains, GitHub and 3D Slicer, have been examined, and a multilingual glossary has been created with the contributions of European and African engineers and technicians cooperating in the European project MACbioIDi. This multilingual glossary is a useful tool for engineers, as it provides equivalent terminology in these five languages. The use of the studied Anglicisms is documented with interviews to different engineers to verify the oral uses, and the written uses are recorded with examples in context taken from different Internet websites and forums. This is an interdisciplinary research that involves people from different areas of knowledge (linguists, engineers and technicians), and from different continents (Africa, America and Europe)
Mathematical Expression Recognition based on Probabilistic Grammars
[EN] Mathematical notation is well-known and used all over the
world. Humankind has evolved from simple methods representing
countings to current well-defined math notation able to account for
complex problems. Furthermore, mathematical expressions constitute a
universal language in scientific fields, and many information
resources containing mathematics have been created during the last
decades. However, in order to efficiently access all that information,
scientific documents have to be digitized or produced directly in
electronic formats.
Although most people is able to understand and produce mathematical
information, introducing math expressions into electronic devices
requires learning specific notations or using editors. Automatic
recognition of mathematical expressions aims at filling this gap
between the knowledge of a person and the input accepted by
computers. This way, printed documents containing math expressions
could be automatically digitized, and handwriting could be used for
direct input of math notation into electronic devices.
This thesis is devoted to develop an approach for mathematical
expression recognition. In this document we propose an approach for
recognizing any type of mathematical expression (printed or
handwritten) based on probabilistic grammars. In order to do so, we
develop the formal statistical framework such that derives several
probability distributions. Along the document, we deal with the
definition and estimation of all these probabilistic sources of
information. Finally, we define the parsing algorithm that globally
computes the most probable mathematical expression for a given input
according to the statistical framework.
An important point in this study is to provide objective performance
evaluation and report results using public data and standard
metrics. We inspected the problems of automatic evaluation in this
field and looked for the best solutions. We also report several
experiments using public databases and we participated in several
international competitions. Furthermore, we have released most of the
software developed in this thesis as open source.
We also explore some of the applications of mathematical expression
recognition. In addition to the direct applications of transcription
and digitization, we report two important proposals. First, we
developed mucaptcha, a method to tell humans and computers apart by
means of math handwriting input, which represents a novel application
of math expression recognition. Second, we tackled the problem of
layout analysis of structured documents using the statistical
framework developed in this thesis, because both are two-dimensional
problems that can be modeled with probabilistic grammars.
The approach developed in this thesis for mathematical expression
recognition has obtained good results at different levels. It has
produced several scientific publications in international conferences
and journals, and has been awarded in international competitions.[ES] La notación matemática es bien conocida y se utiliza en todo el
mundo. La humanidad ha evolucionado desde simples métodos para
representar cuentas hasta la notación formal actual capaz de modelar
problemas complejos. Además, las expresiones matemáticas constituyen
un idioma universal en el mundo científico, y se han creado muchos
recursos que contienen matemáticas durante las últimas décadas. Sin
embargo, para acceder de forma eficiente a toda esa información, los
documentos científicos han de ser digitalizados o producidos
directamente en formatos electrónicos.
Aunque la mayoría de personas es capaz de entender y producir
información matemática, introducir expresiones matemáticas en
dispositivos electrónicos requiere aprender notaciones especiales o
usar editores. El reconocimiento automático de expresiones matemáticas
tiene como objetivo llenar ese espacio existente entre el conocimiento
de una persona y la entrada que aceptan los ordenadores. De este modo,
documentos impresos que contienen fórmulas podrían digitalizarse
automáticamente, y la escritura se podría utilizar para introducir
directamente notación matemática en dispositivos electrónicos.
Esta tesis está centrada en desarrollar un método para reconocer
expresiones matemáticas. En este documento proponemos un método para
reconocer cualquier tipo de fórmula (impresa o manuscrita) basado en
gramáticas probabilísticas. Para ello, desarrollamos el marco
estadístico formal que deriva varias distribuciones de probabilidad. A
lo largo del documento, abordamos la definición y estimación de todas
estas fuentes de información probabilística. Finalmente, definimos el
algoritmo que, dada cierta entrada, calcula globalmente la expresión
matemática más probable de acuerdo al marco estadístico.
Un aspecto importante de este trabajo es proporcionar una evaluación
objetiva de los resultados y presentarlos usando datos públicos y
medidas estándar. Por ello, estudiamos los problemas de la evaluación
automática en este campo y buscamos las mejores soluciones. Asimismo,
presentamos diversos experimentos usando bases de datos públicas y
hemos participado en varias competiciones internacionales. Además,
hemos publicado como código abierto la mayoría del software
desarrollado en esta tesis.
También hemos explorado algunas de las aplicaciones del reconocimiento
de expresiones matemáticas. Además de las aplicaciones directas de
transcripción y digitalización, presentamos dos propuestas
importantes. En primer lugar, desarrollamos mucaptcha, un método para
discriminar entre humanos y ordenadores mediante la escritura de
expresiones matemáticas, el cual representa una novedosa aplicación
del reconocimiento de fórmulas. En segundo lugar, abordamos el
problema de detectar y segmentar la estructura de documentos
utilizando el marco estadístico formal desarrollado en esta tesis,
dado que ambos son problemas bidimensionales que pueden modelarse con
gramáticas probabilísticas.
El método desarrollado en esta tesis para reconocer expresiones
matemáticas ha obtenido buenos resultados a diferentes niveles. Este
trabajo ha producido varias publicaciones en conferencias
internacionales y revistas, y ha sido premiado en competiciones
internacionales.[CA] La notació matemàtica és ben coneguda i s'utilitza a tot el món. La
humanitat ha evolucionat des de simples mètodes per representar
comptes fins a la notació formal actual capaç de modelar
problemes complexos. A més, les expressions matemàtiques
constitueixen un idioma universal al món científic, i s'han creat
molts recursos que contenen matemàtiques durant les últimes
dècades. No obstant això, per accedir de forma eficient a tota
aquesta informació, els documents científics han de ser
digitalitzats o produïts directament en formats electrònics.
Encara que la majoria de persones és capaç d'entendre i produir
informació matemàtica, introduir expressions matemàtiques en
dispositius electrònics requereix aprendre notacions especials o usar
editors. El reconeixement automàtic d'expressions matemàtiques
té per objectiu omplir aquest espai existent entre el coneixement
d'una persona i l'entrada que accepten els ordinadors. D'aquesta
manera, documents impresos que contenen fórmules podrien
digitalitzar-se automàticament, i l'escriptura es podria utilitzar per
introduir directament notació matemàtica en dispositius electrònics.
Aquesta tesi està centrada en desenvolupar un mètode per reconèixer
expressions matemàtiques. En aquest document proposem un mètode per
reconèixer qualsevol tipus de fórmula (impresa o manuscrita) basat en
gramàtiques probabilístiques. Amb aquesta finalitat, desenvolupem el
marc estadístic formal que deriva diverses distribucions de
probabilitat. Al llarg del document, abordem la definició i estimació
de totes aquestes fonts d'informació probabilística. Finalment,
definim l'algorisme que, donada certa entrada, calcula globalment
l'expressió matemàtica més probable d'acord al marc estadístic.
Un aspecte important d'aquest treball és proporcionar una avaluació
objectiva dels resultats i presentar-los usant dades públiques i
mesures estàndard. Per això, estudiem els problemes de l'avaluació
automàtica en aquest camp i busquem les millors solucions. Així
mateix, presentem diversos experiments usant bases de dades públiques
i hem participat en diverses competicions internacionals. A més, hem
publicat com a codi obert la majoria del software desenvolupat en
aquesta tesi.
També hem explorat algunes de les aplicacions del reconeixement
d'expressions matemàtiques. A més de les aplicacions directes de
transcripció i digitalització, presentem dues propostes
importants. En primer lloc, desenvolupem mucaptcha, un mètode per
discriminar entre humans i ordinadors mitjançant l'escriptura
d'expressions matemàtiques, el qual representa una nova aplicació del
reconeixement de fórmules. En segon lloc, abordem el problema de
detectar i segmentar l'estructura de documents utilitzant el marc
estadístic formal desenvolupat en aquesta tesi, donat que ambdós són
problemes bidimensionals que poden modelar-se amb gramàtiques
probabilístiques.
El mètode desenvolupat en aquesta tesi per reconèixer expressions
matemàtiques ha obtingut bons resultats a diferents nivells. Aquest
treball ha produït diverses publicacions en conferències
internacionals i revistes, i ha sigut premiat en competicions
internacionals.Álvaro Muñoz, F. (2015). Mathematical Expression Recognition based on Probabilistic Grammars [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/51665TESI
Duolingo: An Useful Complementary Mobile Tool to Improve English as a Foreign Language Learning and Teaching.
Esta monografía incluye una revisión de la aplicación Doulingo y cómo podría usarse como herramienta móvil complementaria para mejorar la enseñanza y el aprendizaje del inglés como lengua extranjera. El objetivo es mostrar la posible efectividad del uso de Duolingo como complemento de las clases de inglés. Esto se ha hecho al examinar diferentes revisiones de bibliografía sobre tecnología móvil y gamificación, tecnología móvil y m-learning, aprendizaje de idiomas, la traducción y crowdsourcing en Duolingo y algunos resultados sobre la efectividad de Duolingo en el proceso de aprendizaje y enseñanza. Tras el examen de esta revisión, queda claro que el uso de la tecnología móvil podría brindar una excelente oportunidad tanto para los estudiantes como para los profesores para practicar y mejorar las habilidades lingüísticas en todas partes, debido a las facilidades que tiene la mayoría de las personas. Esta monografía destaca el efecto positivo del uso de esta aplicación, la manera fácil de obtenerla de forma gratuita y el entusiasmo que puede traer a algunos estudiantes.This monograph involves a review on Doulingo application and how it could be used as a complementary mobile tool to improve English as a Foreign Language teaching and learning. The goal is to show the possible effectiveness of using Duolingo as a complement for English lessons. This has been done by examining different bibliography reviews on mobile technology and gamification, mobile technology and m-learning, language learning, the translation and crowdsourcing in Duolingo and some results on the effectiveness of Duolingo in the learning and teaching process. Upon examination of this review, it becomes clear that the use of mobile technology could bring an excellent opportunity to both students and teachers to practice and improve language skills everywhere, due to the facilities on use a smartphone that most people have. This monograph highlights the positive effect of using this application, the easy way to get it for free and the enthusiasm it could bring to some students
Crowdsourcing for Speech: Economic, Legal and Ethical analysis
With respect to spoken language resource production, Crowdsourcing - the process of distributing tasks to an open, unspecified population via the internet - offers a wide range of opportunities: populations with specific skills are potentially instantaneously accessible somewhere on the globe for any spoken language. As is the case for most newly introduced high-tech services, crowdsourcing raises both hopes and doubts, certainties and questions. A general analysis of Crowdsourcing for Speech processing could be found in (Eskenazi et al., 2013). This article will focus on ethical, legal and economic issues of crowdsourcing in general (Zittrain, 2008a) and of crowdsourcing services such as Amazon Mechanical Turk (Fort et al., 2011; Adda et al., 2011), a major platform for multilingual language resources (LR) production
Modeling and estimating the economic and social impact of the results of the project Re-search Alps
The idea behind the Re-search Alps project has been conceived inside within the EUSALP Action Group 1 - “to develop an effective research and innovation ecosystem” (AG1). EUSALP is the EU-Strategy for the Alpine Region, which is composed of seven countries: Austria, France, Germany, Italy Liechtenstein, Slovenia and Switzerland. The strategy aims at ensuring mutually beneficial interactions between the mountain regions at its core and the surrounding lowlands and urban areas. The goal of the Re-search Alps project is the publication on the web of an open dataset describing the private and public laboratories, research and innovation centers (hereinafter, referred as “labs”, in short) existing in the seven aforementioned countries, with particular reference to the 48 Regions constituting the Alpine Area
- …