961 research outputs found

    Browse-to-search

    Full text link
    This demonstration presents a novel interactive online shopping application based on visual search technologies. When users want to buy something on a shopping site, they usually have the requirement of looking for related information from other web sites. Therefore users need to switch between the web page being browsed and other websites that provide search results. The proposed application enables users to naturally search products of interest when they browse a web page, and make their even causal purchase intent easily satisfied. The interactive shopping experience is characterized by: 1) in session - it allows users to specify the purchase intent in the browsing session, instead of leaving the current page and navigating to other websites; 2) in context - -the browsed web page provides implicit context information which helps infer user purchase preferences; 3) in focus - users easily specify their search interest using gesture on touch devices and do not need to formulate queries in search box; 4) natural-gesture inputs and visual-based search provides users a natural shopping experience. The system is evaluated against a data set consisting of several millions commercial product images. © 2012 Authors

    CurriculumLoc: Enhancing Cross-Domain Geolocalization through Multi-Stage Refinement

    Full text link
    Visual geolocalization is a cost-effective and scalable task that involves matching one or more query images, taken at some unknown location, to a set of geo-tagged reference images. Existing methods, devoted to semantic features representation, evolving towards robustness to a wide variety between query and reference, including illumination and viewpoint changes, as well as scale and seasonal variations. However, practical visual geolocalization approaches need to be robust in appearance changing and extreme viewpoint variation conditions, while providing accurate global location estimates. Therefore, inspired by curriculum design, human learn general knowledge first and then delve into professional expertise. We first recognize semantic scene and then measure geometric structure. Our approach, termed CurriculumLoc, involves a delicate design of multi-stage refinement pipeline and a novel keypoint detection and description with global semantic awareness and local geometric verification. We rerank candidates and solve a particular cross-domain perspective-n-point (PnP) problem based on these keypoints and corresponding descriptors, position refinement occurs incrementally. The extensive experimental results on our collected dataset, TerraTrack and a benchmark dataset, ALTO, demonstrate that our approach results in the aforementioned desirable characteristics of a practical visual geolocalization solution. Additionally, we achieve new high recall@1 scores of 62.6% and 94.5% on ALTO, with two different distances metrics, respectively. Dataset, code and trained models are publicly available on https://github.com/npupilab/CurriculumLoc.Comment: 14 pages, 15 figure

    Visual Analysis of Large, Time-Dependent, Multi-Dimensional Smart Sensor Tracking Data

    Get PDF
    Technological advancements over the past decade have increased our ability to collect data to previously unimaginable volumes [Kei02]. Understanding temporal patterns is key to gaining knowledge and insight. However, our capacity to store data now far exceeds the rate at which we are able to understand it [KKEM10]. This phenomenon has led to a growing need for advanced solutions to make sense and use of an ever-increasing data space. Abstract temporal data provides additional challenges in its, representation, size, and scalability, high dimensionality, and unique structure.One instance of such temporal data is acquired from smart sensor tags attached to freely roaming animals recording multiple parameters at infra-second rates which are becoming commonplace, and are transforming biologists understanding of the way wild animals behave.The excitement at the potential inherent in sophisticated tracking devices has, however, been limited by a lack of available software to advance research in the field. This thesis introduces methodologies to deal with the problem of the analysis of the large, multi-dimensional, time-dependent data acquired. Interpretation of such data is complex and currently limits the ability of biologists to realise the value of their recorded information.We present several contributions to the field of time-series visualisation, that is, the visualisation of ordered collections of real value data attributes at successive points in time sampled at uniform time intervals. Traditionally, time-series graphs have been used for temporal data. However, screen resolution is small in comparison to the large information space commonplace today. In such cases, we can only render a proportion of the data.It is widely accepted that the effective interpretation of large temporal data sets requires advanced methods and interaction techniques. In this thesis, we address these issues to enhance the exploration, analysis, and presentation of time-series data for movement ecologists in their smart sensor data analysis

    Confluence of Vision and Natural Language Processing for Cross-media Semantic Relations Extraction

    Get PDF
    In this dissertation, we focus on extracting and understanding semantically meaningful relationships between data items of various modalities; especially relations between images and natural language. We explore the ideas and techniques to integrate such cross-media semantic relations for machine understanding of large heterogeneous datasets, made available through the expansion of the World Wide Web. The datasets collected from social media websites, news media outlets and blogging platforms usually contain multiple modalities of data. Intelligent systems are needed to automatically make sense out of these datasets and present them in such a way that humans can find the relevant pieces of information or get a summary of the available material. Such systems have to process multiple modalities of data such as images, text, linguistic features, and structured data in reference to each other. For example, image and video search and retrieval engines are required to understand the relations between visual and textual data so that they can provide relevant answers in the form of images and videos to the users\u27 queries presented in the form of text. We emphasize the automatic extraction of semantic topics or concepts from the data available in any form such as images, free-flowing text or metadata. These semantic concepts/topics become the basis of semantic relations across heterogeneous data types, e.g., visual and textual data. A classic problem involving image-text relations is the automatic generation of textual descriptions of images. This problem is the main focus of our work. In many cases, large amount of text is associated with images. Deep exploration of linguistic features of such text is required to fully utilize the semantic information encoded in it. A news dataset involving images and news articles is an example of this scenario. We devise frameworks for automatic news image description generation based on the semantic relations of images, as well as semantic understanding of linguistic features of the news articles

    Challenges and Opportunities of End-to-End Learning in Medical Image Classification

    Get PDF
    Das Paradigma des End-to-End Lernens hat in den letzten Jahren die Bilderkennung revolutioniert, aber die klinische Anwendung hinkt hinterher. Bildbasierte computergestützte Diagnosesysteme basieren immer noch weitgehend auf hochtechnischen und domänen-spezifischen Pipelines, die aus unabhängigen regelbasierten Modellen bestehen, welche die Teilaufgaben der Bildklassifikation wiederspiegeln: Lokalisation von auffälligen Regionen, Merkmalsextraktion und Entscheidungsfindung. Das Versprechen einer überlegenen Entscheidungsfindung beim End-to-End Lernen ergibt sich daraus, dass domänenspezifische Zwangsbedingungen von begrenzter Komplexität entfernt werden und stattdessen alle Systemkomponenten gleichzeitig, direkt anhand der Rohdaten, und im Hinblick auf die letztendliche Aufgabe optimiert werden. Die Gründe dafür, dass diese Vorteile noch nicht den Weg in die Klinik gefunden haben, d.h. die Herausforderungen, die sich bei der Entwicklung Deep Learning-basierter Diagnosesysteme stellen, sind vielfältig: Die Tatsache, dass die Generalisierungsfähigkeit von Lernalgorithmen davon abhängt, wie gut die verfügbaren Trainingsdaten die tatsächliche zugrundeliegende Datenverteilung abbilden, erweist sich in medizinische Anwendungen als tiefgreifendes Problem. Annotierte Datensätze in diesem Bereich sind notorisch klein, da für die Annotation eine kostspielige Beurteilung durch Experten erforderlich ist und die Zusammenlegung kleinerer Datensätze oft durch Datenschutzauflagen und Patientenrechte erschwert wird. Darüber hinaus weisen medizinische Datensätze drastisch unterschiedliche Eigenschaften im Bezug auf Bildmodalitäten, Bildgebungsprotokolle oder Anisotropien auf, und die oft mehrdeutige Evidenz in medizinischen Bildern kann sich auf inkonsistente oder fehlerhafte Trainingsannotationen übertragen. Während die Verschiebung von Datenverteilungen zwischen Forschungsumgebung und Realität zu einer verminderten Modellrobustheit führt und deshalb gegenwärtig als das Haupthindernis für die klinische Anwendung von Lernalgorithmen angesehen wird, wird dieser Graben oft noch durch Störfaktoren wie Hardwarelimitationen oder Granularität von gegebenen Annotation erweitert, die zu Diskrepanzen zwischen der modellierten Aufgabe und der zugrunde liegenden klinischen Fragestellung führen. Diese Arbeit untersucht das Potenzial des End-to-End-Lernens in klinischen Diagnosesystemen und präsentiert Beiträge zu einigen der wichtigsten Herausforderungen, die derzeit eine breite klinische Anwendung verhindern. Zunächst wird der letzten Teil der Klassifikations-Pipeline untersucht, die Kategorisierung in klinische Pathologien. Wir demonstrieren, wie das Ersetzen des gegenwärtigen klinischen Standards regelbasierter Entscheidungen durch eine groß angelegte Merkmalsextraktion gefolgt von lernbasierten Klassifikatoren die Brustkrebsklassifikation im MRT signifikant verbessert und eine Leistung auf menschlichem Level erzielt. Dieser Ansatz wird weiter anhand von kardiologischer Diagnose gezeigt. Zweitens ersetzen wir, dem Paradigma des End-to-End Lernens folgend, das biophysikalische Modell, das für die Bildnormalisierung in der MRT angewandt wird, sowie die Extraktion handgefertigter Merkmale, durch eine designierte CNN-Architektur und liefern eine eingehende Analyse, die das verborgene Potenzial der gelernten Bildnormalisierung und einen Komplementärwert der gelernten Merkmale gegenüber den handgefertigten Merkmalen aufdeckt. Während dieser Ansatz auf markierten Regionen arbeitet und daher auf manuelle Annotation angewiesen ist, beziehen wir im dritten Teil die Aufgabe der Lokalisierung dieser Regionen in den Lernprozess ein, um eine echte End-to-End-Diagnose baserend auf den Rohbildern zu ermöglichen. Dabei identifizieren wir eine weitgehend vernachlässigte Zwangslage zwischen dem Streben nach der Auswertung von Modellen auf klinisch relevanten Skalen auf der einen Seite, und der Optimierung für effizientes Training unter Datenknappheit auf der anderen Seite. Wir präsentieren ein Deep Learning Modell, das zur Auflösung dieses Kompromisses beiträgt, liefern umfangreiche Experimente auf drei medizinischen Datensätzen sowie eine Serie von Toy-Experimenten, die das Verhalten bei begrenzten Trainingsdaten im Detail untersuchen, und publiziren ein umfassendes Framework, das unter anderem die ersten 3D-Implementierungen gängiger Objekterkennungsmodelle umfasst. Wir identifizieren weitere Hebelpunkte in bestehenden End-to-End-Lernsystemen, bei denen Domänenwissen als Zwangsbedingung dienen kann, um die Robustheit von Modellen in der medizinischen Bildanalyse zu erhöhen, die letztendlich dazu beitragen sollen, den Weg für die Anwendung in der klinischen Praxis zu ebnen. Zu diesem Zweck gehen wir die Herausforderung fehlerhafter Trainingsannotationen an, indem wir die Klassifizierungskompnente in der End-to-End-Objekterkennung durch Regression ersetzen, was es ermöglicht, Modelle direkt auf der kontinuierlichen Skala der zugrunde liegenden pathologischen Prozesse zu trainieren und so die Robustheit der Modelle gegenüber fehlerhaften Trainingsannotationen zu erhöhen. Weiter adressieren wir die Herausforderung der Input-Heterogenitäten, mit denen trainierte Modelle konfrontiert sind, wenn sie an verschiedenen klinischen Orten eingesetzt werden, indem wir eine modellbasierte Domänenanpassung vorschlagen, die es ermöglicht, die ursprüngliche Trainingsdomäne aus veränderten Inputs wiederherzustellen und damit eine robuste Generalisierung zu gewährleisten. Schließlich befassen wir uns mit dem höchst unsystematischen, aufwendigen und subjektiven Trial-and-Error-Prozess zum Finden von robusten Hyperparametern für einen gegebene Aufgabe, indem wir Domänenwissen in ein Set systematischer Regeln überführen, die eine automatisierte und robuste Konfiguration von Deep Learning Modellen auf einer Vielzahl von medizinischen Datensetzen ermöglichen. Zusammenfassend zeigt die hier vorgestellte Arbeit das enorme Potenzial von End-to-End Lernalgorithmen im Vergleich zum klinischen Standard mehrteiliger und hochtechnisierter Diagnose-Pipelines auf, und präsentiert Lösungsansätze zu einigen der wichtigsten Herausforderungen für eine breite Anwendung unter realen Bedienungen wie Datenknappheit, Diskrepanz zwischen der vom Modell behandelten Aufgabe und der zugrunde liegenden klinischen Fragestellung, Mehrdeutigkeiten in Trainingsannotationen, oder Verschiebung von Datendomänen zwischen klinischen Standorten. Diese Beiträge können als Teil des übergreifende Zieles der Automatisierung von medizinischer Bildklassifikation gesehen werden - ein integraler Bestandteil des Wandels, der erforderlich ist, um die Zukunft des Gesundheitswesens zu gestalten

    Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021

    Get PDF
    The eighth edition of the Italian Conference on Computational Linguistics (CLiC-it 2021) was held at Università degli Studi di Milano-Bicocca from 26th to 28th January 2022. After the edition of 2020, which was held in fully virtual mode due to the health emergency related to Covid-19, CLiC-it 2021 represented the first moment for the Italian research community of Computational Linguistics to meet in person after more than one year of full/partial lockdown

    Multi-view representation learning for natural language processing applications

    Get PDF
    The pervasion of machine learning in a vast number of applications has given rise to an increasing demand for the effective processing of complex, diverse and variable datasets. One representative case of data diversity can be found in multi-view datasets, which contain input originating from more than one source or having multiple aspects or facets. Examples include, but are not restricted to, multimodal datasets, where data may consist of audio, image and/or text. The nature of multi-view datasets calls for special treatment in terms of representation. A subsequent fundamental problem is that of combining information from potentially incoherent sources; a problem commonly referred to as view fusion. Quite often, the heuristic solution of early fusion is applied to this problem: aggregating representations from different views using a simple function (concatenation, summation or mean pooling). However, early fusion can cause overfitting in the case of small training samples and also, it may result in specific statistical properties of each view being lost in the learning process. Representation learning, the set of ideas and algorithms devised to learn meaningful representations for machine learning problems, has recently grown to a vibrant research field, that encompasses multiple view setups. A plethora of multi-view representation learning methods has been proposed in the literature, with a large portion of them being based on the idea of maximising the correlation between available views. Commonly, such techniques are evaluated on synthetic datasets or strictly defined benchmark setups; a role that, within Natural Language Processing, is often assumed by the multimodal sentiment analysis problem. This thesis argues that more complex downstream applications could benefit from such representations and describes a multi-view contemplation of a range of tasks, from static, two-view, unimodal to dynamic, three-view, trimodal applications.setting out to explore the limits of the seeming applicability of multi-view representation learning More specifically, we experiment with document summarisation, framing it as a multi-view problem where documents and summaries are considered two separate, textual views. Moreover, we present a multi-view inference algorithm for the bimodal problem of image captioning. Delving more into multimodal setups, we develop a set of multi-view models for applications pertaining to videos, including tagging and text generation tasks. Finally, we introduce narration generation, a new text generation task from movie videos, that requires inference on the storyline level and temporal context-based reasoning. The main argument of the thesis is that, due to their performance, multi-view representation learning tools warrant serious consideration by the researchers and practitioners of the Natural Language Processing community. Exploring the limits of multi-view representations, we investigate their fitness for Natural Language Processing tasks and show that they are able to hold information required for complex problems, while being a good alternative to the early fusion paradigm

    Recuperação multimodal e interativa de informação orientada por diversidade

    Get PDF
    Orientador: Ricardo da Silva TorresTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Os métodos de Recuperação da Informação, especialmente considerando-se dados multimídia, evoluíram para a integração de múltiplas fontes de evidência na análise de relevância de itens em uma tarefa de busca. Neste contexto, para atenuar a distância semântica entre as propriedades de baixo nível extraídas do conteúdo dos objetos digitais e os conceitos semânticos de alto nível (objetos, categorias, etc.) e tornar estes sistemas adaptativos às diferentes necessidades dos usuários, modelos interativos que consideram o usuário mais próximo do processo de recuperação têm sido propostos, permitindo a sua interação com o sistema, principalmente por meio da realimentação de relevância implícita ou explícita. Analogamente, a promoção de diversidade surgiu como uma alternativa para lidar com consultas ambíguas ou incompletas. Adicionalmente, muitos trabalhos têm tratado a ideia de minimização do esforço requerido do usuário em fornecer julgamentos de relevância, à medida que mantém níveis aceitáveis de eficácia. Esta tese aborda, propõe e analisa experimentalmente métodos de recuperação da informação interativos e multimodais orientados por diversidade. Este trabalho aborda de forma abrangente a literatura acerca da recuperação interativa da informação e discute sobre os avanços recentes, os grandes desafios de pesquisa e oportunidades promissoras de trabalho. Nós propusemos e avaliamos dois métodos de aprimoramento do balanço entre relevância e diversidade, os quais integram múltiplas informações de imagens, tais como: propriedades visuais, metadados textuais, informação geográfica e descritores de credibilidade dos usuários. Por sua vez, como integração de técnicas de recuperação interativa e de promoção de diversidade, visando maximizar a cobertura de múltiplas interpretações/aspectos de busca e acelerar a transferência de informação entre o usuário e o sistema, nós propusemos e avaliamos um método multimodal de aprendizado para ranqueamento utilizando realimentação de relevância sobre resultados diversificados. Nossa análise experimental mostra que o uso conjunto de múltiplas fontes de informação teve impacto positivo nos algoritmos de balanceamento entre relevância e diversidade. Estes resultados sugerem que a integração de filtragem e re-ranqueamento multimodais é eficaz para o aumento da relevância dos resultados e também como mecanismo de potencialização dos métodos de diversificação. Além disso, com uma análise experimental minuciosa, nós investigamos várias questões de pesquisa relacionadas à possibilidade de aumento da diversidade dos resultados e a manutenção ou até mesmo melhoria da sua relevância em sessões interativas. Adicionalmente, nós analisamos como o esforço em diversificar afeta os resultados gerais de uma sessão de busca e como diferentes abordagens de diversificação se comportam para diferentes modalidades de dados. Analisando a eficácia geral e também em cada iteração de realimentação de relevância, nós mostramos que introduzir diversidade nos resultados pode prejudicar resultados iniciais, enquanto que aumenta significativamente a eficácia geral em uma sessão de busca, considerando-se não apenas a relevância e diversidade geral, mas também o quão cedo o usuário é exposto ao mesmo montante de itens relevantes e nível de diversidadeAbstract: Information retrieval methods, especially considering multimedia data, have evolved towards the integration of multiple sources of evidence in the analysis of the relevance of items considering a given user search task. In this context, for attenuating the semantic gap between low-level features extracted from the content of the digital objects and high-level semantic concepts (objects, categories, etc.) and making the systems adaptive to different user needs, interactive models have brought the user closer to the retrieval loop allowing user-system interaction mainly through implicit or explicit relevance feedback. Analogously, diversity promotion has emerged as an alternative for tackling ambiguous or underspecified queries. Additionally, several works have addressed the issue of minimizing the required user effort on providing relevance assessments while keeping an acceptable overall effectiveness. This thesis discusses, proposes, and experimentally analyzes multimodal and interactive diversity-oriented information retrieval methods. This work, comprehensively covers the interactive information retrieval literature and also discusses about recent advances, the great research challenges, and promising research opportunities. We have proposed and evaluated two relevance-diversity trade-off enhancement work-flows, which integrate multiple information from images, such as: visual features, textual metadata, geographic information, and user credibility descriptors. In turn, as an integration of interactive retrieval and diversity promotion techniques, for maximizing the coverage of multiple query interpretations/aspects and speeding up the information transfer between the user and the system, we have proposed and evaluated a multimodal learning-to-rank method trained with relevance feedback over diversified results. Our experimental analysis shows that the joint usage of multiple information sources positively impacted the relevance-diversity balancing algorithms. Our results also suggest that the integration of multimodal-relevance-based filtering and reranking was effective on improving result relevance and also boosted diversity promotion methods. Beyond it, with a thorough experimental analysis we have investigated several research questions related to the possibility of improving result diversity and keeping or even improving relevance in interactive search sessions. Moreover, we analyze how much the diversification effort affects overall search session results and how different diversification approaches behave for the different data modalities. By analyzing the overall and per feedback iteration effectiveness, we show that introducing diversity may harm initial results whereas it significantly enhances the overall session effectiveness not only considering the relevance and diversity, but also how early the user is exposed to the same amount of relevant items and diversityDoutoradoCiência da ComputaçãoDoutor em Ciência da ComputaçãoP-4388/2010140977/2012-0CAPESCNP
    corecore