15,038 research outputs found
A best view selection in meetings through attention analysis using a multi-camera network
Human activity analysis is an essential task in ambient intelligence and computer vision. The main focus lies in the automatic analysis of ongoing activities from a multi-camera network. One possible application is meeting analysis which explores the dynamics in meetings using low-level data and inferring high-level activities. However, the detection of such activities is still very challenging due to the often corrupted or imprecise low-level data. In this paper, we present an approach to understand the dynamics in meetings using a multi-camera network, consisting of fixed ambient and portable close-up cameras. As a particular application we are aiming to find the most informative video stream, for example as a representative view for a remote participant. Our contribution is threefold: at first, we estimate the extrinsic parameters of the portable close-up cameras based on head positions. Secondly, we find common overlapping areas based on the consensus of people’s orientation. And thirdly, the most informative view for a remote participant is estimated using common overlapping areas. We evaluated our proposed approach and compared it to a motion estimation method. Experimental results show that we can reach an accuracy of 74% compared to manually selected views
Tracking and modeling focus of attention in meetings [online]
Abstract
This thesis addresses the problem of tracking the focus of
attention of people. In particular, a system to track the focus
of attention of participants in meetings is developed. Obtaining
knowledge about a person\u27s focus of attention is an important
step towards a better understanding of what people do, how and
with what or whom they interact or to what they refer. In
meetings, focus of attention can be used to disambiguate the
addressees of speech acts, to analyze interaction and for
indexing of meeting transcripts. Tracking a user\u27s focus of
attention also greatly contributes to the improvement of
humanÂcomputer interfaces since it can be used to build interfaces
and environments that become aware of what the user is paying
attention to or with what or whom he is interacting.
The direction in which people look; i.e., their gaze, is closely
related to their focus of attention. In this thesis, we estimate
a subject\u27s focus of attention based on his or her head
orientation. While the direction in which someone looks is
determined by head orientation and eye gaze, relevant literature
suggests that head orientation alone is a su#cient cue for the
detection of someone\u27s direction of attention during social
interaction. We present experimental results from a user study
and from several recorded meetings that support this hypothesis.
We have developed a Bayesian approach to model at whom or what
someone is look ing based on his or her head orientation. To
estimate head orientations in meetings, the participants\u27 faces
are automatically tracked in the view of a panoramic camera and
neural networks are used to estimate their head orientations
from preÂprocessed images of their faces. Using this approach,
the focus of attention target of subjects could be correctly
identified during 73% of the time in a number of evaluation meetÂ
ings with four participants.
In addition, we have investigated whether a person\u27s focus of
attention can be preÂdicted from other cues. Our results show
that focus of attention is correlated to who is speaking in a
meeting and that it is possible to predict a person\u27s focus of
attention
based on the information of who is talking or was talking before
a given moment.
We have trained neural networks to predict at whom a person is
looking, based on information about who was speaking. Using this
approach we were able to predict who is looking at whom with 63%
accuracy on the evaluation meetings using only information about
who was speaking. We show that by using both head orientation
and speaker information to estimate a person\u27s focus, the
accuracy of focus detection can be improved compared to just
using one of the modalities for focus estimation.
To demonstrate the generality of our approach, we have built a
prototype system to demonstrate focusÂaware interaction with a
household robot and other smart appliances in a room using the
developed components for focus of attention tracking. In the
demonstration environment, a subject could interact with a
simulated household robot, a speechÂenabled VCR or with other
people in the room, and the recipient of the subject\u27s speech
was disambiguated based on the user\u27s direction of attention.
Zusammenfassung
Die vorliegende Arbeit beschäftigt sich mit der automatischen
Bestimmung und VerÂfolgung des Aufmerksamkeitsfokus von Personen
in Besprechungen.
Die Bestimmung des Aufmerksamkeitsfokus von Personen ist zum
Verständnis und zur automatischen Auswertung von
Besprechungsprotokollen sehr wichtig. So kann damit
beispielsweise herausgefunden werden, wer zu einem bestimmten
Zeitpunkt wen angesprochen hat beziehungsweise wer wem zugehört
hat. Die automatische BestimÂmung des Aufmerksamkeitsfokus kann
desweiteren zur Verbesserung von Mensch-MaschineÂSchnittstellen
benutzt werden.
Ein wichtiger Hinweis auf die Richtung, in welche eine Person
ihre Aufmerksamkeit richtet, ist die Kopfstellung der Person.
Daher wurde ein Verfahren zur Bestimmung der Kopfstellungen von
Personen entwickelt. Hierzu wurden künstliche neuronale Netze
benutzt, welche als Eingaben vorverarbeitete Bilder des Kopfes
einer Person erhalten, und als Ausgabe eine Schätzung der
Kopfstellung berechnen. Mit den trainierten Netzen wurde auf
Bilddaten neuer Personen, also Personen, deren Bilder nicht in
der Trainingsmenge enthalten waren, ein mittlerer Fehler von
neun bis zehn Grad für die Bestimmung der horizontalen und
vertikalen Kopfstellung erreicht.
Desweiteren wird ein probabilistischer Ansatz zur Bestimmung von
AufmerksamkeitsÂzielen vorgestellt. Es wird hierbei ein
Bayes\u27scher Ansatzes verwendet um die AÂposterior
iWahrscheinlichkeiten verschiedener Aufmerksamkteitsziele,
gegeben beobachteter Kopfstellungen einer Person, zu bestimmen.
Die entwickelten Ansätze wurden auf mehren Besprechungen mit
vier bis fünf Teilnehmern evaluiert.
Ein weiterer Beitrag dieser Arbeit ist die Untersuchung,
inwieweit sich die BlickrichÂtung der Besprechungsteilnehmer
basierend darauf, wer gerade spricht, vorhersagen läßt. Es wurde
ein Verfahren entwickelt um mit Hilfe von neuronalen Netzen den
Fokus einer Person basierend auf einer kurzen Historie der
Sprecherkonstellationen zu schätzen.
Wir zeigen, dass durch Kombination der bildbasierten und der
sprecherbasierten Schätzung des Aufmerksamkeitsfokus eine
deutliche verbesserte Schätzung erreicht werden kann.
Insgesamt wurde mit dieser Arbeit erstmals ein System
vorgestellt um automatisch die Aufmerksamkeit von Personen in
einem Besprechungsraum zu verfolgen.
Die entwickelten Ansätze und Methoden können auch zur Bestimmung
der AufmerkÂsamkeit von Personen in anderen Bereichen,
insbesondere zur Steuerung von computÂerisierten, interaktiven
Umgebungen, verwendet werden. Dies wird an einer
Beispielapplikation gezeigt
A Geometric Approach to Sound Source Localization from Time-Delay Estimates
This paper addresses the problem of sound-source localization from time-delay
estimates using arbitrarily-shaped non-coplanar microphone arrays. A novel
geometric formulation is proposed, together with a thorough algebraic analysis
and a global optimization solver. The proposed model is thoroughly described
and evaluated. The geometric analysis, stemming from the direct acoustic
propagation model, leads to necessary and sufficient conditions for a set of
time delays to correspond to a unique position in the source space. Such sets
of time delays are referred to as feasible sets. We formally prove that every
feasible set corresponds to exactly one position in the source space, whose
value can be recovered using a closed-form localization mapping. Therefore we
seek for the optimal feasible set of time delays given, as input, the received
microphone signals. This time delay estimation problem is naturally cast into a
programming task, constrained by the feasibility conditions derived from the
geometric analysis. A global branch-and-bound optimization technique is
proposed to solve the problem at hand, hence estimating the best set of
feasible time delays and, subsequently, localizing the sound source. Extensive
experiments with both simulated and real data are reported; we compare our
methodology to four state-of-the-art techniques. This comparison clearly shows
that the proposed method combined with the branch-and-bound algorithm
outperforms existing methods. These in-depth geometric understanding, practical
algorithms, and encouraging results, open several opportunities for future
work.Comment: 13 pages, 2 figures, 3 table, journa
Audiovisual head orientation estimation with particle filtering in multisensor scenarios
This article presents a multimodal approach to head pose estimation of individuals in environments equipped with multiple cameras and microphones, such as SmartRooms or automatic video conferencing. Determining the individuals head orientation is the basis for many forms of more sophisticated interactions between humans and technical devices and can also be used for automatic sensor selection (camera, microphone) in communications or video surveillance systems. The use of particle filters as a unified framework for the estimation of the head orientation for both monomodal and multimodal cases is proposed. In video, we estimate head orientation from color information by exploiting spatial redundancy among cameras. Audio information is processed to estimate the direction of the voice produced by a speaker making use of the directivity characteristics of the head radiation pattern. Furthermore, two different particle filter multimodal information fusion schemes for combining the audio and video streams are analyzed in terms of accuracy and robustness. In the first one, fusion is performed at a decision level by combining each monomodal head pose estimation, while the second one uses a joint estimation system combining information at data level. Experimental results conducted over the CLEAR 2006 evaluation database are reported and the comparison of the proposed multimodal head pose estimation algorithms with the reference monomodal approaches proves the effectiveness of the proposed approach.Postprint (published version
GCC-PHAT based head orientation estimation
This work presents a novel two-step algorithm to estimate the
orientation of speakers in a smart-room environment equipped
with microphone arrays. First the position of the speaker is
estimated by the SRP-PHAT algorithm, and the time delay of
arrival for each microphone pair with respect to the detected
position is computed. In the second step, the value of the cross-
correlation at the estimated time delay is used as the fundamen-
tal characteristic from where to derive the speaker orientation. The proposed method performs consistently better than other state-of-the-art acoustic techniques with a purposely recorded database and the CLEAR head pose database.Peer ReviewedPostprint (author’s final draft
Radar and RGB-depth sensors for fall detection: a review
This paper reviews recent works in the literature on the use of systems based on radar and RGB-Depth (RGB-D) sensors for fall detection, and discusses outstanding research challenges and trends related to this research field. Systems to detect reliably fall events and promptly alert carers and first responders have gained significant interest in the past few years in order to address the societal issue of an increasing number of elderly people living alone, with the associated risk of them falling and the consequences in terms of health treatments, reduced well-being, and costs. The interest in radar and RGB-D sensors is related to their capability to enable contactless and non-intrusive monitoring, which is an advantage for practical deployment and users’ acceptance and compliance, compared with other sensor technologies, such as video-cameras, or wearables. Furthermore, the possibility of combining and fusing information from The heterogeneous types of sensors is expected to improve the overall performance of practical fall detection systems. Researchers from different fields can benefit from multidisciplinary knowledge and awareness of the latest developments in radar and RGB-D sensors that this paper is discussing
Real-time human ambulation, activity, and physiological monitoring:taxonomy of issues, techniques, applications, challenges and limitations
Automated methods of real-time, unobtrusive, human ambulation, activity, and wellness monitoring and data analysis using various algorithmic techniques have been subjects of intense research. The general aim is to devise effective means of addressing the demands of assisted living, rehabilitation, and clinical observation and assessment through sensor-based monitoring. The research studies have resulted in a large amount of literature. This paper presents a holistic articulation of the research studies and offers comprehensive insights along four main axes: distribution of existing studies; monitoring device framework and sensor types; data collection, processing and analysis; and applications, limitations and challenges. The aim is to present a systematic and most complete study of literature in the area in order to identify research gaps and prioritize future research directions
Examining the role of smart TVs and VR HMDs in synchronous at-a-distance media consumption
This article examines synchronous at-a-distance media consumption from two perspectives: How it can be facilitated using existing consumer displays (through TVs combined with smartphones), and imminently available consumer displays (through virtual reality (VR) HMDs combined with RGBD sensing). First, we discuss results from an initial evaluation of a synchronous shared at-a-distance smart TV system, CastAway. Through week-long in-home deployments with five couples, we gain formative insights into the adoption and usage of at-a-distance media consumption and how couples communicated during said consumption. We then examine how the imminent availability and potential adoption of consumer VR HMDs could affect preferences toward how synchronous at-a-distance media consumption is conducted, in a laboratory study of 12 pairs, by enhancing media immersion and supporting embodied telepresence for communication. Finally, we discuss the implications these studies have for the near-future of consumer synchronous at-a-distance media consumption. When combined, these studies begin to explore a design space regarding the varying ways in which at-a-distance media consumption can be supported and experienced (through music, TV content, augmenting existing TV content for immersion, and immersive VR content), what factors might influence usage and adoption and the implications for supporting communication and telepresence during media consumption
- …