The proliferation of consumer recording devices and video sharing websites makes the possibility of having access to multiple recordings of the same occurrence increasingly likely. These co-synchronous recordings can be identified via their audio tracks, despite local noise and channel variations. We explore a robust fingerprinting strategy to do this. Matching pursuit is used to obtain a sparse set of the most prominent elements in a video soundtrack. Pairs of these elements are hashed and stored, to be efficiently compared with one another. This fingerprinting is tested on a corpus of over 700 YouTube videos related to the 2009 U.S. presidential inauguration. Reliable matching of identical events in different recordings is demonstrated, even under difficult conditions

Cotton, Courtenay Valentine

Ellis, Daniel P. W.

English

Suhteellisuusperiaate oikeusperiaatteena velvoittaa pääsääntöisesti kaikkia viranomaisia virkatehtäviä suorittaessaan. Suhteellisuusperiaate on vakiinnuttanut asemansa yleiseksi hallintooikeudelliseksi periaatteeksi, ja sai laintasoisen aseman uuden hallintolain (434/2003) myötä. Hallintolain 6 §:n mukaan "viranomaisten toimien on oltava oikeassa suhteessa tavoiteltuun päämäärään nähden".Tässä tutkielmassa keskitytään kansalliseen suhteellisuusperiaatteeseen, mutta aluksi tarkastelen myös oikeusnormeja ja oikeusperiaatteita yleensä. Pyrin vastaamaan mm. kysymyksiin, millaisia erilaisia tehtäviä suhteellisuusperiaatteella on ja millaisen aseman se saa erilaisissa viranomaisten toiminnoissa? Miten suhteellisuusperiaate vaikuttaa hallinto- ja tuomiovaltaa käyttävien elinten toimintaan ja päätöksentekoon sekä millaisia elementtejä suhteellisuusperiaate itse asiassa sisältää?  Suhteellisuusperiaate ei ole yksiselitteinen ja täsmällinen, "helposti" sovellettava periaate, vaan se jättää soveltajalleen vallan päättää mikä intressi painaa ratkaisutilanteessa eniten. Sen käyttöarvo on suuri, mutta deduktioarvo vähäinen. Suhteellisuusperiaatteesta on siten vaikea johtaa selkeitä ohjeita tms. käytettäväksi yksittäisiin ratkaisutilanteisiin. Suhteellisuusperiaate lähinnä ohjaa ratkaisua oikeaan suuntaan. Jokainen yksittäinen tapaus on siten erilainen ja se, onko jokin päätös suhteellisuusperiaatteen mukainen, on päätöstä tekevän virkamiehen vastuulla. Virkamiehen on omaksuttava yhteiskunnassa yleisesti vallitseva näkemys siitä, mikä on suhteellisuusvaatimuksen mukaista ja mikä ei.Lähdemateriaali koostuu pääosin kotimaisesta kirjallisuudesta, useista tuomioistuinten ratkaisuista sekä eduskunnan oikeusasiamiehen ja valtioneuvoston oikeuskanslerin antamista päätöksistä, joissa suhteellisuusperiaate on ollut sovellettuna. Lisäksi olen käyttänyt lähteinä suhteellisuusperiaatteen kannalta olennaista hallintolakia ja poliisilakia ja niiden esitöitä, sekä useita muita  suhteellisuusvaatimuksen sisältäviä lakeja.Tutkielma on oikeusdogmaattinen. Tarkastelen suhteellisuusperiaatetta lähinnä oikeuskäytännön ja valvontakäytännön ratkaisujen kautta. Oikeudenalajaottelussa tutkielmani painottuu julkisoikeuteen ja erityisesti hallinto-oikeuteen, mutta sillä on liittymäkohtia myös valtiosääntöoikeuteen lainsäädäntötyön ja perustuslain perusoikeuksien käsittelyn kautta. EY-oikeudellinen ja Euroopan ihmisoikeussopimukseen liittyvä suhteellisuusperiaatteen tarkastelu jää tämän tutkielman ulkopuolelle

RIEKKI, MILA

Trepo - Institutional Repository of Tampere University

Suhteellisuusperiaate kansallisena oikeusperiaatteena

Columbia University Academic Commons

AUDIO FINGERPRINTING TO IDENTIFY MULTIPLE VIDEOS OF AN EVENTCourtenay V. Cotton and Daniel P. W. Ellis∗LabROSA, Dept. of Electrical EngineeringColumbia University, New York NY 10027 USA{cvcotton,dpwe}@ee.columbia.eduABSTRACTThe proliferation of consumer recording devices and videosharing websites makes the possibility of having access tomultiple recordings of the same occurrence increasinglylikely. These co-synchronous recordings can be identifiedvia their audio tracks, despite local noise and channel varia-tions. We explore a robust fingerprinting strategy to do this.Matching pursuit is used to obtain a sparse set of the mostprominent elements in a video soundtrack. Pairs of these ele-ments are hashed and stored, to be efficiently compared withone another. This fingerprinting is tested on a corpus of over700 YouTube videos related to the 2009 U.S. presidential in-auguration. Reliable matching of identical events in differentrecordings is demonstrated, even under difficult conditions.Index Terms— Acoustic signal analysis, Multimediadatabases, Database searching1. INTRODUCTIONAny notable current public event is very likely to have beencaptured on the personal video recorders (cameras, cell-phones, etc.) of some of the people present, and manyof these recordings will subsequently be published on theinternet through video sharing sites. We are interested inautomatically discovering these multiple recordings of thesame event. It would be extremely difficult to identify theseconclusively using visual information, since different record-ings may be taken from entirely different viewpoints. It is,however, possible to consider doing this with the audio, sincethe same basic acoustic event sequence should be capturedconsistently by any recording made in the same vicinity.This soundtrack matching problem has similarities withthat of identifying identical musical recordings in the pres-ence of noise and channel variations. In both cases, we ex-pect to see a lot of invariant underlying structure (e.g. spec-tral peaks) in the same relative time locations, but possiblycorrupted with different channel effects and mixed with vary-ing levels and types of noise. This problem is addressed by anumber of prior works in audio fingerprinting [1]; our work∗This work was supported by the NSF (grant IIS-0716203), and the East-man Kodak company.is based on the approach of [2] which uses the locations ofpairs of spectrogram peaks as robust features for matching.A similar approach was used to identify repeated events inenvironmental audio in [3], and a variant based on matchingpursuit (MP) was presented in [4] to group similar but non-identical audio events. This work is closely related in spirit touse of audio fingerprints to synchronize multiple cameras in[5] and amateur rock concert videos in [6].In section 2 we present a strategy for using MP to se-lect salient elements of a signal, pairing these elements tocreate distinguishing landmarks, and efficiently searching formatching landmarks. In section 3 we describe the video dataused to test this strategy, and in section 4 we examine the pre-cision of our search results.2. ALGORITHM2.1. Matching PursuitMP [7] is an algorithm for sparse signal decomposition into anover-complete dictionary of basis functions. MP basis func-tions, called atoms, correspond to concentrated bursts of en-ergy localized in time and frequency, but spanning a rangeof time-frequency tradeoffs. The MP algorithm iterativelyselects atoms corresponding to the most energetic points inthe signal, as long as they can be approximated by a basisfunction in the dictionary. In contrast to selecting peaks witha fixed-window spectrogram representation, MP can capturesalient features in the signal at varying time-frequency scales.In our fingerprinting, each video soundtrack is decom-posed into an MP representation in order to identify a sparseset of the most salient elements it contains. It is most straight-forward to decompose the entire length of a video at once, inorder to avoid issues with windowing and boundary overlaps.A variable number of atoms are extracted from each video,roughly a few hundred atoms per second, although these arenot uniformly distributed throughout the video. Selecting alarger number of elements than will actually be used from thesignal as a whole will tend to sufficiently cover both louderand quieter portions of the signal; then a smaller number ofatoms in each local area can be selected from these.We use the efficient implementation of MP from [8]. Ourdictionary contains Gabor atoms (Gaussian-windowed sinu-soids) at nine length scales, incremented by powers of two.For data sampled at 22.05 kHz, this corresponds to windowlengths ranging from 1.5 to 372 ms. These are each translatedin time by increments of one eighth of the atom length overthe duration of the signal.2.2. Landmark Formation and HashingA landmark consists of a pair of atoms, and is defined only bytheir two center frequencies and the time difference betweentheir temporal centers. These values are quantized to alloweffecient matching between landmarks. The time resolutionis 32 ms. The frequency resolution is 21.5 Hz, with only fre-quencies up to 5.5 kHz considered; this results in 256 discretefrequencies.For every block of 32 time steps (around 1 second), the15 highest energy atoms are selected. Each of these is pairedwith other atoms only in a local target area of the frequency-time plane. Here, each atom is paired with up to 3 others; ifthere are more than 3 atoms in the target area, the closest 3 intime are selected. This leads to approximately 45 landmarksper second. The target area is defined as the frequency of theinitial atom, plus or minus 667 Hz, and up to 64 time stepsafter the initial atom.The landmark values as quantized above can be describedas a unique hash of 20 bits: 8 bits for the frequency of the firstatom, 6 bits for the frequency difference between them, and6 bits for the time difference. A hash table is constructed tostore all the locations of each landmark hash value. Landmarklocations are stored in the table with an identification numberfrom the originating video and a time offset value, which isthe time location of the earlier atom relative to the start of thevideo.2.3. Query MatchingTo find instances of the same events as in a query video, thequery is decomposed with MP as described above. The videois then divided into five-second (non-overlapping) clips, andlandmarks are formed from the atoms in each clip and hashed,as described in section 2.2. Each clip will contain an aver-age of 225 landmarks. We break the query into these shorterpieces to improve the opportunity for matching subportionsof videos, as well as to provide independent tests of matchesbetween longer videos, as described below. The hash tableis queried for each of the landmarks found in the five secondclip. The start time of each query landmark is treated as a ref-erence time; this is subtracted from the offset times for land-marks returned from the table. A likely match will thereforereturn multiple landmarks from the same video, all reportingthe same relative offset time from their corresponding querylandmarks.3. VIDEO DATABASEWe wanted to test this algorithm on a set of videos likelyto contain multiple versions of the same sequence of acous-tic events. We chose to consider videos taken during the2009 American presidential inauguration. We assumed therewere likely to be many different professional and personalrecordings of the ceremony available, given the massive pub-lic attendence and news coverage. We obtained a set ofvideos from YouTube using the query “inauguration obama”.YouTube query results are limited to 1000 items; this andother complications (e.g. videos with no soundtrack) limitedour actual database set to 733 videos. Other than this, nohand selection or filtering was done on the video set. The setcomprises 56.2 hours of video. All audio is sampled at 22.05kHz.4. RESULTS4.1. Match EvaluationEach video was processed as above and stored in the hashtable. Then each video was divided into five second (non-overlapping) segments, and used as a query to the database.Matches to the query video itself were discounted. Matcheswere returned based on the proportion of identical land-marks matched to the total number of landmarks in the querysegment. A lower threshold proportion will result in morematches returned. In this experiment, all matches containingat least 5% of the query landmarks and at least 10 actuallandmarks were considered.Two videos with a long stretch of matching audio willresult in a number of sequential query segments matchingthe same video, with the same time offset. For the purposeof simplifying evaluation, all matches occuring between thesame two videos at the same offset are collapsed into a singlematch, spanning the time from the start of the first matchingsegment to the end of the last matching segment. This is areasonable assumption, since it is unlikely for two videos tomatch at multiple points with the same offset unless they aretruly part of the same long matching segment.4.2. Estimating PrecisionThe procedure described above produced 34,247 individualmatches. Fig. 1 shows a histogram of the number of matchesfound by proportion of matching landmarks to total querylandmarks, with 5% being the minimum considered. Therewere 91 matches which occured above the level of 40%; man-ual examination of these results revealed them to be largelymatches between videos in several different ‘series’, eachwith a signature introduction sound or music at the beginningof the video. There were also six pairs of identical or nearlyidentical videos in this set. All these matches are accurate,but not particularly interesting. The ‘series’ videos in generaldid not contain footage of the inauguration or related events.For simplicity, they and (one copy of) the six exact dupli-cate videos were all removed from the database. This left aset of 31,756 matches. Of these, 8186 (27%) matched overa longer time period than a single five-second clip. Giventhe small probability of two videos matching with the sametime offset in multiple places by chance, it is reasonable toassume that most of these longer matches are accurate. Thiswas confirmed by random checking of longer matches. Simi-larly, even short matches with relatively high proportion (over15%) of matching landmarks seem generally accurate on thebasis of casual spot-checks.We therefore wanted to examine the precision of shortmatches (a single five-second clip) with low percentages ofmatching landmarks. In order to estimate the precision ofthese, we randomly sampled 1.5% of them to hand check; atleast 20 samples were taken at each match percentage level.Fig. 2 shows the level of precision observed in this set ofmatches versus the proportion of landmarks matched. A largenumber of the incorrect matches were between clips whicheither both contained music or crowd noise. Further exami-nation revealed that a large number of these spurious matchescontain a long chain of landmarks in a single frequency bin.It seems likely that the large majority of these could be au-tomatically identified and removed in future experiments, butthis work has not been completed yet.0.1 0.3 0.5 0.7 0.9050001000015000proportion of identical landmarksnumber of matches foundFig. 1. Histogram of matches found, by proportion of land-marks matched.4.3. Identifying Unique RecordingsIn the process of examining matches above, a number of dif-ferent types of accurate matches were observed. The mostcommon at high landmark proportion levels were betweenvideos of the same events taken by different news organiza-tions. Another set were between videos which were obviouslyderived from the same original news recording, but with var-ious levels of additional processing. Some of these were re-broadcasts by a news organization in another country, with0.05 0.07 0.09 0.11 0.13 0.15 0.1700.20.40.60.81proportion of identical landmarksprecision of matchesFig. 2. Precision of five-second matches, based on manualexamination of random samples.additional narration or translation over the original footage.Others had been remixed with music. A surprisingly largenumber seemed to be videos taken of television screens. Avery small number were discovered which had been taken byamateurs in attendance at the actual event.An interesting question is how many of these independentrecordings exist in the database. We observed that each ofthe various professional news recordings represented in thedatabase tend to match each other well, since they are all veryclean long-duration recordings of exactly the same chain ofevents. We attempted to estimate this subset of professionalrecordings by selecting any videos that match each other inat least 15% of the total landmarks, contain at least 25 actualmatching landmarks, and are at least 20 seconds long. Thisdescribed 691 matches, between 118 separate videos.We expect amateur recordings to also match one or moreof these professional videos, but likely for a shorter dura-tion and/or at a smaller landmark percentage level. We there-fore looked at the set of videos which match any of the pre-sumed professional set described above, in at least 10% ofthe landmarks, with at least 20 actual landmarks, and with nominimum duration. This yielded a set of 2130 matches, be-tween 189 videos (in additional to the 118 above). For eachof these videos, the top (highest proportion of landmarks)match was returned for examination. Many of these videosturned out to be heavily processed or remixed versions of aprofessional recording. A few were actually incorrect, butcommonly mistaken, videos containing either music or crowdnoise. A number of them (14) were actually discovered to beindependently recorded videos of the inauguration ceremonyor related events, that were reliably matched with professionalfootage of the same events. Fig. 3 demonstrates the match-ing landmarks in one of these amateur video results; fig. 4shows frames from each video. These and other examples ofthe matches described here can be viewed at http://www.ee.columbia.edu/˜cvcotton/vidMatch.htm.freq / kHzfreq / kHzQuery195 196 198197 19902.55219 221 time / sMatch  218 220 22202.55level / dB−40−20020Fig. 3. Top: a clip of an amateur video of the inauguration speech; Bottom: a CNN broadcast. The two share 59 commonlandmarks over a 10 second period (only five seconds shown). The matching landmarks are drawn in white.Fig. 4. Frames from each of the two matching videos.5. CONCLUSIONSThe fingerprinting procedure outlined here was demonstratedto be robust to high levels of noise and channel differences.The system as demonstrated reliably returns accurate matcheswith very few false positives at a match threshold of around15% of landmarks. The main shortcoming is the number offalse positives that occur at lower match levels. We are, how-ever, confident that many of these can be reliably removed infuture experiments by filtering matches with landmarks oc-curing all or mostly in a single frequency bin.6. REFERENCES[1] P. Cano, E. Batlle, T. Kalker, and J. Haitsma, “A reviewof audio fingerprinting,” J. VLSI Sig. Proc., vol. 41, no.3, pp. 271–284, 2005.[2] A. Wang, “The Shazam music recognition service,”Comm. ACM, vol. 49, no. 8, pp. 44–48, Aug. 2006.[3] J. Ogle and D.P.W. Ellis, “Fingerprinting to identifyrepeated sound events in long-duration personal audiorecordings,” in Proc. ICASSP, 2007, vol. I, pp. 233–236.[4] C. Cotton and D.P.W. Ellis, “Finding similar acousticevents using matching pursuit and locality-sensitive hash-ing,” in Proc. WASPAA, 2009, pp. 125–128.[5] P. Shrstha, M. Barbieri, and H. Weda, “Synchroniza-tion of multi-camera video recordings based on audio,”in Proc. 15th Int. Conf. on Multimedia. ACM, 2007, pp.545–548.[6] L. Kennedy and M. Naaman, “Less talk, more rock: auto-mated organization of community-contributed collectionsof concert videos,” in Proc. 18th Int. Conf. on World WideWeb. ACM, 2009, pp. 311–320.[7] S. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Tr. Sig. Proc., vol. 41, no.12, Dec. 1993.[8] Sacha Krstulovic and Re´mi Gribonval, “MPTK: Match-ing Pursuit made tractable,” in Proc. Int. Conf. Acoust.Speech Signal Process. (ICASSP’06), Toulouse, France,May 2006, vol. 3, pp. III–496 – III–499.

Audio Fingerprinting to Identify Multiple Videos of an Event

TamPub Julkaisuarkisto - TamPub Institutional Repository

https://academiccommons.columbia.edu/doi/10.7916/D8DJ5QXZ/download

Audio Fingerprinting to Identify Multiple Videos of an Event

Abstract

Similar works

Full text

Available Versions

Trepo - Institutional Repository of Tampere University

Columbia University Academic Commons

TamPub Julkaisuarkisto - TamPub Institutional Repository