743 research outputs found

    Constructing fading histograms from data streams

    Get PDF
    The ability to collect data is changing drastically. Nowadays, data are gathered in the form of transient and finite data streams. Memory restrictions preclude keeping all received data in memory. When dealing with massive data streams, it is mandatory to create compact representations of data, also known as synopses structures or summaries. Reducing memory occupancy is of utmost importance when handling a huge amount of data. This paper addresses the problem of constructing histograms from data streams under error constraints. When constructing online histograms from data streams there are two main characteristics to embrace: the updating facility and the error of the histogram. Moreover, in dynamic environments, besides the need of compact summaries to capture the most important properties of data, it is also essential to forget old data. Therefore, this paper presents sliding histograms and fading histograms, an abrupt and a smooth strategies to forget outdated data

    Probabilistic change detection and visualization methods for the assessment of temporal stability in biomedical data quality

    Full text link
    The final publication is available at Springer via http://dx.doi.org/DOI 10.1007/s10618-014-0378-6. Published online.Knowledge discovery on biomedical data can be based on on-line, data-stream analyses, or using retrospective, timestamped, off-line datasets. In both cases, changes in the processes that generate data or in their quality features through time may hinder either the knowledge discovery process or the generalization of past knowledge. These problems can be seen as a lack of data temporal stability. This work establishes the temporal stability as a data quality dimension and proposes new methods for its assessment based on a probabilistic framework. Concretely, methods are proposed for (1) monitoring changes, and (2) characterizing changes, trends and detecting temporal subgroups. First, a probabilistic change detection algorithm is proposed based on the Statistical Process Control of the posterior Beta distribution of the Jensen–Shannon distance, with a memoryless forgetting mechanism. This algorithm (PDF-SPC) classifies the degree of current change in three states: In-Control, Warning, and Out-of-Control. Second, a novel method is proposed to visualize and characterize the temporal changes of data based on the projection of a non-parametric information-geometric statistical manifold of time windows. This projection facilitates the exploration of temporal trends using the proposed IGT-plot and, by means of unsupervised learning methods, discovering conceptually-related temporal subgroups. Methods are evaluated using real and simulated data based on the National Hospital Discharge Survey (NHDS) dataset.The work by C Saez has been supported by an Erasmus Lifelong Learning Programme 2013 Grant. This work has been supported by own IBIME funds. The authors thank Dr. Gregor Stiglic, from the Univeristy of Maribor, Slovenia, for his support on the NHDS data.Sáez Silvestre, C.; Pereira Rodrigues, P.; Gama, J.; Robles Viejo, M.; García Gómez, JM. (2014). Probabilistic change detection and visualization methods for the assessment of temporal stability in biomedical data quality. Data Mining and Knowledge Discovery. 28:1-1. doi:10.1007/s10618-014-0378-6S1128Aggarwal C (2003) A framework for diagnosing changes in evolving data streams. In Proceedings of the International Conference on Management of Data ACM SIGMOD, pp 575–586Amari SI, Nagaoka H (2007) Methods of information geometry. American Mathematical Society, Providence, RIArias E (2014) United states life tables, 2009. Natl Vital Statist Rep 62(7): 1–63Aspden P, Corrigan JM, Wolcott J, Erickson SM (2004) Patient safety: achieving a new standard for care. Committee on data standards for patient safety. The National Academies Press, Washington, DCBasseville M, Nikiforov IV (1993) Detection of abrupt changes: theory and application. Prentice-Hall Inc, Upper Saddle River, NJBorg I, Groenen PJF (2010) Modern multidimensional scaling: theory and applications. Springer, BerlinBowman AW, Azzalini A (1997) Applied smoothing techniques for data analysis: the Kernel approach with S-plus illustrations (Oxford statistical science series). Oxford University Press, OxfordBrandes U, Pich C (2007) Eigensolver methods for progressive multidimensional scaling of large data. In: Kaufmann M, Wagner D (eds) Graph drawing. Lecture notes in computer science, vol 4372. Springer, Berlin, pp 42–53Brockwell P, Davis R (2009) Time series: theory and methods., Springer series in statisticsSpringer, BerlinCesario SK (2002) The “Christmas Effect” and other biometeorologic influences on childbearing and the health of women. J Obstet Gynecol Neonatal Nurs 31(5):526–535Chakrabarti K, Garofalakis M, Rastogi R, Shim K (2001) Approximate query processing using wavelets. VLDB J 10(2–3):199–223Cruz-Correia RJ, Pereira Rodrigues P, Freitas A, Canario Almeida F, Chen R, Costa-Pereira A (2010) Data quality and integration issues in electronic health records. Information discovery on electronic health records, pp 55–96Csiszár I (1967) Information-type measures of difference of probability distributions and indirect observations. Studia Sci Math Hungar 2:299–318Dasu T, Krishnan S, Lin D, Venkatasubramanian S, Yi K (2009) Change (detection) you can believe. In: Finding distributional shifts in data streams. In: Proceedings of the 8th international symposium on intelligent data analysis: advances in intelligent data analysis VIII, IDA ’09. Springer, Berlin, pp 21–34Endres D, Schindelin J (2003) A new metric for probability distributions. IEEE Trans Inform Theory 49(7):1858–1860Gama J, Gaber MM (2007) Learning from data streams: processing techniques in sensor networks. Springer, BerlinGama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: Bazzan A, Labidi S (eds) Advances in artificial intelligence—SBIA 2004., Lecture notes in computer scienceSpringer, Berlin, pp 286–295Gama J (2010) Knowledge discovery from data streams, 1st edn. Chapman & Hall, LondonGehrke J, Korn F, Srivastava D (2001) On computing correlated aggregates over continual data streams. SIGMOD Rec 30(2):13–24Guha S, Shim K, Woo J (2004) Rehist: relative error histogram construction algorithms. In: Proceedings of the thirtieth international conference on very large data bases VLDB, pp 300–311Han J, Kamber M, Pei J (2012) Data mining: concepts and techniques. Morgan Kaufmann, Elsevier, Burlington, MAHowden LM, Meyer JA, (2011) Age and sex composition. 2010 Census Briefs US Department of Commerce. Economics and Statistics Administration, US Census BureauHrovat G, Stiglic G, Kokol P, Ojstersek M (2014) Contrasting temporal trend discovery for large healthcare databases. Comput Methods Program Biomed 113(1):251–257Keim DA (2000) Designing pixel-oriented visualization techniques: theory and applications. IEEE Trans Vis Comput Graph 6(1):59–78Kifer D, Ben-David S, Gehrke J (2004) Detecting change in data streams. In: Proceedings of the thirtieth international conference on Very large data bases, VLDB Endowment, VLDB ’04, vol 30, pp 180–191Klinkenberg R, Renz I (1998) Adaptive information filtering: Learning in the presence of concept drifts. In: Workshop notes of the ICML/AAAI-98 workshop learning for text categorization. AAAI Press, Menlo Park, pp 33–40Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biolog Cybern 43(1):59–69Lin J (1991) Divergence measures based on the Shannon entropy. IEEE Trans Inform Theory 37:145–151Mitchell TM, Caruana R, Freitag D, McDermott J, Zabowski D (1994) Experience with a learning personal assistant. Commun ACM 37(7):80–91Mouss H, Mouss D, Mouss N, Sefouhi L (2004) Test of page-hinckley, an approach for fault detection in an agro-alimentary production system. In: Proceedings of the 5th Asian Control Conference, vol 2, pp 815–818National Research Council (2011) Explaining different levels of longevity in high-income countries. The National Academies Press, Washington, DCNHDS (2010) United states department of health and human services. Centers for disease control and prevention. National center for health statistics. National hospital discharge survey codebookNHDS (2014) National Center for Health Statistics, National Hospital Discharge Survey (NHDS) data, US Department of Health and Human Services, Centers for Disease Control and Prevention, National Center for Health Statistics, Hyattsville, Maryland. http://www.cdc.gov/nchs/nhds.htmPapadimitriou S, Sun J, Faloutsos C (2005) Streaming pattern discovery in multiple time-series. In: Proceedings of the 31st international conference on very large data bases, VLDB endowment, VLDB ’05, pp 697–708Parzen E (1962) On estimation of a probability density function and mode. Ann Math Statist 33(3):1065–1076Ramsay JO, Silverman BW (2005) Functional data analysis. Springer, New YorkRodrigues P, Correia R (2013) Streaming virtual patient records. In: Krempl G, Zliobaite I, Wang Y, Forman G (eds) Real-world challenges for data stream mining. University Magdeburg, Otto-von-Guericke, pp 34–37Rodrigues P, Gama J, Pedroso J (2008) Hierarchical clustering of time-series data streams. IEEE Trans Knowl Data Eng 20(5):615–627Rodrigues PP, Gama Ja (2010) A simple dense pixel visualization for mobile sensor data mining. In: Proceedings of the second international conference on knowledge discovery from sensor data, sensor-KDD’08. Springer, Berlin, pp 175–189Rodrigues PP, Gama J, Sebastiã o R (2010) Memoryless fading windows in ubiquitous settings. In Proceedings of ubiquitous data mining (UDM) workshop in conjunction with the 19th european conference on artificial intelligence—ECAI 2010, pp 27–32Rodrigues PP, Sebastiã o R, Santos CC (2011) Improving cardiotocography monitoring: a memory-less stream learning approach. In: Proceedings of the learning from medical data streams workshop. Bled, SloveniaRubner Y, Tomasi C, Guibas L (2000) The earth mover’s distance as a metric for image retrieval. Int J Comput Vision 40(2):99–121Sebastião R, Gama J (2009) A study on change detection methods. In: 4th Portuguese conference on artificial intelligenceSebastião R, Gama J, Rodrigues P, Bernardes J (2010) Monitoring incremental histogram distribution for change detection in data streams. In: Gaber M, Vatsavai R, Omitaomu O, Gama J, Chawla N, Ganguly A (eds) Knowledge discovery from sensor data, vol 5840., Lecture notes in computer science. Springer, Berlin, pp 25–42Sebastião R, Silva M, Rabiço R, Gama J, Mendonça T (2013) Real-time algorithm for changes detection in depth of anesthesia signals. Evol Syst 4(1):3–12Sáez C, Martínez-Miranda J, Robles M, García-Gómez JM (2012) O rganizing data quality assessment of shifting biomedical data. Stud Health Technol Inform 180:721–725Sáez C, Robles M, García-Gómez JM (2013) Comparative study of probability distribution distances to define a metric for the stability of multi-source biomedical research data. In: Engineering in medicine and biology society (EMBC), 2013 35th annual international conference of the IEEE, pp 3226–3229Sáez C, Robles M, García-Gómez JM (2014) Stability metrics for multi-source biomedical data based on simplicial projections from probability distribution distances. Statist Method Med Res (forthcoming)Shewhart WA, Deming WE (1939) Statistical method from the viewpoint of quality control. Graduate School of the Department of Agriculture, Washington, DCShimazaki H, Shinomoto S (2010) Kernel bandwidth optimization in spike rate estimation. J Comput Neurosci 29(1–2):171–182Solberg LI, Engebretson KI, Sperl-Hillen JM, Hroscikoski MC, O’Connor PJ (2006) Are claims data accurate enough to identify patients for performance measures or quality improvement? the case of diabetes, heart disease, and depression. Am J Med Qual 21(4):238–245Spiliopoulou M, Ntoutsi I, Theodoridis Y, Schult R (2006) monic: modeling and monitoring cluster transitions. In: Proceedings of the 12th ACm SIGKDD international conference on knowledge discovery and data mining, KDD ’06. ACm, New York, NY, pp 706–711Stiglic G, Kokol P (2011) Interpretability of sudden concept drift in medical informatics domain. In Proceedings of the 2010 IEEE international conference on data mining workshops, pp 609–613Torgerson W (1952) Multidimensional scaling: I theory and method. Psychometrika 17(4):401–419Wang RY, Strong DM (1996) Beyond accuracy: what data quality means to data consumers. J Manage Inform Syst 12(4):5–33Weiskopf NG, Weng C (2013) M ethods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc 20(1):144–151Wellings K, Macdowall W, Catchpole M, Goodrich J (1999) Seasonal variations in sexual activity and their implications for sexual health promotion. J R Soc Med 92(2):60–64Westgard JO, Barry PL (2010) Basic QC practices: training in statistical quality control for medical laboratories. Westgard Quality Corporation, Madison, WIWidmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23(1):69–10

    Fading histograms in detecting distribution and concept changes

    Get PDF
    The remarkable number of real applications under dynamic scenarios is driving a novel ability to generate and gatherinformation.Nowadays,amassiveamountofinforma- tion is generated at a high-speed rate, known as data streams. Moreover, data are collected under evolving environments. Due to memory restrictions, data must be promptly processed and discarded immediately. Therefore, dealing with evolving data streams raises two main questions: (i) how to remember discarded data? and (ii) how to forget outdated data? To main- tain an updated representation of the time-evolving data, this paper proposes fading histograms. Regarding the dynamics of nature, changes in data are detected through a windowing scheme that compares data distributions computed by the fading histograms: the adaptive cumulative windows model (ACWM). The online monitoring of the distance between data distributions is evaluated using a dissimilarity measure based on the asymmetry of the Kullback–Leibler divergence.The experimental results support the ability of fading his- tograms in providing an updated representation of data. Such property works in favor of detecting distribution changes with smaller detection delay time when compared with stan- dard histograms. With respect to the detection of concept changes, the ACWM is compared with 3 known algorithms taken from the literature, using artificial data and using pub- lic data sets, presenting better results. Furthermore, we the proposed method was extended for multidimensional and the experiments performed show the ability of the ACWM for detecting distribution changes in these settings

    5G無線通信における誤り訂正符号化方式の評価に関する研究

    Get PDF
    早大学位記番号:新8267早稲田大

    Mukautuva moniulotteisten poikkeavuuksien tunnistaminen reaaliaikaisesti

    Get PDF
    Data volumes are growing at a high speed as data emerges from millions of devices. This brings an increasing need for streaming analytics, processing and analysing the data in a record-by-record manner. In this work a comprehensive literature review on streaming analytics is presented, focusing on detecting anomalous behaviour. Challenges and approaches for streaming analytics are discussed. Different ways of determining and identifying anomalies are shown and a large number of anomaly detection methods for streaming data are presented. Also, existing software platforms and solutions for streaming analytics are presented. Based on the literature survey I chose one method for further investigation, namely Lightweight on-line detector of anomalies (LODA). LODA is designed to detect anomalies in real time from even high-dimensional data. In addition, it is an adaptive method and updates the model on-line. LODA was tested both on synthetic and real data sets. This work shows how to define the parameters used with LODA. I present a couple of improvement ideas to LODA and show that three of them bring important benefits. First, I show a simple addition to handle special cases such that it allows computing an anomaly score for all data points. Second, I show cases where LODA fails due to lack of data preprocessing. I suggest preprocessing schemes for streaming data and show that using them improves the results significantly, and they require only a small subset of the data for determining preprocessing parameters. Third, since LODA only gives anomaly scores, I suggest thresholding techniques to define anomalies. This work shows that the suggested techniques work fairly well compared to theoretical best performance. This makes it possible to use LODA in real streaming analytics situations.Datan määrä kasvaa kovaa vauhtia miljoonien laitteiden tuottaessa dataa. Tämä luo kasvavan tarpeen datan prosessoinnille ja analysoinnille reaaliaikaisesti. Tässä työssä esitetään kattava kirjallisuuskatsaus reaaliaikaisesta analytiikasta keskittyen anomalioiden tunnistukseen. Työssä pohditaan reaaliaikaiseen analytiikkaan liittyviä haasteita ja lähestymistapoja. Työssä näytetään erilaisia tapoja määrittää ja tunnistaa anomalioita sekä esitetään iso joukko menetelmiä reaaliaikaiseen anomalioiden tunnistukseen. Työssä esitetään myös reaaliaika-analytiikkaan tarkoitettuja ohjelmistoalustoja ja -ratkaisuja. Kirjallisuuskatsauksen perusteella työssä on valittu yksi menetelmä lähempään tutkimukseen, nimeltään Lightweight on-line detector of anomalies (LODA). LODA on suunniteltu tunnistamaan anomalioita reaaliaikaisesti jopa korkeaulotteisesta datasta. Lisäksi se on adaptiivinen menetelmä ja päivittää mallia reaaliaikaisesti. Työssä testattiin LODAa sekä synteettisellä että oikealla datalla. Työssä näytetään, miten LODAa käytettäessä kannattaa valita mallin parametrit. Työssä esitetään muutama kehitysehdotus LODAlle ja näytetään kolmen kehitysehdotuksen merkittävä hyöty. Ensinnäkin, näytetään erikoistapauksia varten yksinkertainen lisäys, joka mahdollistaa anomaliapisteytyksen laskemisen jokaiselle datapisteelle. Toiseksi, työssä näytetään tapauksia, joissa LODA epäonnistuu, kun dataa ei ole esikäsitelty. Työssä ehdotetaan reaaliaikaisesti prosessoitavalle datalle soveltuvia esikäsittelymenetelmiä ja osoitetaan, että niiden käyttö parantaa tuloksia merkittävästi, samalla käyttäen vain pientä osaa datasta esikäsittelyparametrien määrittämiseen. Kolmanneksi, koska LODA antaa datapisteille vain anomaliapisteytyksen, työssä on ehdotettu, miten sopivat raja-arvot anomalioiden tunnistukseen voitaisiin määrittää. Työssä on osoitettu, että nämä ehdotukset toimivat melko hyvin verrattuna teoreettisesti parhaaseen mahdolliseen tulokseen. Tämä mahdollistaa LODAn käytön oikeissa reaaliaika-analytiikkatapauksissa

    Catalogs of Hot White Dwarfs in the Milky Way from GALEX's Ultraviolet Sky Surveys. Constraining Stellar Evolution

    Get PDF
    We present comprehensive catalogs of hot star candidates in the Milky Way, selected from GALEX far-UV (FUV, 1344-1786 AA) and near-UV (NUV, 1771-2831 AA) imaging. The FUV and NUV photometry allows us to extract the hottest stellar objects, in particular hot white dwarfs (WD), which are elusive at other wavelengths because of their high temperatures and faint optical luminosities. We generated catalogs of UV sources from two GALEX's surveys: AIS (All-Sky Imaging Survey, depth ABmag~19.9/20.8 in FUV/NUV) and MIS (Medium-depth Imaging Survey, depth ~22.6/22.7mag). The two catalogs (from GALEX fifth data release) contain 65.3/12.6 million (AIS/MIS) unique UV sources with error(NUV)<0.5mag, over 21,435/1,579 square degrees. We also constructed subcatalogs of the UV sources with matched optical photometry from SDSS (7th data release): these contain 0.6/0.9million (AIS/MIS) sources with errors <0.3mag in both FUV and NUV, excluding sources with multiple optical counterparts, over an area of 7,325/1,103 square degrees. All catalogs are available online. We then selected 28,319(AIS)/9,028(MIS) matched sources with FUV-NUV<-0.13; this color cut corresponds to stellar Teff hotter than ~18,000 K. An additional color cut of NUV-r>0.1 isolates binaries with largely differing Teff's, and some intruding QSOs. Available spectroscopy for a subsample indicates that hot-star candidates with NUV-r<0.1 have negligible contamination by non-stellar objects. We discuss the distribution of sources in the catalogs, and the effects of error and color cuts on the samples. The density of hot-star candidates increases from high to low Galactic latitudes, but drops on the MW plane due to dust extinction. Our hot-star counts at all latitudes are better matched by Milky Way models computed with an initial-final mass relation that favours lower final masses. (ABRIDGED)Comment: To appear in MNRAS. Better printed in colou

    Enhancement of Historical Printed Document Images by Combining Total Variation Regularization and Non-Local Means Filtering

    Get PDF
    This paper proposes a novel method for document enhancement which combines two recent powerful noise-reduction steps. The first step is based on the total variation framework. It flattens background grey-levels and produces an intermediate image where background noise is considerably reduced. This image is used as a mask to produce an image with a cleaner background while keeping character details. The second step is applied to the cleaner image and consists of a filter based on non-local means: character edges are smoothed by searching for similar patch images in pixel neighborhoods. The document images to be enhanced are real historical printed documents from several periods which include several defects in their background and on character edges. These defects result from scanning, paper aging and bleed- through. The proposed method enhances document images by combining the total variation and the non-local means techniques in order to improve OCR recognition. The method is shown to be more powerful than when these techniques are used alone and than other enhancement methods
    corecore