18 research outputs found

    Universal Linear Fit Identification: A Method Independent of Data, Outliers and Noise Distribution Model and Free of Missing or Removed Data Imputation

    No full text
    <div><p>Data processing requires a robust linear fit identification method. In this paper, we introduce a non-parametric robust linear fit identification method for time series. The method uses an indicator <i>2/n</i> to identify linear fit, where <i>n</i> is number of terms in a series. The ratio <i>R</i><sub><i>max</i></sub> of <i>a</i><sub><i>max</i></sub><i>− a</i><sub><i>min</i></sub> and <i>S</i><sub><i>n</i></sub><i>− a</i><sub><i>min</i></sub><i>*n</i> and that of <i>R</i><sub><i>min</i></sub> of <i>a</i><sub><i>max</i></sub><i>− a</i><sub><i>min</i></sub> and <i>a</i><sub><i>max</i></sub><i>*n − S</i><sub><i>n</i></sub> are always equal to <i>2/n</i>, where <i>a</i><sub><i>max</i></sub> is the maximum element, <i>a</i><sub><i>min</i></sub> is the minimum element and <i>S</i><sub><i>n</i></sub> is the sum of all elements. If any series expected to follow <i>y = c</i> consists of data that do not agree with <i>y = c</i> form, <i>R</i><sub><i>max</i></sub><i>> 2/n</i> and <i>R</i><sub><i>min</i></sub><i>> 2/n</i> imply that the maximum and minimum elements, respectively, do not agree with linear fit. We define threshold values for outliers and noise detection as <i>2/n</i> * (1 + <i>k</i><sub><i>1</i></sub><i>)</i> and <i>2/n</i> * (1 + <i>k</i><sub><i>2</i></sub><i>)</i>, respectively, where <i>k</i><sub><i>1</i></sub> > <i>k</i><sub><i>2</i></sub> and <i>0 ≤ k</i><sub><i>1</i></sub><i>≤ n/2 − 1</i>. Given this relation and transformation technique, which transforms data into the form <i>y = c</i>, we show that removing all data that do not agree with linear fit is possible. Furthermore, the method is independent of the number of data points, missing data, removed data points and nature of distribution (Gaussian or non-Gaussian) of outliers, noise and clean data. These are major advantages over the existing linear fit methods. Since having a perfect linear relation between two variables in the real world is impossible, we used artificial data sets with extreme conditions to verify the method. The method detects the correct linear fit when the percentage of data agreeing with linear fit is less than 50%, and the deviation of data that do not agree with linear fit is very small, of the order of ±10<sup>−4</sup>%. The method results in incorrect detections only when numerical accuracy is insufficient in the calculation process.</p></div

    Plots (a) and (b) show the first and second data set of Anscombe’s quartet and used the same value of <i>k</i>. Plots (c) and (d) represent the third data set of Anscombe’s quartet and used different <i>k</i> values.

    No full text
    <p>In all detections, ENNOL was set to five. When the <i>k</i> value changes, the reference point and number of points in linear fit are not the same for the same ENNOL (Plots (c) and (d)). In all plots, the reference point (the first term of the linear fit) was automatically detected during the detection process (all the points were considered as the reference point). For data set of plots in this figure see <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0141486#pone.0141486.s003" target="_blank">S3 File</a>.</p

    Improved version of the first method shown Fig 1 for grouping outliers or noise into several groups based on different <i>k</i> values.

    No full text
    <p>The method shown in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0141486#pone.0141486.g002" target="_blank">Fig 2</a> can also be improved for grouping outliers or noise into several groups in the same manner.</p

    Plots show four selected windows of data captured automatically from a biogas plant in a three-minute interval (each window consists of 1,000 data points).

    No full text
    <p>The left side shows the linear fit detection in relation to a particular criterion, while the right side shows the linear fit detection of the relevant left-side data set in relation to narrower criteria than the left-side plot. In all cases, the method identified the most suitable linear fit in relation to the selected window. When the criteria are narrowed, the detection is sharp and there is a sub-set of the linear fit identified in relation to wider criteria. In all plots, the reference point (the first term of the linear fit) was automatically detected during the detection process (all the points were considered as the reference point). For data set of plots in this figure see <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0141486#pone.0141486.s005" target="_blank">S5 File</a>.</p

    The first method of applying the new multiple reference point linear fit algorithm.

    No full text
    <p>When terminating conditions are fulfilled with reference to a particular reference point, outlier detection is terminated. Then, the process continues with the next reference point until all reference points are finished. Among the different candidate linear fits in relation to different successful reference points, the best linear fit is determined by considering the linear correlation coefficient and the number of data points.</p

    A complete process circle for achieving a candidate data set for linear fit with reference to the second item (30) of the data set.

    No full text
    <p>Legend:</p><p>**: Reference data point.</p><p><sup>‡</sup>: Term identified as the outlier in the relevant iteration.</p><p>*<sup>x</sup>: Removed in the relevant iteration and not considered for the next iteration.</p><p>Detection process must be conducted considering each term as a reference point. However, in this example shows calculations only with reference to the second item. In the first iteration <i>MMS(a</i><sup><i>TT</i></sup><i>)</i><sub><i>max|</i>2</sub> > <i>2/n</i> and fulfils the detection condition. Thus, in the first iteration <math><mrow><msubsup><mi>a</mi><mrow>max<mo>|</mo><mn>2</mn></mrow><mrow><mi>T</mi><mi>T</mi></mrow></msubsup></mrow></math> is the term that not agrees with the linear fit. Therefore, (8, 41.81) was removed and excluded from the calculations in second iteration. This process was continued until the termination condition (<math><mrow><msubsup><mi>a</mi><mrow>max<mo>|</mo><mo> </mo><mn>2</mn></mrow><mrow><mi>T</mi><mi>T</mi></mrow></msubsup><mo> </mo><mo>=</mo><mo> </mo><mn>0</mn></mrow></math> and <math><mrow><msubsup><mi>a</mi><mrow>min<mo>|</mo><mo> </mo><mn>2</mn></mrow><mrow><mi>T</mi><mi>T</mi></mrow></msubsup><mo> </mo><mo>=</mo><mo> </mo><mn>0</mn></mrow></math>) is reached in fourth iteration. Note that in this example, <i>k = 0</i> and <i>r = 2</i>. Also, see <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0141486#pone.0141486.s001" target="_blank">S1 File</a> for better understanding on the calculation process.</p

    The gradient of linear fits shown in (a1) and (a2), (b1) and (b2) and (c1) and (c2) are ascending, descending and constant, respectively.

    No full text
    <p>In data sets (a1), (b1) and (c1), all data points that do not agree with linear fit are located on one side (non-Gaussian) of linear fit. In data sets (a2), (b2) and (c2), all data points that do not agree with linear fit are located on both sides of linear fit. In all data sets, fewer than 50% of the data points agree with linear fit. Some of the data not agreeing with linear fit deviate more than ±10<sup>4</sup> from the correct value. At the same time, there are data points that have very small deviation, as small as ±10<sup>−4</sup>, from the correct value. Whatever the condition, the new method was capable of identifying robust linear fit. In all plots, the reference point is the first data point in linear fit, which was automatically detected during the detection process (all the points were considered as the reference point). For data set of plots in this figure see <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0141486#pone.0141486.s002" target="_blank">S2 File</a>. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0141486#pone.0141486.g005" target="_blank">Fig 5</a> consists of three data sets of Anscombe’s quartet [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0141486#pone.0141486.ref001" target="_blank">1</a>], which can be considered as APs. As shown in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0141486#pone.0141486.g005" target="_blank">Fig 5</a>, the new method was capable of identifying the nearest data set that agrees with linear fit. We set the number of minimum data points at five for all examples in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0141486#pone.0141486.g005" target="_blank">Fig 5</a>. In <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0141486#pone.0141486.g005" target="_blank">Fig 5(c) and 5(d)</a> represent the third data set of Anscombe’s quartet and use different <i>k</i> values. When the <i>k</i> value changes, the reference point and number of non-outliers are not the same for the same ENNOL. Furthermore, no masking or swapping occurred in relation to any <i>k</i> value we used for linear fit identification.</p

    The second method of applying the new multiple reference point linear fit algorithm.

    No full text
    <p>In this method, the expected number of non-outliers (ENNOL) is used as a termination condition. When terminating conditions are fulfilled with reference to a particular reference point, outlier detection is terminated. Then, the process continues with the next reference point until all reference points are finished. Among the different candidate linear fits in relation to different successful reference points, the best linear fit is determined by considering the linear correlation coefficient.</p

    Plots (a) and (b) show two artificial data sets, each consisting of a data set with 100% agreement with unknown linear regression.

    No full text
    <p>The number of data points agreeing with linear fit is less than 50% of total existing data points. In plot (a), data points that do not agree with linear fit lie on both sides of linear fit and exhibit four initial missing data regions of 50, 100, 100 and 50 data points (total 300 initial missing data). In plot (b), data points that do not agree with linear fit are located on one side of linear fit and exhibit two initial missing data regions of 100 and 150 data points (total 250 initial missing data). In both plots, data points that do not agree with linear fit are in the range of ±10<sup>−2</sup> to ±10<sup>4</sup>. Though both data sets represent very extreme conditions, the method was capable of locating all data points that agreed with linear fit without swapping or masking. Zoomed areas of selected areas that contain very near values to linear fit demonstrate the ability of the proposed method. In plots (a) and (b) the reference points (the first term of the linear fit) were automatically detected during the detection process as 20 and 26, respectively (all the points were considered as the reference point). For data set of plots in this figure see <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0141486#pone.0141486.s004" target="_blank">S4 File</a>.</p
    corecore