%% Covariance Demonstration % In this example we show how covariance (or lack of independence) in data % is revealed by the residual, and how it affects error calculations. We % have a vector of random values and we want to calculate the mean and the % error in the mean. We do this directly (which would yield the correct % estimates) and then after smoothing the data (which will result in an % underestimate of the error). Such covariance issues often arise in time % series data. So: %% The data set % We generate the random data. To get good statistics we shall use a long % time series, but for clarity in plotting we shall only plot up a fraction % of the values. clear n=10000; x = 5 + randn(n,1); xbar = mean(x) sigx = std(x) sigxbar = sigx/n^.5 figure(1) plot(x(10:60),'o-') xlabel('time') ylabel('x') %% The smoothed data set % One way of introducing covariance in the data is to use the "smooth" % command. This (with the default option, anyway) does a 5 point running % average smoothing. You will see that the means of the smoothed and % unsmoothed data match, but the standard deviation of the smoothed data is % smaller by a factor of 5^.5 as would be expected. This leads to a % reduction of the standard deviation of the mean by the same factor. % Thus: xs = smooth(x); xsbar = mean(xs) sigxs = std(xs) sigxsbar = sigxs/n^.5 figure(1) hold on plot(xs(10:60),'*-r') legend('original independent data','smoothed data') hold off %% Analysis % As you can see, the smoothed data has significant covariance. The % standard deviation of xs is reduced (no surprise, as it is a running % average now, so is smaller by 5^.5), but more significantly the error in % the mean is now calculated to be too small. This is because the number % of independent data points in the smoothed data set isn't n, but rather % is about n/5 (not exactly). Thus, in calcuating the "correct" error for % xbar for a data set with covariance we should reduce n by this loss of % independence (in this case a factor of 5). We can examine this via a % shifted correlation of the residuals. Basically, we normalize the % residuals by the standard deviation and see how the correlation decays in % time by taking the inner product of offset vectors. We do this for both % the original and smoothed data sets. As you can see from the data, the % correlation of the original random data dies off immediately, while that % of the smoothed data decreases linearly over a span corresponding to the % smoothing window. The linear decay is because of the linear nature of % the "moving average" smoothing command used to generate the data. For % normal time series data the correlation decay is usually exponential. In % any event, the number of degrees of freedom in calculating the error in % the mean needs to be reduced as the true number of independent % measurements is lower than n - in this case by a factor of 5. The key % conclusion is that if the residuals are not independent, then the error % in calculated quantities such as the mean (or regression fitting % parameters) are underestimated by a factor reflecting the overestimation % of the number of degrees of freedom. r = x - xbar; rs = xs - xsbar; for is = 1:20 cor(is) = r(1:end-is+1)'*r(is:end)/std(r(1:end-is+1))/std(r(is:end))/(n-is+1); cors(is) = rs(1:end-is+1)'*rs(is:end)/std(rs(1:end-is+1))/std(rs(is:end))/(n-is+1); end shift=[0:length(cor)-1]; figure(2) plot(shift,cor,'o-',shift,cors,'*-') xlabel('shift') ylabel('normalized correlation') legend('random data','smoothed data') variance_ratio = sigxbar^2/sigxsbar^2