%% Covariance Demonstration
% In this example we show how covariance (or lack of independence) in data
% is revealed by the residual, and how it affects error calculations.  We
% have a vector of random values and we want to calculate the mean and the
% error in the mean.  We do this directly (which would yield the correct
% estimates) and then after smoothing the data (which will result in an
% underestimate of the error).  Such covariance issues often arise in time
% series data.  So:

%% The data set
% We generate the random data.  To get good statistics we shall use a long
% time series, but for clarity in plotting we shall only plot up a fraction
% of the values.
clear
n=10000;
x = 5 + randn(n,1);

xbar = mean(x)
sigx = std(x)
sigxbar = sigx/n^.5

figure(1)
plot(x(10:60),'o-')
xlabel('time')
ylabel('x')

%% The smoothed data set
% One way of introducing covariance in the data is to use the "smooth"
% command.  This (with the default option, anyway) does a 5 point running
% average smoothing.  You will see that the means of the smoothed and
% unsmoothed data match, but the standard deviation of the smoothed data is
% smaller by a factor of 5^.5 as would be expected.  This leads to a
% reduction of the standard deviation of the mean by the same factor.
% Thus:

xs = smooth(x);

xsbar = mean(xs)
sigxs = std(xs)
sigxsbar = sigxs/n^.5

figure(1)
hold on
plot(xs(10:60),'*-r')
legend('original independent data','smoothed data')
hold off

%% Analysis
% As you can see, the smoothed data has significant covariance.  The
% standard deviation of xs is reduced (no surprise, as it is a running
% average now, so is smaller by 5^.5), but more significantly the error in
% the mean is now calculated to be too small.  This is because the number
% of independent data points in the smoothed data set isn't n, but rather
% is about n/5 (not exactly).  Thus, in calcuating the "correct" error for
% xbar for a data set with covariance we should reduce n by this loss of
% independence (in this case a factor of 5).  We can examine this via a
% shifted correlation of the residuals.  Basically, we normalize the
% residuals by the standard deviation and see how the correlation decays in
% time by taking the inner product of offset vectors.  We do this for both
% the original and smoothed data sets.  As you can see from the data, the
% correlation of the original random data dies off immediately, while that
% of the smoothed data decreases linearly over a span corresponding to the
% smoothing window.  The linear decay is because of the linear nature of
% the "moving average" smoothing command used to generate the data.  For
% normal time series data the correlation decay is usually exponential.  In
% any event, the number of degrees of freedom in calculating the error in
% the mean needs to be reduced as the true number of independent
% measurements is lower than n - in this case by a factor of 5.  The key
% conclusion is that if the residuals are not independent, then the error
% in calculated quantities such as the mean (or regression fitting
% parameters) are underestimated by a factor reflecting the overestimation
% of the number of degrees of freedom.

r = x - xbar;
rs = xs - xsbar;

for is = 1:20
    cor(is) = r(1:end-is+1)'*r(is:end)/std(r(1:end-is+1))/std(r(is:end))/(n-is+1);
    cors(is) = rs(1:end-is+1)'*rs(is:end)/std(rs(1:end-is+1))/std(rs(is:end))/(n-is+1);
end
shift=[0:length(cor)-1];

figure(2)
plot(shift,cor,'o-',shift,cors,'*-')
xlabel('shift')
ylabel('normalized correlation')
legend('random data','smoothed data')

variance_ratio = sigxbar^2/sigxsbar^2