clear
clf
echo on
%This example illustrates the difference betwee the "Student t
%distribution" and the Gaussian normal distribution.  The key reason that
%they are different is that the sample standard deviation (that calculated
%from a sample of n measurements) is itself a random variable.  Thus, there
%is a probability that the standard deviation estimate is smaller than the
%true standard deviation.  This results in the estimated "t-statistic"
%being larger than the true "z-statistic", and thus the outlier probability
%(the tails of the distribution) are larger.  The t-statistic probability
%distribution collapses to the normal distribution if n is large enough.

%The randn command produces a set of variables which are normally
%distributed about zero with standard deviation of one.  Now suppose we
%extract n measurements from this distribution.  We can calculate the mean
%and standard deviation of this sample.  These will have an expectation
%value of zero and one, respectively.  We can also calculate the standard
%deviation of the -average-, which has an expectation value of 1/n^.5.

pause
%Let's look at this:
n=[2,3,4,5,10,100,1000]; %We choose different sample sizes.

m=10000; %We do m realizations to look at the probability distribution.

sigx=zeros(size(n));
xbar=zeros(size(n));

for i=1:length(n)
    x=randn(n(i),m); %We do the "experiments" where the columns are a realization of x
    xbardist=sum(x)/n(i); %The distribution of xbars
    sigxdist=((sum(x.^2)-n(i)*xbardist.^2)/(n(i)-1)).^.5; %The distribution of sigma x
    xbar(i)=sum(xbardist)/m;
    sigx(i)=(sum(sigxdist.^2)/m)^.5; %We get the average variance and take square root.
    echo off
end
echo on
%We can look at these values:
n
xbar
sigx

%So all the sample means have an expectation value of zero, and the
%standard deviations have an expectation value of one.
pause

%Now let's do it again, only this time we will calculate the "t-statistic",
%or the probability distribution of (x(i)-xbar)/sx.  If n is large, this
%will just be the normal distribution - but if n is small, it will be
%significantly different.

colors=['rgbycmk']; %An array of plotting colors
colors=[colors,colors,colors]; %an even longer array

for i=1:length(n)
    x=randn(n(i),m); %We do the "experiments" where the columns are a realization of x
    xbardist=sum(x)/n(i); %The distribution of xbars
    sigxdist=((sum(x.^2)-n(i)*xbardist.^2)/(n(i)-1)).^.5; %The distribution of sigma x
    sigxbar=sigxdist/n(i)^.5; %The variance in xbar
    
    cumsum=([1:m]-.5)/m; %The cumulative sum
    tdist=sort(xbardist./sigxbar); %This is the t value about a mean of zero
    figure(1)
    plot(tdist,cumsum,colors(i))
    hold on
    echo off
end
echo on
axis([-3 3 0 1])
grid on
xlabel('t-statistic')
ylabel('cumulative probability')
title('Cumulative probability of t-statistic as a function of sample size')
labels=['n = ',num2str(n(1))];
for i=2:length(n)
    labels=str2mat(labels,['n = ',num2str(n(i))]);
    echo off
end
echo on
legend(labels,'Location','SouthEast')
hold off
pause
%Examination of these results shows that the tails are much larger for
%small sample size.  This is because there are a number of samples for
%which the calculated standard deviation is too low.  Thus, the t-statistic
%is much larger than would be expected from a gaussian distribution.  For
%small t-values and n>4 there isn't too much difference (say at the 1 sigma
%level), but for larger t-values there is still a large difference.  Try to
%use the zoom on command to look at the graph in more detail.
zoom on
pause
%What does this mean in practice?  Suppose we take a sample of 3
%measurements and calculate a mean and standard deviation.  We also compute
%the standard deviation of the mean.  From the graph, there is an 82%
%probability that the true population mean lies within 2 standard 
%deviations of the sample mean.  This is much less than the 95% probability
%that it was within 2 sigma (now much smaller, due to larger n) of the 
%sample mean for larger samples.  Looking at it another way, if mean and
%standard deviation are calculated from 3 measurements, you have to go
%about +-4.3 sigma to get to the 95% confidence interval.  Note that the 1
%sigma confidence level is much less sensitive than the tails: for a sample
%size of 3, the 69% confidence limit is reached at +-1.3 sigma, not much
%different from +-1.0 for a large sample.  By the time you get to n=10
%there isn't much difference at all - at the 1 sigma level, anyway.
pause
%As usual, matlab provides many useful statistical functions to help do
%your analysis.  For the t-distribution the most useful are tcdf, its
%inverse tinv, and tpdf.  All of these require knowing the number of
%degrees of freedom.  We can see what tcdf does by adding it to the plot:
figure(1)
hold on
plot([-3:.1:3],tcdf([-3:.1:3],2),'*g')
hold off
%which matches the simulations for the two degrees of freedom (sample
%size n = 3) case.
echo off