clear format compact echo on %%Undersampling %In this example, we look at the technique of undersampling. This is %really only appropriate for very large data sets. The idea is that you %break your dataset into many subsets, calculate the parameters for each, %and then take the mean and standard deviations of these calculated values. %Because you need a significant number of data points in each set (for any %reasonable statistics) and a significant number of subsets, the total %number of data points has to be really large! %We use the "ball in air" problem again: t=[0:.01:1]'; %and we generate some "data": xexact=[0,2,-2]'; %The exact values a=[ones(size(t)),t,t.^2]; noise=0.05; %The amplitude of the noise b=a*xexact+noise*randn(size(t)); %Our artificial data set. pause %%Basic Regression: %We solve this using the usual regression formula: k=inv(a'*a)*a'; x=k*b %We calculate the error in the usual way: r=a*x-b; varb=r'*r/(length(r)-3); %Three degrees of freedom lost sigb=varb^.5 %Which is (usually) close to the value of noise we put in varx=k*varb*k' %and in particular we have the 2-sigma confidence intervals of the x %values: xinterval=[x-2*diag(varx).^.5,x+2*diag(varx).^.5] %which usually contains the exact values: xexact %Where probabilities are governed by the t-distribution: prob=tcdf(2,length(b)-3)-tcdf(-2,length(b)-3) %So the 2-sigma probability in the 90% range (depending on n-m). %Now let's plot this up. We want some smooth plotting: tp=[0:.01:1]'; ap=[ones(size(tp)),tp,tp.^2]; bp=ap*x; sigbp=diag(ap*varx*ap').^.5; figure(1) plot(t,b,'o',tp,ap*xexact,'k',tp,bp,tp,bp+sigbp,':r',tp,bp-sigbp,':r') set(gca,'FontSize',14) xlabel('time') ylabel('position') legend('data','exact','model','1sig confidence interval','Location','South') pause %%Undersampling %Now we estimate our fitting parameters by undersampling. [n m]=size(a); xussqall=zeros(m,m); %We initialize the array. xusall=zeros(m,1); p=10 for i=1:p ikeep=[i:p:n]; %The indices we keep this time around a_us=a(ikeep,:); b_us=b(ikeep); xus=a_us\b_us; xusall=xusall+xus; xussqall=xussqall+xus*xus'; echo off end echo on %Now we calculate the mean and matrix of covariance of these values: xus=xusall/p %And the covariance: varxus=(xussqall-p*xus*xus')/(p-1)/p %And that generates the matrix, without having to assume anything about the %matrix of covariance of b - other than assuming that the data points are %independent, of course. Note that we are getting the matrix of covariance %of the -mean- of p samples of our dataset. pause %Let's compare this to the values we obtained using the error propagation %formula: var_ratio=varxus./varx %Which is (usually) close to one - it will change every time you run it. If %you have a large number of data points it will converge exactly to one, as %expected, but it takes a fair number of samples (each with a fair number %of points) to do it! pause %We can also add this to our plot: bpus=ap*xus; sigbpus=diag(ap*varxus*ap').^.5; figure(2) plot(t,b,'o',tp,ap*xexact,'k',tp,bp,tp,bp+sigbp,':r',tp,bpus+sigbpus,'--g',tp,bp-sigbp,':r',tp,bpus-sigbpus,'--g') set(gca,'FontSize',14) xlabel('time') ylabel('position') legend('data','exact','model','normal error','undersampling error','Location','South') pause %Undersampling is a very simple way of getting at the variance of %measurements, but it does require a very large number of datapoints. You %certainly can't do it with just a few! Like the other resampling methods, %it works as well for both linear and non-linear regression problems, and %will also behave well for non-normal residuals (although it is %questionable whether non-weighted regression is appropriate if your %residuals are not uniform). You can also use these techniques to estimate %the -distribution- of the fitting parameters: statistics beyond mean and %variance, as fitting parameters are usually not normally distributed as %well. echo off