clear format compact echo on %%The Jackknife %In this example, we demonstrate the jackknife - a way of estimating the %matrix of covariance of a set of fitted parameters without requiring %modeling of the residuals directly. We do this by solving the problem %over and over, leaving out one data point each time. We then calculate %the matrix of covariance of the fitted values. This has the advantage of %not relying on the residuals being normally distributed, although if the %residuals are not of uniform error the same issues with bias of the %regression will occur. %OK, we shall take as our example the "ball in air" problem we've looked at %before: t=[0:.1:1]'; %and we generate some "data": xexact=[0,2,-2]'; %The exact values a=[ones(size(t)),t,t.^2]; noise=0.05; %The amplitude of the noise b=a*xexact+noise*randn(size(t)); %Our artificial data set. pause %%Basic Regression: %We solve this using the usual regression formula: k=inv(a'*a)*a'; x=k*b %We calculate the error in the usual way: r=a*x-b; varb=r'*r/(length(r)-3); %Three degrees of freedom lost sigb=varb^.5 %Which is (usually) close to the value of noise we put in varx=k*varb*k' %and in particular we have the 2-sigma confidence intervals of the x %values: xinterval=[x-2*diag(varx).^.5,x+2*diag(varx).^.5] %which usually contains the exact values: xexact %Where probabilities are governed by the t-distribution: prob=tcdf(2,length(b)-3)-tcdf(-2,length(b)-3) %So the 2-sigma probability in the 90% range (depending on n-m). pause %Now let's plot this up. We want some smooth plotting: tp=[0:.01:1]'; ap=[ones(size(tp)),tp,tp.^2]; bp=ap*x; sigbp=diag(ap*varx*ap').^.5; figure(1) plot(t,b,'o',tp,ap*xexact,'k',tp,bp,tp,bp+sigbp,':r',tp,bp-sigbp,':r') set(gca,'FontSize',14) xlabel('time') ylabel('position') legend('data','exact','model','1sig confidence interval','Location','South') pause %%The Jackknife %Now we look at how to estimate the matrix of covariance using the jackknife %approach. We simply leave off one of the data points and resolve for x. %We then subtract it from the value obtained by looking at all the points, %and determine the matrix of covariance: [n m]=size(a); xjksqall=zeros(m,m); %We initialize the array. xjkall=zeros(m,1); for i=1:n iall=[1:n]; ikeep=find(iall~=i); %We keep all but the ith data point! ajk=a(ikeep,:); bjk=b(ikeep); xjk=ajk\bjk; xjksqall=xjksqall+xjk*xjk'; xjkall=xjkall+xjk; echo off end echo on %And we get the final result: xjk=xjkall/n %with the matrix of covariance: varxjk=(xjksqall-n*xjk*xjk')*(n-1)/n %And that generates the matrix, without having to assume anything about the %matrix of covariance of b - other than assuming that the data points are %independent, of course. pause %Let's compare this to the values we obtained using the error propagation %formula: var_ratio=varxjk./varx %Which is (usually) close to one - it will change every time you run it. If %you have a large number of data points it will converge exactly to one, as %expected, but it takes about a thousand or so. pause %We can also add this to our plot: bpjk=ap*xjk; sigbpjk=diag(ap*varxjk*ap').^.5; figure(2) plot(t,b,'o',tp,ap*xexact,'k',tp,bpjk,tp,bp+sigbp,':r',tp,bpjk+sigbpjk,'--g',tp,bp-sigbp,':r',tp,bpjk-sigbpjk,'--g') set(gca,'FontSize',14) xlabel('time') ylabel('position') legend('data','exact','model','normal error','jacknife error','Location','South') pause %As a final note, the Jackknife will work both for linear and non-linear %regression, although it does involve solving the problem n times. It does %not require determining the residuals (other than assuming independence), %but you can't use it if you -require- one of the data points in the model %fitting (such as cr0 in Tuesday's reaction engineering problem). It also %may be more prone to "strange results" if the number of data points is %small, whereas the exact expressions (if the number of data points is %still sufficient to estimate the variance in the data, or if you can get %it in other ways) are less so. For example, if you have a small number of %data points and "leave off the one on the end", you will tend to get a %much different value for fitting parameters, leading to (on average) an %overestimate of the variance. echo off