clear
echo on
%This script illustrates the difficulties with using linearization to analyze
%data.  We look at the exponential approach to equilibrium discussed in class.
%We shall demonstrate the behavior by generating an artificial data set.  We
%are measuring a series of concentrations decaying to some equilibrium concentration
%ceq, which is measured as well.  The model is:
%
%   c = ceq + (c0 - ceq) * exp(k * t)
%
pause
%We have the series of times at which measurements are made:
t=[0:.25:12]';

%Now we suppose that the "exact" values of c0, ceq, and k are 2, 1, and -0.2,
%respectively.  Thus, the concentration measurements -should- be:

c0ex=2;
ceqex=1;
kex=-0.2;

cex=ceqex+(c0ex-ceqex)*exp(kex*t);

pause
%Now we recognize that there is some measurement error.  Suppose we measure all
%the concentrations to +- 0.05 (standard deviation).  Thus, we have "random" data:
sigc=0.05;
st=5; %We pick a particular "state" for the random number generator
randn('state',st);
c=cex+sigc*randn(size(cex));

%and we have the error in ceq:
ceq=ceqex+sigc*randn;

pause
%Let's plot the data up:
figure(1)
plot(t,c,'*',t,ceq*ones(size(t)),t,cex)
legend('data','measured equilibrium','original expression')
xlabel('time')
ylabel('concentration')
drawnow

%As you can see, the data looks to be randomly scattered about the "true" curve,
%as should occur!
pause
%OK, now let's see if we can recover the original parameters used in generating
%the artificial data.  We linearize the model as was done in class:

b=log(c-ceq);
a=[ones(size(t)),t];
x=a\b

%And that's it!  But how did we do?
pause

%The calculated value of k is just x(2):
k=x(2)

%The initial concentration fit is:
c0=ceq+exp(x(1))

%These (sometimes) aren't very close to the correct values of 0.2 and 2.0, and
%occasionally (depending on the "seed"), it even yields complex values!
pause
%We can also calculate the uncertainty in the values, using the usual assumptions
%of independence, and that error is reflected by deviation from the model:
resid=b-a*x;
varb=norm(resid)^2/(length(c)-2);
kmat=inv(a'*a)*a';
varx=kmat*varb*kmat';

%And in particular, we have the 95% error interval for k:
[k-2*varx(2,2)^.5,k+2*varx(2,2)^.5]

%Which doesn't even contain the "correct" value!

pause
%What happened?  The problem was that there was an excessive weight placed on
%certain measured values based on the linearization process.  Values close to
%equilibrium (large times) got a disproportionate weighting, and the value of
%ceq was really heavily weighted as it was used in calculating every (c-ceq)
%value.  Sometimes c-ceq could be measured to be negative, which yields
%complex values when you take the logarithm!

%This is seen most easily if we plot up the "b" values:
figure(2)
plot(t,b,'*',t,a*x)
xlabel('time')
ylabel('log(c-ceq)')
legend('data','fit')
drawnow
pause

%Examination of the graph shows that the scatter in the large time points is
%much greater than at short times, but since they are weighted equally in the
%linear regression formula, the fitted relationship is biased.  In addition, the
%assumptions of uniform, independent error in b are incorrect, as all the values
%of b have ceq in them (thus there is covariance), and the use of a single value
%for ceq essentially introduces systematic error in b which doesn't show up in
%the scatter.  Also, short time values have a smaller uncertainty than the long 
%time values which is not reflected in the analysis.  This means that not only will
%the parameters be estimated incorrectly, but the error in the parameters will
%also be underestimated.  In other words, you get a poor answer and think it is
%better than it really is, a dangerous combination...
pause

%The error estimate could be more accurately estimated if we account for
%the systematic error induced by ceq through error propagation.  Let's do
%this.  First, we need an estimate of the magnitude of the error in ceq,
%which we get from the random scatter in the data:
sigcest=norm(c-(ceq+exp(a*x)))/(length(c)-2).^.5

%Which is reasonably close to the correct value of 
sigc

pause

%Next we take the derivative of the fitted values with respect to ceq:
ep=1e-8;
bt=log(c-ceq+ep);
xt=a\bt;
gradx=(xt-x)/ep;
varkceq=gradx(2)^2*sigcest^2

%which is actually much greater than the error estimate from the random
%scatter in the data alone:

vark=varx(2,2)

%Thus, the error due to measurement of ceq totally dominates the overall
%error of the calculation!  This is becuase the random error from the
%independent measurements at different times cancels out, while the
%systematic error due to ceq measurement doesn't!
pause

%The total error would be the sum of these two variances (more or less - 
%we're ignoring some covariance here, which may reduce the error),
%which yields the new confidence interval:

[k-2*(vark+varkceq)^.5,k+2*(vark+varkceq)^.5]

%This new (more conservative) error estimate contains the correct value, but
%it's really large.  Let's look at an alternative approach which gives us a
%better estimate of k, as well as tighter error bounds!

pause

%OK, so how do we do better?  We can use non-linear regression!  Let's choose
%ceq, c0, and k as our three unknown parameters (rather than taking the measured
%ceq required for the linearized version).  We define the function "miss.m" which
%calculates the 2-norm of the residual from the data.  We pass in the data using
%global variables:
global tpass cpass ceqpass
tpass=t;
cpass=c;
ceqpass=ceq;

guess=[1 2 -.2]; %The initial guess for the fitting parameters

xfit=fminsearch('miss',guess)

%And we have the value of k:
knonlin=xfit(3)

%Which gets quite a bit closer to the correct value!
pause

%We can also calculate the error.  First, we look at the residual of the data to get
%the error in the concentration measurements:
varc=miss(xfit)/(length(t)+1-3); %we had three fitting parameters!

%Now we calculate the gradient (dependence on each measured concentration):
n=length(t);
gradf=zeros(3,n);
dc=varc^.5;
for i=1:n
  cpass=c;
  cpass(i)=cpass(i)+dc;
  txfit=fminsearch('miss',xfit);
  gradf(:,i)=(txfit-xfit)'/dc;
  echo off
end
%We also append the gradient due to ceq measurement:
ceqpass=ceqpass+dc;
txtfit=fminsearch('miss',xfit);
gradf=[gradf,(txfit-xfit)'/dc];
echo on
pause
%And we use this to calculate the error:
varxnl=gradf*varc*gradf'
%with the uncertainty interval in k:
[knonlin-2*varxnl(3,3).^.5,knonlin+2*varxnl(3,3).^.5]

%Which contains the correct value, this time!
pause
%We can also put the new curve on the plot:
figure(1)
hold on
 plot(t,ceq+exp(a*x),'b',t,xfit(1)+(xfit(2)-xfit(1))*exp(xfit(3)*t),'g')
legend('data','measured equilibrium','original expression','linear fit','non-linear fit')
hold off

%For this set of data, at least, the non-linear fitted curve matches up very well
%with the "true" values.  Note that the non-linear approach also doesn't run into 
%the "imaginary" values that you get if c-ceq is measured to be negative as well!
%You can see this most clearly if you change the "seed" in line 28 for the random
%number generator.  The non-linear approach always works fairly well, while the 
%linearized fit is often toast!
pause
%Finally, note that for most problems the linearized model will work just fine IF
%the data is sufficiently accurate.  A big part of the problem here was that the
%standard deviation in the concentration measurements was fairly large - although
%not implausibly so.  If you decrease sigc to 0.01 and increase the discretization
%in time (e.g., more data points), the linearized version does better - but still
%not as well as the non-linear version.

echo off