clear echo on %This script illustrates the difficulties with using linearization to analyze %data. We look at the exponential approach to equilibrium discussed in class. %We shall demonstrate the behavior by generating an artificial data set. We %are measuring a series of concentrations decaying to some equilibrium concentration %ceq, which is measured as well. The model is: % % c = ceq + (c0 - ceq) * exp(k * t) % pause %We have the series of times at which measurements are made: t=[0:.25:12]'; %Now we suppose that the "exact" values of c0, ceq, and k are 2, 1, and -0.2, %respectively. Thus, the concentration measurements -should- be: c0ex=2; ceqex=1; kex=-0.2; cex=ceqex+(c0ex-ceqex)*exp(kex*t); pause %Now we recognize that there is some measurement error. Suppose we measure all %the concentrations to +- 0.05 (standard deviation). Thus, we have "random" data: sigc=0.05; st=5; %We pick a particular "state" for the random number generator randn('state',st); c=cex+sigc*randn(size(cex)); %and we have the error in ceq: ceq=ceqex+sigc*randn; pause %Let's plot the data up: figure(1) plot(t,c,'*',t,ceq*ones(size(t)),t,cex) legend('data','measured equilibrium','original expression') xlabel('time') ylabel('concentration') drawnow %As you can see, the data looks to be randomly scattered about the "true" curve, %as should occur! pause %OK, now let's see if we can recover the original parameters used in generating %the artificial data. We linearize the model as was done in class: b=log(c-ceq); a=[ones(size(t)),t]; x=a\b %And that's it! But how did we do? pause %The calculated value of k is just x(2): k=x(2) %The initial concentration fit is: c0=ceq+exp(x(1)) %These (sometimes) aren't very close to the correct values of 0.2 and 2.0, and %occasionally (depending on the "seed"), it even yields complex values! pause %We can also calculate the uncertainty in the values, using the usual assumptions %of independence, and that error is reflected by deviation from the model: resid=b-a*x; varb=norm(resid)^2/(length(c)-2); kmat=inv(a'*a)*a'; varx=kmat*varb*kmat'; %And in particular, we have the 95% error interval for k: [k-2*varx(2,2)^.5,k+2*varx(2,2)^.5] %Which doesn't even contain the "correct" value! pause %What happened? The problem was that there was an excessive weight placed on %certain measured values based on the linearization process. Values close to %equilibrium (large times) got a disproportionate weighting, and the value of %ceq was really heavily weighted as it was used in calculating every (c-ceq) %value. Sometimes c-ceq could be measured to be negative, which yields %complex values when you take the logarithm! %This is seen most easily if we plot up the "b" values: figure(2) plot(t,b,'*',t,a*x) xlabel('time') ylabel('log(c-ceq)') legend('data','fit') drawnow pause %Examination of the graph shows that the scatter in the large time points is %much greater than at short times, but since they are weighted equally in the %linear regression formula, the fitted relationship is biased. In addition, the %assumptions of uniform, independent error in b are incorrect, as all the values %of b have ceq in them (thus there is covariance), and the use of a single value %for ceq essentially introduces systematic error in b which doesn't show up in %the scatter. Also, short time values have a smaller uncertainty than the long %time values which is not reflected in the analysis. This means that not only will %the parameters be estimated incorrectly, but the error in the parameters will %also be underestimated. In other words, you get a poor answer and think it is %better than it really is, a dangerous combination... pause %The error estimate could be more accurately estimated if we account for %the systematic error induced by ceq through error propagation. Let's do %this. First, we need an estimate of the magnitude of the error in ceq, %which we get from the random scatter in the data: sigcest=norm(c-(ceq+exp(a*x)))/(length(c)-2).^.5 %Which is reasonably close to the correct value of sigc pause %Next we take the derivative of the fitted values with respect to ceq: ep=1e-8; bt=log(c-ceq+ep); xt=a\bt; gradx=(xt-x)/ep; varkceq=gradx(2)^2*sigcest^2 %which is actually much greater than the error estimate from the random %scatter in the data alone: vark=varx(2,2) %Thus, the error due to measurement of ceq totally dominates the overall %error of the calculation! This is becuase the random error from the %independent measurements at different times cancels out, while the %systematic error due to ceq measurement doesn't! pause %The total error would be the sum of these two variances (more or less - %we're ignoring some covariance here, which may reduce the error), %which yields the new confidence interval: [k-2*(vark+varkceq)^.5,k+2*(vark+varkceq)^.5] %This new (more conservative) error estimate contains the correct value, but %it's really large. Let's look at an alternative approach which gives us a %better estimate of k, as well as tighter error bounds! pause %OK, so how do we do better? We can use non-linear regression! Let's choose %ceq, c0, and k as our three unknown parameters (rather than taking the measured %ceq required for the linearized version). We define the function "miss.m" which %calculates the 2-norm of the residual from the data. We pass in the data using %global variables: global tpass cpass ceqpass tpass=t; cpass=c; ceqpass=ceq; guess=[1 2 -.2]; %The initial guess for the fitting parameters xfit=fminsearch('miss',guess) %And we have the value of k: knonlin=xfit(3) %Which gets quite a bit closer to the correct value! pause %We can also calculate the error. First, we look at the residual of the data to get %the error in the concentration measurements: varc=miss(xfit)/(length(t)+1-3); %we had three fitting parameters! %Now we calculate the gradient (dependence on each measured concentration): n=length(t); gradf=zeros(3,n); dc=varc^.5; for i=1:n cpass=c; cpass(i)=cpass(i)+dc; txfit=fminsearch('miss',xfit); gradf(:,i)=(txfit-xfit)'/dc; echo off end %We also append the gradient due to ceq measurement: ceqpass=ceqpass+dc; txtfit=fminsearch('miss',xfit); gradf=[gradf,(txfit-xfit)'/dc]; echo on pause %And we use this to calculate the error: varxnl=gradf*varc*gradf' %with the uncertainty interval in k: [knonlin-2*varxnl(3,3).^.5,knonlin+2*varxnl(3,3).^.5] %Which contains the correct value, this time! pause %We can also put the new curve on the plot: figure(1) hold on plot(t,ceq+exp(a*x),'b',t,xfit(1)+(xfit(2)-xfit(1))*exp(xfit(3)*t),'g') legend('data','measured equilibrium','original expression','linear fit','non-linear fit') hold off %For this set of data, at least, the non-linear fitted curve matches up very well %with the "true" values. Note that the non-linear approach also doesn't run into %the "imaginary" values that you get if c-ceq is measured to be negative as well! %You can see this most clearly if you change the "seed" in line 28 for the random %number generator. The non-linear approach always works fairly well, while the %linearized fit is often toast! pause %Finally, note that for most problems the linearized model will work just fine IF %the data is sufficiently accurate. A big part of the problem here was that the %standard deviation in the concentration measurements was fairly large - although %not implausibly so. If you decrease sigc to 0.01 and increase the discretization %in time (e.g., more data points), the linearized version does better - but still %not as well as the non-linear version. echo off