Intro to Scientific Computing

PHYS 27/193

Physics Department
University of the Pacific

Fitting Data II

In the previous lecture, we did a very simple fit of a straight line to some data, extracting the value of slope of the data.

But how does gnuplot define the "Best" fit?

This goes under a very old branch of mathematics called "Linear Regression" (when fitting linear functions), or "Least Squares" fitting, or sometimes "Data Modelling".
In it's simplest form, the procedure works like this. We'll consider a very simple linear fit to three data points.

Suppose your experiment gives you these three data points:

1.0   0.7    # x1, y1
2.0   0.8    # x2, y2
3.0   1.7    # x3, y3
You define a function f(x) = ax + b which has some parameters that you can change, a and b in this case.
Each different value of a and b gives a different line; three different examples are shown below.

But again, how do you know which line is the best fit (comes closest to the data)?

For each data point (labeled by i = 1,2,3), we compute the difference between our guess function f(xi) = axi + b (at xi), and the yi value of the real data.

This difference is called the Residual at xi.   The residuals of each data point are shown as the little vertical lines ( | ) in the figure above.
Since these can be either positive or negative, we square them to get a positive number irregardless of whether the data is above or below the line.
Then we sum all the squares of residuals.

The fit program changes the parameters ( a and b in our case, but there could be more) to get the lowest Sum of the Squares of the Residuals, R, hence the name "Least Squares Fit".

For those who know calculus, what we are really doing is taking the derivative of R with respect to a and b and set those expressions to zero; this finds where R has a minimum as we vary a and b.
This gives two equations that we have to solve, sumultaneously.

These are actually easy to solve (email me if you need help).

However, gnuplot doesn't actually solve them!
It just varies a and b and sees how much R changes. If increasing a decreases R, then gnuplot increases a a little more.

Once changing the parameters by a little amount always increases R, fit stops and prints out the resulting parameters:

After 4 iterations the fit converged.
final sum of squares of residuals : 0.106667
rel. change during last iteration : -3.71111e-09

degrees of freedom    (FIT_NDF)                        : 1
rms of residuals      (FIT_STDFIT) = sqrt(WSSR/ndf)    : 0.326599
variance of residuals (reduced chisquare) = WSSR/ndf   : 0.106667

Final set of parameters            Asymptotic Standard Error
=======================            ==========================

a               = 0.5              +/- 0.2309       (46.19%)
b               = 0.0666667        +/- 0.4989       (748.3%)

This gives the paramerters of the line that is the "Best"   fit to the data,
as well as some information about how good the fit is (how big was the sum of the squares of the residuals).

This Best line looks like this. It's the one that's closest to the data.

For more information on the Least Squares method can be found here:

as well as any good math book on statistics.


As you can see from the figure below, we need not limit ourselves to linear functions--you could do the same with almost any function and data.
Your job as a scientist is to have insight (usually from theory) as to which function is correct.

In this case, you want a quadratic function.
You would do the following

gnuplot> f(x) = a*x**2 + b*x + c
gnuplot> fit f(x) "data2.dat" via a ,b, c
gnuplot> plot "data2.dat", f(x)

Do you see that we've defined a quadratic function: ax2 + bx + c above?

Try this yourself. Here's the data: data2.dat

Here's my fit.


Hint: Operator Binding or Precedence

Above I wrote
f(x) = a*x**2 + b*x + c
without putting parentheses in. This probably bothers several of you, who would rather see the function written as:
f(x) = a*(x**2) + b*x + c
which is "safer". As with most computer math operators (like * and /), the "exponent" operator ** acts only on the previous thing. Clearly
(a*x)**2
is much different than
a*(x**2).
I have been a bit lazy and used the fact that the exponent operator "binds tighter" than the multiplication operator. When gnuplot reads the line
a*x**2 + b*x + c
tt first squares x before anything else, because ** has the highest "precedence" or "tightest binding". Then it does the multiplications a*x2 and b*x since * has next highest precedence. Finally, it does the additions because + has lowest precedence.
But we could also use:
a*(x**2) + b*x + c


Using the "Using" command

A very VERY important feature of gunplot's plot and fit functions is the using argument. It has MANY uses, some of which we'll look at below.

Plotting data from different columns

Suppose your data file has separate columns of data that are all measured at the same time. You have an example of this in your "YOURNAMEcalpop.dat" file.
Go ahead and find it now.

Before we look at it in gnuplot however, we will have to edit it slightly. It will look something like this (right?):

Year       Total US            Total California
1900       76212168          1485053
1910       92228496          2377549
...

Start emacs and Open "YOURcalpop.dat".

We have to make the first line (or lines) into a comment, ortherwise gnuplot will bail trying to plot the number "Year".
Remember how to make a line into a comment? If not see here: page 4. (Simply put a # at the beginning of the line.)

Now our file looks like this:

#Year       Total US            Total California
1900       76212168          1485053
1910       92228496          2377549
...

We can do a normal plot:

plot "calpop.dat"

which plots column 1 and 2 on the x and y axes.

However look at what this command does
gnuplot> plot "calpop.dat", "calpop.dat" using 1:3

While the first plot is the normal default one--columns 1:2 on x:y, the second part

gnuplot> plot "calpop.dat", "calpop.dat" using 1:3

plots the data in "calpop.dat" again, but this time using columns 1 and 3. You can simplify this with the abreviated version:

gnuplot> plot "calpop.dat", "" u 1:3
The empty "" means use the same datafile as before (with older versions of gnuplot you will have to supply the datafile name again),
and u is short for using.

Of course you could do more if you had more columns, such as

plot "datafile" u 1:4, "" u 1:3, "" u 3:2

etc. Notice that the last instance, u 3:2, plots the 3rd column as the x values and 2nd column as y values of the points.

Very nice, huh?

But now--Check this out!


You can modify the data on the fly with using!

Try these examples:

gnuplot> plot "calpop.dat", "" u 1:(10*$3)
gnuplot> plot "calpop.dat" u 1:(1.0/$2)
gnuplot> plot "calpop.dat" u 1:($2/$3)
In the above examples we did

u 1:(10*$3)  -> plots 10 times the value in column 3
u 1:(1/$2)   -> plots 1/y for column 2 values
u 1:($2/$3)  -> plots the ratio of the values in columns 2 and 3

Here's the syntax for modifying data with using:

If you want to refer to the value in a particular column, so that you can multiply or act with other math functions on it, etc.

  • instead of a column number, you put something in ()s.
  • inside the ()s, you must refer to data in column n as $n (ie. starting with a $).
  • then you can treat $n as the value of the number for that point.

You can make complicated functions. Suppose for example, the data in your file was

       1   1
       2   10
       3   100
       

the using command

u 1:(2*log10($2)+1)

would plot your data as if it were

       1   2*0+1    =  1
       2   2*1+1    =  3
       3   2*2+1    =  5
       
This is a really nice feature.

Here is another of my favorite handy using examples. Make sure you understand what it does.

gnuplot> plot "calpop.dat", "" u 1:($1<1950 ? $2 : 1/0)
This uses the fact that 1/0, division by zero, evaluates to undefined, so gnuplot ignores that point.

The IF/THEN (actually called the ternary operator) checks if the x value is less than 1950; if so, it returns the normal value, $2 of column 2. If x is equal to or greater than 1950, it returns 1/0, so that point is ignored.

This is a handy way to suppress plotting or fitting of some part of your data.