Amherst College I.T. I.T. home. Amherst College.
I.T. home.
ATS > Software > Geographic Information Systems > Geostatistics

Geographic Information Systems

Geostatistics

Previous: Editing Map Data


Geostatistics is the application of statistics to the characterization and modeling of geographically distributed data. It is based on the observation that locations that are close to each other will generally have characteristics that are more similar than locations that are far apart, and as such the variation of these characteristics is distinguishable from random positioning. Geostatistics therefore accounts for distance itself as a characteristic of data that is potentially correlated. ArcGIS provides a number of tools that help test geospatial data for statistically significant distributions and compare the accuracy of distance-based models.

Topics

Procedures

  1. Determining the Descriptive Statistics of an Attribute

  2. Symbolizing an Attribute by a Defined Interval

  3. Symbolizing an Attribute by Its Standard Deviation

  4. Calculating an Attribute’s Standard Score

  5. Visualizing Attribute Relationships with Scatterplots

  6. Multiple Linear Regression


Getting Started

Since this tutorial will be using specific maps and data, the first step is to make your own copy of the tutorial data.

Set Up: Getting the Tutorial Data

  1. In the Windows Explorer, navigate to the network drive  K: (aka \\Software\Winsoft), open the folder  Maps, and then open the folder  Introduction to GIS.
  2. Drag the folder  Geostatistics and its contents to either:
    1. your network drive  U:, e.g. into the folder  My Documents; or
    2. the local hard drive  C:, e.g. onto your Desktop.
  3. The folder  Geostatistics contains the following layer, which comes from Mass GIS:
  4.  MASSCENSUS2010BLOCKGROUPS.shp      

Since some — but not all — of the ArcGIS components have trouble handling names with spaces or special symbols, do not rename the folders or files.

Set Up: Initializing ArcMap and Adding Data

  1. Start up the  ArcMap software (see Procedure 1 in Constructing and Sharing Maps for details).
  2. Click on the button Add Data Icon Add Data.
  3. In the dialog Add Data, navigate into the folder  Geostatistics; if necessary, make a new connection to it (see Procedure 2 in Constructing and Sharing Maps for details).
  4. Double-click on the file  MASSCENSUS2010BLOCKGROUPS.shpto add it to the map.
  5. ArcMap will now display the census block groups in Massachusetts:



    Each block group is an aggregate of several census blocks, with anywhere from 500 to 2500 people living within one, and there are roughly three block groups per census tract.

    Census block groups are the smallest region available for some sensitive attributes such as income.

Basic Descriptive Statistics

Before discussing geostatistics, a review of basic descriptive statistics of data sets will be helpful.


Extremes and Middles

In this course we’ve seen many geographic objects with different characteristics, e.g. states with different populations, schools with different ethnicities, and waterfalls with different heights.

We can review such data in ArcGIS attribute tables, and at first glance they may appear to be random sets of numbers, e.g. the population figures in column POP_2010 in this census-block-group data set:

Census Block Attribute Table with Population

There are, however, several characteristics that we can extract from this set of values using the ArcGIS skills we have developed. The number of records or count is given at the bottom of the attribute table, and is symbolized by N:

Count of block groups: N = 4979

Let’s describe the data set by the symbols A = {a1, a2, a3, …, aN }. The minimum and maximum values, together called the range, can be quickly determined by sorting the population field, as shown in the image at the right (see Constructing and Sharing Maps for details):

Minimum population: amin = 0
Maximum population: amax= 6288

In addition, we can locate a value that describes the “middle” of a data set by dividing sorted records into two equal halves. If there are an odd number of records, there is one value in the middle (highlighted in the image at the right); otherwise a value halfway between the two middle values is used. This value is called the median, and is symbolized by m:

Median of population: m = 1170

“Median” can also be used as an adjective, such as the semiredundant “median value”, or the “median census block group”, in this case 250277102004 (though the two others with a population of 1170 could also be used, since the block group number is a semiarbitrary assignment).

Another way to describe the middle of the data set might be to find the midpoint μ0, halfway between the minimum and maximum values:

μ0amin = amaxμ0

and therefore:

μ0 = (amin + amax)/2

Midpoint of population: μ0= 3144

This formula can be generalized to include all of the values, resulting in a more representative middle value known as the average or mean value:

Definition of the Mean

The symbols in this expression are:

μ: the mean value (this is the Greek letter “mu”)
ai: one of the values in the data set A = {a1, a2, a3, …, aN }
i: an integer that indexes the values from 1 to N.
Σ: the summation symbol, representing the sum of all of the values (this is the Greek capital letter “Sigma”)

The mean has the property that its separation from all of the data values is, in one way, as small as possible (see the next section).

For a large data set programs like Excel make it relatively easy to calculate the mean, once you have entered all of the data. ArcGIS will also calculate the mean and other characteristics of any numerical attribute.

Procedure 1: Determining the Basic Descriptive Statistics of an Attribute

  1. In ArcMap, right-click on the layer of interest, e.g.  MASSCENSUS2010BLOCKGROUPS.shp, and in the contextual menu that appears, select the menu item  Open Attribute Table.
  2. In the dialog Table, select just the records within the attribute table that you want to summarize by click-dragging, shift-clicking, and/or shift-control-clicking on them; if you want the entire data set, make sure no records are selected, e.g. by clicking on the button  Clear Selection.
  3. Field Options MenuIn the attribute table, right-click on the name of the attribute in its field header, e.g POP_2010, and in the contextual menu that appears, select the menu item  Statistics….
  4. In the window Statistics of …, you can read off the basic descriptive statistics of the attribute: Count, Minimum, Maximum, Sum, Mean, and Standard Deviation.
  5. Basic Statistics of an Attribute

Using the previous procedure, we see that for the census block groups

Mean of population: μ = 1315


Frequency Distributions

To further characterize a data set, consider the distribution of its values, where you might observe that some show up more often than others. For example, in the census block group table, the record count or frequency of particular population values is:

Population Frequency
0 10
6 2
9 1
1170 3
1171 4
1172 5
6288 1

These numbers don’t appear to change in a consistent way as population increases, but they can be smoothed out by counting the number within specific ranges of values, or bins. The result is a frequency distribution, a graph of which is called a histogram.

In ArcGIS’ summary of the statistics of an attribute, it provides a histogram; for the census block group population, it uses 30 bins of width 215:Histogram

Bin Population Range Frequency Included Parameter
1 0 … 214 25 Minimum =      0
2 215 … 429 29  
3 430 … 644 245  
4 645 … 859 861 Mean − 1 SD =  714
5 860 … 1074 943 Mode ≈  967
6 1075 … 1289 794 Median m = 1170
7 1290 … 1504 617 Mean μ = 1315
8 1505 … 1719 395  
9 1720 … 1934 325 Mean + 1 SD = 1916
10 1935 … 2149 245  
30 6235 … 6288 1 Maximum = 6288

This histogram shape, with a peak in the middle and tails dropping to zero on either side, is very common for data sets. The clustering of the data around the peak is known as central tendency.

We can represent the peak by the midpoint of its bin b5, (860 + 1074) / 2 = 967. This “most common” value or mode is another way to characterize the middle of the distribution.

The median value m = 1170 and mean value μ = 1315 are also both relatively near the peak, in bins b6 and b7, respectively, though pulled to the higher side by the long tail of large values. The latter are known as outliers and such an asymmetric distribution is said to be skewed.

An important way to characterize a frequency distribution is by the width of its peak, known as its dispersion; if it’s large the distribution is relatively flat, and if it’s small the distribution has a narrow peak and short tails.

A common method to estimate the dispersion is by calculating the separation of all population values from the mean, {aiμ }, squaring them to remove negative values, averaging them, and finally taking the square root (to compensate for the squares). The result is called the root-mean-square deviation or standard deviation.

Standard Deviation

(σ is the Greek letter “sigma”.) It can be shown with a bit of calculus that using the separation from the mean (as opposed to, say, the median) minimizes the deviation.

The square of the standard deviation σ2 is known as the variance.

The previous procedure for obtaining the statistics of an attribute in ArcGIS will include the standard deviation, which for the population of the census block groups is

Standard deviation of population: σ = 601

The distribution of values in the figure above can be seen to occur primarily within the range from μσ to μ + σ, which can be written μ ± σ; for the census block groups this is a population range from 714 to 1916, which includes 3706 out of 4979 records, or 74%.

While census block groups are constructed with the goal of having about 1200 people each, there is necessarily some variation since they are bounded by features like streets, while population can vary dramatically from block to block. There can also be a substantial influx or egress of people over time as residences are constructed or torn down.

Amherst ElevationNatural phenomena will have similar variability, and the processes that produce them are understood to have factors that disperse values away from a central value. Consider, for example, elevation— the Earth’s surface was originally flat, being formed by molten material pulled downward by gravity. Over time geologic processes have produced hills and valleys, mountains and trenches, and therefore a wide range of elevations.

Elevation in the Amherst area was discussed in , and is shown again to the right. The histogram of this elevation raster is shown below, and has a similar appearance to the census-block-group data, though it has a much more signifcant tail at high elevations.

Amherst DEM Histogram

Using statistics to visualize a data set is the default behavior for image data, where raster values are mapped to gray values between 0 and 255, but they can also be used for feature data, such as the census block groups.

Procedure 2: Symbolizing an Attribute by a Defined Interval

The basic descriptive statistics of the attribute of a feature data set are provided when symbolizing it quantitatively, and the bins used in frequency distributions can be visualized using a defined interval.

  1. In ArcMap, in the Table of Contents, double-click on the layer to be symbolized, e.g.  MASSCENSUS2010BLOCKGROUPS.shp.
  2. In the dialog Layer Properties, click on the tab Symbology.
  3. In the list Show:, click on the symbolization Quantities; the specific symbolization Graduated Colors should be selected by default.
  4. In the section Fields, in the list Value:, select the attribute you wish to symbolize, e.g. POP_2010.
  5. In the section Classification, click on the button Classify…. In the dialog Classification that appears, you will see the basic descriptive statistics of the data (including the median this time), and a histogram:
  6. Bin Symbology

  7. By checking the boxes Show Std. Dev. and Show Mean, the histogram can show the location of the mean value μ (the dash-dotted line) as well as multiples of the standard deviation to either side, μ ± , where n = 1, 2, and 3 (the dashed lines).
  8. In the section Classification, in the menu Method:, select the item Defined Interval.
  9. In the text box Interval Size:, type in the desired size of each bin, e.g. 215. The number of Classes will change in response, here corresponding to the number of bins.
  10. Click the button OK.
  11. Back in the dialog Layer Properties, if you want you can select a different Color Ramp:.
  12. Click the button Apply (not OK) and move the dialog window to see the effect.
  13. Boundary Fill-UpWith a large number of polygons, such as the 4979 census block groups in Massachusetts, the finite width of the boundaries of the polygons can get in the way of seeing the smaller block groups, as in the image to the right. In this case removing the outlines is advisable:
    1. Right-click on the first color patch in the list of symbols, and in the contextual menu that appears, select the menu item Properties for All Symbol(s)…;
    2. In the dialog Symbol Selector, in the text field Outline Width:, type 0, or click on the down arrow in the number selector Number Selector;
    3. Click the button OK;
    4. Back in the dialog Layer Properties, click the button Apply to see the effect.
  14. You might want to highlight the modal bin (or the mean, etc.) by changing its color to something more distinctive.
    1. Double-click on the color patch corresponding to the mode in the list of symbols, e.g. 860 … 1074 for the census block groups;
    2. In the dialog Symbol Selector, in the menu Fill Color:, choose a color that will stand out from the others, e.g. red.
    3. Click the button OK.
    4. Back in the dialog Layer Properties, click the button Apply to see the effect.

Massachusetts census tract population symbolized by defined interval

This is, in a sense, a geographic histogram, with the count of blocks in a particular “bin” represented by a visual area.

Question: In the image above showing the census block groups, can you see any patterns in the geographic distributions of the least populated block groups (yellow), the most populated block groups (dark brown), or the block groups with the most common or modal populations (red)?

Procedure 3: Symbolizing an Attribute by Its Standard Deviation

An alternative to symbolizing data by a defined interval is to use its standard deviation as the bin size, which generally produces fewer gradations but can also sometimes provide a clear picture as to its behavior.

  1. In ArcMap, in the Table of Contents, double-click on the layer to be symbolized, e.g.  MASSCENSUS2010BLOCKGROUPS.shp.
  2. In the dialog Layer Properties, click on the tab Symbology.
  3. In the list Show:, click on the symbolization Quantities; the specific symbolization Graduated Colors should be selected by default.
  4. In the section Fields, in the list Value:, select the attribute you wish to symbolize, e.g. POP_2010.
  5. In the section Classification, click on the button Classify…. In the dialog Classification that appears, you will see the basic descriptive statistics of the data (including the standard deviation this time), and a histogram:
  6. Standard Deviation Symbology

  7. In the section Classification, in the menu Method:, select the item Standard Deviation.
  8. The default Interval Size is 1 Std Dev, centered around the mean value, but a half or smaller fraction of a standard deviation can be chosen instead to show more gradation.
  9. Click the button OK.
  10. Each standard-deviation-sized bin is assigned a different color.Back in the dialog Layer Properties, you'll see that appropriate labeling has been generated and a particular Color Ramp: has been chosen, but you can select a different ramp if you want.
  11. Click the button Apply (not OK) and move the dialog window to see the effect.
  12. Boundary Fill-UpWith a large number of polygons, such as the 4979 census block groups in Massachusetts, the finite width of the boundaries of the polygons can get in the way of seeing the smaller block groups, as in the image to the right. In this case removing the outlines is advisable:
    1. Right-click on the first color patch in the list of symbols, and in the contextual menu that appears, select the menu item Properties for All Symbol(s)…;
    2. In the dialog Symbol Selector, in the text field Outline Width:, type 0, or click on the down arrow in the number selector Number Selector;
    3. Click the button OK;
    4. Back in the dialog Layer Properties, click the button Apply to see the effect.

Massachusetts census tract population symbolized by defined interval

Question: Are the patterns in the geographic distribution of the census block groups any clearer now?


Standardization

It is often useful to define standard scores or z-scores for any given data set:

z-value

These values are sometimes also called z-values, but that term is also used more specifically for height values, as in (x, y, z) spatial coordinates.

The effect of this transformation is three-fold:

  • The mean value of the standardized data set is 0;
  • The standard deviation of the standardized data set is 1.
  • Standard values are dimensionless, since ai, μ, and σ are all the same type of quantity (e.g. length or mass).

In other words, standard scores are a purely numeric multiple of a standard deviation away from the mean.

HistogramThe frequency distribution of the census-block-group data above is displayed in standardized form in the figure to the right.

A primary result of standardizing is that data values within σ of the mean will have standard scores −1 < zi < 1, which are mathematically small (in particular their powers zi2, zi3, … decrease and vanish).

The conception of a data set as a dispersal of values around a best estimate is thus advanced when most of this dispersion is small (note that most of the values are between –1 and 1 in the figure to the right).

Standardized data sets are thus used when modeling data, and should be calculated before performing any advanced statistical analysis.

Table Options MenuProcedure 4: Calculating an Attribute’s Standard Score

To calculate an attribute’s standard score one must first create a new field to hold it. First, though, make a note of the field’s mean value and standard deviation, as described previously. (Hint: you can copy both values at once from the statistics dialog and in the calculation below delete the intervening text.)

  1. In ArcMap, right-click on the layer of interest, e.g.  MASSCENSUS2010BLOCKGROUPS.shp, and in the contextual menu that appears, select the menu item  Open Attribute Table.
  2. In the dialog Table, click on the button  Table Options, and in its menu click on the item Add Field….
  3. Spatial Statistics ToolsIn the dialog Add Field, in the text field Name:, type in a name for the standard-score field, e.g. POP_2010_Z.

    Remember the limitations on field names described in the procedure Making an Excel File Compatible with ArcGIS, Step #1.
  4. In the menu Type:, select the data type Double.
  5. Click the button OK.
  6. Back in the dialog Table, make sure that no records are selected by clicking on the button  Clear Selection (otherwise only the selected records will be calculated).
  7. Right-click on the header of the field you want to calculate, e.g. POP_2010_Z, and in its contextual menu click on the item  Field Calculator.
  8. Spatial Statistics ToolsIn the dialog Field Calculator, type the expression for the standard score in the text field field_name =

          ( [A] – μ ) / σ

    Here A refers to the field name of the attribute whose standard score you are calculating, and it can be entered into the text field by finding it in the list Fields: and double-clicking on it.
  9. Click the button OK, and the z-field should be filled with numbers largely between –1 and 1.


Question: In the image above showing the census block groups, can you see any patterns in the geographic distributions of the different z-scores? Assuming that the census block groups haven’t changed their geography for a long time and at one time had roughly similar values, what can you hypothesize about the outlying features?


Probability Distributions and Sample Data

Frequency distributions provide a measure of probability, and the laws of probability allow us to approximate them with small samples.


Frequency as Probabability

The frequency distributions above reflect the probability of certain values, with values in the peak being the most likely, while values in the tails are the least likely.

Because of the way that bins are constructed, each bin bk contains some portion of all of the values:

0 ≤ Count(bk) ≤ N

Therefore, we can define the probability of values being in a particular bin bk as the fraction of the total count N:

pk = Count(bk) / N

and it must be the case that

0 ≤ pk ≤ 1

Zero probability means a bin has no values in it, while a probability of 1 or 100% means that a single bin contains all of the values.

It must also be the case that, because of the way that bins are constructed, when we add up the count of values in each of the n bins, we must get the total count:

Bin Count Sum

Therefore, when we sum the probabilities of each bin, the total must be 1 or 100%:

Probability Normalization

HistogramThat is, we have accounted for 100% of the probability by adding together the probability of each of the bins.

As examples, in the census block groups above, the modal population bin b5, with a population range of 860 … 1074, contains 940 block groups, out of a total count of 4979, for a probability of 940/4979 = 0.189 = 18.9%. Similarly, the region between –1 < z < 1 has a probability of 3881/4979 = 77.9%.

Because the probabilty of a bin depends on the size of the bin, it is often better to give the probability per unit value, or probability density. In this case the peak bin has a probability per unit value of 18.9%/215 = 0.0879%, while the range –1 < z < 1 spreads the probability per unit value out to 77.9%/(2σ) = 0.0648%.

In the standardized histogram to the right, the probability per unit value is graphed on the right axis, allowing the frequency distribution to also be interpreted as a probability distribution.

The bins within one standard deviation of the mean are highlighted in yellow. This is known as a confidence interval of one standard deviation, which has a probability of 77.9% that data will fall within it.


Comparing Probability Distributions

The shape of the frequency distributions seen previously, with a high-probability peak near the middle and low-probability tails to either side, is very common, despite the diversity of processes that produce population clusters, elevation variations, traffic patterns, temperature fluctuations, et al.

One might therefore speculate that there are some simple features of these processes that one could understand before diving into more complicated details. And, to the extent we can compare with these simple models and find a significant difference, we can say something about the importance of the data we have collected.

As a simple example, suppose that the population in each census block group was chosen completly randomly from the range of data 0 … 6288. Another way of saying this is that every value is equally likely, with a uniform probability of 1/6288 = 0.0159% per unit value. In this case the histogram would be flat, with every bin having the same frequency, the red line in the histogram above. This is clearly different than the measured modal probabilty of 0.0879% per unit value, which tells us something about how the block groups were constructed, in particular that they are not purely random or “uniformly distributed”.

To facilitate comparisons of probability distributions, we can simulate a set of data using its measured probability distribution:

  1. Let’s start with all of the n bins B = {b1, b2, b3, …, bn } being empty and increase their count in a probabilistic fashion.
  2. By construction the probabilities of the bins add up to one, so if we define the cumulative probability as the total probability through each bin,


    q1 = p1
    q2 = q1 + p2
    q3 = q2 + p3

    qn = qn – 1 + pn = 1


    then these increase in value from 0 to 1.
  3. We generate a uniformly distributed random number in the range 0 ≤ r < 1 using any number of computer programs. That value will correspond to the first bin bk whose cumulative probability qk > r, and we increase the count of that bin by 1.
  4. We then repeat step 3 until we’ve increased the count in the bins to the total count N.

Animated Probability Accumulation

While this probabilistic approach to reconstructing the data is clearly not the same process by which a real set of data is created, it allows us to compare with other probability distributions. Note that it also allows for the selection of values that are not in the original data set.

In addition, it allows to us see how we might treat some less important features as arising from random processes. In particular, the processes that spread out a single value into a wide distribution might be modeled as a random perturbation!


The Normal Distribution

One of the most common probability distributions we will compare to is the normal distribution:

Normal Distribution

where a p is the probablity per unit value of a, and μ and σ are the calculated values of its mean and standard deviation. z p is the probablity per unit value of z = (aμ) / σ.

The function exp(x) is the exponential function, and is often written as ex where e = 2.71828… is the exponential constant.

The normal distribution is symmetric around its mean value, which is also its mode and median:

Normal Distribution Graph

As is required for a probability distribution, the total probability — the area under the curve — is equal to 1 (100%).

Because of the normal distribution’s symmetry, because it is easy to calculate, because it can represent a “typical” distribution, and for reasons that we will see later, the normal distribution has great importance in statistics. Because of its shape it is also commonly called a bell curve.

A normal distribution is often characterized by its confidence intervals: within one standard deviation of the mean the probability is 68.3%, within two 95.5%, and within three 99.7%.

You can see a similarity of the normal distribution to the probability distribution of the census block group population above. However, the latter has a skew of the mode away from the mean of –0.58σ. In addition, it is narrower, with a confidence interval of one standard deviation having a probability of 77.9%. We can therefore characterize the census block group population as not being “normally distributed”.


Sample Data and the Central Limit Theory

The population data in the census block groups is what is referred to as 100% data, as the Census Bureau is required by the U.S. Constitution to “enumerate” the country’s entire population. This is generally a formidable task, and more often than not researchers will instead select a sample of all of the available data to save time and other resources, and use it to represent the entire population.

Generally sample data is chosen randomly, avoiding as much possible any influence from other factors. This may seem intuitively obvious, but it sometimes hard to avoid and must be documented.

When you do have randomly selected data, you can see how similar these representations at 5% N, 10% N, and 100% N are to the original probability distribution in the following charts, where the samples are blue, the original is red, and the overlap is purple:

Sample Distribution 0249Sample Distribution 0498Sample Distribution 4979

This is not just coincidence; there is a mathematical reason, known as the Central Limit Theory. Suppose we have a data set A of size N, mean μ, and standard deviation σ, and take a random sample Â1 of size (usually < N). We can calculate the sample’s mean value, which we will label μ̂1, and which in general will be different than μ. If we then repeat this for a total of n sampling sets Â1, Â2, …, Ân, the theory states that the set of measured means μ̂1, μ̂2, …, μ̂n will form its own sampling distribution that is a normal distribution and has the following characteristics as N:

Sample Distribution in the Central Limit

In other words, as the size of the sample sets increases, the measured “mean of the means” approaches that of the entire data set, which is probably no surprise.

Perhaps more important, however, is that the width of the sampling distribution of means around the actual mean is very narrow in comparison to that of the complete data set. As a result, any of the μ̂kμ and we can often get by with a relatively small sample size to calculate an estimate of the mean. On the other hand, if we want to improve an estimate, it will take a relatively large increase in for a significant effect, e.g. four times to halve the estimate error.

Note that, because some values might be selected more than once in a random sample, even when = N as in the third picture above, there will be some variability in the sample means and σμ̂ will not be zero.

This result is independent of whether the original distribution is normal, and it also applies to estimates of the median and mode (the latter being the most obvious in the above pictures).

One other result that follows is that any sample data set  = {â1, â2, â3, …, â} will have a smaller standard deviation σ̂ around its mean μ̂ than around that of the full data set — essentially, it’s offset from the actual mean and doesn’t include a contribution from the sampling distribution of the mean σμ̂.

A better estimate of the standard deviation is therefore given by the standard error s, which can be found by adding together the variance of the sample data around its mean and the variance of the mean itself:

s2 = σ̂2 + σμ̂2

resulting in

Sample Standard Deviation

The difference from the sample standard deviation is the factor of – 1 as opposed to just , which slightly increases the value of s towards the actual standard deviation.

Warning: ArcGIS doesn’t know if attribute data is 100% data or sample data, and it always assumes the former in its statistical summary. For large this doesn’t make much difference, but be careful for smaller values!

Another warning: Most statistics texts refer to 100% data as “population” data when distinguishing it from sample data, whether or not it actually represents population values!


Statistics of Multiple Attributes

A data set often has multiple attributes that may or may not depend on each other.


Dependence and Independence

Quite often two sets of data may be related to each other, at the very least because their values are measured at the same time or location, or both. For example, a weather station might make hourly measurements of temperature, humidity, wind speed, etc.

Census data is another common example, such as the layer  MASSCENSUS2010BLOCKGROUPS.shp, whose attribute table includes information not only about total population but also the white population, black population, hispanic population, housing units, etc. in particular locations in a particular year:

Population Data: Multiple Attributes

Beyond the basic connection they have due to their location-based collection, these different sets of data might have other relationships, e.g. there can be simple constraints of definition such as:

POP_2010 = POP_WHITE + POP_BLACK + POP_NATV + POP_ASN + POP_ISLND + POP_OTHER + POP_MULTI

See the U.S. Census Bureau's document “About Race” to learn how they define these categories.

The Census Bureau also allows for the possiblity that a person of Hispanic or Latinx ethnicity could be in any one of these categories. See the U.S. Census Bureau Guidance on the Presentation and Comparison of Race and Hispanic Origin Data for more information.

Importantly, there can also be more complicated relationships resulting from societal factors. For example, the ratio of blacks to whites is not uniform but tends to be inversely related as whites and blacks cluster together in different locations.

The relationship between different attributes can be visualized, to some extent, by plotting each pair within a record on a two-dimension graph of their values, which is known as a scatterplot.

Procedure 5: Visualizing Attribute Relationships with Scatterplots

  1. In ArcMap, menu View, then select the menu item Graphs, and then select the menu item  Create Scatterplot Matrix….
  2. In the dialogCreate Scatterplot Matrix Wizard, in the menu Layer/Table, select the layer or table of interest, e.g.  MASSCENSUS2010BLOCKGROUPS.shp.
  3. Basic Statistics of an Attribute

  4. In the section Fields:, click in each row of the column Field name and in the pop-up menu select the attributes that you want to graph, e.g POP_2010 and POP_WHITE. At least three fields must be selected. If necessary, drag the bottom of the window down to give you access to additional rows.
  5. If you want, you can give each attribute a clearer label by clicking in the column Label and typing something else, e.g. White Population.
  6. To see a preview of the scatterplot matrix, click the button Apply.
  7. By default the current symbology of the layer is used to color the different data points. Therefore, in the example above, the data points becomes darker as POP_2010 increases; you can also see the block groups in the median bin in red. To customize this, click on the menu Color and select the item Custom.
  8. If you want histograms of each attribute to appear undeneath the scatterplot matrix, click the button Show Histograms.
  9. Click the button Next>.
  10. In the next screen, you can choose what to do with features that you previously selected in your layer/table:
    1. Show all features/records with selected items highlighted (the default);
    2. Show all features/records with selected items appearing the same as others;
    3. Show only selected records.
  11. In the area General Graph Properties, type in a Title for the scatterplot matrix, and if you want type in a Footer, too.
  12. Click the button Finish.
  13. Scatterplot Matrix

  14. The scatterplot matrix now appears, with every possible pair-wise combination of the attributes plotted against each other — and in the current color scheme of the layer (in the example above, referring back to the symbology used in Procedure 2. If you click on any one of these graphs, it will appear expanded in the upper right corner, showing the actual values along the axes, as for the POP_OTHER x HISPanic graph.
  15. If you want to make changes to the graph, double-click on it; if you want to save or print it, right-click on it and select those items.

Scatterplots often reveal several types of relationships between attributes:

  • A linear relationship, clearly visible in the the POP_OTHER vs. HISPanic graph expanded above:
  • HISP = α + β × POP_OTHER

    Recall that α (the Greek letter “alpha”) is the y-intercept and β (the Greek letter “beta”) is the slope of the line.

  • If d = 0, then we have a constant relationship, such as the POP_NATV (Native American) vs. every other population above:
  • POP_NATV = α

  • An inverse relationship, somewhat visible in the POP_WHITE vs. POP_BLACK graph above:
  • POP_BLACK = β / POP_WHITE

    In other words, where there are more whites there are fewer blacks, and where there are fewer whites, there are more blacks.

    Inverse relationships can often be approximated by linear relationships with negative slopes.

Some pairs of attributes may have no obvious relationship, such as POP_OTHER vs. POP_MULTI, perhaps indicating an overlap in meaning or a more complicated relationship involving other attributes. Relationships between z-scores can sometimes be clearer, because these values are mostly smaller than 1 (mathematically speaking, nonlinear terms will be less important).

When an attribute remains constant relative to another attribute, or if they have a purely random relationship, we can say that they are independent of each other; if, on the other hand, the attribute has a clear mathematical relationship to another attribute, we can say they are dependent on each other.

Somewhat confusingly, when expressed as a mathematical relationship such as the above, the attribute on the left of the equal sign is called the dependent variable or the response variable, and the attribute in the expression on the right is called the explanatory variable, which implies an asymmetric relationship that requires qualification or justification.

An important aphorism to remember when considering dependent relationships is that correlation does not imply causation, i.e. two attributes may be dependent upon each other not because one causes the other, but because they both arise from a third attribute. For example, black households are more likely to have lower incomes than white households, not because being black causes lower incomes but because of their historical origins and ongoing discrimination.


Correlation

The degree to which the two sets of data have a linear relationship can be described by calculating their correlation, defined by Pearson as

Correlation Definition

This expression multiplies two attributes’ z-scores feature-by-feature, sums the result, and divides by the total number N (replaced by − 1 for sample data sets).

The correlation of two attributes will vary between −1 and +1, with the latter occurring if all pairs of values {ai, bi} are exactly the same (because the sum is then the same as that of the standard deviation squared), and the former when the values differ only by a minus sign.

If two attributes are independent of each other, the correlation will be close to zero. This is obviously true when one of the attributes is constant, since that value will equal its mean and its z-score will always be zero. More generally, since z-scores are distributed around zero, there will be roughly the same number of positive and negative terms, which will tend to cancel each other out.

In ArcGIS, you can calculate the correlation of two attributes by calculating their z-scores, then calculating a third attribute that is the product of their z-scores, then summarizing the latter to find its mean value. (You can also calculate a linear regression; see the next section.) Excel provides a function CORREL which is somewhat easier to use to calculate correlations.

For the Massachusetts data above, we can create a correlation matrix with the same form as the scatterplot matrix:

POP_WHITE 0.88              
POP_BLACK 0.13 -0.27            
POP_NATV 0.16 -0.02 0.24          
POP_ASN 0.33 0.10 0.10 0.00        
POP_ISLND 0.13 0.03 0.12 0.15 0.06      
POP_OTHER 0.14 -0.21 0.40 0.46 0.03 0.20    
POP_MULTI 0.42 0.04 0.52 0.41 0.27 0.24 0.70  
HISP 0.17 -0.16 0.38 0.48 0.05 0.20 0.95 0.67
POP_2010 POP_WHITE POP_BLACK POP_NATV POP_ASN POP_ISLND POP_OTHER POP_MULTI

The color codes indicate the strength and sign of the correlation (similar to the standardized map above). From this we see that the POP_OTHER and HISP attributes have the strongest correlation at 0.95, while for POP_WHITE and POP_BLACK there is a weak negative correlation of −0.27, both matching our visual characterization.

Question: The second strongest correlation is between the white and total populations at 0.88; why do you think that would be?


Linear Regression

Least Squares Regression: HISP x POP_OTHERAn attribute such as the Hispanic population can be characterized by its mean value and standard deviation, but consider the graph at the right, which plots HISP on the y axis vs. POP_OTHER on the x axis.

The mean value of HISP, μ = 126 (the solid green line), is also plotted, along with the confidence interval 3σ = 636 (the dashed green line).

Clearly a significant fraction of the HISP data is quite far from the mean and even outside of the 3σ confidence interval — but it’s much closer to the blue line, which varies with POP_OTHER.

If we want to model the relationship between , the simplest type of relationship between two attributes A and B is a linear one, viz.

A = α + β × B

The coefficients α and β are called the intercept and slope, respectively. Note that if the slope β is zero, then A will be represented by the constant value α , which we might expect to be the mean μ.

In general there will be a dispersion of data that prevents a perfect representation by such a line, as in the graph at the right.

The difference between a dependent value and the corresponding calculated value of a representational line is known as a residual:

εi = ai − (α + β × bi)

(ε is the Greek letter “epsilon”).

We’d like to calculate best-fit values for the coefficientsα and β, a process known as linear regression. The most common procedure, ordinary least-squares regression, is based on the idea that the line that fits the data best is the one that minimizes the sum of the squared residuals:

Bin Count Sum

As with the standard deviation

squaring the residuals puts values above and below the regression line on an even footing. Also note that, if the slope β is zero, the sum is the same as that in the expression for σa, since the mean μ is the value of α that minimizes the sum.

Question: Where have you seen a least-squares fit previously? (Hint: the residuals were represented by blue lines between two geographic locations.)

Multiple linear regression is also possible when there is more than one explanatory variable:

A = α + β1 × B1 + β2 × B2 + … + βn-1 × Bn–1

In this case, with n coefficients and n – 1 different explanatory variables, it’s helpful to express the latter as z-scores in order to compare their relative importance to the dependent variable. Then the slopes {βk} will represent the effect of a one-standard-deviation change in the corresponding variables.

The derived expressions for the intercept α and slopes {βk} are unenlightening and won’t be listed here. But they can be calculated with a number of tools, including Excel and ArcGIS (see below).

As an example, consider the relationship discussed earlier,

HISP = α + β × POP_OTHER

which was notable because these two attributes appear to be strongly correlated. It has least-squares intercept and slope of

α = 16.5

β = 1.788

resulting in the equation

HISP = 16.5 + 1.788 × POP_OTHER

and the solid blue regression line that is plotted in the graph above.

Question: How might you interpret a slope of 1.788 in this case?


Goodness of Fit

Least Squares Regression: HISP x POP_OTHERHow well a linear regression equation fits the data is an important consideration, and a number of statistical measures have been devised to test its goodness-of-fit.

The standard error of the equation describes the distribution of the dependent values around the best-fit line, and is similar to the standard deviation around the mean value:

Bin Count Sum

Again the εi are the residuals of the dependent values, and smaller values represents a smaller spread from the regression line, as seen in the graph to the right.

As before N is the number of data points, so if more of them fit within a given spread of residuals, that will reduce the standard error.

Finally, n is the number of coefficients; the more of them there are the greater the standard error, because they add degrees of freedom to the equation and make it easier to fit more precisely, even though the data hasn’t changed. It is therefore subtracted from the total number of data points N, which decreases the denominator and increases the standard error.

Remember “n equations for n unknowns”? That means that one data point is required for each coefficient to determine them exactly, and the remaining Nn data points are responsible for the variation around the line (the residuals).

The standard error of the HISP(POP_OTHER) regression is

Σ = 67

Note that in the above graph, almost all of the data lies close to the regression line, falling within the confidence interval ±3Σ= ±201, denoted by the dashed blue lines. This is much better than simply describing the dependent variable by its mean value, since ±3σa = ±636. This model therefore accounts for a large fraction of the variation in the HISP data, leaving a much smaller set of residuals that must be accounted for by other factors. We can say that we have partitioned the variation between the model and the remaining residuals.

The coefficient of determination is a convenient and accepted way to compare the standard error of the equation Σ and the dependent variable’s standard deviation σa, and thereby describe the overall goodness-of-fit of the equation:

R Squared

If the regression line perfectly fits the data, the residuals εi will all be zero and R2 will be one; when the residuals approach the standard deviation of the dependent variable, the second term will be one and R2 will be zero.

One way to interpret the coefficient of determination is as a generalization of correlation to a set of explanatory variables. It can be shown that, when there is only one explanatory variable, R2 will equal the square of the correlation ρ with the dependent variable. For the HISP(POP_OTHER) regression,

R2 = 0.90

which matches the correlation calculated above, since 0.952 = 0.90. So R2.

Because the coefficient of determination can improve simply by adding more explanatory variables, i.e. by increasing n, a related quantity that provides a better estimate of significance is the adjusted coefficient of determination:

Adjusted R Squared

2 will always be less than or equal to R2, and it can be negative, unlike R2. The significance of your equation will be greatest when 2 is maximized.

For the example regression,

R22

since N (4979) is much larger than n (2).

The F statistic is another common way to analyze the dependence of your model on the number of explanatory variables you’ve chosen. It compares the “explained” variance R2 that follows from these n – 1 variables to the “unexplained” variance 1 – R2 remaining in the Nn unfitted data points:

F-Statistic

F DistributionF can be as small as 0, when the numerator R2/(n – 1) is 0: none of the variance in the dependent variable is explained.

F can be as large as ∞, when the denominator (1 – R2)/(Nn) is 0: all of the variance in the dependent variable is explained.

So the regression is better when F >> 1; for the HISP(POP_OTHER) regression, F = 45,000.

But could a different set of coefficient values be substituted and produce a better result? When coefficient values are selected with random probability and their F values are calculated, an F distribution results, such as the graph of F p versus F shown at the right; clearly some F values are more likely than others.

Generally speaking values of F >> 1 have a low probability per unit value F p , and the total probability p that random coefficient values will have F > FRegression is very small, as suggested by the red portion of the F distribution graph.

Is there a significant probability p that random coefficient values could produce better results than the regression best-fit? This question is an example of a null hypothesis.

A significance level is a value of p below which you may decide to reject the null hypothesis, i.e. decide that FRegression is significant. Commonly these are stated in the form p < 0.1 or p < 0.05. The former represents a less-than-1-in-10 chance and the latter a less-than-1-in-20 chance that a random result will produce a better F.

For the HISP(POP_OTHER) regression, p ≈ 0, so FRegression is clearly significant and we can reject the null hypothesis.


Least Squares Regression: HISP x POP_OTHERStandard Errors of the Coefficients

Once the overall goodness-of-fit has been established, the individual coefficients should come under scrutiny.

Because the best-fit regression line is only one of many that could pass through the data, the coefficients also clearly have a range of values, e.g. tilting the line upward for a larger slope or downward for a smaller slope. These values therefore have their own distributions whose widths are described by standard errors of the coefficients, which for the HISP(POP_OTHER) regression are:

sα = 1.1

sβ = 0.008

You will commonly see coefficient errors expressed together with the coefficients in the form β ± sβ, e.g.

HISP = (16.5 ± 1.1) + (1.788 ± 0.008) × POP_OTHER

Note that this is an expression of just one possible confidence interval; to claim more certainty, a multiple of this value is generally necessary.

In addition, we can set up another null hypothesis: can these values be left out of the model with little effect, i.e. are they significantly different than zero? A simple test for their significance is based on the t-statistic:

tα = (α - 0)/ sα = 15

tβ = (β - 0) / sβ = 200

t DistributionLike the F-statistic, we can test these values with the t distribution, which, like the F distribution, charts the probability that a random set of values could produce the observed coefficient.

When these values are greater than two, i.e. the coefficients ± the standard errors are significantly different than zero, the values are considered good estimates. More precisely, suppose the data was completely random, e.g. HISP showed no dependence on POP_OTHER; then we would expect the coefficients to be all zero and α = μ.

The coefficient of determination for the dependence of the HISP attribute on the POP_OTHER attribute is good, but looking at the scatterplot matrix there appears to be correlation not just with POP_OTHER but also with POP_MULTI and, to a lesser extent, with POP_BLACK and POP_NATV. In general, we also know that Spanish-speaking people can be of any racial background. We may therefore be able to produce a better fit by including them in the analysis with a multiple linear regression.

Procedure 6: Multiple Linear Regression

ArcGIS provides a tool for calculating the ordinary least squares fit to a multiple linear regression of an attribute dependent on multiple other attributes, providing detailed statistical characteristics of a fit described by the equation

A = α + β1 × B1 + β2 × B2 + … + βn–1 × Bn–1

This includes the coefficient of determination R2, meaning that it can also be used to calculate the correlation between any pair of attributes, too.

  1. The Ordinary Least Squares tool requires that the input feature class have an integer attribute with unique values for every feature; if your layer doesn’t already have one, open its attribute table and add a new field, e.g. UniqueID, and use the field calculator as described above to copy the attribute FID (which unfortunately doesn’t work for this purpose).
  2. In ArcMap, open  ArcToolbox (see Constructing and Sharing Maps for details).
  3. Z-Score RenderingDouble-click on  Spatial Statistics Tools, then on  Modeling Spatial Relationships, and finally on Ordinary Least Squares.
  4. In the dialog  Ordinary Least Squares, in the menu Input Feature Class, select the data layer to be symbolized, e.g.  MASSCENSUS2010BLOCKGROUPS. If the layer is not already added to ArcGIS, you can click instead on the button Document Open Browseto select one.
  5. In the menu Unique ID Field, choose an integer field with unique values, e.g UniqueID.
  6. In the text field Output Feature Class, choose a location and name for the output layer file, e.g. Geostatistics.gdb\HISP_OLS, by typing it or by clicking on the button Document Open Browseto select it. You will probably want to put it in the same location as the data layer it’s modeling.
  7. In the menu Dependent Variable, choose the attribute you would like to explain, e.g HISP.
  8. In the menu Explanatory Variables, click on the attribute(s) that you think will explain the dependent variable, e.g POP_OTHER_Z.
  9. In the text field Output Report File, choose a location and name for an output report in PDF format, e.g. HISP_OLS_Report.pdf, by typing it or by clicking on the button Document Open Browseto select it. You will probably want to put it in the same location as the data layer it’s modeling.
  10. Optionally, you can request a Coefficient Output Table and a Diagnostic Output Table; these have almost the same information as in the PDF report, but in a table format that can and will be loaded into ArcGIS. One statistic the former provides that isn’t in the PDF report is the standard error of the equation S.
  11. Click on the button OK.
  12. If you have turned off background processing (see Constructing and Sharing Maps for details), the dialog Ordinary Least Squares will appear, describing the process, and eventually displaying the Completed results (you may need to enlarge the window and scroll up to see everything):

    Z-Score Rendering Completed

    Quite a few statistical characteristics are included here, including the ones we have already described. In particular, this model of the hispanic population

    Report Label Value Symbol Description
    Coefficient: Intercept 126.1 α Intercept in the multiple linear regression equation.
    Coefficient: POP_OTHER 1.788 β1 Slope in the multiple linear regression equation.
    StdError: Intercept 0.9 sα Standard error of intercept
    StdError: POP_OTHER 0.008 sβ1 Standard error of first slope
    t-Statistic: Intercept 134 tα Standard error of intercept
    t-Statistic: POP_OTHER 143 tβ Standard error of first slope
    Multiple R-Squared 0.90030 R2 Coefficient of determination
    Adjusted R-Squared 0.90028 2 Adjusted coefficient of determination
    Sigma-Squared 0.0953 S2 Standard error of the equation squared; S= 0.316, compared to σa= 1.
           
           
           
  13. Click on the button Close.
  14. The new symbology layer will automatically be added to your map. The default coloring has yellow for polygons within one standard deviation of the mean, cyan and orange for those between one and two standard deviations, and blue and red for those that are more than two standard deviations.

    Again, if there are a large number of polygons you may want to turn off the polygon outlines as described in step 12 of Procedure 2.

    Hispanic Ordinary Least Squares Model

Excel provides a function LINEST that can also be used to calculate regression coefficients and standard errors, but it’s a bit cumbersome to use.


Descriptive Statistics of Spatial Data

Spatial data .


Autocorrelation

The frequency distributions above reflect the probability of certain values, with values in the peak being the most likely, while values in the tails are the least likely.


Spatial Autocorrelation

The correlation of one attribute from a dataset with itself will always be equal to one, but the correlation across all of its features, known as autocorrelation, may have a more complicated value:

Autocorrelation Definition

This expression multiplies a single attributes’ z-scores for every pair of its features, sums the result, and divides by the total number N2 (replaced by ( − 1)2 for sample data sets).

Because the second sum can be distributed through to just the second z-score, and by design the sum of a dataset’s z-scores will always be zero, ρ(A) will always be equal to zero, so it’s not particularly useful.

But a more interesting quantity can devised that depends on the features’ spatial relationships. A feature of geographic data observed by Waldo Tobler which he called the First Law of Geography is that “everything is related to everything else, but near things are more related than distant things”. By taking into account the distance between two features, a quantity called Moran’s I can be defined:


References

  1. Geographic Information Analysis, David O’Sullivan and David J. Unwin, http://onlinelibrary.wiley.com/book/10.1002/9780470549094.
  2. Statistical Modeling: A Fresh Approach, 2nd. Ed., Daniel T. Kaplan, http://www.mosaic-web.org/go/StatisticalModeling/.

Previous: Editing Map Data

 

 

Top of Page  | Using IT DocumentationIT Home  |  IT Site Map  |  Search IT