|
||||
Geostatistics is the application of statistics to the characterization and modeling of geographically distributed data. It is based on the observation that locations that are close to each other will generally have characteristics that are more similar than locations that are far apart, and as such the variation of these characteristics is distinguishable from random positioning. Geostatistics therefore accounts for distance itself as a characteristic of data that is potentially correlated. ArcGIS provides a number of tools that help test geospatial data for statistically significant distributions and compare the accuracy of distance-based models.
Getting StartedSince this tutorial will be using specific maps and data, the first step is to make your own copy of the tutorial data. Set Up: Getting the Tutorial Data
Since some but not all of the ArcGIS components have trouble handling names with spaces or special symbols, do not rename the folders or files. Set Up: Initializing ArcMap and Adding Data
Basic Descriptive StatisticsBefore discussing geostatistics, a review of basic descriptive statistics of data sets will be helpful. Extremes and MiddlesIn this course we’ve seen many geographic objects with different characteristics, e.g. states with different populations, schools with different ethnicities, and waterfalls with different heights. We can review such data in ArcGIS attribute tables, and at first glance they may appear to be random sets of numbers, e.g. the population figures in column POP_2010 in this census-block-group data set: There are, however, several characteristics that we can extract from this set of values using the ArcGIS skills we have developed. The number of records or count is given at the bottom of the attribute table, and is symbolized by N: Count of block groups: N = 4979 Let’s describe the data set by the symbols A = {a1, a2, a3, …, aN }. The minimum and maximum values, together called the range, can be quickly determined by sorting the population field, as shown in the image at the right (see Constructing and Sharing Maps for details): Minimum population: amin = 0 In addition, we can locate a value that describes the “middle” of a data set by dividing sorted records into two equal halves. If there are an odd number of records, there is one value in the middle (highlighted in the image at the right); otherwise a value halfway between the two middle values is used. This value is called the median, and is symbolized by m: Median of population: m = 1170 “Median” can also be used as an adjective, such as the semiredundant “median value”, or the “median census block group”, in this case 250277102004 (though the two others with a population of 1170 could also be used, since the block group number is a semiarbitrary assignment). Another way to describe the middle of the data set might be to find the midpoint μ0, halfway between the minimum and maximum values: μ0 − amin = amax − μ0 and therefore: μ0 = (amin + amax)/2 Midpoint of population: μ0= 3144 This formula can be generalized to include all of the values, resulting in a more representative middle value known as the average or mean value: The symbols in this expression are:
The mean has the property that its separation from all of the data values is, in one way, as small as possible (see the next section). For a large data set programs like Excel make it relatively easy to calculate the mean, once you have entered all of the data. ArcGIS will also calculate the mean and other characteristics of any numerical attribute. Procedure 1: Determining the Basic Descriptive Statistics of an Attribute
Using the previous procedure, we see that for the census block groups Mean of population: μ = 1315 Frequency DistributionsTo further characterize a data set, consider the distribution of its values, where you might observe that some show up more often than others. For example, in the census block group table, the record count or frequency of particular population values is:
These numbers don’t appear to change in a consistent way as population increases, but they can be smoothed out by counting the number within specific ranges of values, or bins. The result is a frequency distribution, a graph of which is called a histogram. In ArcGIS’ summary of the statistics of an attribute, it provides a histogram; for the census block group population, it uses 30 bins of width 215:
This histogram shape, with a peak in the middle and tails dropping to zero on either side, is very common for data sets. The clustering of the data around the peak is known as central tendency. We can represent the peak by the midpoint of its bin b5, (860 + 1074) / 2 = 967. This “most common” value or mode is another way to characterize the middle of the distribution. The median value m = 1170 and mean value μ = 1315 are also both relatively near the peak, in bins b6 and b7, respectively, though pulled to the higher side by the long tail of large values. The latter are known as outliers and such an asymmetric distribution is said to be skewed. An important way to characterize a frequency distribution is by the width of its peak, known as its dispersion; if it’s large the distribution is relatively flat, and if it’s small the distribution has a narrow peak and short tails. A common method to estimate the dispersion is by calculating the separation of all population values from the mean, {ai − μ }, squaring them to remove negative values, averaging them, and finally taking the square root (to compensate for the squares). The result is called the root-mean-square deviation or standard deviation. (σ is the Greek letter “sigma”.) It can be shown with a bit of calculus that using the separation from the mean (as opposed to, say, the median) minimizes the deviation. The square of the standard deviation σ2 is known as the variance. The previous procedure for obtaining the statistics of an attribute in ArcGIS will include the standard deviation, which for the population of the census block groups is Standard deviation of population: σ = 601 The distribution of values in the figure above can be seen to occur primarily within the range from μ − σ to μ + σ, which can be written μ ± σ; for the census block groups this is a population range from 714 to 1916, which includes 3706 out of 4979 records, or 74%. While census block groups are constructed with the goal of having about 1200 people each, there is necessarily some variation since they are bounded by features like streets, while population can vary dramatically from block to block. There can also be a substantial influx or egress of people over time as residences are constructed or torn down. Natural phenomena will have similar variability, and the processes that produce them are understood to have factors that disperse values away from a central value. Consider, for example, elevation— the Earth’s surface was originally flat, being formed by molten material pulled downward by gravity. Over time geologic processes have produced hills and valleys, mountains and trenches, and therefore a wide range of elevations. Elevation in the Amherst area was discussed in , and is shown again to the right. The histogram of this elevation raster is shown below, and has a similar appearance to the census-block-group data, though it has a much more signifcant tail at high elevations. Using statistics to visualize a data set is the default behavior for image data, where raster values are mapped to gray values between 0 and 255, but they can also be used for feature data, such as the census block groups. Procedure 2: Symbolizing an Attribute by a Defined IntervalThe basic descriptive statistics of the attribute of a feature data set are provided when symbolizing it quantitatively, and the bins used in frequency distributions can be visualized using a defined interval.
This is, in a sense, a geographic histogram, with the count of blocks in a particular “bin” represented by a visual area. Question: In the image above showing the census block groups, can you see any patterns in the geographic distributions of the least populated block groups (yellow), the most populated block groups (dark brown), or the block groups with the most common or modal populations (red)? Procedure 3: Symbolizing an Attribute by Its Standard DeviationAn alternative to symbolizing data by a defined interval is to use its standard deviation as the bin size, which generally produces fewer gradations but can also sometimes provide a clear picture as to its behavior.
Question: Are the patterns in the geographic distribution of the census block groups any clearer now? StandardizationIt is often useful to define standard scores or z-scores for any given data set: These values are sometimes also called z-values, but that term is also used more specifically for height values, as in (x, y, z) spatial coordinates. The effect of this transformation is three-fold:
In other words, standard scores are a purely numeric multiple of a standard deviation away from the mean. The frequency distribution of the census-block-group data above is displayed in standardized form in the figure to the right. A primary result of standardizing is that data values within σ of the mean will have standard scores −1 < zi < 1, which are mathematically small (in particular their powers zi2, zi3, … decrease and vanish). The conception of a data set as a dispersal of values around a best estimate is thus advanced when most of this dispersion is small (note that most of the values are between –1 and 1 in the figure to the right). Standardized data sets are thus used when modeling data, and should be calculated before performing any advanced statistical analysis. Procedure 4: Calculating an Attribute’s Standard ScoreTo calculate an attribute’s standard score one must first create a new field to hold it. First, though, make a note of the field’s mean value and standard deviation, as described previously. (Hint: you can copy both values at once from the statistics dialog and in the calculation below delete the intervening text.)
Question: In the image above showing the census block groups, can you see any patterns in the geographic distributions of the different z-scores? Assuming that the census block groups haven’t changed their geography for a long time and at one time had roughly similar values, what can you hypothesize about the outlying features? Probability Distributions and Sample DataFrequency distributions provide a measure of probability, and the laws of probability allow us to approximate them with small samples. Frequency as ProbababilityThe frequency distributions above reflect the probability of certain values, with values in the peak being the most likely, while values in the tails are the least likely. Because of the way that bins are constructed, each bin bk contains some portion of all of the values: 0 ≤ Count(bk) ≤ N Therefore, we can define the probability of values being in a particular bin bk as the fraction of the total count N: pk = Count(bk) / N and it must be the case that 0 ≤ pk ≤ 1 Zero probability means a bin has no values in it, while a probability of 1 or 100% means that a single bin contains all of the values. It must also be the case that, because of the way that bins are constructed, when we add up the count of values in each of the n bins, we must get the total count: Therefore, when we sum the probabilities of each bin, the total must be 1 or 100%: That is, we have accounted for 100% of the probability by adding together the probability of each of the bins. As examples, in the census block groups above, the modal population bin b5, with a population range of 860 … 1074, contains 940 block groups, out of a total count of 4979, for a probability of 940/4979 = 0.189 = 18.9%. Similarly, the region between –1 < z < 1 has a probability of 3881/4979 = 77.9%. Because the probabilty of a bin depends on the size of the bin, it is often better to give the probability per unit value, or probability density. In this case the peak bin has a probability per unit value of 18.9%/215 = 0.0879%, while the range –1 < z < 1 spreads the probability per unit value out to 77.9%/(2σ) = 0.0648%. In the standardized histogram to the right, the probability per unit value is graphed on the right axis, allowing the frequency distribution to also be interpreted as a probability distribution. The bins within one standard deviation of the mean are highlighted in yellow. This is known as a confidence interval of one standard deviation, which has a probability of 77.9% that data will fall within it. Comparing Probability DistributionsThe shape of the frequency distributions seen previously, with a high-probability peak near the middle and low-probability tails to either side, is very common, despite the diversity of processes that produce population clusters, elevation variations, traffic patterns, temperature fluctuations, et al. One might therefore speculate that there are some simple features of these processes that one could understand before diving into more complicated details. And, to the extent we can compare with these simple models and find a significant difference, we can say something about the importance of the data we have collected. As a simple example, suppose that the population in each census block group was chosen completly randomly from the range of data 0 … 6288. Another way of saying this is that every value is equally likely, with a uniform probability of 1/6288 = 0.0159% per unit value. In this case the histogram would be flat, with every bin having the same frequency, the red line in the histogram above. This is clearly different than the measured modal probabilty of 0.0879% per unit value, which tells us something about how the block groups were constructed, in particular that they are not purely random or “uniformly distributed”. To facilitate comparisons of probability distributions, we can simulate a set of data using its measured probability distribution:
While this probabilistic approach to reconstructing the data is clearly not the same process by which a real set of data is created, it allows us to compare with other probability distributions. Note that it also allows for the selection of values that are not in the original data set. In addition, it allows to us see how we might treat some less important features as arising from random processes. In particular, the processes that spread out a single value into a wide distribution might be modeled as a random perturbation! The Normal DistributionOne of the most common probability distributions we will compare to is the normal distribution: where ∂a p is the probablity per unit value of a, and μ and σ are the calculated values of its mean and standard deviation. ∂z p is the probablity per unit value of z = (a − μ) / σ. The normal distribution is symmetric around its mean value, which is also its mode and median: As is required for a probability distribution, the total probability — the area under the curve — is equal to 1 (100%). Because of the normal distribution’s symmetry, because it is easy to calculate, because it can represent a “typical” distribution, and for reasons that we will see later, the normal distribution has great importance in statistics. Because of its shape it is also commonly called a bell curve. A normal distribution is often characterized by its confidence intervals: within one standard deviation of the mean the probability is 68.3%, within two 95.5%, and within three 99.7%. You can see a similarity of the normal distribution to the probability distribution of the census block group population above. However, the latter has a skew of the mode away from the mean of –0.58σ. In addition, it is narrower, with a confidence interval of one standard deviation having a probability of 77.9%. We can therefore characterize the census block group population as not being “normally distributed”. Sample Data and the Central Limit TheoryThe population data in the census block groups is what is referred to as 100% data, as the Census Bureau is required by the U.S. Constitution to “enumerate” the country’s entire population. This is generally a formidable task, and more often than not researchers will instead select a sample of all of the available data to save time and other resources, and use it to represent the entire population. Generally sample data is chosen randomly, avoiding as much possible any influence from other factors. This may seem intuitively obvious, but it sometimes hard to avoid and must be documented. When you do have randomly selected data, you can see how similar these representations at 5% N, 10% N, and 100% N are to the original probability distribution in the following charts, where the samples are blue, the original is red, and the overlap is purple: This is not just coincidence; there is a mathematical reason, known as the Central Limit Theory. Suppose we have a data set A of size N, mean μ, and standard deviation σ, and take a random sample Â1 of size N̂ (usually < N). We can calculate the sample’s mean value, which we will label μ̂1, and which in general will be different than μ. If we then repeat this for a total of n sampling sets Â1, Â2, …, Ân, the theory states that the set of measured means μ̂1, μ̂2, …, μ̂n will form its own sampling distribution that is a normal distribution and has the following characteristics as N̂ → N: In other words, as the size of the sample sets increases, the measured “mean of the means” approaches that of the entire data set, which is probably no surprise. Perhaps more important, however, is that the width of the sampling distribution of means around the actual mean is very narrow in comparison to that of the complete data set. As a result, any of the μ̂k ≈ μ and we can often get by with a relatively small sample size to calculate an estimate of the mean. On the other hand, if we want to improve an estimate, it will take a relatively large increase in N̂ for a significant effect, e.g. four times to halve the estimate error. Note that, because some values might be selected more than once in a random sample, even when N̂ = N as in the third picture above, there will be some variability in the sample means and σμ̂ will not be zero. This result is independent of whether the original distribution is normal, and it also applies to estimates of the median and mode (the latter being the most obvious in the above pictures). One other result that follows is that any sample data set  = {â1, â2, â3, …, âN̂} will have a smaller standard deviation σ̂ around its mean μ̂ than around that of the full data set — essentially, it’s offset from the actual mean and doesn’t include a contribution from the sampling distribution of the mean σμ̂. A better estimate of the standard deviation is therefore given by the standard error s, which can be found by adding together the variance of the sample data around its mean and the variance of the mean itself: s2 = σ̂2 + σμ̂2 resulting in The difference from the sample standard deviation is the factor of N̂ – 1 as opposed to just N̂, which slightly increases the value of s towards the actual standard deviation. Warning: ArcGIS doesn’t know if attribute data is 100% data or sample data, and it always assumes the former in its statistical summary. For large N̂ this doesn’t make much difference, but be careful for smaller values! Another warning: Most statistics texts refer to 100% data as “population” data when distinguishing it from sample data, whether or not it actually represents population values! Statistics of Multiple AttributesA data set often has multiple attributes that may or may not depend on each other. Dependence and IndependenceQuite often two sets of data may be related to each other, at the very least because their values are measured at the same time or location, or both. For example, a weather station might make hourly measurements of temperature, humidity, wind speed, etc. Census data is another common example, such as the layer MASSCENSUS2010BLOCKGROUPS.shp, whose attribute table includes information not only about total population but also the white population, black population, hispanic population, housing units, etc. in particular locations in a particular year: Beyond the basic connection they have due to their location-based collection, these different sets of data might have other relationships, e.g. there can be simple constraints of definition such as: POP_2010 = POP_WHITE + POP_BLACK + POP_NATV + POP_ASN + POP_ISLND + POP_OTHER + POP_MULTI See the U.S. Census Bureau's document “About Race” to learn how they define these categories. The Census Bureau also allows for the possiblity that a person of Hispanic or Latinx ethnicity could be in any one of these categories. See the U.S. Census Bureau Guidance on the Presentation and Comparison of Race and Hispanic Origin Data for more information. Importantly, there can also be more complicated relationships resulting from societal factors. For example, the ratio of blacks to whites is not uniform but tends to be inversely related as whites and blacks cluster together in different locations. The relationship between different attributes can be visualized, to some extent, by plotting each pair within a record on a two-dimension graph of their values, which is known as a scatterplot. Procedure 5: Visualizing Attribute Relationships with Scatterplots
Scatterplots often reveal several types of relationships between attributes:
HISP = α + β × POP_OTHER Recall that α (the Greek letter “alpha”) is the y-intercept and β (the Greek letter “beta”) is the slope of the line. POP_NATV = α POP_BLACK = β / POP_WHITE In other words, where there are more whites there are fewer blacks, and where there are fewer whites, there are more blacks. Inverse relationships can often be approximated by linear relationships with negative slopes. Some pairs of attributes may have no obvious relationship, such as POP_OTHER vs. POP_MULTI, perhaps indicating an overlap in meaning or a more complicated relationship involving other attributes. Relationships between z-scores can sometimes be clearer, because these values are mostly smaller than 1 (mathematically speaking, nonlinear terms will be less important). When an attribute remains constant relative to another attribute, or if they have a purely random relationship, we can say that they are independent of each other; if, on the other hand, the attribute has a clear mathematical relationship to another attribute, we can say they are dependent on each other. Somewhat confusingly, when expressed as a mathematical relationship such as the above, the attribute on the left of the equal sign is called the dependent variable or the response variable, and the attribute in the expression on the right is called the explanatory variable, which implies an asymmetric relationship that requires qualification or justification. An important aphorism to remember when considering dependent relationships is that correlation does not imply causation, i.e. two attributes may be dependent upon each other not because one causes the other, but because they both arise from a third attribute. For example, black households are more likely to have lower incomes than white households, not because being black causes lower incomes but because of their historical origins and ongoing discrimination. CorrelationThe degree to which the two sets of data have a linear relationship can be described by calculating their correlation, defined by Pearson as This expression multiplies two attributes’ z-scores feature-by-feature, sums the result, and divides by the total number N (replaced by N̂ − 1 for sample data sets). The correlation of two attributes will vary between −1 and +1, with the latter occurring if all pairs of values {ai, bi} are exactly the same (because the sum is then the same as that of the standard deviation squared), and the former when the values differ only by a minus sign. If two attributes are independent of each other, the correlation will be close to zero. This is obviously true when one of the attributes is constant, since that value will equal its mean and its z-score will always be zero. More generally, since z-scores are distributed around zero, there will be roughly the same number of positive and negative terms, which will tend to cancel each other out. In ArcGIS, you can calculate the correlation of two attributes by calculating their z-scores, then calculating a third attribute that is the product of their z-scores, then summarizing the latter to find its mean value. (You can also calculate a linear regression; see the next section.) Excel provides a function CORREL which is somewhat easier to use to calculate correlations. For the Massachusetts data above, we can create a correlation matrix with the same form as the scatterplot matrix:
The color codes indicate the strength and sign of the correlation (similar to the standardized map above). From this we see that the POP_OTHER and HISP attributes have the strongest correlation at 0.95, while for POP_WHITE and POP_BLACK there is a weak negative correlation of −0.27, both matching our visual characterization. Question: The second strongest correlation is between the white and total populations at 0.88; why do you think that would be? Linear RegressionAn attribute such as the Hispanic population can be characterized by its mean value and standard deviation, but consider the graph at the right, which plots HISP on the y axis vs. POP_OTHER on the x axis. The mean value of HISP, μ = 126 (the solid green line), is also plotted, along with the confidence interval 3σ = 636 (the dashed green line). Clearly a significant fraction of the HISP data is quite far from the mean and even outside of the 3σ confidence interval — but it’s much closer to the blue line, which varies with POP_OTHER. If we want to model the relationship between , the simplest type of relationship between two attributes A and B is a linear one, viz. A = α + β × B The coefficients α and β are called the intercept and slope, respectively. Note that if the slope β is zero, then A will be represented by the constant value α , which we might expect to be the mean μ. In general there will be a dispersion of data that prevents a perfect representation by such a line, as in the graph at the right. The difference between a dependent value and the corresponding calculated value of a representational line is known as a residual: εi = ai − (α + β × bi) (ε is the Greek letter “epsilon”). We’d like to calculate best-fit values for the coefficientsα and β, a process known as linear regression. The most common procedure, ordinary least-squares regression, is based on the idea that the line that fits the data best is the one that minimizes the sum of the squared residuals: As with the standard deviation
squaring the residuals puts values above and below the regression line on an even footing. Also note that, if the slope β is zero, the sum is the same as that in the expression for σa, since the mean μ is the value of α that minimizes the sum. Question: Where have you seen a least-squares fit previously? (Hint: the residuals were represented by blue lines between two geographic locations.) Multiple linear regression is also possible when there is more than one explanatory variable: A = α + β1 × B1 + β2 × B2 + … + βn-1 × Bn–1 In this case, with n coefficients and n – 1 different explanatory variables, it’s helpful to express the latter as z-scores in order to compare their relative importance to the dependent variable. Then the slopes {βk} will represent the effect of a one-standard-deviation change in the corresponding variables. The derived expressions for the intercept α and slopes {βk} are unenlightening and won’t be listed here. But they can be calculated with a number of tools, including Excel and ArcGIS (see below). As an example, consider the relationship discussed earlier, HISP = α + β × POP_OTHER which was notable because these two attributes appear to be strongly correlated. It has least-squares intercept and slope of α = 16.5 β = 1.788 resulting in the equation HISP = 16.5 + 1.788 × POP_OTHER and the solid blue regression line that is plotted in the graph above. Question: How might you interpret a slope of 1.788 in this case? Goodness of FitHow well a linear regression equation fits the data is an important consideration, and a number of statistical measures have been devised to test its goodness-of-fit. The standard error of the equation describes the distribution of the dependent values around the best-fit line, and is similar to the standard deviation around the mean value: Again the εi are the residuals of the dependent values, and smaller values represents a smaller spread from the regression line, as seen in the graph to the right. As before N is the number of data points, so if more of them fit within a given spread of residuals, that will reduce the standard error. Finally, n is the number of coefficients; the more of them there are the greater the standard error, because they add degrees of freedom to the equation and make it easier to fit more precisely, even though the data hasn’t changed. It is therefore subtracted from the total number of data points N, which decreases the denominator and increases the standard error. Remember “n equations for n unknowns”? That means that one data point is required for each coefficient to determine them exactly, and the remaining N – n data points are responsible for the variation around the line (the residuals). The standard error of the HISP(POP_OTHER) regression is Σ = 67 Note that in the above graph, almost all of the data lies close to the regression line, falling within the confidence interval ±3Σ= ±201, denoted by the dashed blue lines. This is much better than simply describing the dependent variable by its mean value, since ±3σa = ±636. This model therefore accounts for a large fraction of the variation in the HISP data, leaving a much smaller set of residuals that must be accounted for by other factors. We can say that we have partitioned the variation between the model and the remaining residuals. The coefficient of determination is a convenient and accepted way to compare the standard error of the equation Σ and the dependent variable’s standard deviation σa, and thereby describe the overall goodness-of-fit of the equation: If the regression line perfectly fits the data, the residuals εi will all be zero and R2 will be one; when the residuals approach the standard deviation of the dependent variable, the second term will be one and R2 will be zero. One way to interpret the coefficient of determination is as a generalization of correlation to a set of explanatory variables. It can be shown that, when there is only one explanatory variable, R2 will equal the square of the correlation ρ with the dependent variable. For the HISP(POP_OTHER) regression, R2 = 0.90 which matches the correlation calculated above, since 0.952 = 0.90. So R2. Because the coefficient of determination can improve simply by adding more explanatory variables, i.e. by increasing n, a related quantity that provides a better estimate of significance is the adjusted coefficient of determination: R̅2 will always be less than or equal to R2, and it can be negative, unlike R2. The significance of your equation will be greatest when R̅2 is maximized. For the example regression, R2 ≈ R̅2 since N (4979) is much larger than n (2). The F statistic is another common way to analyze the dependence of your model on the number of explanatory variables you’ve chosen. It compares the “explained” variance R2 that follows from these n – 1 variables to the “unexplained” variance 1 – R2 remaining in the N – n unfitted data points:
F can be as small as 0, when the numerator R2/(n – 1) is 0: none of the variance in the dependent variable is explained. F can be as large as ∞, when the denominator (1 – R2)/(N – n) is 0: all of the variance in the dependent variable is explained. So the regression is better when F >> 1; for the HISP(POP_OTHER) regression, F = 45,000. But could a different set of coefficient values be substituted and produce a better result? When coefficient values are selected with random probability and their F values are calculated, an F distribution results, such as the graph of ∂F p versus F shown at the right; clearly some F values are more likely than others. Generally speaking values of F >> 1 have a low probability per unit value ∂F p , and the total probability p that random coefficient values will have F > FRegression is very small, as suggested by the red portion of the F distribution graph. Is there a significant probability p that random coefficient values could produce better results than the regression best-fit? This question is an example of a null hypothesis. A significance level is a value of p below which you may decide to reject the null hypothesis, i.e. decide that FRegression is significant. Commonly these are stated in the form p < 0.1 or p < 0.05. The former represents a less-than-1-in-10 chance and the latter a less-than-1-in-20 chance that a random result will produce a better F. For the HISP(POP_OTHER) regression, p ≈ 0, so FRegression is clearly significant and we can reject the null hypothesis. Standard Errors of the CoefficientsOnce the overall goodness-of-fit has been established, the individual coefficients should come under scrutiny. Because the best-fit regression line is only one of many that could pass through the data, the coefficients also clearly have a range of values, e.g. tilting the line upward for a larger slope or downward for a smaller slope. These values therefore have their own distributions whose widths are described by standard errors of the coefficients, which for the HISP(POP_OTHER) regression are: sα = 1.1 sβ = 0.008 You will commonly see coefficient errors expressed together with the coefficients in the form β ± sβ, e.g. HISP = (16.5 ± 1.1) + (1.788 ± 0.008) × POP_OTHER Note that this is an expression of just one possible confidence interval; to claim more certainty, a multiple of this value is generally necessary. In addition, we can set up another null hypothesis: can these values be left out of the model with little effect, i.e. are they significantly different than zero? A simple test for their significance is based on the t-statistic: tα = (α - 0)/ sα = 15 tβ = (β - 0) / sβ = 200 Like the F-statistic, we can test these values with the t distribution, which, like the F distribution, charts the probability that a random set of values could produce the observed coefficient. When these values are greater than two, i.e. the coefficients ± the standard errors are significantly different than zero, the values are considered good estimates. More precisely, suppose the data was completely random, e.g. HISP showed no dependence on POP_OTHER; then we would expect the coefficients to be all zero and α = μ. The coefficient of determination for the dependence of the HISP attribute on the POP_OTHER attribute is good, but looking at the scatterplot matrix there appears to be correlation not just with POP_OTHER but also with POP_MULTI and, to a lesser extent, with POP_BLACK and POP_NATV. In general, we also know that Spanish-speaking people can be of any racial background. We may therefore be able to produce a better fit by including them in the analysis with a multiple linear regression. Procedure 6: Multiple Linear RegressionArcGIS provides a tool for calculating the ordinary least squares fit to a multiple linear regression of an attribute dependent on multiple other attributes, providing detailed statistical characteristics of a fit described by the equation A = α + β1 × B1 + β2 × B2 + … + βn–1 × Bn–1 This includes the coefficient of determination R2, meaning that it can also be used to calculate the correlation between any pair of attributes, too.
Excel provides a function LINEST that can also be used to calculate regression coefficients and standard errors, but it’s a bit cumbersome to use. References
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|