Sunday, November 15, 2015

Correlation and Spatial Autocorrelation

 
Part 1:  Correlation and using Excel and SPSS to create a scatterplot. 

Introduction:  This is an analysis of what type of correlation if any there is between distance and sound.  Below is a chart of the variables. 
 
 
 
 
 

The Null Hypothesis states that there is no linear association between distance and sound level.  The Alternative Hypothesis states there is a linear association between distance and sound level.  The Pearson correlation statistic was calculated using a two-tailed test and a significance level of 0.05.  The correlation calculation shows a value of -.0896 which represents a strong negative relationship between distance and the sound level.  The Null Hypothesis is rejected, stating that there is a linear association between distance and sound.  As distance increases, the sound level will decrease. 

2.  This example is using the Census Tracts and Population in Detroit, MI using the Excel sheets provided .  I created a correlation matrix with all the variables listed on the excel spreadsheet. 

The correlation matrix provides statistical evidence for a variety of positive and negative correlations between race and education level.  It also shows the type of employment associated with the race as well.  Results will be discussed in regards to strength of association. 
Perfect Association- all points fit on the trend line.
Strong Association- point pattern is tightly packed near the trend line.
Weak Association- point pattern is widely spread around the trend line.
No Association-when point pattern is distributed around the trend line with no pattern.

Correlation Strength:

 
 
 

For example there is a strong positive correlation that Median Home Values and a person having a bachelor's degree .753 as the it suggests that the more education you have the more income you will have to be able to afford a nicer home.  This of course only implies that there is a correlation not that everyone with more education has more money.  There is also a Moderate correlation with a white person having a bachelors degree, .698 shows that many whites are educated with a bachelors degree.  On the other end of the spectrum, the graph indicates that black people do not have the same correlation to a bachelor's degree.  The black population is a negative -.305 which is a low correlation to education.  Median Household income shows a high correlation with Median Home Value, this a positive correlation at .883 which is a high correlation.  This suggests that if you have a higher income you have a house valued higher.  There is a trend that I see present in this chart.  The matrix indicates that white people with a bachelors degree have higher home values and higher household income.  Black populations and Asians have a low correlation to household income and home values.  Overall, the matrix suggests that the Hispanic population displays a negative correlation to wealth and education, suggesting that the Hispanic population is at a strong disadvantage. 
 
Part 2:  Spatial Autocorrelation
 
 
Introduction: 
 

The Texas Election Commission is interested in analyzing election patterns for the state of Texas from 1980 to 2012. The state of Texas is predominantly concerned about clustering of particular voting patterns and whether or not these patterns have remained consistent over this period of time.   Percent democratic vote and voter turnout data for both 1980 and 2012 elections have been analyzed to determine whether or not clustering is occurring in Texas, as well as if similar voting patterns are consistent over this period of time. Furthermore, the Texas Election Commission wants to know, if clustering is occurring, whether or not certain population variables influence certain patterns. Therefore data regarding percent Hispanic population in Texas has been used in relation to the voting data, considering Texas’s significant Hispanic population. After statistical and spatial analysis of the data, the Texas Election Commission is able to provide identifiable voting pattern information to the governor. 
Important Definitions:
In statistics, Moran's I is a measure of spatial autocorrelation developed by Patrick Alfred Pierce Moran. Spatial autocorrelation is characterized by a correlation in a signal among nearby locations in space. Spatial autocorrelation is more complex than one-dimensional autocorrelation because spatial correlation is multi-dimensional (i.e. 2 or 3 dimensions of space) and multi-directional.
Local Indicators of Spatial Autocorrelation (LISA)
These maps provide a spatial component of spatial autocorrelation. They use spatial weights to determine clustering.  All colors on the map indicate which areas are significant .  White counties not significant.  These are not true tests, because they are not based on central limit theorem.  
 
  
 Methods:

In order to efficiently identify whether or not clustering of certain voting patterns is occurring, and if these patterns are consistent over time, data is analyzed through spatial autocorrelations. Spatial autocorrelation analysis produce a spatial representation which can be used to identify whether or not the distribution of a variable indicates a systematic pattern over space. If clustering is occurring in voting patterns in the state of Texas, spatial autocorrelation will portray, not only if there is clustering or not, but also the areas in which clustering is occurring. Texas Election commission is also interested whether or not certain population variables influence possible clustering patterns. In addition to the percent democrat vote and voter turnout for 1980 and 2012, the percent Hispanic population is taken into consideration to examine if any relationship exists between certain voting patterns and fairly dense Hispanic population in Texas.
 
 

The Moran I calculation compares the value of a variable at any one location with the value at all other locations and produces a number between -1.0 (weak clustering) and 1.0 (Strong Clustering) which determines the strength of the autocorrelation. Not only can Geoda produce a Moran I statistic, it also produces a scatterplot of four quadrants indicating where each observed value for the tested variable lies. Quadrants range from areas of with high values surrounded by areas of other high values of a certain variable (Quadrant I), to areas of low values surrounded by areas of other low values (Quadrant III), as well as areas of high values surrounded by areas of low values (Quadrant II), and areas of low values surround by areas of high values (Quadrant IV). Because areas closer to one another tend to be more similar than areas further away, most of the observed values for a variable will fall within quadrant I and III of the scatterplot. Values of a variable that fall within quadrant II and IV tend to indicate outliers in a situation, representing areas that are unlike the surrounding areas.  The Moran I statistic is helpful in determining the strength of clustering patterns for certain variables, where the scatterplot helps identify details concerning clustering patterns.
 
 
The LISA cluster map is also generated through Geoda, and can be used in relation to the Moran I calculation. A cluster map was created for each variable which identifies specific areas where clustering of a particular variable are significant. The cluster map incorporates the placement of the value on the Moran I scatterplot and displays the exact locations of areas of high and low values in comparison to one another. The map helps to identify exactly where clustering occurs by representing where the areas of high values and areas of low values are located, as well as the location of certain outliers. After the Moran I calculation provides evidence for significant clustering, the LISA cluster map can put into perspective where the clustering is actually occurring.
 
 In addition to spatial autocorrelation statistics represented through Moran I scatterplots and LISA cluster maps, simple correlation statistics are also useful in order to determine any relationship between certain variables. Significant correlations between certain variables, particularly between the percent Hispanic population and specific voting patterns, are useful for determining why clustering is occurring. A correlation matrix run through SPSS provides the correlation statistics comparing each of the five variables to one another in order to identify if of the variables has a strong linear relationship to one another. If there are significant correlations between certain variables, then those correlations can possibly explain the reason for certain voting patterns and clustering.
 
 
 
Results:
 
The data for the first variable of percent democratic vote in 1980 produced a fairly strong Moran's I statistic of 0.5752. This statistic indicates there is evident clustering of percent democratic vote throughout the state of Texas in 1980. The Scatterplot produced in relation to the Moran I statistic reflects clustering of areas with high democratic votes surrounded by other areas of high democratic votes along with areas of low democratic votes to other areas with low democratic votes. The LISA cluster map portrays precisely where these high and low democratic voting areas in 1980 are located. The areas with a clustering of high democratic vote are apparent in the southernmost part of the state, along with a few areas to the eastern part of the state. The areas with very low democratic vote are located predominantly to the north and mid-western part of the state.




 
The data for the variable related to the percent democratic vote in 2012 produced similar results to the 1980 data in both the Moran I scatterplot and LISA cluster map. The Moran I statistic for democratic vote in 2012, though similar to the 1980 data, shows a slightly stronger spatial autocorrelation at 0.6959. The clustering in 2012 is slightly more apparent than in 1980, however areas of high democratic votes seem to still be surrounded by other areas of high democratic vote, and areas of low democratic vote are still surrounded by other areas of low democratic vote. In addition to the similarity between the Moran I statistics between 1980 and 2012, the location of clustering for democratic vote is also similar. It is still apparent that clustering of high democratic vote is still located towards the southernmost part of the state and areas of low democratic vote are primarily towards the northern part of the state.

 
 





The variable concerning the data for voter turnout in 1980 was also analyzed through a Moran's I scatterplot and LISA cluster map in order to identify noticeable clustering patterns. The results obtained through the Moran's I calculation of 0.4681 indicate there is a considerable clustering pattern occurring in the state of Texas in regards to voter turnout in 1980. The scatterplot indicates significantly more outlier areas present for the voter turnout variable compared to the percent democrat vote variable. However, majority of the data represents clustering of areas of high voter turnout next to other areas of high voter turnout along with clustering of areas of low voter turnout surrounded by other areas of low voter turnout.  The LISA map displays the exact locations where this clustering is occurring. The locations where there is consistent high voter turnout are primarily located at the northernmost part of the state, with a few areas towards the center of the state. The map also indicates the large areas of low voter turnout are located at the southernmost part of the state, as well as a small area toward the Midwest part of the state.
 
Voter turnout 1980.
 
The voter turnout data for 2012 shows similar clustering patterns compared to 1980 in both the Moran's I scatterplot and LISA cluster map. The Moran's I value of 0.3359 for voter turnout in 2012 is slightly less than of the Moran's I value for 1980. Though the value indicates there is evident clustering occurring for voter turnout in 2012, the clustered areas for high voter turnout and low voter turnout are not as dense compared to 1980. The LISA cluster map for voter turnout displays clustered areas that are comparable to 1980. Even though there is noticeable clustering of high voter turnout in the northern part of the state during the 2012 election, in 1980 the northern part of the state had a much more expansive area of high voter turnout. There is still similar clustering of high voter turnout in various areas of central Texas in 2012 just as there was in 1980. In addition to similar clustering patterns for high voter turnout from 1980 to 2012, there is also consistent pattern for low voter turnout in the southern part of Texas. In 2012, southern Texas maintained significant area of low voter turnout, just as it did in 1980.
 
 
The last variable analyzed through a Moran's I scatterplot and LISA cluster map was data concerning the percent Hispanic population throughout Texas in 2010. This data was used to identify if clustering of the Hispanic race is comparable to the identifiable clustering of certain voting patterns.  The Moran's I value of 0.7787 indicates an extremely strong clustering pattern of the Hispanic population.  There are very obvious clustering patterns of highly populated Hispanic areas as well as areas with very low Hispanic populations as indicated by the Moran I scatterplot.  The LISA cluster map portrays the specific areas where high clustering of Hispanic population are present, and the areas where Hispanic population is very low. The map indicates the entire southern part of Texas, along the Texas and Mexico border, there is a widespread area of high Hispanic population. In contrast to this area, the north western part of Texas shows a vast area of very low Hispanic population. These clustering patterns can be examined in relation to particular clustering in voting patterns to indicate whether or not there is a relationship between the Hispanic population and particular voting patterns.
 
Percent Hispanic in 2010



 
 
 

In addition to comparing the Moran's I scatterplots and the LISA maps to identify a relationship between the Hispanic population and voting patterns, results from a correlation matrix help solidify any relationship that may be observed.  The correlation matrix produced several statistically significant relationships between certain variables. The Pearson correlation statistic was statistically significant when comparing the percent Hispanic population to all voting pattern variables.
Conclusion:
 The large majority of the results indicate there is a definite clustering of certain voting patterns occurring in the state of Texas.  Not only are the clustering of voting patterns occurring, they also appear to remain consistent over time.  The southern part of Texas shows a clustering of a high percentage of voters in 1980.  This clustering of democratic votes has remained fairly consistent into 2012.  There is also a clustering of pattern of low democratic votes in the northern part of Texas which has also remained consistent through 1980 to 2012.  The southern part of Texas has maintained a pattern of low voter turnout from 1980 to 2012.  These results suggest the occurrence of consistent clustering patterns where southern Texas is mostly more democratic and has less voter turnout, and northern Texas is less democratic but with significantly stronger voter turnout.   
 
In addition to these consistent patterns there is also a strong correlation between the percent Hispanic population in Texas and both percent democratic vote and voter turnout. The statistics indicate areas with larger Hispanic populations are also areas where there are a higher percentage of democratic votes in 2012. Both variables portray strong clustering along the southern border of Texas. The results of the correlation statistics also show a strong relationship to percent Hispanic population and voter turnout in both 1980 and 2012 elections. The results indicate areas with high Hispanic populations are similarly areas with low voter turnout. Both these variables, once again, fall along the southern Texas border. The results specify a strong relationship between Hispanic population and particular voting patterns, which is comparable to the cluster maps portraying southern Texas as an area of high Hispanic population and similarly an area of high percent democratic vote, as well as low voter turnout.  Overall, the results are conclusive in both supporting the idea that the Hispanic population in Texas has a significant influence in particular clustering of voting patterns.