Tuesday, December 1, 2015

Regression Analysis

 
 

Part 1:

A study on crime rates and poverty for a town, a local news station retrieved some of this data and then made the claim that as the number of kids that get free lunches increases so does the crime in the same area. 

After conduction a regression analysis in SPSS one would say that the news is not valid because the
R squared  value is .173 which is not significant (Figure 1).  This value is lower than 1 so it has no strength. 

If a new area of town was identified having a 23.5% of the children having free lunch one would say that the corresponding crime rate would be 22.85945%.  This regression analysis equation (y=a+bx). 
The coefficients table (Figure 2) shows the variable needed for this equation and the x is the percentage changed into a decimal.  The final equation can be solved:

Y=1.685 (.235) + 21.89
Y= 22.285945

The null hypothesis states, there is no linear association between crime rate and the percentage of free lunches given out.  The alternative hypothesis states, that there is a linear relationship between crime rate and the percentage of free lunches given out.  In order to determine whether or not a linear relationship exists between the two variable of crime rate and percent free lunches, linear regression analysis is used.   The null hypothesis can be rejected because of a significance level of less than 0.005.  There is a very weak linear association between crime rate and the percentage of lunches given out. 
One would not be confident that free lunches are dependent on crime rate because 22.3% is much further away from 100% than it is to 0. One would not be confident that free lunch is dependent on crime rates because the significant level is 97.5% meaning that there is a difference in the number of crimes and the free lunches received. 

Figure 1:  Model Summary R squared value .173 is not significant because it is smaller than 1.


Figure 2:  Coefficients Table show the significance level of .005

Part 2:

Introduction:

The UW system is curious in determining whether or not certain factors influence the amount of enrollment at two different schools, the University of Wisconsin Milwaukee and the University of Wisconsin Eau Claire. The amount of enrollment at each university may be influenced by factors such as the amount of income and education in certain counties, as well as the distance of each county away from each university. These variables can determine a student’s decision in deciding between different universities, thus affecting the amount of enrollment at different schools. Data regarding the enrollment amount at each university as well as income, percent bachelor’s degree, and distance for each county in Wisconsin is used. In order to determine whether or not these variables influence the amount of enrollment the data is analyzed using regression analysis. After performing regression analysis on the data the UW System can determine which factors are most significant in influencing enrollment amounts at the University of Wisconsin Whitewater and the University Wisconsin Eau Claire. When significant factors are determined, spatial representations are used in relation to the regression statistics to determine spatial patterns of enrollment based on the most influential variables.
Introduction:
 

Methods: 

Regression analysis will determine whether or not to reject the null hypothesis, stating that there is no relationship between each variable and enrollment at both universities. If statistically significant, then the alternative hypothesis, stating there is a linear relationship between each variable and enrollment at both universities, can be investigated. In order to properly determine which variables have the most significant influence on the amount of enrollment at each university the data is analyzed through regression analysis in SPSS.  Regression analysis statistics are performed in SPSS to determine whether or not any of the three suspected variables have a significant relationship to the amount of enrollment at each university. Six different regression analyses are performed using the enrollment data for both universities in relation to each of the three variables. The results each analysis will indicate which variables have significant relationships to the amount of enrollment at each university, thus indicating which factors are more influential on a student’s decision to attended different universities.

In the file given, below is a following information included for each county:

  • Education (percent with a Bachelor's degree)
  • An income variable
  • Distance from the center of each county
  • Number of students attending the different UW schools
  • Number of students under the age of 24
  • County Name

The focus was on two UW schools, Eau Claire and Whitewater.  This information was put into a new spreadsheet and used for this purpose.  Two additional fields called PopdistanceEC and PopdistanceWW to normalize the data. 


Figure 3: Xcel fields used and created

The variable chosen were:

Population Distance
Age 24 and under
Median Household Income

These variables were used on UWEC and UWW.  Maps were chosen to be mapped based on significance level, maps were made using the residuals.  The residual is the amount of deviation of each point from the best fit line or regression line.  It represents the difference between the actual and the predicted value of Y. 

Results:

Figure 4, and 5 show information for the regression analysis of UWEC Enrollment Population and the Normalization of Population Distance.  Figure 6 is the residuals in a created map.  Figure 4 is the coefficients table.  The table is used to help write the regression equation as it has the constant (B) and the slope (b, popdistanceec).  The significance level is also in this chart to help decide whether to map it or not.  The significance level of 99% and should be mapped because there is a difference between UWEC population distance.  Figure 5:  This figure is the model summary and it has the R- square value (.945), the r squared value is a useful term that illustrates how much X explains Y.  R- squared ranges from 0 to 1, with 1 being very strong. The R-squared value of .94 would be considered very strong meaning that population distance is dependent upon UWEC's population.     Figure 6: Map made of the residual levels.  Milwaukee County in red is 5.5 or smaller deviations below the best fit line or regression line.  Northwestern Wisconsin sends a lot of students to UWEC and above the best fit line mainly include Wood, Barron, Marathon and Chippewa Counties in blue.  Northern Wisconsin in green have a high amount of students from these counties included in these is Eau Claire County.  One would think Eau Claire County would be in blue as well but there are a great deal of students from around the state that attend UWEC.  We reject the null hypothesis because there is a linear relationship.   
Figure 4:  Coefficient table which shows the constant (B) and the significance level.

Figure 5:  Model Summary shows the R-square value.


Figure 6:  Map of residuals of University of Eau Claire and the Distance from which students come for school.

Figures 7, 8 are the regression analysis information.  Figure 7, shows the constant of 24.153 and a slope of 0.068 and also has a significance level of 99% indicating that it should be mapped for residuals.  Figure 8, the R squared value of 0.779 which shows that the population and distance have strong relationship.    Figure 9, is the map of the residual levels of UW Whitewater Normalization Population/Distance.  Milwaukee in red again in this analysis is very below the best fit line or regression line.  Dane, Jefferson, Rock, Waukesha in blue are above the line.  Marinette County sends a lot of students compared to the rest of northern and central Wisconsin.  We reject the null hypothesis because there is a linear relationship.  
Figure 7:  Coefficients table with constant (B) 24.153 and the slope of 0.068.  The significant is 99%. 

Figure 8:  Model Summary has the R squared value of 0.779.

Figure 9:  UW Whitewater Population Distance from Campus. 


Figures 10 and 11 are the regression analysis information.  Figure 10 is the coefficient table where the (B) is located,  66.590 and the slope of 0.006.  The significance level of 97.5% meaning that it is significant and the residuals should be mapped.  Figure 11, Model Summary shows the R square value of .109 which is very low that there is very little relationship between the two variables.    Figure 12, map of residuals of UW Eau Claire students under the age of 24.  Eau Claire County in blue is the only county that is noticeably above the best fit line.  There are a few green  counties that are near Eau Claire County that are above the best fit or regression line.  Milwaukee in red again is far below.  We reject the null hypothesis because there is a linear relationship.   
Figure 10: Coefficients table with the constant of 66.590 and the slope of 0.006. The significance level of .005 or 97.5%.
Figure 11:  Model Summary showing the R squared value of .109.
 
Figure 12:  Map of residuals of Eau Claire County under age 24.
 

Figures 13 and 14 are the regression information.  Figure 13 is the coefficient table with the constant (B) of 16.284 and the slope of 0.016.  The significance level is .000 or 99%.  This shows that residuals should be mapped.  Figure 14 is the model summary with the R squared value of .515   Figure 15, map of the residuals of students at UW Whitewater 24 and under.  Rock, Jefferson, Walworth and Waukesha Counties are far above the best fit line, this could be due to the fact that the city of Whitewater falls on the county lines of Walworth, Jefferson and Rock Counties. The green  is directly around the blue almost creating a buffer zone.  It make a great deal of sense for the green area to be where it is surrounding the college area as many students could live off campus in neighboring counties.   Milwaukee, Eau Claire, Lacrosse, Brown, Portage and Winnebago Counties are far below the best fit line.We reject the null hypothesis because there is a linear relationship.  

Figure 13:  Coefficients table with the constant (B) of 16.284 and slope of 0.016.  Significance level of .000 or 99%. 


Figure 14:  Model summary table shows the R squared value of .515.


Figure 15:  Map of Students 24 and under in UW Whitewater. 


Figure 16 and 17 are the regression information. Figure 16 is the coefficients table with the constant (B) of -80.982 and the slope of 0.006 and the significance level of .104 or 89.6%.  This is not significant enough of a result to map.  Figure 17,  shows the R square value of 0.037 it is very small and does not show a relationship between the two variables.  We fail to reject the null hypothesis because there not a linear relationship. 
Figure 16:  Coefficients table showing the constant (B) -80.982 and the slope of 0.006 and a significance level of 0.104 or 89.6%.

Figure 17:  Model Summary with R Squared value of 0.037.


Figure 18 and 19 are the regression information.  Figure 18 is the coefficient table show the constant (B) of -579.631 and a slope of 0.022.  The significance level of .000 or 99% which shows that the residuals should be mapped.  Figure 19, Model Summary shows the R square value of 0.329 shows that there is very little relationship between the two variables.   Figure 20, map of UW Whitewater Median Household Income in relation to the population.  The blue counties Dane, Jefferson, Waukesha, Milwaukee, Rock and Walworth Counties are above the best fit line for these two variables.  The red counties, St. Croix, Pierce, Outagamie, Calumet, Washington and Ozaukee are far below the best fit line and are in sets of two counties each near each other indicating that there could be a geographical relationship to income in similar areas.  The counties are spread out across the state.  Given that Milwaukee County is a generally poor region, one would assume the neighboring county and the others are relatively the same. We reject the null hypothesis because there is a linear relationship.   



Figure 18:  Coefficient table showing the constant (B) -579.631 and the slope of 0.022 and the significance level of 0.005 or 99%. 
Figure 19:  Model Summary shows the R square value of 0.0329.
Figure 20: Map of UW Whitewater Median Household Income and Population. 
Conclusion:           
When considering the statistics as well as the residual maps conclusions can be made about influential factors determining enrollment at different schools. Not only can the statistics determine which factors are statistically significant and have the most influence but they also provide information concerning the pattern and strength of the influence. This information is particularly helpful when used in relation to the residual maps, as certain significant factors of influence vary based on location. Overall, the statistics can provide the means to determine which factors are most influential, but the maps allow clearer interpretation of where each variable has the most influence. Some factors deemed the most influential in determining enrollment at different schools are more significant in some counties compared to others. Because of this different areas seem to be more influenced by one variable, and may not be as influenced by another.
Based on analysis of all the data, it is easy to determine that the most significant factor influencing enrollment at both university is distance. When considering how other significant factors influence enrollment at each university, the influence is not the same throughout the state. While some counties may be more influenced by the percentage of bachelor’s degrees other counties, specifically the ones closer to the university, are more influenced by distance.
The question that is being asked is from the variables that are available which variables help provide possible explanations as to why students choose what schools they do.  I would say that the one variable that was chosen to look at, that best describes why students choose schools that they do is based on the number of people in each county is under the age of 24.  I say this because in both UW Whitewater and UW Eau Claire’s maps made using this data show the greatest information on both below and above the best fit line that would be made.  Although some of the data negates each other but I think an overall analysis of this piece shows why students go where they do.  I think another think to keep in mind is that Milwaukee is below the best fit line in almost all the maps, this maybe do to the fact that Milwaukee is one of the most segregated places in the United States and that also takes into consideration the fact that there is a lot of poverty in that area as well.​






Sunday, November 15, 2015

Correlation and Spatial Autocorrelation

 
Part 1:  Correlation and using Excel and SPSS to create a scatterplot. 

Introduction:  This is an analysis of what type of correlation if any there is between distance and sound.  Below is a chart of the variables. 
 
 
 
 
 

The Null Hypothesis states that there is no linear association between distance and sound level.  The Alternative Hypothesis states there is a linear association between distance and sound level.  The Pearson correlation statistic was calculated using a two-tailed test and a significance level of 0.05.  The correlation calculation shows a value of -.0896 which represents a strong negative relationship between distance and the sound level.  The Null Hypothesis is rejected, stating that there is a linear association between distance and sound.  As distance increases, the sound level will decrease. 

2.  This example is using the Census Tracts and Population in Detroit, MI using the Excel sheets provided .  I created a correlation matrix with all the variables listed on the excel spreadsheet. 

The correlation matrix provides statistical evidence for a variety of positive and negative correlations between race and education level.  It also shows the type of employment associated with the race as well.  Results will be discussed in regards to strength of association. 
Perfect Association- all points fit on the trend line.
Strong Association- point pattern is tightly packed near the trend line.
Weak Association- point pattern is widely spread around the trend line.
No Association-when point pattern is distributed around the trend line with no pattern.

Correlation Strength:

 
 
 

For example there is a strong positive correlation that Median Home Values and a person having a bachelor's degree .753 as the it suggests that the more education you have the more income you will have to be able to afford a nicer home.  This of course only implies that there is a correlation not that everyone with more education has more money.  There is also a Moderate correlation with a white person having a bachelors degree, .698 shows that many whites are educated with a bachelors degree.  On the other end of the spectrum, the graph indicates that black people do not have the same correlation to a bachelor's degree.  The black population is a negative -.305 which is a low correlation to education.  Median Household income shows a high correlation with Median Home Value, this a positive correlation at .883 which is a high correlation.  This suggests that if you have a higher income you have a house valued higher.  There is a trend that I see present in this chart.  The matrix indicates that white people with a bachelors degree have higher home values and higher household income.  Black populations and Asians have a low correlation to household income and home values.  Overall, the matrix suggests that the Hispanic population displays a negative correlation to wealth and education, suggesting that the Hispanic population is at a strong disadvantage. 
 
Part 2:  Spatial Autocorrelation
 
 
Introduction: 
 

The Texas Election Commission is interested in analyzing election patterns for the state of Texas from 1980 to 2012. The state of Texas is predominantly concerned about clustering of particular voting patterns and whether or not these patterns have remained consistent over this period of time.   Percent democratic vote and voter turnout data for both 1980 and 2012 elections have been analyzed to determine whether or not clustering is occurring in Texas, as well as if similar voting patterns are consistent over this period of time. Furthermore, the Texas Election Commission wants to know, if clustering is occurring, whether or not certain population variables influence certain patterns. Therefore data regarding percent Hispanic population in Texas has been used in relation to the voting data, considering Texas’s significant Hispanic population. After statistical and spatial analysis of the data, the Texas Election Commission is able to provide identifiable voting pattern information to the governor. 
Important Definitions:
In statistics, Moran's I is a measure of spatial autocorrelation developed by Patrick Alfred Pierce Moran. Spatial autocorrelation is characterized by a correlation in a signal among nearby locations in space. Spatial autocorrelation is more complex than one-dimensional autocorrelation because spatial correlation is multi-dimensional (i.e. 2 or 3 dimensions of space) and multi-directional.
Local Indicators of Spatial Autocorrelation (LISA)
These maps provide a spatial component of spatial autocorrelation. They use spatial weights to determine clustering.  All colors on the map indicate which areas are significant .  White counties not significant.  These are not true tests, because they are not based on central limit theorem.  
 
  
 Methods:

In order to efficiently identify whether or not clustering of certain voting patterns is occurring, and if these patterns are consistent over time, data is analyzed through spatial autocorrelations. Spatial autocorrelation analysis produce a spatial representation which can be used to identify whether or not the distribution of a variable indicates a systematic pattern over space. If clustering is occurring in voting patterns in the state of Texas, spatial autocorrelation will portray, not only if there is clustering or not, but also the areas in which clustering is occurring. Texas Election commission is also interested whether or not certain population variables influence possible clustering patterns. In addition to the percent democrat vote and voter turnout for 1980 and 2012, the percent Hispanic population is taken into consideration to examine if any relationship exists between certain voting patterns and fairly dense Hispanic population in Texas.
 
 

The Moran I calculation compares the value of a variable at any one location with the value at all other locations and produces a number between -1.0 (weak clustering) and 1.0 (Strong Clustering) which determines the strength of the autocorrelation. Not only can Geoda produce a Moran I statistic, it also produces a scatterplot of four quadrants indicating where each observed value for the tested variable lies. Quadrants range from areas of with high values surrounded by areas of other high values of a certain variable (Quadrant I), to areas of low values surrounded by areas of other low values (Quadrant III), as well as areas of high values surrounded by areas of low values (Quadrant II), and areas of low values surround by areas of high values (Quadrant IV). Because areas closer to one another tend to be more similar than areas further away, most of the observed values for a variable will fall within quadrant I and III of the scatterplot. Values of a variable that fall within quadrant II and IV tend to indicate outliers in a situation, representing areas that are unlike the surrounding areas.  The Moran I statistic is helpful in determining the strength of clustering patterns for certain variables, where the scatterplot helps identify details concerning clustering patterns.
 
 
The LISA cluster map is also generated through Geoda, and can be used in relation to the Moran I calculation. A cluster map was created for each variable which identifies specific areas where clustering of a particular variable are significant. The cluster map incorporates the placement of the value on the Moran I scatterplot and displays the exact locations of areas of high and low values in comparison to one another. The map helps to identify exactly where clustering occurs by representing where the areas of high values and areas of low values are located, as well as the location of certain outliers. After the Moran I calculation provides evidence for significant clustering, the LISA cluster map can put into perspective where the clustering is actually occurring.
 
 In addition to spatial autocorrelation statistics represented through Moran I scatterplots and LISA cluster maps, simple correlation statistics are also useful in order to determine any relationship between certain variables. Significant correlations between certain variables, particularly between the percent Hispanic population and specific voting patterns, are useful for determining why clustering is occurring. A correlation matrix run through SPSS provides the correlation statistics comparing each of the five variables to one another in order to identify if of the variables has a strong linear relationship to one another. If there are significant correlations between certain variables, then those correlations can possibly explain the reason for certain voting patterns and clustering.
 
 
 
Results:
 
The data for the first variable of percent democratic vote in 1980 produced a fairly strong Moran's I statistic of 0.5752. This statistic indicates there is evident clustering of percent democratic vote throughout the state of Texas in 1980. The Scatterplot produced in relation to the Moran I statistic reflects clustering of areas with high democratic votes surrounded by other areas of high democratic votes along with areas of low democratic votes to other areas with low democratic votes. The LISA cluster map portrays precisely where these high and low democratic voting areas in 1980 are located. The areas with a clustering of high democratic vote are apparent in the southernmost part of the state, along with a few areas to the eastern part of the state. The areas with very low democratic vote are located predominantly to the north and mid-western part of the state.




 
The data for the variable related to the percent democratic vote in 2012 produced similar results to the 1980 data in both the Moran I scatterplot and LISA cluster map. The Moran I statistic for democratic vote in 2012, though similar to the 1980 data, shows a slightly stronger spatial autocorrelation at 0.6959. The clustering in 2012 is slightly more apparent than in 1980, however areas of high democratic votes seem to still be surrounded by other areas of high democratic vote, and areas of low democratic vote are still surrounded by other areas of low democratic vote. In addition to the similarity between the Moran I statistics between 1980 and 2012, the location of clustering for democratic vote is also similar. It is still apparent that clustering of high democratic vote is still located towards the southernmost part of the state and areas of low democratic vote are primarily towards the northern part of the state.

 
 





The variable concerning the data for voter turnout in 1980 was also analyzed through a Moran's I scatterplot and LISA cluster map in order to identify noticeable clustering patterns. The results obtained through the Moran's I calculation of 0.4681 indicate there is a considerable clustering pattern occurring in the state of Texas in regards to voter turnout in 1980. The scatterplot indicates significantly more outlier areas present for the voter turnout variable compared to the percent democrat vote variable. However, majority of the data represents clustering of areas of high voter turnout next to other areas of high voter turnout along with clustering of areas of low voter turnout surrounded by other areas of low voter turnout.  The LISA map displays the exact locations where this clustering is occurring. The locations where there is consistent high voter turnout are primarily located at the northernmost part of the state, with a few areas towards the center of the state. The map also indicates the large areas of low voter turnout are located at the southernmost part of the state, as well as a small area toward the Midwest part of the state.
 
Voter turnout 1980.
 
The voter turnout data for 2012 shows similar clustering patterns compared to 1980 in both the Moran's I scatterplot and LISA cluster map. The Moran's I value of 0.3359 for voter turnout in 2012 is slightly less than of the Moran's I value for 1980. Though the value indicates there is evident clustering occurring for voter turnout in 2012, the clustered areas for high voter turnout and low voter turnout are not as dense compared to 1980. The LISA cluster map for voter turnout displays clustered areas that are comparable to 1980. Even though there is noticeable clustering of high voter turnout in the northern part of the state during the 2012 election, in 1980 the northern part of the state had a much more expansive area of high voter turnout. There is still similar clustering of high voter turnout in various areas of central Texas in 2012 just as there was in 1980. In addition to similar clustering patterns for high voter turnout from 1980 to 2012, there is also consistent pattern for low voter turnout in the southern part of Texas. In 2012, southern Texas maintained significant area of low voter turnout, just as it did in 1980.
 
 
The last variable analyzed through a Moran's I scatterplot and LISA cluster map was data concerning the percent Hispanic population throughout Texas in 2010. This data was used to identify if clustering of the Hispanic race is comparable to the identifiable clustering of certain voting patterns.  The Moran's I value of 0.7787 indicates an extremely strong clustering pattern of the Hispanic population.  There are very obvious clustering patterns of highly populated Hispanic areas as well as areas with very low Hispanic populations as indicated by the Moran I scatterplot.  The LISA cluster map portrays the specific areas where high clustering of Hispanic population are present, and the areas where Hispanic population is very low. The map indicates the entire southern part of Texas, along the Texas and Mexico border, there is a widespread area of high Hispanic population. In contrast to this area, the north western part of Texas shows a vast area of very low Hispanic population. These clustering patterns can be examined in relation to particular clustering in voting patterns to indicate whether or not there is a relationship between the Hispanic population and particular voting patterns.
 
Percent Hispanic in 2010



 
 
 

In addition to comparing the Moran's I scatterplots and the LISA maps to identify a relationship between the Hispanic population and voting patterns, results from a correlation matrix help solidify any relationship that may be observed.  The correlation matrix produced several statistically significant relationships between certain variables. The Pearson correlation statistic was statistically significant when comparing the percent Hispanic population to all voting pattern variables.
Conclusion:
 The large majority of the results indicate there is a definite clustering of certain voting patterns occurring in the state of Texas.  Not only are the clustering of voting patterns occurring, they also appear to remain consistent over time.  The southern part of Texas shows a clustering of a high percentage of voters in 1980.  This clustering of democratic votes has remained fairly consistent into 2012.  There is also a clustering of pattern of low democratic votes in the northern part of Texas which has also remained consistent through 1980 to 2012.  The southern part of Texas has maintained a pattern of low voter turnout from 1980 to 2012.  These results suggest the occurrence of consistent clustering patterns where southern Texas is mostly more democratic and has less voter turnout, and northern Texas is less democratic but with significantly stronger voter turnout.   
 
In addition to these consistent patterns there is also a strong correlation between the percent Hispanic population in Texas and both percent democratic vote and voter turnout. The statistics indicate areas with larger Hispanic populations are also areas where there are a higher percentage of democratic votes in 2012. Both variables portray strong clustering along the southern border of Texas. The results of the correlation statistics also show a strong relationship to percent Hispanic population and voter turnout in both 1980 and 2012 elections. The results indicate areas with high Hispanic populations are similarly areas with low voter turnout. Both these variables, once again, fall along the southern Texas border. The results specify a strong relationship between Hispanic population and particular voting patterns, which is comparable to the cluster maps portraying southern Texas as an area of high Hispanic population and similarly an area of high percent democratic vote, as well as low voter turnout.  Overall, the results are conclusive in both supporting the idea that the Hispanic population in Texas has a significant influence in particular clustering of voting patterns.  
 
 


Tuesday, October 27, 2015

Signifigance Testing

Part 1:  t and z tests:


2.   A  Department of Agriculture and Live Stock Development in Kenya estimate that yields in a certain district should approach the following amounts in metric tons, (average based on data from the whole country) per hectare:  groundnuts 0.5; cassava 3.70; and beans 0.30.  The survey was of 100 farmers and these are the results. 


                          Sample Mean          Standard Deviation              
Ground Nuts      0.40                                 1.07
Cassava              3.4                                   1.42
Beans                 0.33                                 0.14

The hypothesis testing for Groundnuts is two-tailed test with the Confidence level of 95%.

For Groundnuts I used a z test because the amount of samples is over 30.

Null Hypothesis:  There is no significant difference between the mean and the hypothesized mean.

Alternative Hypothesis:  There is a significant difference between the mean and the hypothesized mean.

Conclusion:  We fail to reject the null hypothesis because there is no significant difference in the yields between actual and the hypothesized mean. 

The probability is 82.38% that there will be similar yields next year. 

The hypothesis testing for Cassava Nuts is a two-tailed test with a Confidence level of 95%.

For Cassava Nuts I used a z test because the amount of samples is over 30.




Null Hypothesis:  There is no significant difference between the mean and the hypothesized mean.

Alternative Hypothesis:  There is significant difference between the mean and the hypothesized mean. 

Conclusion:  We reject the null hypothesis because there is a significant difference between the actual and the hypothesized yields as they are lower than the confidence level. 

The probability is 98.26%  that there will be similar yields next year.
 
The hypothesis testing for Beans is a two-tailed test with a Confidence Level of 95%. 

For Beans I used a z test because the amount of samples is over 30. 


Null Hypothesis:  There is no significant difference between the mean and the hypothesized mean.

Alternative Hypothesis:  There is a significant difference between the mean and the hypothesized mean. 

Conclusion:  We reject the null hypothesis because there is a difference between the actual and hypothesized mean of beans.  The yields are higher than the confidence level. 

Wilderness Parks

A survey of the all the users of a wilderness park was taken in 1960 revealed that the average number of person per party at the park was 2.8.  In a random sample of 25 parties in 1985, the average was 3.7 with a standard deviation of 1.45. 
This survey was done with a with a one tailed test with a Confidence level of 95%. 

 For the Wilderness Parks I used a t-test because the sample is less than 30. 

   
Null Hypothesis:  There is no significant difference between the mean and the hypothesized mean. 

Alternative Hypothesis:  There is a significant difference between the mean and the hypothesized mean. 

Conclusion:  We reject the null hypothesis because there is a significant difference in the mean and the hypothesized mean.  The amount of people per party in 1980 is higher than in 1969. These results are not totally the same as the first survey in 1960 was of all users but the survey in 1980 was of a sample.  It is possible that the sample may just have more people than another sample would have.  I feel if this was done again with another 25 random people the results could possibly be much different, possibly more people per party or even less depending on the sample. 
The probability is 1.711 or 83%.   
 
Part II. Chi-Squared Testing
 
This is a an analysis of what is considered "Up North."  I made a map of Wisconsin from the United States Census Bureau, America Factfinder site.  I then decided, based on the location of Highway 29, which counties I would consider North and which ones I chose to be South.  I then assigned  the 1 and 2 values to the counties in order to do the Chi-square calculations.
 
  
I decided to make Clark County part of the south even though it was more northern than most of the others I chose.  I chose this based on the location of Highway 29.  I chose to have Marathon County to be in the North even though it is cut pretty close to in half by Highway 29. 
 
Next I chose the three variables to conduct a Chi-squared testing.  The variables I selected beaches, picnic areas and bike trails.  The following maps will show the results of the analysis of the variables and show the location of the variables on the maps. 
 
Location of Bike Trails. 
 
 
This map shows the dark blue area is 100 bike trials or more.  There are three counties in the south that have 100 or more, Dane, Milwaukee and Waukesha.  The northern part of the state only has one county, Door County that has over 100.  The next color on the legend is the green (50-99) which also has more counties in the south.  The dark red or rust color shows the amount of counties that have (25-49) which by the map shows it is really close to even on the number of counties.  The last one is the light blue which is (0-24) bike trails per county.  This category is also judging from looking at the map to be about the same in the north as in the south. 

Below is the Chi-squared results.  I opened the SPSS and imported the dbase file exported from ArcMap, then I chose Analyze, Descriptive reports, Crosstabs and then I made sure to check the box on the Chi-square box the continue.  This is the results of this process. 

The Chi-squared tests shows the following results. 

     1.00
   -.507
---------
  0.493  X 100=  49.3%

The Pearson Chi-Squared value is 2.329 with  3 degrees of freedom.


  Public beaches by county.

This maps shows the amount of public beaches per county.  The purple counties signify counties with 20 or more public beaches.  Door county is in the north and Kenosha on the bottom right of the map and Dane slightly north west of it are the two counties in the south that has the most public beaches.  The next amount is the green with (8-13) public beaches and just judging by the colors on the map there are many more counties with 8-13 public beaches per county in the north.  Next is the blue counties which is 5-7 beaches per county.  It appears in the map to be evenly distributed across the state with the 5-7 category.  The last is the pink and it appears that there are more of the 0-4 public beaches in the south.  The northeastern part of the state appears to have the most public beaches in the state. 

The results of this chart were achieved with the same process as above. 

Chi-squared test shows the following results.
     1.00
   -  .07
------------
   .93    X   100=  93%

 Picnic areas by county. 


This map shows the amount of picnic areas per county.  The dark blue areas represent the counties that have 100 or more picnic areas per county.  According to the map all of the areas with 100 or more are in the south, the southeast to be more precise.  The next color is the bright green (50-99) picnic areas per county, most of these are in the south as well, there are a few just north of 29 but none in the northern most part of the map.  The next category of color is the light blue (25-49) appears to be pretty evenly distributed across the state.  The last color is the teal blue which is (0-24) picnic areas.  The category is mostly in the northern part of the state. 

The results of this chart were achieved the same as above.


      1.00
     -.021
------------
  0.979  X 100= 97.9%

We reject the null hypothesis because there is a significant difference in the amount of picnic areas in the south compared to the north. 


Conclusion:  Up -North is a relative term to the person who is doing the testing.  It can be based off many different criteria.  Highway 29 was chosen in this case as the line dividing the north and south.  The maps illustrate the number of bike trails, beaches and picnic areas.  The only county in the north that had a significant amount of each of these categories is Door County.  Door County is located on the lake so it has a lot of recreational activities and it is a much more visited areas than many of the other counties up north due to its location.  The maps also show that the south has many more picnic areas and bike trails, this is most likely due to the fact that the southern part of the state is much more urbanized and populated.  The northern part of the state does not have any large cities or tourist areas besides Door County.