Tuesday, October 27, 2015

Signifigance Testing

Part 1:  t and z tests:


2.   A  Department of Agriculture and Live Stock Development in Kenya estimate that yields in a certain district should approach the following amounts in metric tons, (average based on data from the whole country) per hectare:  groundnuts 0.5; cassava 3.70; and beans 0.30.  The survey was of 100 farmers and these are the results. 


                          Sample Mean          Standard Deviation              
Ground Nuts      0.40                                 1.07
Cassava              3.4                                   1.42
Beans                 0.33                                 0.14

The hypothesis testing for Groundnuts is two-tailed test with the Confidence level of 95%.

For Groundnuts I used a z test because the amount of samples is over 30.

Null Hypothesis:  There is no significant difference between the mean and the hypothesized mean.

Alternative Hypothesis:  There is a significant difference between the mean and the hypothesized mean.

Conclusion:  We fail to reject the null hypothesis because there is no significant difference in the yields between actual and the hypothesized mean. 

The probability is 82.38% that there will be similar yields next year. 

The hypothesis testing for Cassava Nuts is a two-tailed test with a Confidence level of 95%.

For Cassava Nuts I used a z test because the amount of samples is over 30.




Null Hypothesis:  There is no significant difference between the mean and the hypothesized mean.

Alternative Hypothesis:  There is significant difference between the mean and the hypothesized mean. 

Conclusion:  We reject the null hypothesis because there is a significant difference between the actual and the hypothesized yields as they are lower than the confidence level. 

The probability is 98.26%  that there will be similar yields next year.
 
The hypothesis testing for Beans is a two-tailed test with a Confidence Level of 95%. 

For Beans I used a z test because the amount of samples is over 30. 


Null Hypothesis:  There is no significant difference between the mean and the hypothesized mean.

Alternative Hypothesis:  There is a significant difference between the mean and the hypothesized mean. 

Conclusion:  We reject the null hypothesis because there is a difference between the actual and hypothesized mean of beans.  The yields are higher than the confidence level. 

Wilderness Parks

A survey of the all the users of a wilderness park was taken in 1960 revealed that the average number of person per party at the park was 2.8.  In a random sample of 25 parties in 1985, the average was 3.7 with a standard deviation of 1.45. 
This survey was done with a with a one tailed test with a Confidence level of 95%. 

 For the Wilderness Parks I used a t-test because the sample is less than 30. 

   
Null Hypothesis:  There is no significant difference between the mean and the hypothesized mean. 

Alternative Hypothesis:  There is a significant difference between the mean and the hypothesized mean. 

Conclusion:  We reject the null hypothesis because there is a significant difference in the mean and the hypothesized mean.  The amount of people per party in 1980 is higher than in 1969. These results are not totally the same as the first survey in 1960 was of all users but the survey in 1980 was of a sample.  It is possible that the sample may just have more people than another sample would have.  I feel if this was done again with another 25 random people the results could possibly be much different, possibly more people per party or even less depending on the sample. 
The probability is 1.711 or 83%.   
 
Part II. Chi-Squared Testing
 
This is a an analysis of what is considered "Up North."  I made a map of Wisconsin from the United States Census Bureau, America Factfinder site.  I then decided, based on the location of Highway 29, which counties I would consider North and which ones I chose to be South.  I then assigned  the 1 and 2 values to the counties in order to do the Chi-square calculations.
 
  
I decided to make Clark County part of the south even though it was more northern than most of the others I chose.  I chose this based on the location of Highway 29.  I chose to have Marathon County to be in the North even though it is cut pretty close to in half by Highway 29. 
 
Next I chose the three variables to conduct a Chi-squared testing.  The variables I selected beaches, picnic areas and bike trails.  The following maps will show the results of the analysis of the variables and show the location of the variables on the maps. 
 
Location of Bike Trails. 
 
 
This map shows the dark blue area is 100 bike trials or more.  There are three counties in the south that have 100 or more, Dane, Milwaukee and Waukesha.  The northern part of the state only has one county, Door County that has over 100.  The next color on the legend is the green (50-99) which also has more counties in the south.  The dark red or rust color shows the amount of counties that have (25-49) which by the map shows it is really close to even on the number of counties.  The last one is the light blue which is (0-24) bike trails per county.  This category is also judging from looking at the map to be about the same in the north as in the south. 

Below is the Chi-squared results.  I opened the SPSS and imported the dbase file exported from ArcMap, then I chose Analyze, Descriptive reports, Crosstabs and then I made sure to check the box on the Chi-square box the continue.  This is the results of this process. 

The Chi-squared tests shows the following results. 

     1.00
   -.507
---------
  0.493  X 100=  49.3%

The Pearson Chi-Squared value is 2.329 with  3 degrees of freedom.


  Public beaches by county.

This maps shows the amount of public beaches per county.  The purple counties signify counties with 20 or more public beaches.  Door county is in the north and Kenosha on the bottom right of the map and Dane slightly north west of it are the two counties in the south that has the most public beaches.  The next amount is the green with (8-13) public beaches and just judging by the colors on the map there are many more counties with 8-13 public beaches per county in the north.  Next is the blue counties which is 5-7 beaches per county.  It appears in the map to be evenly distributed across the state with the 5-7 category.  The last is the pink and it appears that there are more of the 0-4 public beaches in the south.  The northeastern part of the state appears to have the most public beaches in the state. 

The results of this chart were achieved with the same process as above. 

Chi-squared test shows the following results.
     1.00
   -  .07
------------
   .93    X   100=  93%

 Picnic areas by county. 


This map shows the amount of picnic areas per county.  The dark blue areas represent the counties that have 100 or more picnic areas per county.  According to the map all of the areas with 100 or more are in the south, the southeast to be more precise.  The next color is the bright green (50-99) picnic areas per county, most of these are in the south as well, there are a few just north of 29 but none in the northern most part of the map.  The next category of color is the light blue (25-49) appears to be pretty evenly distributed across the state.  The last color is the teal blue which is (0-24) picnic areas.  The category is mostly in the northern part of the state. 

The results of this chart were achieved the same as above.


      1.00
     -.021
------------
  0.979  X 100= 97.9%

We reject the null hypothesis because there is a significant difference in the amount of picnic areas in the south compared to the north. 


Conclusion:  Up -North is a relative term to the person who is doing the testing.  It can be based off many different criteria.  Highway 29 was chosen in this case as the line dividing the north and south.  The maps illustrate the number of bike trails, beaches and picnic areas.  The only county in the north that had a significant amount of each of these categories is Door County.  Door County is located on the lake so it has a lot of recreational activities and it is a much more visited areas than many of the other counties up north due to its location.  The maps also show that the south has many more picnic areas and bike trails, this is most likely due to the fact that the southern part of the state is much more urbanized and populated.  The northern part of the state does not have any large cities or tourist areas besides Door County. 


 

Wednesday, October 7, 2015

Z scores, Mean Center, and Standard Distance


Introduction:

This assignment focused on locating the Mean Center and Weighted Mean Centers for the data from 2003 and 2009.  We also found the Standard Distance and Standard Deviation for the same years.  The question we are looking at focuses on the areas of Eau Claire where there are a high number of arrests for disorderly conduct.  The investigation includes analyzing data from each of the years, 2003 and 2009 to see if there is a difference in the two years or how similar they are.  I am looking the information for both years to see if there is a particular part of Eau Claire where most of these arrest occur or the most likely places for it.  I will be looking at the 2003 and 2009 arrests data for disorderly conduct and locating the Mean Center, the Weighted Mean Center, Standard Deviation and Standard Distance. I will use these to try to narrow down an area in Eau Claire that is the most affected by these arrests.  I will also analyze the data to see if there is a difference between 2003 and 2009 or if they are similar.  I will also use the Eau Claire bars to see if there is any correlation with the distance to a bar. 

Methodology:

First I am using the Mean Center which is a spatial measure of central attached to a Cartesian plane (x,y) or (latitude, longitude), constructed from the average of x and y values.  The center point is a point from the average x and y values for the input feature, the input features are grouped according to the case field values and mean center is calculated from the average x and y values for the centroids in each group.  In this case I used the arc tool box and mean center tool to calculate it.  I also used the Weighted Mean Center which is a little different, it is calculated by using the Mean Center but adding value associated with each point.  Such as in this case it was weighted to Join with the map and a count.  This adds a little more information about the arrests and their location to a bar. 

Map 1:  This map represents the arrests for 2003 and the Mean Center in red and Weighted Mean Center in green.  Notice that the Weighted Mean Center is slightly south west of the Mean Center.  The big blue dots are locations in which 13-30 arrests have occurred which is the highest number, many of the larger blue dots are somewhat close to the Mean Center and the Weighted Mean Center.  There is a large blue dot indicating that there were a large number of arrests but there is no information in this map to indicate the reason for this.   

Map 2:  All the information in the previous map is the same but this is for the year 2009 instead.  If you notice the Mean Center and the Weighted Mean Center are both in close to the same place and the large green dot northeast of the Mean is also a place of concern for this year. 

 
Map 3:  From 2003 to 2009 there is not much difference in the Mean Center and the Weighted Mean for each of these years.  They are so similar that it was difficult to find colors for this map that you could see well due to overlap.  The area with the most arrests is basically the same in both years.  In 2009 there is a new development that is not in the 2003 data, the large pink dot to the left of the Mean and the Mean Center is much larger in the 2009 data.  Perhaps there was a bar that was added within that 6 year span. 

 

Next we are looking at the Standard Distance. The Standard Distance is the spatial equivalent to the standard deviations.  It measures the degree to which features are concentrated or dispersed around the points.  It provides a single summary measure of the feature distributions around any given point.  It is expressed as a radius or a circle.  Using the Weighted Standard Distance, it is used in connection with Weighted Mean Centers.  This is also done by using Arc Map and choosing the Arc Tool Box and Spatial Statistics Tools- Measuring Geographic Distributions and choosing the Standard Distance, and in the input box you can chose to Weight it by the location of the arrests in 2003.  Each are using 1 Standard Deviation above the mean.

Map 4:  This map for 2003 shows that the Weighted Standard Distance is slightly smaller and slightly more south as shown in pink,  The Standard Distance in blue is slightly larger and somewhat more northern. Maybe this has to do with the large blue square that has appeared to be a high place for arrest on all the maps so far.  There were just as many arrests in the northern most larger square as all of the other larger squares in the Standard Distance. 
 
Map 5: This map is of 2009 shows very little if almost no change from 2003.  The areas are very similar and the areas of concern are still very close if not the same.  There is a small difference in the 2003 and the 2009 maps, on the right most part between the two circles, there is a group of larger pink circle indicating more arrests there and these places are not as prevalent in2003.

Map 6:  This map shows the Weighted Standard Distance for 2003 and 2009.  As discussed in the previous maps there is very little difference between the years.  The arrests and the distance from the bars is still concentrated in the same areas. 
 

Results:

I am using map 7 which shows the relationship of arrests for Disorderly Conduct with the locations of Eau Claire bars.  I found there to be a strong correlation to the location of bars and the amount of arrests in the particular area near bars in the Mean Center.  As indicated below, the areas in red show a high amount of arrests in the area of the Eau Claire Blocks that is located on or near Water St.  This area is known to be frequented by many college kids and has many bars located in a short distance from the college and from each other.  The information in the previous maps shows that this is an area of concern.  In Map 1, 2, 3, 4, 5, 6 show the concentration of bars in and the concentration of arrests in the same areas.  A large amount of the bars are located near the bars Mean Center labeled with a black asterisk. There is still the large amount of arrests in the northern part of Maps 1-6 that has not been addressed.  My best explanation is as follows, there are only two bars in that particular area, either one or both bars may be places that a lot of drinking occurs to cause so many arrests.  Due to the amount of arrests in that particular area for both years, there must be something that is not known through this analysis.   
Map 7:  This map shows the location of the Eau Claire bars and the Standard Deviation of how many arrests per Block Group.  It has much of the same information as the other maps.  It show the locations of the maps in relation to the Mean Center of the bars which is coincidentally the same areas of many arrests.  The red blocks are associated with many bars being in close proximity to them.  The red area is more than 2.5 standard deviations from the mean which means that there are large number of arrests near these particular bars.  The dark orange area is also in close proximity but the lighter orange block in the lower right side of map only has a few bars on its border but no bars inside of its perimeters.  Further information would be needed to know the reason for this. 
Figure 1:  This figure shows the PolyID of particular Block Groups and the number of arrests there.  I have calculated the z-scores for them to indicate the distance from the mean. Block 41 is not very far off the mean, 46 is much higher than the mean, one would assume it is one of the dark red areas on Map 7 while 57 is located in the light green which is <0.50 stdv.
If this pattern holds next year in Eau Claire, based on this Data, what number of Disorderly Conducts in Eau Claire will be exceeded 70% of the time? 

Figure 2:  This is the calculation of the probability that the number of arrests will be exceed 1.29 70% of the time.
Figure 3: This calculation shows the probability that the number of arrests will be exceed 11.92 arrests 20% of the time.  

Conclusion:

The overall results show that there are many more arrests for the years 2003 and 2009 near bars and even more so where there are multiple bars in a small area.  There is a specific area near and on Water Street that has a cause for concern.  I think the implications of this is huge.  If these results could end up in the hands of someone who really cared about the situation, they could have more strict guidelines for the bars and how much alcohol they can serve.  If the bar owners were to come together to have better regulations on when to cut someone off this could prevent over drinking which could prevent disorderly conduct arrests.  The bar owners would lose money doing this so in turn it takes you right back to where you were previously with young college students drinking past their limits.  This is not what all the results say but the Mean Center for bars and the Mean Center for arrests is basically the same spot.  Recommendations, more police patrol, more security in the bars.  This is not just about Disorderly Conduct, this is about the possibility of drinking and driving too which is I am sure a whole analysis of its own.  These results should prove that there is a problem but problems are not usually solved when the solution causes loss of revenue.