Statistics

Introduction

Statistics deals with techniques for collecting, analyzing and drawing conclusions from data. A sample is a collection of items from a larger aggregate (the population) about which we wish information. Collection of information by sample surveys saves time and money.

Example of sample survey. When freight travels from place A to place B by two different railroads, the total freight charge is divided between them. A document called a waybill states the amount due each railroad for any shipment. Reviewing all the waybills during a 6-month period to determine how much money is due each railroad is time consuming. In one experiment, there were 22,984 waybills in the population during a 6-month period. A sample of 2072 waybills was chosen. The results were the following. Amount due from complete population: \$64,651. Estimated amount from sample: \$64,568. Difference: \$83. The sample saved over \$4000 in clerical costs.

The variability from item to item is characteristic of the populations encountered in statistical studies. For this reason, problems in conducting sample surveys deal with 1. how to select the sample so that it does not give a distorted view of the population, and 2. how to make statements about the population from the results of the sample.

The sample is an estimate of some characteristic of the population, also called point estimate. Because of variation, the sample point estimate is not the true population value. Therefore, we add to it a statement indicating how far the point estimate is likely to be from the true value. One way is to supplement the point estimate by interval estimate. For example, we can say that from the sample evidence we are confident that the number of farmers in Boone County who sprayed their cornfields for control of the European corn borer was between 345 and 736. By "confident" we mean that the probability is 95 chances in 100 that the interval from 345 to 736 contains the true but unknown number of farmers in Boone County who sprayed. The interval is called a 95% confidence interval.

The process of making statements about the population from the results of samples is called statistical inference.

It is easy to select a sample that gives a distorted picture of the population (think of interviewer in an opinion poll who picks sample families mainly among friends). Human judgment may result in biased sampling even when the population can be inspected carefully before the sample is drawn (students picking larger rocks from the table).

Random sampling is an unbiased method of sampling. In random sampling without replacement the chance of selection at any draw is the same for all items in the population not previously drawn. Random sampling without replacement allows any member of the population to appear only once in the sample.

https://math.stackexchange.com/questions/489772/probability-of-sampling-with-and-without-replacement

There are ${n \choose r}$ ways to select the sample of $r$ items from the population of $n$ items. To calculate number of samples that include one specific item, we must pick that specific item in ${1 \choose 1}$, which leaves us with with a pool of $n-1$ items from which we need to select $r-1$ to fill out the samples. Therefore, there are

$${1 \choose 1}{n-1 \choose r-1}$$

ways to select the desired samples. So the probability of having a specific item in all of the samples is

$$ \frac{{n-1 \choose r-1}}{{n \choose r}} = \frac{(n-1)!}{(r-1)!(n-r)!}\div\frac{n!}{r!(n-r)!}= \frac{(n-1)!}{(r-1)!(n-r)!}\cdot\frac{r!(n-r)!}{n!} = \mathbf{\frac{r}{n}} $$

Random sampling with replacement gives every item in the population an equal chance of being chosen at any draw.

If the sample is a fraction of the population (say less then 2%), samplings with and without replacement give practically identical results, since an individual is very unlikely to appear more than once in a sample.

Random sampling gives every item in the population an equal chance of being selected and measured, which should protect us against distortion or misrepresentation of the population. However, random sampling lacks any knowledge about the structure of the population.

The samples of railroad waybills used stratified random sampling. All waybills were first classified according to the total size of the bill. All waybills with total bills over $40 were selected because we need an accurate sample of large bills. A random sample of 50% of the waybills with totals between $20 and $40 was chosen, and so on. Only 1% of waybills with totals under $5 was selected.

We use random digits to select random samples. Tables of random digits are created by a process designed to give each digit from 0 to 9 an equal chance of appearing at every draw. The following is an example of the table of random digits:

   00-04 05-09 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49

00 11164 36318 75061 37674 26320 75100 10431 20418 19228 91792 
01 21215 91791 76831 58678 87054 31687 93205 43685 19732 08468 
02 10438 44482 66558 37649 08882 90870 12462 41810 01806 02977 
03 36792 26236 33266 66583 60881 97395 20461 36742 02852 50564 
04 73944 04773 12032 51414 82384 38370 00249 80709 72605 67497 

05 49563 12872 14063 93104 78483 72717 68714 18048 25005 04151 
06 64208 48237 41701 73117 33242 42314 83049 21933 92813 04763 
07 51486 72875 38605 29341 80749 80151 33835 52602 79147 08868 
08 99756 26360 64516 17971 48478 09610 04638 17141 09227 10606 
09 71325 55217 13015 72907 00431 45117 33827 92873 02953 85474

When we draw samples from the table of random digits, a stem and leaf table helps to organize those numbers. An example of the random allotment of 75 persons to three foods in stem and leaf form:

         C           Liquid H          Solid H
0 | 5           0 | 2,9,8,3,7     0 | 1,4,6
1 | 0,1,6,5,2   1 |               1 | 3,4,7,8,9
2 | 2,5,0,6,7   2 | 1,9,3         2 | 4,8
3 | 7           3 | 6,5,4,0,3     3 | 1,2,8,9
4 | 0           4 | 2,4,7,8,5,9   4 | 1,3,6
5 | 9,4,7,6     5 | 2,1,5         5 | 0,3,8
6 | 0,4,5       6 | 2,9           6 | 1,3,6,7,8
7 | 1,5,4,3,2   7 | 0             7 |

When drawing conclusions from comparative studies, we are interested in answering the following question: Are we convinced that there is a real difference between the effects of different treatments? To answer this question, we consider the null hypothesis that there is no difference between the effects of the two treatments in the population. Then we ask: Do our sample data agree of disagree with this hypothesis? To answer this we calculate the probability that an average difference between the treatments as great as that observed in our sample could arise solely from person-to-person variability if the null hypothesis were true. If this probability is small, say 1 in 20 or 1 in 100, we reject the hypothesis and claim that there was a real difference between the effects of the two treatments. This technique is called a test of statistical significance.

One type of statistical studies called observational where conclusions are less confident, because we can seldom be sure that we measured and adjusted correctly for all important variables in which our groups differed systematically. The investigator in observational studies lacks the power to create the groups to be compared but is restricted to his or her choice of the observations or data that are collected and analyzed.