Math 10
Math 10 is De Anza's Elementary Statistics course. It has a pre-requisite of Math 114 (equivalent to a second year of algebra) and a suitable proficiency in reading comprehension (refer to the De Anza catalogue for all the particulars). The course is intended as an introduction to collecting and analyzing data (2 weeks), probability distributions (3 weeks), and inference (6 weeks).
For those interested in a more advanced statistics course, see Engineering Statistics (Math 23). Typically, Math 23 is offered from time to time (check the schedule and/or contact the Physical Sciences, Mathematics and Engineering Division office at 408-864-8774 for current scheduling information.) The course has a Math 1C (third quarter Calculus) pre-requisite.
Practice problems (calculator or spreadsheet) for finding the sample mean and the sample standard deviation of data sets given in different formats.
1. {1,3,5,7,7.5,8,9,15}. Answer: mean = 6.94; st. dev. = 4.23
2. {10,20,20,20,24,26,48,100,150,160,200}. Answer: mean = 70.72; st. dev. = 69.23.
3. The following data has been grouped in the form: (data value, frequency).
(0.60,5);(0.85,2);(1.5,4);(4,10);(5.25,6);(8,2);(9,1)
Answer: mean = 3.74; st. dev. = 2.31.
Math 10. Here are 10 review questions for Chs. 1-3. [Answers appear after question 10].
1. Which of the following data list has the smallest variance?
(a) -5, -5, -4, -4, -10
(b) -1, 0, 0, 0, 0, 1
(c) 100,100,100,100,100,100,100,100,100
(d) x, x, x, x, x, x, x, x + 1
2. The distribution of people's heights is approximately bell-shaped symmetrical with mean = 69 inches and standard deviation = 3 inches. In a gathering of 500 randomly selected individuals, how tall would you expect the tallest person to be? Explain your reasoning.
3. At a toll booth, you record the color of every car that goes by. What type of data are you recording?
4. The following data gives the number of free throws I missed in order to make five free throws. I repeated this experiment 100 times. The data is given as an ordered pair in the form (X = number of free throws missed, frequency).
(0,30); (1,40); (2,20); (3,5); (4,5)
(a) On average, how many free throws would I attempt in order to make five free throws?
(b) What type of data was recorded?
(c) What was the largest number of free throws I ever attempted?
(d) In how many of the 100 experiments did I make the first five free throws I attempted?
5. The distribution of weights for newborns is symmetrical with mean = 5.7 lbs.
(a) If the third quartile is 7.4 lbs., what is the first quartile?
(b) In a random sample of 1000 newborns, how many are expected to have weights below 7.4 lbs.?
(c) What is the average weight of a newborn?
(d) How many of the newborns in the sample of 1000 from part (b) are expected to have weights falling within the IQR?
6. Tire manufacturer XYZ uses two different processes to make tires. Process A accounts for 80% of the tires made and process B accounts for the rest. Under normal driving conditions, it is known that 90% of the process A tires will last at least 40K miles while 98% of the process B tires will last at least 40K miles.
(a) If one tire is randomly chosen from a large batch of tires, what is the probability the tire will last fewer than 40K miles?
(b) If a randomly chosen tire blew up after 40K miles, what is the probability it was made using process B?
(c) Construct a tree diagram for the situation of the problem.
(d) Construct a table for the situation of the problem.
7. A summary for the distribution of annual incomes for a certain category of workers in the United States is as follows:
mean = $70,000; st. deviation = $15,000; median = $45,000;
first quartile = $25,000; third quartile = $70,000
minimum = $9,000; maximum = $150,000
(a) Construct a box-plot for these data.
(b) Are there any outliers in this distribution? Explain.
(c) What is the shape of this distribution? Explain.
(d) What would be a 'typical' annual income for the middle 50% of these workers? Explain your choice.
8. Suppose P(A) = 0.2 and P(B) = 0.6. If P(A knowing B) = 0.1, what is P(B knowing A)?
9. When running an experiment, why is replication important?
10. Problem #5, page 2-56.
Answers:
1. (c)
2. 75 to 78 inches tall. That is, somewhere between 2 to 3 st. deviations to the right of the mean.
3. Attribute or Qualitative or Categorical.
4. (a) [0 + 40 + 40 + 15 + 20]/100 = 1.15. So, on average, it would take me about 5 + 1.15 = 6.15 free throws to make 5.
(b) Quantitative: discrete
(c) 30
5. (a) 4 lbs.; (b) 750; (c) 5.7 lbs.; (d) 500
6. (a) (0.8)(0.1) + (0.2)(0.0) = 0.084;
(b) (0.2)(0.98)/0.2)(0.98) + (0.8)(0.9)] = 0.196/0.916 = 0.214.
7. (b) Since mean > median by a lot, the income data are right skewed. So, to check for outliers, use Q1, Q3, and the IQR. Thus, Q1 - 1.5(IQR) = 25,00 - 1.5(70,000 - 25,000) = -42,500. So, there are no outliers in the left tail. Similarly, Q3 + 1.5(IQR) = 70,000 + 1.5(70,000 - 25,000) = 137,500. Since 150,00 > 137,500, there is at least one outlier in the right tail.
8. Use Bayes' Theorem: P(B knowing A) = [P(B)/P(A]P(A knowing B). This gives: [0.6/0.2]0.1 = 0.3.
9. Replication minimizes variation since we have more data.
10. In 1941, Ted Williams batting average was 0.406. The mean and st. deviation for batting averages that year were 0.278 and 0.037, respectively. So, the z value associated with Williams' batting average that year was (0.406 - 0.278)/0.037 = 3.46. Sure, this is an outlier, but not an outrageous one.
Today, a major league baseball player with a batting average of 0.400 would have a z score of (0.400 - 0.260)/0.033 = 4.24. So, 4.24 st. deviations to the right of the mean is an outrageous outlier. More than likely, this won't happen for a long time. Well, notice that it hasn't happened since 1941!
Math 10. Review questions for Chs. 4-6 and Sample Exam link.
Sample Exam 2
Answers to these problems are provided after problem #12.
1. If X ~ H(20,30,2) then X is best approximated by:
(a) H(30,20,2)
(b) B(2, 0.60)
(c) H(2,20,30)
(d) B(2,0.40)
2. If X ~B(40,0.3), what is the standard deviation of X?
(a) 12
(b) sqrt(12)
(c) sqrt(8.4)
(d) sqrt(28)
3. A 'sonar' system detects submarines with 90% accuracy. The system operates independently from one submarine to the next.
(a) Of the next 20 submarines, what is the probability the system will not detect exactly 2?
(b) Of the next 20 submarines, what is the probability the system will detect at least 16?
(c) On average, out of 20 submarines, how many will the system detect?
(d) Out of the next 100 submarines, what is the probability the system will detect 88 or 89?
(e) What is the probability the system will fail exactly three times before detecting the first submarine?
4. Discuss the differences between the binomial and hypergeometric distributions.
5. Discuss the similarities between the binomial and hypergeometric distributions.
6. Let Y represent the time to blood coagulation for certain flesh wounds. It is known that for a 'typical' flesh wound, on average, the time to coagulation is about 3 minutes. Assume Y follows an exponential distribution.
(a) Suppose a random patient is treated for a flesh wound. What is the probability the patient will bleed for over 5 minutes?
(b) Suppose a random sample of 100 patients with flesh wounds is treated. What is the probability that the average coagulation time for this sample of patients will exceed 3.5 minutes?
(c) What is the average of the total bleeding time for the 100 patients in the sample of part (b)?
(d) What is the 95th percentile for the distribution of Y? What does this number represent?
7. What is the difference between the Central Limit Theorem and the Law of Large Numbers?
8. Is the correct answer to the last question in Lab 1 (which, by the way, many of you missed) predicated by the Central Limit Theorem or the Law of Large Numbers? Explain.
9. As some of you may know, the U.S. military arsenal consists of numerous 'smart weapons'. For example, 'smart missiles' hit their target with 99.5% accuracy! In an all out war, it is not uncommon to fire 10000 missiles a day. In modern warfare, many of those missiles are aimed at targets located in civilian areas (like some key military facility in the middle of town, for instance). Is it truly surprising that in such war scenarios, reports of these missiles killing innocent civilians are almost a daily occurrence? Explain your reasoning.
10. Let X ~ N(50,25). Show that IQR/st. deviation = 1.35.
11. In a raffle, there are 2000 tickets sold. Each ticket cost $5. You bought 5 tickets. There are two prizes. The first prize is worth $5000 and the second prize is worth $1000. All 2000 tickets are placed in a hat and two tickets are drawn. The first ticket drawn wins the first prize. The second ticket drawn wins the second prize. What is the 'approximate' expected profit (or loss) for your $25 investment?
12. Suppose X ~ R(2,10).
(a) What is the variance of X?
(b) What is the probability that a randomly chosen X exceeds 12?
(c) What is the probability that a randomly chosen X falls between 3 and 5?
(d) What is the 95th percentile for the distribution of X?
(e) If a random sample of 1000 x's is taken, what is the probabiity that the sum of the 1000 x's will be less than 6000?
Answers
1. d
2. c
3. (a) 0.2852 (b) 0.9568 (c) 18 (d) 0.2187 (e) 0.0009
4. Binomial: sequence of Bernoulli trials are independent.
Hypergeometric: sequence of Bernoull trials are not independent.
5. Hypergeometric and Binomial are similar when, for the Hypergeometic, n/(n1 + n2) <= 0.05. That is, when the sample sizes of the two groups are large in relation to how many items are being sampled.
6. (a) e^(-5/3) = 0.1889 (b) 0.0478 (c) 300 min. (d) 8.9872 min. The interpretation for 8.9872 minutes is this: 95% of all such wounds would not have bled for over 8.9872 minutes.
7. The main difference is that the CLT deals with statements about distributions of certain random variables. The LLN addresses the issue of how large samples yield statistics that converge (i.e., get closed to) the value of the corresponding population parameter.
8. Skip (we didn't do Lab 1)
9. No, it's not surprising. Expect (0.005)(10000) = 50 missiles to miss their intended target.
10. Skip
11. I'll do this problem in class -- please, remind me.
12. (a) (10 - 2)^2/12 = 64/12 (b) 0 (c) 1/4 (d) (0.95)(8)+ 2 = 9.6 (e) Note that the mean of X = 6. Therefore, the mean for 1000 x's is (6)(1000) = 6000. Thus, P(sum < 6000) = 0.5 since the distribution of "sums" is Normal (by the CLT).
Math 10. Review questions for Chs. 7-9. (Answers are provided after problem #9.)
1. A random sample of 100 fans attending a San Francisco Giants baseball game shows that 65 of them spend at least $20 in concessions (i.e., food and beverage.)
(a) Find the estimated percentage of San Francisco Giants baseball fans who spend at least $20 in concessions.
(b) Compute the Error Bound, using a 95% confidence level, associated with your estimate from part (a).
(c) Give the 95% confidence interval about the true population proportion of those fans attending San Francisco Giants baseball games who spend at least $20 in concessions.
2. Comment on the appropriateness/accuracy of each of the following four interpretations for the confidence interval obtained in part (c) of question 1 above.
(a) There is a 95% probability that the true percentage of San Francisco Giants baseball fans who attend games and spend at least $20 in concessions is included in the interval.
(b) If the price of concession items is reduced, expect an increase, with 95% confidence, in the numbers of fans spending at least $20 in concessions, of about the magnitude of the Error Bound computed in part (b) problem 1.
(c) If another random sample of 100 San Francisco Giants baseball fans is taken, we are 95% confident that the percentage of those spending at least $20 in concessions will fall in the computed confidence interval.
(d) There is a 95% probability that the percentage of San Francisco Giants baseball fans who spend at least $20 in concessions is somewhere around 65%.
3. We wish to test the following hypotheses:
Null: mu = 50 vs. Alternative: mu < 50
To this end, an SRS of size n = 25 is taken from the population, yielding a sample mean of 48.5. It is known that the population standard deviation = 4.
(a) Give the random variable of interest and its corresponding distribution.
(b) Sketch a graph and shade the region corresponding to the p-value for this test.
(c) Compute the p-value.
(d) Based on the computed p-value from part (c) above, would you favor the Null or the Alternative? Explain your decision and state what it means.
(4) For the test of hypotheses of problem 3, set up the computation for beta if the test is conducted at the alpha = 0.05 significance level. Suppose the true population mean = 49.
(5) We wish to test the following hypotheses:
Null: mu = 30 vs. Alternative: mu < 30
Suppose a sample of size n = 16 is taken and the sample mean and standard deviation are 28 and 4 respectively.
(a) What distribution would you use to conduct the test?
(b) Find the critical value associated with alpha = 0.05.
(c) Based on the critical value from (b) above, what is your decision and conclusion?
(6) Page 9-16, #7.
(7) Page 8-35, #11.
(8) Page 8-35, #14.
(9) Page 9-24, #6 and #10.
Answers
1. (a) 65%; (b) 0.0935; (c) (0.557,0.744)
2. (a) Not good. The 95% C.L. is not a probability.
(b) The C.L. does not address the issue of what will happen if concession prices are reduced.
(c) Any C.I. addresses only population parameters and not sample statistics.
(d) 'Probability" is misused in this context.
3. (a) Xbar ~ N(50, 16/25)
(b) Sorry, can't do a sketch within this platform.
(c) p-value = P(Xbar < 48.5 assuming mu = 50) = P(Z < -1.88) = 0.03
(d) Since 0.03 < 0.10, reject the Null. Conclude that the population mean appears to be less than 50.
4. Beta = P(Xbar > 48.68 assuming mu = 49) = 0.6554. Note that the left critical value for Xbar = 48.68.
5. This problem pertains to Chapter 9.
6. This problem pertains to Chapter 9.
7. If you do this test under both hypotheses, you should find that the two tail test produces the highest Beta.
8. The key assumption is that the students enrolled in Professor Jenkins' class are a random sample of all the students enrolled in Elementary Statistics for that academic term.
9. This problem pertains to Chapter 9.
Math 10. Review questions for Final Exam. Answers/comments for questions 3-10 are at the bottom.
1. Carefully go over the midterms and make sure you understand all the concepts in them.
2. Work the Test Yourself for all chapters.
3. Consider the following hypotheses test:
Null hypothesis: mu = 50 vs. Alternative hypothesis: mu < 50
Suppose the level of significance is set at 0. Then:(a) A type II error will always be made.(b) A type I error will always be made.(c) A valid null hypothesis will never be rejected. (d) A valid alternative hypothesis will never be rejected. 4. A random sample has been taken from a population. A statistician, using this sample, needs to decide whether to construct a 90% confidence interval or a 95% confidence interval for the population mean. How will these intervals differ?(a) The 90% confidence interval will not be as wide as the 95% confidence interval.(b) The 90% confidence interval will be wider than the 95% confidence interval.(c) Which interval is wider depends on how large the sample is.(d) Which interval is wider depends on whether the sample is biased.(e) Which interval is wider depends on whether the Student's-t distribution or the Standard Normal distribution is used.5. A geneticist hypothesizes that half of a given population will have brown eyes while the other half will be split evenly between green and blue-eyed people. In a random sample of 60 people from this population, the individuals are distributed as shown below. What is the value of the Chi-Squared statistic for the goodness of fit test on these data?
brown eyes = 34; green eyes = 15; blue eyes = 11.
(a) Less than 1
(b) At least 1, but less than 10
(c) At least 10, but less than 20
(d) At least 20, but less than 50
(e) At least 50
6. The degrees of freedom for the situation of problem 5 above are:
(a) 59
(b) 57
(c) 2
(d) 3
(e) 58
7. A hypotheses test is conducted to ascertain which of four groups of professionals (accountants, economists, mathematicians, biologists) has, on average, a higher annual income. Which of the following distributions should be used?
(a) Chi-Square
(b) Student's-t
(c) Normal
(d) F
(e) Exponential
8. A consulting statistician reported the results from a learning experiment to a psychologist. The report stated that on one particular phase of the experiment a statistical test result yielded a p-value of 0.30. Based on this p-value, which of the following conclusions should be reached by the psychologist?
(a) The test was statistically significant because a p-value of 0.30 is far larger than an alpha of 0.05.
(b) The test was statistically significant because 1 - 0.30 = 0.70 is greater than an alpha of 0.10 with high statistical power.
(c) The test was statistically significant because 2*0.30 = 0.60 which is higher than 0.50.
(d) The test was not statistically significant because, if the null hypothesis is true, one could expect to obtain a test statistic as extreme as that observed about 30% of the time.
(e) The test was not statistically significant because, if the null hypothesis is true, one could expect to obtain a test statistic at least as extreme as that observed about 70% of the time.
9. An SRS of 50 observations produces a sample mean of 15. A 95%confidence interval for the corresponding population mean is (12,18). Which of the following statements is true?
(a) 95% of the population measurements fall between 12 and 18.
(b) 95% of the sample measurements fall between 12 and 18.
(c) If 100 samples of 50 observations each are taken, 95% of the sample means would fall between 12 and 18.
(d) P(12 < mu < 18) = 0.95
(e) If mu = 19, the sample x-bar of 15 would be unlikely to occur.
10. Let X represent an approximately Normal distribution with mean = 100 and variance = 400. Suppose one million random samples of size 100 are taken from this population. For each sample, the sample variance is computed. Thus, one may think of these computed one million sample variances to be a very close approximation to the sampling distribution of sample variances. Denote this sampling distribution by W.Then,
(a) The distribution of W is Normal.
(b) The distribution of W is Chi-Square.
(c) The sampling distribution of Y = 99*sample variance/400 is Chi-Square.
(d) The sampling distribution of the one million sample variances follows an F distribution.
Answers and comments
3. (c) Level of significance refers to "alpha". "Alpha" is the probability of making a Type I error. A Type I error relates to the Null Hypothesis.
4. (a) Less confidence implies a smaller Error Bound.
5. (b) The computed Chi-Square value is 1.6.
6. (c) Degrees of freedom = number of outcomes - 1 = 3 - 1 = 2
7. (d) The ANOVA procedure uses the F distribution.
8. (e) See definition of p-value (top of page 8-16 in the book).
9. (e) Note: (d) is incorrect because once the confidence interval is computed, the population mean either falls in it or not (with probability 1 or 0).
10. (c)
ANOVA