Title: | Datasets for Stat2 |
---|---|
Description: | Datasets for the textbook Stat2: Modeling with Regression and ANOVA (second edition). The package also includes data for the first edition, Stat2: Building Models for a World of Data and a few functions for plotting diagnostics. |
Authors: | Ann Cannon, George Cobb, Bradley Hartlaub, Julie Legler, Robin Lock, Thomas Moore, Allan Rossman, Jeffrey Witmer |
Maintainer: | Robin Lock <[email protected]> |
License: | GPL-3 |
Version: | 2.0.0 |
Built: | 2025-03-09 02:24:17 UTC |
Source: | https://github.com/statmanrobin/stat2data |
Datasets for Stat2: Modeling with Regression and ANOVA (second edition) and Stat2: Building Models for a World of Data (first edition)
Package: | Stat2Data |
Type: | Package |
Version: | 2.0.0 |
Date: | 2018-12-29 |
License: | GPL-2 |
LazyLoad: | yes |
This package included datasets for both the first and second editions of the text.
Ann Cannon, George Cobb, Bradley Hartlaub, Julie Legler, Robin Lock, Thomas Moore, Allan Rossman, Jeffrey Witmer
Maintainer: Robin Lock <[email protected]>
Age, price, and mileage of used Honda Accords in 2017
A data frame with 30 observations on the following 3 variables.
Age
Age of used Honda Accord car
Price
Price (in $1,000's)
Mileage
Mileage (in 1,000's of miles)
Information on used Honda Accords obtained from cars.com.
Cars.com, February 2017 using zip code 44107, Lakewood, Ohio
Congressional votes on the American Health Care Act in 2017
A data frame with 430 observations on the following 11 variables.
STATE
State name
Dist
Congressional district
Party
Party affiliation (D
=Democrat, R
=Republican)
Dem
1
=Democrat, 0
=Republican
Rep
1
=Republican, 0
=Democrat
uni2013
Percentage of citizens without health care in 2013
uni2015
Percentage of citizens without health care in 2015
uniChange
uni2015 - uni2013
Member
Name of representative
AHCAvote
1
=yes, 0
=no
Trump
1
=Trump won district, 0
=Clinton won district
On May 4, 2017, the U.S. House of Representatives voted, by the narrow margin of 217-213, to pass the American Health Care Act. Most Republicans voted Yes, while all Democrats voted No.
https://fivethirtyeight.com/features/obamacare-has-increased-insurance-coverage-everywhere/
https://docs.google.com/spreadsheets/d/1VfkHtzBTP5gf4jAu8tcVQgsBJ1IDvXEHjuMqYlOgYbA/edit#gid=0
https://www.nytimes.com/interactive/2017/05/04/us/politics/house-vote-republican-health-care-bill.html
Ontime arrivals for American and Delta airlines at LaGuardia and O'Hare airports
A data frame with 10333 observations on the following 5 variables.
airline
American
or Delta
airport
LGA
=LaGuardia ORD
=O'Hare
OnTime
no
or yes
IndOHare
Is the airport ORD? (1
=yes or 0
=no)
IndDelta
Is the airline Delta? (1
=yes or 0
=no)
Ontime/late data for individual flights to LaGuardia and O'Hare airports by American and Delta airlines.
Data collected on 9/20/16 from http://www.transtats.bts.gov/ot_delay/OT_DelayCause1.asp?pn=1
Growth of alfalfa sprouts in acidic conditions
A dataset with 15 observations on the following 3 variables.
Ht4 |
Height of alfalfa sprouts after four days |
Acid |
Amount of acid: 1.5HCl , 3.0HCl , or water
|
Row |
a through e with a = closest to window and e =farthest from window |
Some students were interested in how an acidic environment might affect the growth of plants. They planted alfalfa seeds in 15 cups and randomly chose five to get plain water, five to get a moderate amount of acid (1.5M HCl), and five to get a stronger acid solution (3.0M HCl). The plants were grown in an indoor room so the students assumed that the distance from the main source of daylight (windows) might have an affect on growth rates. For this reason, they arranged the cups in five rows of three with one cup from each Acid level in each row. These are labeled in the data set as Row: a=farthest from the window through e=nearest to the window.
Neumann, A., Richards, A. L., and Randa, J. (2001). Effects of acid rain on alfalfa plants. Unpublished manuscript, Oberlin College.
data(Alfalfa)
data(Alfalfa)
US Senate party affiliatoin and votes on confirming Samuel Alito for the Supreme Court
A data frame with 100 observations on the following 6 variables.
State
State name
Senator
Senator's name
Party
Party affiliation (D
=Democrat, R
=Republican)
ConfVote
Confirmation vote (Nay
=against or Yea
=for)
StateOpinion
Percentage of state residents supporting the choice
Vote
1=for or 0=against
Data from the U.S. Senate vote on January 31, 2006 to confirm Samuel Alito to a position on the Supreme Court.
These numbers are taken from Kastellec, J.P., Lax, J.R., and Phillips, J. (2010), "Public Opinion and Senate Confirmation of Supreme Court Nominees," Journal of Politics, 72(3): 767-84. In this paper the authors used opinion polls and an advanced statistical method known as multilevel regression and poststratification to determine the StateOpinion levels.
Amyloid-beta and cognitive impairment for a sample of Catholic priests
A data frame with 57 observations on the following 2 variables.
Group
mAD
=Alzheimer's, MCI
=mild impairment, NCI
=no impairment
Abeta
Amount of Abeta from the posterior cingulate cortex (pmol/g tissue)
Amyloid-beta (Abeta) is a protein fragment that has been linked to Alzheimer's disease. Autopsies from a sample of Catholic priests included measurements of Abeta (pmol/g tissue from the posterior cingulate cortex) from three groups: subjects who had exhibited no cognitive impairment before death, subjects who had exhibited mild cognitive impairment, and subjects who had mild to moderate Alzheimer's disease.
Violetta N. Pivtoraiko, Eric E. Abrahamson, Sue E. Leurgans, Steven T. DeKosky, Elliott J. Mufson,, Milos D. Ikonomovic (2015) Cortical pyroglutamate amyloid-beta levels and cognitive decline in Alzheimer's disease. Neurobiology of Aging (36) 12-19. Data are read from Figure 1, panel d.
Daily prices and trading volume of Apple stock from July 21st to August 21st in 2016
A data frame with 66 observations on the following 4 variables.
Date
Date as mm/dd/yyyy
Price
Closing price of Apple stock
Change
Change in price from previous day
Volume
Number of shares traded (in millions)
Closing price of Apple stock (AAPL) for each trading day in a three month period from 7/21/2016 to 10/21/2016 as well as the change in stock price and number of shares traded.
Data downloaded from Nasdaq historical prices at http://www.nasdaq.com/symbol/aapl/historical
Score results from an archery class
A dataset with 18 observations on the following 7 variables.
Attendance |
Number of days in class |
Average |
Average score over all days |
Sex |
Coded as f or m
|
Day1 |
Archery score on first day |
LastDay |
Archery score on last day |
Improvement |
Last day - first day score |
Improve |
1=improved or 0=did not improve |
In 2002, Heather Tollerud, a Saint Olaf College student, undertook a study of the archery scores of students at the college who were enrolled in an archery course. Students taking the course record a score for each day they attend class from the first until the last day. Hopefully the instruction they receive helps them to improve their game.
Student project
Six-year graduation data for 214,555 students in 2004
A data frame with 214555 observations on the following 3 variables.
Student
Athlete
or NonAthlete
Race
Black
or White
Grad
1
=graduated within 6 years, otherwise 0
Six-year graduation data from 2004 for male non-athletes and for male athletes, where "Athlete"" means football or basketball player. These data show Simpson's Paradox.
Victor Matheson, College of the Holy Cross, collected the summary statistics.
Data are derived from the summary tables in:
Matheson, V., "Research Note: Athletic Graduation Rates and Simpson's Paradox," Economics of Education Review, Vol. 26:4 (August 2007), 516-520.
Data from an experiment on reaction times to audio or visual stimuli by Oberlin College students.
A data frame with 72 observations on the following 4 variables.
Subject
SubjectIDs coded s1
to s36
ResponseTime
Time to respond to a stimulus (in ms)
Stimulus
Type of stimulus (auditory
or visual
)
Group
Musician
or NonMusician
Subjects in a reaction time study were asked to press a button as fast as possible after being exposed to either an auditory stimulus (a burst of white noise) or a visual stimulus (a circle flashing on a computer screen). Average reaction times (ms) were recorded for between 10 and 20 trials for each type of stimulus for each subject. Data also identifies which subjects are musicians.
Arjuna Pettit, Jr. and Jeremy Potterfield at Oberlin College
Measurements of noise levels for different filters to reduce pollution levels of automobiles.
A dataset with 36 observations on the following 4 variables.
Noise |
Noise level (decibels) |
Size |
Vehicle size: 1 =small, 2 =medium, or 3 =large |
Type |
1 =standard filter or 2 =new filter |
Side |
Side of vehicle: code1=right or 2 =left |
In a 1973 testimony before the Air and Water Pollution Subcommittee of the Senate Public Works Committee, John McKinley, President of Texaco discussed a new filter that had been developed to reduce pollution. Questions were raised about the effects of this filter on other measures of vehicle performance. The data set AutoPollution gives the results of an experiment on 36 different cars. The cars were randomly assigned to get either this new filter or a standard filter and the noise level for each car was measured.
Data explanation and link can be found at http://lib.stat.cmu.edu/DASL/Stories/airpollutionfilters.html.
A.Y. Lewin and M.F. Shakun, Policy Sciences: Methodology and Cases, Pergammon Press, 1976, p 313.
Backpack weights for a sample of college students
A data frame with 100 observations on the following 9 variables.
BackpackWeight |
Backpack weight (in pounds) |
BodyWeight |
Body weight (in pounds) |
Ratio |
BackpackWeight/BodyWeight |
BackProblems |
0 =no or 1 =yes |
Major |
Code for academic major |
Year |
Year in school |
Sex |
a factor with levels Female Male
|
Status |
Graduate or undergraduate? G or U
|
Units |
Number of credits taken that quarter |
A survey of students at California Polytechnic State University (San Luis Obispo) collected data to investigate the question of whether back aches might be due to carrying heavy backpacks,
Mintz J., Mintz J., Moore K., and Schuh K., "Oh, My Aching Back! A Statistical Analysis of Backpack Weights," Stats: The Magazine for Students of Statistics, vol. 32, 2002, pp. 1719.
Game times and boxscore information for baseball games
A data frame with 15 observations on the following 7 variables.
Game |
Code for opposing teams |
League |
AL = American League or NL =National League |
Runs |
Total number of runs scored (both teams) |
Margin |
Margin of victory (Winner-Loser score) |
Pitchers |
Total number of pitchers used (both teams |
Attendance |
Number of spectators at the game |
Time |
Total time for the game (in minutes) |
Data were collected for 15 Major League Baseball (MLB) games played on August 26, 2008.
This dataset was used in first edition, but updated to BaseballTimes2017 for the second edition.
Data from boxscores at www.baseball-reference.com
Times for one day's major league baseball games
A data frame with 14 observations on the following 7 variables.
Game
MLB teams that played
League
AL
=American League, IL
=Interleague, or NL
=National League
Runs
Runs scored by the two teams combined
Margin
Winning margin
Pitchers
Number of pitchers used total for two teams
Attendance
Announced attendance
Time
Time in minutes to play the game
Data from all MLB games played on August 11, 2017. There were no extra-innings game nor any rain delays.
https://www.baseball-reference.com/boxes/?month=8&day=11&year=2017
Data from an experiment to see it the number of bee stings depends on previous stings.
A data frame with 18 observations on the following 3 variables.
Occasion |
Trial: I to IX |
Treatment |
Fresh or Stung
|
Stingers |
Number of stingers |
If you are stung by a bee, does that make you more likely to get stung again? Might bees leave behind a chemical message that tells other bees to attack you? To test this hypothesis, scientists dangled a 4x4 array of 16 muslin-wrapped cotton balls over a beehive. Eight of 16 balls had been previously stung; the other eight were fresh. The response was the total number of new stingers left behind by the bees. The process was repeated for a total of nine trials.
Used in first edition, but not second edition.
Free, J.B. (1961) "The stinging response of honeybees," Animal Behavior, Vol. 9, pp 193-196.
An experiment on the effects of a hormone on blood calcium levels in robins
A data frame with 20 observations on the following 5 variables.
Bird
ID number for each bird (1 to 20)
Sex
female
or male
Hormone
Treated with hormone (no
or yes
)
Group
Combined Sex and Hormone (F No
, F Yes
, M No
, or M Yes
)
Ca
Blood calcium level (mg per 100 ml)
An experiment looked at the effects of treatment with a hormone for increasing the concentration of calcium in birds. Twenty birds (robins) were used in the study, ten male and ten female, equally divided between the hormone and no hormone treatments.
Bliss, Chester (1970), Statistics in Biology, McGraw-Hill
Nest and species characteristics for North American passerines
A data frame with 84 observations on the following 12 variables.
Species |
Latin species name |
Common |
Common species name |
Page |
Page in a bird manual describing the species |
Length |
Mean body length for the species (in cm) |
Nesttype |
Type of nest |
Location |
Location of nest |
No.eggs |
Number of eggs |
Color |
Egg color (0 =plain/solid or 1 =speckled/spotted) |
Incubate |
Mean length of time (in days) the species incubates eggs in the nest |
Nestling |
Mean length of time (in days) the species cares for babies in the nest until fledged |
Totcare |
Total care time = Incubate+Nestling |
Closed |
1=closed nest (pendant, spherical, cavity, crevice, burrow), 0=open nest (saucer, cup) |
Amy R. Moore, as a student at Grinnell College in 1999, wanted to study the relationship between species characteristics and the type of nest a bird builds, using data collected from available sources. For the study, she collected data by species for 84 separate species of North American passerines.
Project by Amy Moore at Grinnell College
The Birders Handbook, by Ehrlich, et al. (1988)
Systolic blood pressure, weight and smoking status for a sample of 500 adults
A data frame with 500 observations on the following 3 variables.
SystolicBP |
Systolic blood pressure (mm of Hg) |
Smoke |
Y =smoker or N =non-smoker |
Overwt |
1 =normal, 2 =overweight, or 3 =obese |
Data on systolic blood pressure, along with smoker status and weight status, for a sample of 500 adults.
Data are part of a larger case study for the 2003 Annual Meeting of the Statistical Society of Canada.
http://www.ssc.ca/en/education/archived-case-studies/case-studies-for-the-2003-annual-meeting-blood-pressure.
Body measurements for a sample of blue jays
A data frame with 123 observations on the following 9 variables.
BirdID |
ID tag for bird |
KnownSex |
Sex coded as F or M
|
BillDepth |
Thickness of the bill measured at the nostril (in mm) |
BillWidth |
Width of the bill (in mm) |
BillLength |
Length of the bill (in mm) |
Head |
Distance from tip of bill to back of head (in mm) |
Mass |
Body mass (in grams) |
Skull |
Distance from base of bill to back of skull (in mm) |
Sex |
Sex coded as 0=female or 1=male |
Body measurements for captured blue jays. Values are averaged for birds captured more than once.
Data from Keith Tarvin, Department of Biology, Oberlin College
Brain tissue pH at time of death
A data frame with 54 observations on the following 5 variables.
pH
Brain tissue pH
Sex
F
or M
Ethnicity
AfricanAmerican
, Asian
, Caucasian
, or PacificIslander
Age
Age at death
DeathType
Cause of death (Cardiac
, Other
, or Suicide
)
These are data from a PNAS article (supplemental file) on pH in brain tissue samples for controls and for people who had Major Depressive Disorder. We extracted just the controls (roughly 3/4 of whom died of cardiac arrest).
Jun Z. Li et al. (2013), "Circadian patterns of gene expression in the human brain and disruption in major depressive disorder," PNAS, vol 110, no. 24, www.pnas.org/cgi/doi/10.1073/pnas.1305814110
Data extracted from Supporting Information, Table S4: Li et al. www.pnas.org/cgi/content/short/1305814110
Passing statistics for football quarterback Drew Brees in 2016
A data frame with 16 observations on the following 5 variables.
Game
Game number (1 is the first game of the regular season)
Opponent
Opponent abbreviation
Completed
Number of completed passes
Attempts
Pass attempts
Yards
Passing yards
Drew Brees was the quarterback for the NFL's New Orleans Saints football team in 2016. This dataset shows some of his passing statistics for each of the 16 regular season games.
http://www.espn.com/nfl/player/gamelog/_/id/2580/year/2016
Poll attitudes towards British trade unions
A data frame with 17 observations on the following 7 variables.
Date |
Month of the poll Aug-77 to Sep-79
|
AgreePct |
Percent who agree (unions have too much power) |
DisagreePct |
Percent who disagree |
NetSupport |
DisagreePct-AgreePct |
Months |
Months since August 1975 |
Late |
1 =after 1986 or 0 =before 1986 |
Unemployment |
Unemployment rate |
The British polling company Ipsos MORI conducted several opinion polls in the UK between 1975 and 1995 in which they asked whether people agree or disagree with the statement "Trade unions have too much power in Britain today".
Data from the Ipsos MORI website at
http://www.ipsos-mori.com/researchpublications/researcharchive/poll.aspx?oItemID=94
Measurements for a sample of butterflies in Greenland
A data frame with 32 observations on the following 4 variables.
Temp
Average temperature for preceding summer (Celsius)
Wing
Average wing length (mm)
Sex
Female
or Male
Species
all are Bc
, Boloria chariclea
Scientists measured wing length of a species of butterfly, Boloria chariclea (Bc), in Greenland each year from 1996 through 2013. They also recorded summer temperatures.
Digitized data from plots in Bowden, J. et al., "High-Arctic butterflies become smaller with rising temperatures", published in Biology Letters 11: 20150574
Senate votes for Corporate Average Fuel Economy (CAFE) bill
A data frame with 100 observations on the following 7 variables.
Senator |
Senator's name |
State |
Code for senator's state |
Party |
party affiliation: D =Democrat, I =Independent, R =Republican |
Contribution |
Contributions from car manufactures (dollars) |
LogContr |
Log of (Contribution+1) |
Dem |
1 =Democrat/Independent 0 =Republican |
Vote |
1 =yes or 0 =no |
The Corporate Average Fuel Economy (CAFE) Bill was proposed by Senators John McCain and John Kerry to improve the fuel economy of cars and light trucks sold in the United States. However a critical vote on an amendment in March of 2002 threatened to indefinitely postpone CAFE. The amendment charged the National Highway Traffic Safety Administration to develop a new standard, the effect being to put on indefinite hold the McCain-Kerry bill. It passed by a vote of 62-38. A political question of interest is whether there is evidence of monetary influence on a senator's vote. Scott Preston, a professor of statistics at SUNY, Oswego, collected data on this vote which includes the vote of each senator (1=Yes or 0=No) and monetary contributions that each of the 100 senators received over his or her lifetime from the car manufacturers.
Thanks to Prof. Scott Preston from SUNY Oswego for the data.
An experiment on calcium supplements and blood pressure in 21 men
A data frame with 21 observations on the following 2 variables.
Treatment |
Calcium or Placebo
|
Decrease |
Beginning-ending blood pressure |
The purpose of this study was to see whether daily calcium supplements can lower blood pressure. The subjects were 21 men; each was randomly assigned either to a treatment group or to a control group. Those in the treatment group took a daily pill containing calcium. Those in the control group took a daily pill with no active ingredients. Each subject's blood pressure was measured at the beginning of the 12-week study, and again at the end. The decrease in blood pressure (begin-end) was recorded (so a negative value means blood pressure increased).
Dataset downloaded from online data source Data and Story Library,
http://lib.stat.cmu.edu/DASL/Stories/CalciumandBloodPressure.html
US Senate vote on Klobuchar amendment to lower drug prices
A data frame with 94 observations on the following 6 variables.
Senator
Name of the Senator
Contributions
Amount of money received from the pharmaceutical industry over 6 years
Party
D
=Democrat or R
=Republican
State
Abbreviation for Senator's state
RollCall
Nay
or Yea
Vote
Against
or With
what drug makers wanted
January 2017 vote in the U.S. Senate related to repeal part of ObamaCare. The "Klobuchar amendment" to a bill was introduced with the purpose of lowering drug prices by allowing prescription drugs to be imported from Canada.
The data exclude two senators who did not vote on the amendment and four senators who were new to Congress and thus had received no money from the drug industry. The remaining 94 senators represent 49 states (every state except California) and each of these senators had received at least $3,000.
Data obtained from:
http://www.senate.gov/legislative/LIS/roll_call_lists/roll_call_vote_cfm.cfm?congress=115&session=1&vote=00020
http://maplight.org/us-congress/interest/H4300/view/all
Cancer survival with ascorbate supplement
A dataset with 64 observations on the following 2 variables.
Survival |
Survival time (in days) |
Organ |
Breast , Bronchus , Colon , Ovary , or Stomach
|
In the 1970's doctors wondered if giving terminal cancer patients a supplement of ascorbate would prolong their lives. They designed an experiment to compare cancer patients who received ascorbate to cancer patients who did not receive the supplement. The result of that experiment was that, in fact, ascorbate did seem to prolong the lives of these patients. But then a second question arose. Was the effect of the ascorbate different when different organs were affected by the cancer? The researchers took a second look at the data. This time they concentrated only on those patients who received the ascorbate and divided the data up by which organ was affected by the cancer. They had 5 different organs represented among the patients (all of whom only had one organ affected): Stomach, bronchus, colon, ovary, and breast.
From the article "Supplemental Ascorbate in the Supportive Treatment of Cancer: Reevaluation of Prolongation of Survival Times in Terminal Human Cancer" by Ewan Cameron and Linus Pauling, Proceedings of the National Academy of Sciences of the United States of America, Vol. 75, No. 9 (Sep., 1978), pp. 4538-4542.
Measurements on a sample of Manduca Sexta caterpillars
A data frame with 267 observations on the following 18 variables.
Instar |
Coded from 1 (smallest) to 5 (largest) indicating stage of the caterpillar's life |
ActiveFeeding |
Indicator (Y or N ) of whether or not the animal is actively feeding |
Fgp |
Indicator (Y or N ) of whether or not the animal is in a free growth period |
Mgp |
Indicator (Y or N ) of whether or not the animal is in a maximum growth period |
Mass |
Body mass (in grams) |
LogMass |
Log (base 10) of body mass |
Intake |
Wet food intake (in grams/day) |
LogIntake |
Log (base 10) of Intake |
WetFrass |
Amount of frass (solid waste) produced (in grams/day) |
LogWetFrass |
Log (base 10) of WetFrass |
DryFrass |
Amount of frass, after drying, produced (in grams/day) |
LogDryFrass |
Log (base 10) of DryFrass |
Cassim |
CO2 assimilation (ingestion - excretion) |
LogCassim |
Log (base 10) of Cassim |
Nfrass |
Nitrogen in frass |
LogNfrass |
Log (base 10) of Nfrass |
Nassim |
Nitrogen assimilation (ingestion - excretion) |
LogNassim |
Log (base 10) of Nassim |
Student and faculty researchers at Kenyon College conducted numerous experiments with Manduca Sexta caterpillars to study biological growth.
We thank Professors Harry Itagaki, Drew Kerkhoff, Chris Gillen, and Judy Holdener and their students for sharing this data from research supported by NSF InSTaRs grant #0827208.
Shooting percentages for two Cav players
A data frame with 1940 observations on the following 3 variables.
Player
Frye
or Irving
ShotType
Two
or Three
Hit
1
=made or 0
=missed
Shooting success on 2-point shots and 3-point shots for the 2016-17 NBA season for two Cleveland Cavalier basketball players, Kyrie Irving and Channing Frye. Each case is a shot attempt. These data show Simpson's Paradox.
http://www.espn.com/nba/player/splits/_/id/6442/kyrie-irving http://www.espn.com/nba/player/splits/_/id/2754/type/total/channing-frye
Nutrition content for a sample of 36 different brands of breakfast cereals
A data frame with 36 observations on the following 4 variables.
Cereal |
Brandname of cereal |
Calories |
Calories per serving |
Sugar |
Grams of sugar per serving |
Fiber |
Grams of fiber per serving |
Data give nutrition contents (per serving) for 36 breakfast cereals.
These data were collected by Patricia Benedict, Ronald Brahler, and Kenneth Motz, who read the nutritional labels on the boxes, in an attempt to learn whether cereals high in fiber are also high in sugar and calories. The cereals are all of those that were sold at Russo Stop & Shop in University Heights, OH, in July, 1990.
Comparison of two treatments for nausea in chemotherapy
A data frame with 2 observations on the following 4 variables.
Drug |
Prochlorperazine or THC
|
Effective |
Count of effective cases |
NotEffective |
Count of noneffective cases |
Patients |
Number of patients in the treatment |
An article in the New England Journal of Medicine described a study on the effectiveness of medications for combatting nausea in patients undergoing chemotherapy treatments for cancer. In the experiment, 157 patients were divided at random into two groups. One group of 78 patients was given a standard antinausea drug called prochlorperazine, while the other group of 79 patients received THC (the active ingredient in marijuana). Both medications were delivered orally and no patients were told which of the two drugs they were taking. The response measured was whether or not the patient experienced relief from nausea when undergoing chemotherapy. Dataset is a 2 x 2 table of counts.
Sallan SE, Cronin C, Zelen M, Zinberg NE (1980), "Antiemetics in patients receiving chemotherapy for cancer: a randomized comparison of delta-9-tetrahydrocannabinol and prochlorperazine," New England Journal of Medicine, 302(3) p.135-138.
Age at first speaking and aptitude test scores
A data frame with 21 observations on the following 3 variables.
Child |
ID for each child |
Age |
Age at first speaking (in months) |
Gesell |
Gesell Aptitude Test Score |
The data are from a study about whether there is a relationship between the age at which a child first speaks (in months) and his or her score on a Gesell Aptitude Test taken later in childhood.
These data were originally collected by L.M. Linde of UCLA but were first published by M.R. Mickey, O.J. Dunn, and V. Clark, "Note on the use of stepwise regression in detecting outliers," Computers and Biomedical Research, 1 (1967), pp. 105-111. The data have been used by several authors. We found them in David Moore's Basic Practice of Statistics, WH Freeman (2004)
2016 US Democratic Presidential primary results
A data frame with 31 observations on the following 5 variables.
State
ID for primary state
Delegates
Percentage of delegates won by Clinton
PaperTrail
Was a paper trail available for votes cast? (No Paper Trail
or Paper Trail
)
PopularVote
Percentage of votes won by Clinton
AfAmPercent
Percentage of African-Americans in the state
In 2016 Hillary Clinton won the Democratic nomination for U.S. President over Bernie Sanders. A paper was circulated that claimed to show evidence of election fraud based, among other things, on Clinton doing better in states that don't have a paper trail for votes cast in a primary election than she did in states that have a paper trail. Data is for the 31 states that held Democratic primaries in 2016.
https://docs.google.com/spreadsheets/d/1cszGOhbmHDTHH5ntaGPmeX55RgMMaoBhgqO1Wx-9TRk/ edit#gid=0
http://kff.org/other/state-indicator/distribution-by-raceethnicity/ ?currentTimeframe=0&sortModel=%7B%22colId%22:%22Location%22,%22sort%22:%22asc%22%7D
Data on 60 customers at a clothing retailer
A data frame with 60 observations on the following 8 variables.
ID |
Case ID |
Amount |
Net dollar amount spent by customers in their latest purchase from this retailer |
Recency |
Number of months since the last purchase |
Freq12 |
Number of purchases in the last 12 months |
Dollar12 |
Dollar amount of purchases in the last 12 months |
Freq24 |
Number of purchases in the last 24 months |
Dollar24 |
Dollar amount of purchases in the last 24 months |
Card |
1 for customers who have a private-label credit card with the retailer, 0 if not |
This dataset represents a random sample of 60 customers from a large clothing retailer. The manager of the store is interested in predicting how much a customer will spend on his or her next purchase based on one or more of the available explanatory variables.
Personal communication with David Cameron who completed a more extensive consulting project for the retailer.
Rainfall amounts from a cloud seeding experiment (winter only)
A data frame with 28 observations on the following 7 variables.
Seeded |
Treatment coded as S =seeded or U =unseeded |
Season |
All in Winter
|
TE |
Rainfall in East (treatment) |
TW |
Rainfall in West (treatment |
NC |
Rainfall in North (control) |
SC |
Rainfall in South (control) |
NWC |
Rainfall in Northwest (control) |
Researchers were interested in whether seeded clouds would produce more rainfall. An experiment was conducted in Tasmania between 1964 and 1971 and rainfall amounts were measured in inches per rainfall period. The researchers measured the amount of rainfall in two target areas: East (TE) and West (TW). They also measured the amount of rainfall in three control locations. Clouds were coded as being either seeded (treatment) or unseeded (control). This is a subset (only Winter months) of the larger CloudSeeding2 dataset. All rainfall amounts are in inches.
Data were accessed from the website www.statsci.org/data/oz/cloudtas.html. This is the web home of the Australasian Data and Story Library (OzDASL).
A.J. Miller, D.E. Shaw, L.G. Veitch, and E.J. Smith, (1979) "Analyzing the results of a cloud-seeding experiment in Tasmania" in Communications in Statistics: Theory and Methods, A8 (10), pp. 1017-1047.
Rainfall amounts from a cloud seeding experiment
A data frame with 108 observations on the following 8 variables.
Period |
ID for time period |
Seeded |
Treatment coded as S =seeded or U =unseeded |
Season |
Coded as Autumn , Spring , Summer , or Winter
|
TE |
Rainfall in East (treatment) |
TW |
Rainfall in West (treatment |
NC |
Rainfall in North (control) |
SC |
Rainfall in South (control) |
NWC |
Rainfall in Northwest (control) |
Researchers were interested in whether seeded clouds would produce more rainfall. An experiment was conducted in Tasmania between 1964 and 1971 and rainfall amounts were measured in inches per rainfall period. The researchers measured the amount of rainfall in two target areas: East (TE) and West (TW). They also measured the amount of rainfall in three control locations. Clouds were coded as being either seeded (treatment) or unseeded (control). A subset (only Winter months) of these data is stored in CloudSeeding. All rainfall amounts are in inches.
Data were accessed from the website www.statsci.org/data/oz/cloudtas.html. This is the web home of the Australasian Data and Story Library (OzDASL).
A.J. Miller, D.E. Shaw, L.G. Veitch, and E.J. Smith, (1979) "Analyzing the results of a cloud-seeding experiment in Tasmania" in Communications in Statistics: Theory and Methods, A8 (10), pp. 1017-1047.
Daily carbon dioxide measurements for April through November 2011
A data frame with 237 observations on the following 2 variables.
CO2
Carbon dioxide (CO2) level (in parts per million)
Day
Number of day in 2011 (April 1 = day 91)
Scientists at a research station in Brotjacklriegel, Germany recorded CO2 levels, in parts per million, in the atmosphere for each day from the start of April through November in 2011.
This dataset was renamed to CO2Germany for the second edition.
http://gaw.empa.ch/gawsis/reports.asp?StationID=-739519191
Daily carbon dioxide measurements for April through November 2011
A data frame with 237 observations on the following 2 variables.
CO2
Carbon dioxide (CO2) level (in parts per million)
Day
Number of day in 2011 (April 1 = day 91)
Scientists at a research station in Brotjacklriegel, Germany recorded CO2 levels, in parts per million, in the atmosphere for each day from the start of April through November in 2011.
http://gaw.empa.ch/gawsis/reports.asp?StationID=-739519191
Monthly carbon dioxide readings at Mauna Loa, Hawaii
A data frame with 360 observations on the following 4 variables.
Year
Year (1988 - 2017)
Month
Month (1=Jan. to 12=Dec.)
CO2
Atmospheric carbon dioxide level (ppm)
t
Time interval (t=1 to 360)
Monthly average carbon dioxide readings (1988 - 2017) at the Mauna Loa Observatory in Hawaii. Data collected and disseminated by ERSL (Earth System Research Laboratory) of the U.S. NOAA (National Oceanic and Atmospheric Administration.
Data downloaded for MOL (Mauna Loa) from the ESRL/GMD data page at https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html
Monthly carbon dioxide readings at the South Pole
A data frame with 348 observations on the following 4 variables.
Year
Year (1988 - 2016)
Month
Month (1=Jan. to 12=Dec.)
CO2
Atmospheric carbon dioxide level (ppm)
t
Time interval (t=1 to 348)
Monthly average carbon dioxide readings (1988 - 2016) at the South Pole. Data collected and disseminated by ERSL (Earth System Research Laboratory) of the U.S. NOAA (National Oceanic and Atmospheric Administration.
Data downloaded for SPO (South Pole) from the ESRL/GMD data page at https://www.esrl.noaa.gov/gmd/dv/data/
Drug interaction study with oral contraceptives
A data frame with 44 observations on the following 6 variables.
ID
ID number for each of the women
StudyPeriod
1=first
or 2
=second
Treatment
Drug
or Placebo
EE
Bioavailability of the ethinyl estradiol component of the oral contraceptive (in pg*hr/ml)
ComparisonValues
Comparison values used for a Tukey nonadditivity plot
Residuals
Residuals used for a Tukey nonadditivity plot
Twenty-two female subjects were allocated randomly to one of two treatment sequences in a two period crossover design. The two treatments were a new Drug D or placebo, both given concomitantly with a standard oral contraceptive which was given in both study periods. The oral contraceptive has two components, ethinyl estradiol (EE) and norethindrone (NET). The purpose of the study was to evaluate whether the presence of Drug D affected the bioavailability of each of the oral contraceptive components. Note that our dataset does not include the NET variable.
Thomas E. Bradstreet & Deborah L. Panebianco (2017) "An Oral Contraceptive Drug Interaction Study", Journal of Statistics Education, 12:1, DOI: 10.1080/10691898.2004.11910719
This function produces an plot of standardized residuals versus leverage values for a regression model. Horizontal boundaries identify mild or more extreme standardized residuals. Vertical boundaries identify mild and more severe high leverage points. Curved boundaries identify mild and more severe values of Cook's D.
cooksplot(mod)
cooksplot(mod)
mod |
a regression model from lm( ) |
The plot shows standardized residuals (vertical) versus leverage values (horizontal) for all cases in a regression model.
Horizontal (blue) boundaries mark standardized residuals beyond +/- 2 (mild) and +/- 3 (more severe).
Vertical (green) boundaries mark leverage points beyond 2(k+1)/4 (mild) and 3(k+1)/n (more severe), where k= number of predictors.
Curved (red) boundaries for mark influential points beyond 0.5 (mild) and 1.0 (more severe) using Cook's D.
Unusual points are labeled with a case number.
A plot showing standardized residuals versus leverage values with boundaries for unusual cases
data(AccordPrice) mod1=lm(Price~Age,data=AccordPrice) cooksplot(mod1)
data(AccordPrice) mod1=lm(Price~Age,data=AccordPrice) cooksplot(mod1)
Medical facilities and doctors in a sample of counties.
A data frame with 53 observations on the following 4 variables.
County
County name, state
MDs
Number of medical doctors
Hospitals
Number of community hospitals
Beds
Number of beds in the hospitals
Data compiled from information provided by the American Medical Association on the availability of health care in counties in the United States. A random sample of 53 counties was chosen from among counties with at least two community hospitals.
Physicians–American Medical Association, Chicago, IL, Physician Characteristics and Distribution in the U.S., annual (copyright), accessed May 17, 2006. Community hospitals–Health Forum LLC, an American Hospital Association (AHA) Company, Chicago, IL, Hospital Statistics, and unpublished data (copyright), e-mail accessed May 4, 2006 (related Internet site http://www.healthforum.com).
Other web sources:
http://www.ama-assn.org/
http://www.healthforum.com/healthforum/html/data_statistics/data_statistics.html
http://www.cms.hhs.gov http://www.ssa.gov
Oxygen intake of crabs with different noise sources
A data frame with 34 observations on the following 3 variables.
Mass
Oxygen intake of crabs with different noise sources
Oxygen
Rate of oxygen consumption (mu moles h^-1)
Noise
Source of noise (ambient
or ship
)
Animals that are stressed might increase their oxygen consumption. Biologists measured oxygen consumption of shore crabs that were either exposed to 7.5 minutes of ship noise or 7.5 minutes of ambient harbor noise.
Wale MA, Simpson SD, Radford AN. (2013) "Size-dependent physiological responses of shore crabs to single and repeated playback of ship noise", Biol Lett 9: 20121194. http://dx.doi.org/10.1098/rsbl.2012.1194
Digested calories with different types of fiber in crackers
A data frame with 48 observations on the following 3 variables.
Subj |
ID for the subject |
Fiber |
Type of fiber: bran , combo , control , or gum
|
Calories |
Digested calories |
Twelve female subjects were fed a controlled diet, with crackers before every meal. There were four different kinds of crackers: control, bran fiber, gum fiber, and a combination of both bran and gum fiber. Over the course of the study, each subject ate all four kinds of crackers, one kind at a time, for a stretch of several days. The order was randomized. The response is the number of digested calories, measured as the difference between calories eaten and calories passed through the system.
Subset of the data at http://lib.stat.cmu.edu/DASL/Datafiles/Fiber.html.
Variables that might be related to whether students overdraw a checking account.
A data frame with 450 observations on the following 4 variables.
Age |
Age of the student (in years) |
Sex |
0 =male or 1 =female |
DaysDrink |
Number of days drinking alcohol (in past 30 days) |
Overdrawn |
Has student overdrawn a checking account? 0 =no or 1 =yes |
Researchers conducted a survey of 450 undergraduates in large introductory courses at either Mississippi State University or the University of Mississippi. There were close to 150 questions on the survey, but only four of these variables are included in this dataset. (You can consult the paper to learn how the variables beyond these 4 affect the analysis.) The primary interest for the researchers was factors relating to whether or not a student has ever overdrawn a checking account.
Worthy S.L., Jonkman J.N., Blinn-Pike L. (2010), "Sensation-Seeking, Risk-Taking, and Problematic Financial Behaviors of College Students," Journal of Family and Economic Issues, 31: 161-170
Lengths of cuckoo eggs laid in other birds' nests
A data frame with 120 observations on the following 2 variables.
Bird |
Type of bird nest: mdw_pipit (meadow pipit), tree_pipit , |
hedge_sparrow , robin , wagtail , or wren
|
|
Length |
Cuckoo egg length (in mm) |
Cuckoos are knows to lay their eggs in the nests of other (host) birds. The eggs are then adopted and hatched by the host birds. The data give the lengths of cuckoo eggs found in nests of various other bird species.
Downloaded from DASL at http://lib.stat.cmu.edu/DASL/Datafiles/cuckoodat.html
"The Egg of Cuculus Canorus. An Enquiry into the Dimensions of the Cuckoo's Egg and the Relation of the Variations to the Size of the Eggs of the Foster-Parent, with Notes on Coloration", by Oswald H. Latter, Biometrika, Vol. 1, No. 2 (Jan., 1902), pp. 164-176.
Data from a first day class survey in an introductory statistics course
A data frame with 43 observations on the following 13 variables.
Section |
Section: 1 or 2
|
Class |
Year in school: Freshman , Sophomore , Junior , or Senior
|
Sex |
F =female or M =male |
Distance |
Distance (in miles) to get to campus |
Height |
Height (in inches) |
Handedness |
Left , Right , or Ambidextrous |
Coins |
Value of coins student has (in class) |
WhiteString |
Estimated length of a white string (in inches) |
BlackString |
Estimated length of a black string (in inches) |
Reading |
Expected amount of reading during the semester (pages/week) |
TV |
Hours of TV watched per week |
Pulse |
Resting pulse rate (beats per minute) |
Texting |
Number of text messages in past 24 hours |
An instructor at a small liberal arts college distributed a data survey on the first day of class. The data for two different sections of the course are given in this dataset.
Student survey in an introductory statistics class.
The rate of lactic acid turnover was measured by two methods for normal and diabetic dogs.
A data frame with 20 observations on the following 4 variables.
Dog
Code for individual dogs (d1
through d10
)
Method
Tracer method to measure response (infuse
or inject
)
Operation
Pancreas removed to make the dog diabetic? (no
or yes
)
Response
Rate for biochemical turnover of lactic acid
Five dogs had their pancreas removed to make them diabetic (Operation=yes), the other five were normal (Operation=no). The rate of turnover of lactic acid was measured for each dog by two methods, infusion and injection.
Forbath, N., A. B. Kenshole, and G. Hetenyi, Jr. (1967),"Turnover lactic acid in normal and diabetic dogs calculated by two tracer methods," Am. J. Physiol. v. 212, pp.1179 - 1183.
Price and characteristics for a sample of 351 diamonds
A data frame with 351 observations on the following 6 variables.
Carat |
Size of the diamond (in carats) |
Color |
Coded as D (most white/bright) through J
|
Clarity |
Coded as IF , VVS1 , VVS2 , VS1 , VS2 , SI1 , SI2 , or SI3 |
Depth |
Depth (as a percentage of diameter) |
PricePerCt |
Price per carat |
TotalPrice |
Price for the diamond (in dollars) |
Data for a sample of diamonds. The clarity of the diamonds ranges from IF (internally flawless) through VVS1 (very,very slightly included), VS1 (very slightly included), to SI3 (slightly included) in the order listed above.
Diamond data obtained from AwesomeGems.com on July 28, 2005.
A subset of 307 cases with the most frequent colors from the Diamonds data
A data frame with 307 observations on the following 6 variables.
Carat |
Size of the diamond (in carats) |
Color |
Coded as D (most white/bright) through G
|
Clarity |
Coded as IF , VVS1 , VVS2 , VS1 , VS2 , SI1 , SI2 , or SI3 |
Depth |
Depth (as a percentage of diameter) |
PricePerCt |
Price per carat |
TotalPrice |
Price for the diamond (in dollars) |
A subset of the Diamonds data, containing only those with most frequent colors D, E, F, and G. The clarity of the diamonds ranges from IF (internally flawless) through VVS1 (very,very slightly included), VS1 (very slightly included), to SI3 (slightly included) in the order listed above.
Diamond data obtained from AwesomeGems.com on July 28, 2005.
Iridium levels in prehistoric rock layers
A data frame with 28 observations on the following 4 variables.
ID
Sample identifier
Source
Type of rock (Limestone
Shale
)
Depth
Depth of the sample (in meters)
Iridium
Iridium concentration (ppb)
The question of interest is whether a volcanic eruption or asteroid strike had created a dust cloud that led to extinction of most dinosaurs. Rock samples taken in Gubbio, Italy were measured for the concentration of iridium (a rare metal which is more common in asteroids). The deeper the sample, the older the rocks are. A sudden increase in iridium at some point in time would lend support for the asteroid hypothesis.
Ramsey, Fred L. and Daniel W. Schafer (2002). The Statistical Sleuth, 2nd ed., Pacific Grove, CA, Duxbury, pp.405-407.
State-by-state information from the 2008 U.S. presidential election
A dataframe with 51 observations on the following 7 variables.
State |
Name of the state |
Abr |
Abbreviation for the state |
Income |
Per capita income in the state as of 2007 (in dollars) |
HS |
Percentage of adults with at least a high school education |
BA |
Percentage of adults with at least a college education |
Dem.Rep |
Difference in %Democrat and %Republican (according to 2008 Gallup survey) |
ObamaWin |
1 = Obama (Democrat) wins state in 2008 or 0 =McCain (Republican) wins |
This dataset contains information from all 50 states and the District of Columbia for the 2008 U.S. presidential election.
State income data from: Census Bureau Table 659. Personal Income Per Capita (in 2007)
High school data from: U.S. Census Bureau, 1990 Census of Population,
http://nces.ed.gov/programs/digest/d08/tables/dt08_011.asp
College data from: Census Bureau Table 225. Educational Attainment by State (in 2007)
% Democrat and %Republican:
http://www.gallup.com/poll/114016/state-states-political-party-affiliation.aspx#1
2016 presidential election and state demographic data
A data frame with 50 observations on the following 8 variables.
State
State name
Abr
Abbreviation for state name
Income
Per capita income in the state
HS
Percent high school grads
BA
Percent college grads
Adv
Percent with advanced degrees
Dem.Rep
Democratic lean - Republican lean in 2015 Gallup poll
TrumpWin
Trump won the state? (1
=yes or 0
=no)
This dataset contains information from all 50 states and the District of Columbia for the 2016 U.S. presidential election. It is similar to Election08 for the 2008 election.
Income data from
https://www.census.gov/search-results.html?q=per+capita+income+by+state&search.x=0 &search.y=0&search=submit&page=1&stateGeo=none&searchtype=web&cssp=SERP
2015 data via American Community Survey
https://en.wikipedia.org/wiki/List_of_U.S._states_by_educational_attainment
from Bureau, U.S. Census. "2011-2015 American Community Survey 5-Year Estimates. factfinder.census.gov. Retrieved 2017-01-19.
http://www.gallup.com/poll/188969/red-states-outnumber-blue-first-time-gallup-tracking.aspx
Age and height of male African elephants
A data frame with 138 observations on the following 3 variables.
Age
Age (in years)
Height
Shoulder height (in cm)
Firstborn
Firstborn? (1
=yes, 0
=no)
Data on 138 male African elephants that lived through droughts in the first two years of life.
Data are from Phyllis Lee, Stirling University, and are related to Lee, P., et al. (2013), "Enduring consequences of early experiences: 40-year effects on survival and success among African elephants (Loxodonta Africana)," Biology Letters, 9: 20130011.
Age and height of African elephants
A data frame with 288 observations on the following 3 variables.
Age
Age (in years)
Height
Shoulder height (in cm)
Sex
F
=female or M
=male
Data on 288 African elephants that lived through droughts in the first two years of life.
Data are from Phyllis Lee, Stirling University, and are related to Lee, P., et al. (2013), "Enduring consequences of early experiences: 40-year effects on survival and success among African elephants (Loxodonta Africana)," Biology Letters, 9: 20130011.
This function produces an empirical logit plot for a binary response variable and a single quantitative predictor variable.
emplogitplot1(formula, data = NULL, ngroups = 3, breaks = NULL, yes = NULL, padj = TRUE, out = FALSE, showplot = TRUE, showline = TRUE, ylab = "Log(Odds)", xlab = NULL, dotcol = "black", linecol = "blue", pch = 16, main = "", ylim = NULL, xlim = NULL, lty = 1, lwd = 1, cex = 1)
emplogitplot1(formula, data = NULL, ngroups = 3, breaks = NULL, yes = NULL, padj = TRUE, out = FALSE, showplot = TRUE, showline = TRUE, ylab = "Log(Odds)", xlab = NULL, dotcol = "black", linecol = "blue", pch = 16, main = "", ylim = NULL, xlim = NULL, lty = 1, lwd = 1, cex = 1)
formula |
A formula of the form (binary) Response~Predictor |
data |
A dataframe |
ngroups |
Number of groups to use (not needed if breaks is used), ngroups="all" uses all unique values |
breaks |
A vector of endpoints for the bins (not needed if ngroups is used) |
yes |
Set a value for the response to be counted for proportions (optional) |
padj |
Should proportions be adjusted to avoid zero and one? (default is TRUE) |
out |
Should the function return a dataframe with group information? (default is FALSE) |
showplot |
Show the plot? default is TRUE |
showline |
Show the regression line? default is TRUE |
ylab |
Text label for the vertical axis (default is "Log(Odds)") |
xlab |
Text label for the horizontal axis (default is NULL) |
dotcol |
Color for the dots (default is "black") |
linecol |
Color for the line (default is "black") |
pch |
Plot character for the dots (default is 16) |
main |
Title for plot |
ylim |
Limits for the vertical axis |
xlim |
Limits for the horizontal axis |
lty |
Line type (default is 1) |
lwd |
Line width (default is 1) |
cex |
Multiplier for plot symbols |
Values of the quantitative explanatory variable will be grouped into ngroups
roughly equal sized groups, unless breaks
is used to determine the boundaries of the groups.
Using ngroups="all"
will make each distinct value of the explanatory variable its own group
We find an adjusted proportion for the binary response variable within each of the groups with
(Number yes +0.5)/(Number of cases+1)
. This is converted to an adjusted log odds
log(adjp/(1-adjp))
. The adjustment avoids problems if there are no "successes" or
all "successes" in a group. What constitutes a "success" can be specified with yes=
and the proportion adjustment can be turned off (if no group proportions are likely to be zero or one)
with padj=FALSE
.
The function plots the log odds versus the mean of the explanatory variable within each group.
A least square line is fit to these points. The plot can be suppressed with showplot=FALSE
.
The out=TRUE
option will return a dataframe with the boundaries of each group, proportion,
adjusted proportion, mean explanatory variable, and (adjusted or unadjusted) log odds.
A dataframe with group information (if out=TRUE)
data(MedGPA) emplogitplot1(Acceptance~GPA,data=MedGPA) GroupTable=emplogitplot1(Acceptance~MCAT,ngroups=5,out=TRUE,data=MedGPA) emplogitplot1(Acceptance~MCAT,data=MedGPA,breaks=c(0,34.5,39.5,50.5),dotcol="red",linecol="black") data(Putts1) emplogitplot1(Made~Length,data=Putts1,ngroups="all")
data(MedGPA) emplogitplot1(Acceptance~GPA,data=MedGPA) GroupTable=emplogitplot1(Acceptance~MCAT,ngroups=5,out=TRUE,data=MedGPA) emplogitplot1(Acceptance~MCAT,data=MedGPA,breaks=c(0,34.5,39.5,50.5),dotcol="red",linecol="black") data(Putts1) emplogitplot1(Made~Length,data=Putts1,ngroups="all")
This function produces an empirical logit plot for a binary response variable and with a single quantitative predictor variable broken down by a single categorical factor.
emplogitplot2(formula, data = NULL, ngroups = 3, breaks = NULL, yes = NULL, padj = TRUE, out = FALSE, showplot = TRUE, showline = TRUE, ylab = "Log(Odds)", xlab = NULL, putlegend = "n", levelcol = NULL, pch = NULL, main = "", ylim = NULL, xlim = NULL, lty = NULL, lwd = 1, cex = 1)
emplogitplot2(formula, data = NULL, ngroups = 3, breaks = NULL, yes = NULL, padj = TRUE, out = FALSE, showplot = TRUE, showline = TRUE, ylab = "Log(Odds)", xlab = NULL, putlegend = "n", levelcol = NULL, pch = NULL, main = "", ylim = NULL, xlim = NULL, lty = NULL, lwd = 1, cex = 1)
formula |
A formula of the form (binary) Response~Quantitative Predictor+Factor |
data |
A dataframe |
ngroups |
Number of groups to use (not needed if breaks is used), ngroups="all" uses all unique values |
breaks |
A vector of endpoints for the bins (not needed if ngroups is used) |
yes |
Set a value for the response to be counted for proportions (optional) |
padj |
Should proportions be adjusted to avoid zero and one? (default is TRUE) |
out |
Should the function return a dataframe with group and factor information? (default is FALSE) |
showplot |
Show the plot? default is TRUE |
showline |
Show the regression lines? default is TRUE |
ylab |
Text label for the vertical axis (default is "Log(Odds)") |
xlab |
Text label for the horizontal axis (default is NULL) |
putlegend |
Position for the legend (default is "n" for no legend) |
levelcol |
Vector of colors for the factor levels |
pch |
Plot character for the dots |
main |
Title for plot |
ylim |
Limits for the vertical axis |
xlim |
Limits for the horizontal axis |
lty |
Line type (default is 1) |
lwd |
Line width (default is 1) |
cex |
Multiplier for plot symbols |
Values of the quantitative explanatory variable will be grouped into ngroups
roughly equal sized groups, unless breaks
is used to determine the boundaries of the groups.
Using ngroups="all"
will make each distinct value of the explanatory variable its own group
We find a proportion for the binary response variable within each of the groups created from
the quantitative variable crossed with the categrical variable. To avoid problems with proportions
of zero and one, we compute an adjusted proportion with
(Number yes +0.5)/(Number of cases+1)
. This is converted to an adjusted log odds
log(adjp/(1-adjp))
. What constitutes a "success" can be specified with yes=
and the proportion adjustment can be turned off (if no group proportions are likely to be zero or one)
with padj=FALSE
.
The function plots the log odds versus the mean of the explanatory variable within each group with different
colors for each of the categories defined by the categorical variable.
A least square line is fit to these points within each categorical group.
The plot can be suppressed with showplot=FALSE
.
The out=TRUE
option will return a dataframe with the boundaries of each group, proportion,
adjusted proportion, mean explanatory variable, and (adjusted or unadjusted) log odds.
A dataframe with group information (if out=TRUE)
data(MedGPA) emplogitplot2(Acceptance~GPA+Sex,data=MedGPA) GroupTable2=emplogitplot2(Acceptance~MCAT+Sex,ngroups=5,out=TRUE,data=MedGPA,putlegend="topleft") emplogitplot2(Acceptance~MCAT+Sex,data=MedGPA,breaks=c(0,34.5,39.5,50.5), levelcol=c("red","blue"),putlegend="bottomright")
data(MedGPA) emplogitplot2(Acceptance~GPA+Sex,data=MedGPA) GroupTable2=emplogitplot2(Acceptance~MCAT+Sex,ngroups=5,out=TRUE,data=MedGPA,putlegend="topleft") emplogitplot2(Acceptance~MCAT+Sex,data=MedGPA,breaks=c(0,34.5,39.5,50.5), levelcol=c("red","blue"),putlegend="bottomright")
Experiment on the effects of oxygen on sugar metabolism by bacteria
A data frame with 16 observations on the following 3 variables.
Sugar |
Type of sugar: Galactose or Glucose
|
O2Conc |
Oxygen concentration |
Ethanol |
Ethanol concentration |
Many biochemical reactions are slowed or prevented by the presence of oxygen. For example, there are two simple forms of fermentation, one which converts each molecule of sugar to two molecules of lactic acid, and a second which converts each molecule of sugar to one each of lactic acid, ethanol, and carbon dioxide. This experiment was designed to compare the inhibiting effect of oxygen on the metabolism of two different sugars, glucose and galactose, by Streptococcus bacteria. In this case there were four levels of oxygen that were applied to the two kinds of sugar.
Renamed to SugarEthanol in second edition.
Data are found in Statistics: The Exploration and Analysis of Data by Jay Devore and Roxy Peck (2008). St. Paul, MN: West.
The original article is Yamada T., Takahashi-Abbe S., Abbe K. (1985) "Effects of oxygen concentration on pyruvate formatelyase in situ and sugar metabolism of Streptocucoccus mutans and Streptococcus samguis," Infection and Immunity, pp. 129-134.
Data from an experiment relating pupil dilation to sexual orientation.
A data frame with 106 observations on the following 4 variables.
DilateDiff
Difference in pupil dilation when looking at same-sex and opposite-sex nude photographs
Sex
F
=female or M
=male
Gay
1
=gay or 0
=not, based on Kinsey scale score greater than 3
SexMale
0
=female or 1
=male
DilateDiff is, essentially, the difference in pupil dilation when looking at (a) same-sex nudes and (b) opposite-sex nude photographs. More specifically, multiple measurements of pupil size were taken under each of the two conditions, together with a third condition that involved a neutral stimulus. Within-subject z-scores were then computed, which led to the DilateDiff numbers used here.
G. Rieger and R.C. Savin-Williams (2012),"The Eyes Have It: Sex and Sexual Orientation Differences in Pupil Dilation Patterns," in PLoS ONE. The full study included 325 students. Here we are analyzing a subset of the data that excludes White students.
Grip strength, attractiveness, and shoulder-hip ratio for men
A data frame with 38 observations on the following 5 variables.
MaxGripStrength
Measurement of strength of hand grip
SHR
Shoulder to hip ratio
Partners
Number of sexual partners (lifetime)
Attractive
Attractiveness rating
AgeFirstSex
Age of first sex
Facial attractiveness of several men was rated by female college students. Maximum grip strength was also measured, along with shoulder to hip ratio, age of first sex, and number of sex partners.
Shoup, M. L. and Gallup, G.G., Jr. (2008), "Men's Faces Convey Information about Their Bodies and Their Behavior: What You See is What You Get," Evolutionary Psychology, 6(3): 469-479.
Ratings from a facial photo and actual faithfulness.
A data frame with 170 observations on the following 7 variables.
SexDimorph
Rating of sexual dimorphism (masculinity for males, femininity for females)
Attract
Rating of attractiveness
Cheater
Was the face subject unfaithful to a partner? (1
=yes or 0
=no)
Trust
Rating of trustworthiness
Faithful
Rating of faithfulness
FaceSex
Sex of face (F
=female or M
=male)
RaterSex
Sex of rater (F
=female or M
=male)
College students were asked to look at a photograph of an opposite-sex adult face and to rate the person, on a scale from 1 (low) to 10 (high), for attractiveness. They were also asked to rate trustworthiness, faithfulness, and sexual dimorphism (i.e., how masculine a male face is and how feminine a female face is). Overall, 68 students (34 males and 34 females) rated 170 faces (88 men and 82 women).
This dataset is based on G. Rhodes et al. (2012), "Women can judge sexual unfaithfulness from unfamiliar men's faces," Biology Letters, November 2012. All of the 68 raters were heterosexual Caucasians, as were the 170 persons who were rated. (We have deleted 3 subjects with missing values and 16 subjects who were over age 35.)
Draft selection times for a fantasy baseball league
A data frame with 24 observations on the following 9 variables.
Round |
Round of the draft (1 to 24) |
DJ |
Draft time (in seconds) for D.J. |
AR |
Draft time (in seconds) for A.R. |
BK |
Draft time (in seconds) for B.K. |
JW |
Draft time (in seconds) for J.W. |
TS |
Draft time (in seconds) for T.S. |
RL |
Draft time (in seconds) for R.L. |
DR |
Draft time (in seconds) for D.R. |
MF |
Draft time (in seconds) for M.F. |
Time (in seconds) for participants in a draft for a fantasy baseball league to make a selection at each round.
Mathematical Science Baseball League historical records (online).
Experiment on effects of diets on weight gain of rats
A data frame with 60 observations on the following 3 variables.
Gain
Weight gain (in grams per week)
Protein
Level of protein (Hi
or Lo
)
Source
Source of protein (Beef
, Cereal
, or Pork
)
Data from this experiment compared weight gain for 60 baby rats that were fed different diets. Half of the rats had low-protein diets (Lo) and the rest had high-protein (Hi). The source of protein was either beef, cereal, or pork.
C. P. Wilsie, Iowa State College Agricultural Station (1944) via Snedecor and Cochran
Fertility measurements for a sample of women who have difficulty getting pregnant
A data frame with 333 observations on the following 10 variables.
Age |
Age (in years) |
LowAFC |
Smallest antral follicle count |
MeanAFC |
Average antral follicle count |
FSH |
Maximum follicle stimulating hormone level |
E2 |
Fertility level |
MaxE2 |
Maximum fertility level |
MaxDailyGn |
Maximum daily gonadotropin level |
TotalGn |
Total gonadotropin level |
Oocytes |
Number of egg cells |
Embryos |
Number of embryos |
A medical doctor and her team of researchers collected a variety of data on women who were having trouble getting pregnant. A key method for assessing fertility is a count of antral follicles (LowAFC or MeanAFC) that can be performed with noninvasive ultrasound. Researchers are interested in how the other variables are related to these counts.
We thank Dr. Priya Maseelall and her research team for sharing these data.
Field goal results in the National Football League (NFL) by distance
A data frame with 51 observations on the following 7 variables.
Row |
Case ID |
Dist |
Distance of the attempt (in yards) |
N |
Number of kicks attempted from that distance |
Makes |
Number of kicks made from that distance |
PropMakes |
Proportion of attempts made |
Blocked |
Number of kicks blocked |
PropBlocked |
Proportion of kicks blocked |
This dataset summarizes all 8520 field goals attempted by place kickers in the National Football League (NFL) during regular season games for the 2000 through the 2008 seasons. Results are counts (attempted, made, and blocked) and proportions (made and blocked) for each distance.
We thank Sean Forman and Doug Drinen of Sports Reference LLC for providing us with the NFL field goal data set.
Film data from Maltin's Movie and Video Guide
A data frame with 100 observations on the following 9 variables.
Title |
Movie title |
Year |
Year the movie was released |
Time |
Running time (in minutes) |
Cast |
Number of cast members listed in the guide |
Rating |
Maltin rating (range is 1 to 4, in steps of 0.5) |
Description |
Number of lines of text Maltin uses to describe the movie |
Origin |
Country: 0 = USA, 1 = Great Britain, 2 = France, 3 = Italy, 4 = Canada |
Time_code |
long =90 minutes or longer short =under 90 minutes |
Good |
1 =rating of 3 stars or better 0 =any lower rating |
One statistician movie fan decided to use statistics to study the movie ratings in his favorite movie guide, Movie and Video Guide (1996), by Leonard Maltin. Maltin rates movies on a one-star to four-star system, in increments of half-stars, with higher numbers being better. The guide also includes additional information on each film. The statistician used a random number generator to select a simple random sample of 100 movies rated by the Guide.
Data from Leonard Maltin's Movie and Video Guide (1996)
NCAA Final Four by seed with indicator for Tom Izzo's teams from 1985 - 2010.
A dataset with 1664 observations on the following 4 variables.
Year |
Year (1985 - 2010) |
Seed |
Seed in NCAA men's basketball tournament: 1 to 16 |
Final4 |
1 =made Final Four or 0 =did not make Final Four |
Izzo |
1 =team coached by Tom Izzo or 0 =not an Izzo team |
Each year 64 college teams are selected for the NCAA Division I Men's Basketball tournament, with 16 teams placed in each of four regions. Within each region the teams are seeded from 1 to 16, with the (presumed) best team as the 1 seed and the (presumed) weakest team as the 16 seed; this practice of seeding teams began in 1979 for the NCAA tournament. Only one team from each region (so four teams each year) advances to the Final Four. This dataset is the same as FinalFourLong, except the data starts in 1985 and we have a extra column that is an indicator for Michigan State teams coached by Tom Izzo.
Updated to FinalFourIzzo17 in second edition.
Final Four teams and their seed can be found at
http://www.championshiphistory.com/ncaahoops.php.
NCAA Final Four by seed with indicator for Tom Izzo's teams for 1985 - 2017
A data frame with 2112 observations on the following 4 variables.
Year
Year 1985 - 2017
Seed
Seed in NCAA men's basketball tournament: 1 to 16
Final4
1
=made Final Four or 0
=did not make Final Four
Izzo
1
=team coached by Tom Izzo or 0
=not an Izzo team
Each year 64 college teams are selected for the NCAA Division I Men's Basketball tournament, with 16 teams placed in each of four regions. Within each region the teams are seeded from 1 to 16, with the (presumed) best team as the 1 seed and the (presumed) weakest team as the 16 seed; this practice of seeding teams began in 1979 for the NCAA tournament. Only one team from each region (so four teams each year) advances to the Final Four. This dataset is an extention of FinalFourIzzo (that ended in 2017) and the same as FinalFourLong2017, except the data starts in 1985 and we have an extra column that is an indicator for Michigan State teams coached by Tom Izzo.
Final Four teams and their seed can be found at http://www.championshiphistory.com/ncaahoops.php
NCAA Final Four by seed with individual cases for each team each year
A data frame with 2048 observations on the following 3 variables.
Year |
Year (1979 - 2010) |
Seed |
Seed in NCAA men's basketball tournament: 1 to 16 |
Final4 |
1 =made Final Four or 0 =did not make Final Four |
Each year 64 college teams are selected for the NCAA Division I Men's Basketball tournament, with 16 teams placed in each of four regions. Within each region the teams are seeded from 1 to 16, with the (presumed) best team as the 1 seed and the (presumed) weakest team as the 16 seed; this practice of seeding teams began in 1979 for the NCAA tournament. Only one team from each region (so four teams each year) advances to the Final Four. This dataset has a row (case) for each team in the NCAA Division I Men's Basketball tournament from 1979 to 2010 along with its seed and an indicator for whether the team made the Final Four that year.
Updated to FinalFourLong17 in second edition.
Final Four teams and their seed can be found at
http://www.championshiphistory.com/ncaahoops.php.
NCAA Final Four by seed with individual cases for each team each year
A data frame with 2496 observations on the following 4 variables.
Year
Year (1979 - 2017)
Seed
Seed in NCAA men's basketball tournament: 1 to 16
Final4
1
=made Final Four or 0
=did not make Final Four
Each year 64 college teams are selected for the NCAA Division I Men's Basketball tournament, with 16 teams placed in each of four regions. Within each region the teams are seeded from 1 to 16, with the (presumed) best team as the 1 seed and the (presumed) weakest team as the 16 seed; this practice of seeding teams began in 1979 for the NCAA tournament. Only one team from each region (so four teams each year) advances to the Final Four. This dataset has a row (case) for each team in the NCAA Division I Men's Basketball tournament from 1979 to 2017 along with its seed and an indicator for whether the team made the Final Four that year. This dataset is an extention of FinalFourLong (that went through 2010).
Final Four teams and their seed can be found at
http://www.championshiphistory.com/ncaahoops.php
NCAA Final Four participation summarized each year by seed
A data frame with 512 observations on the following 4 variables.
Year |
Year (1979 - 2010) |
Seed |
Seed in NCAA men's basketball tournament: 1 to 16 |
In |
Number of teams at that seed who made the Final Four that year |
Out |
Number of teams at that seed who did not made the Final Four that year |
Each year 64 college teams are selected for the NCAA Division I Men's Basketball tournament, with 16 teams placed in each of four regions. Within each region the teams are seeded from 1 to 16, with the (presumed) best team as the 1 seed and the (presumed) weakest team as the 16 seed; this practice of seeding teams began in 1979 for the NCAA tournament. Only one team from each region (so four teams each year) advances to the Final Four. This dataset is similar to FinalFourLong, except that each row combines the count of the results (make/don't make the Final Four) for each seed, so that In+Out= 4 for each row.
Updated to FinalFourShort17 in second edition.
Final Four teams and their seed can be found at
http://www.championshiphistory.com/ncaahoops.php.
NCAA Final Four participation summarized each year by seed
A data frame with 624 observations on the following 4 variables.
Year
Year 1979 to 2017
Seed
Seed in NCAA men's basketball tournament: 1 to 16
In
Number of teams at that seed who made the Final Four that year
Out
Number of teams at that seed who did not made the Final Four that year
Each year 64 college teams are selected for the NCAA Division I Men's Basketball tournament, with 16 teams placed in each of four regions. Within each region the teams are seeded from 1 to 16, with the (presumed) best team as the 1 seed and the (presumed) weakest team as the 16 seed; this practice of seeding teams began in 1979 for the NCAA tournament. Only one team from each region (so four teams each year) advances to the Final Four. This dataset is similar to FinalFourLong2017, except that each row combines the count of the results (make/don't make the Final Four) for each seed, so that In+Out= 4 for each row. This dataset is an extention of FinalFourShort (that went though 2010).
Final Four teams and their seed can be found at
http://www.championshiphistory.com/ncaahoops.php
Finger tap rates after drug administration
A data frame with 12 observations on the following 4 variables.
Subject
Subject code (I
, II
, III
, or IV
)
Drug
Drug administered (Ca
=caffeine, Pl
=placebo, or Th
=theobromine)
TapRate
Finger taps in a fixed time interval
Scientists Scott and Chen, published research that compared the effects of caffeine with those of theobromine (a similar chemical found in chocolate) and with those of a placebo. Their experiment used four human subjects, and took place over several days. Each day each subject swallowed a tablet containing one of caffeine, theobromine, or the placebo. Two hours later they were timed while tapping a finger in a specified manner (that they had practiced earlier, to control for learning effects). The response is the number of taps in a fixed time interval.
Renamed FranticFingers in second edition.
The data was found in Statistics in Biology, Vol. 1, by C. I. Bliss (1967), New York: McGraw Hill.
The original article is Scott, C. and Chen, K. (1944) "Comparison of the action of 1-ethyl theobromine and caffeine in animals and man," Journal of Pharmacological Experimental Therapy, v. 82, pp 89-97.
Predicting first-year college GPA
A data frame with 219 observations on the following 10 variables.
GPA |
First-year college GPA on a 0.0 to 4.0 scale |
HSGPA |
High school GPA on a 0.0 to 4.0 scale |
SATV |
Verbal/critical reading SAT score |
SATM |
Math SAT score |
Male |
1 = male, 0 = female |
HU |
Number of credit hours earned in humanities courses in high school |
SS |
Number of credit hours earned in social science courses in high school |
FirstGen |
1 = student is the first in her or his family to attend college, 0 =otherwise |
White |
1 = white students, 0 = others |
CollegeBound |
1 =attended a high school where >=50% students intended to go on to college, 0 =otherwise |
The data in FirstYearGPA contains information from a sample of 219 first year students at a midwestern college that might be used to build a model to predict their first year GPA.
A sample from a larger set of data collected in 1996 by a professor at this college.
Fertility measurement for eggs from a sample of 35 lake trout
A data frame with 35 observations on the following 4 variables.
Age |
Age of the fish (in years) |
PctDM |
Percentage of the total egg material that is solid |
Month |
Month fish was caught: Sep =September or Nov =November |
Sept |
Indicator with 1 =September or 0 =November |
Researchers collected samples of female lake trout from Lake Ontario in September and November of 2002 through 2004. A goal of the study was to investigate the fertility of fish that had been stocked in the lake. One measure of the viability of fish eggs is percent dry mass (PctDM) which reflects the energy potential stored in the eggs by recording the percentage of the total egg material that is solid. Values of the PctDM for a sample of 35 lake trout (14 in September and 21 in November) are given in this dataset along with the age (in years) of the fish.
Lantry, OGorman, and Machut (2008) "Maternal Characteristics versus Egg Size and Energy Density," Journal of Great Lakes Research 34(4): 661-674.
Body measurements for a sample of 28 mammal species from a Fitch paper on acoustic allometry
A data frame with 28 observations on the following 5 variables.
Species
species of mammal
Order
Order (Carnivora
or Primates
)
Wt
Body weight (in kg)
Skull
Skull length (in cm)
Palate
Palate length (in cm)
Data on mammal species from a Zoology paper about acoustic allometry by W. Tecumseh Fitch.
Fitch, W. Tecumseh (2000), "Skull dimensions in relation to body size in nonhuman mammals: The causal bases for acoustic allometry," Zoology, 103, 40-58.
Flight response of Pacific Brant to overflights of helicopters
A dataset with 464 observations on the following 7 variables.
FlockID |
Flock ID |
Altitude |
Altitude of the overflight by the helicopter (in 100m) |
Lateral |
Lateral distance (in 100m) between the aircraft and flock |
Flight |
1 =more than 10% of flock flies away or 0 =otherwise |
AltLat |
Product of Altitude x Lateral |
AltCat |
Altitude categories: low =under 3, mid =3 to 6, high =over 6 |
LatCat |
Lateral categories: 1 under 10 to 4 =over 30 |
A 1994 study collected data on the effects of air traffic on the behavior of the Pacific Brant (a small migratory goose). The data represent the flight response to helicopter "overflights" to see what the relationship between the proximity of a flight, both lateral and altitudinal, would be to the propensity of the Brant to flee the area. For this experiment, air traffic was restricted to helicopters because previous study had ascertained that helicopters created more radical flight response than other aircraft. The data are in FlightResponse. Each case represents a flock of Brant that has been observed during one overflight in the study. Flocks were determined observationally as contiguous collections of Brants, flock sizes varying from 10 to 30,000 birds.
Data come from the book Statistical Case Studies: A Collaboration Between Academe and Industry, Roxy Peck, Larry D. Haugh, and Arnold Goodman, editors; SIAM and ASA, 1998.
Florida death penalty cases by race of defendant and victim
A data frame with 326 observations on the following 4 variables.
Penalty
Was death penalty given? (No
or Yes
)
Defendant
Race of the defendant (Black
or White
)
White.Victim
Was the victim white? (1
=yes or 0
=no)
Black.Victim
Was the victim black? (1
=yes or 0
=no)
Mike Radelet's data on imposition of the death penalty for murderers in Florida broken down by race of the victim and defendant.
Radelet, M. (1981), "Racial Characteristics and Imposition of the Death Penalty," American Sociological Review, 46, 918-927.
Data from an experiment on calcium binding to proteins
A data frame with 51 observations on the following 2 variables.
Calcium |
Log of free calcium concentration |
ProteinProp |
Proportion of protein bound to calcium |
Suzanne Rohrback used a novel approach in a series of experiments to examine calcium binding proteins.
Thanks to Suzanne Rohrback for providing these data from her honors experiments at Kenyon College.
Finger tap rates after drug administration
A data frame with 12 observations on the following 4 variables.
ID
Case ID
Rate
Finger taps in a fixed time interval
Subj
Subject code (A
, B
, C
, or D
)
Drug
Drug administered (Ca
=caffeine, Pl
=placebo, or Th
=theobromine)
Scientists Scott and Chen published research that compared the effects of caffeine with those of theobromine (a similar chemical found in chocolate) and with those of a placebo. Their experiment used four human subjects and took place over several days. Each day each subject swallowed a tablet containing one of caffeine, theobromine, or the placebo. Two hours later they were timed while tapping a finger in a specified manner (that they had practiced earlier, to control for learning effects). The response is the number of taps in a fixed time interval.
The data was found in Statistics in Biology, Vol. 1, by C. I. Bliss (1967), New York: McGraw Hill.
The original article is Scott, C.C. and Chen, K. K. (1944), "Comparison of the action of 1-ethyl theobromine and caffeine in animals and man," Journal of Pharmacological Experimental Therapy, v. 82, pp 89-97.
Sexual activity and lifetimes of fruit flies
A data frame with 125 observations on the following 7 variables.
ID |
a numeric vector |
Partners |
Number of female partners: 0, 1, or 8 |
Type |
0 =pregnant, 1 =virgin, 9 =none |
Longevity |
Lifespan (in days) |
Thorax |
Length of thorax (in mm) |
Sleep |
Percent of day sleeping |
Treatment |
1 pregnant , 1 virgin , 8 pregnant , 8 virgin , or none
|
Hanley and Shapiro (1994) report on a study conducted by Partridge and Farquhar (1981) about the sexual behavior of fruit flies. It was already known that increased reproduction leads to shorter life spans for female fruit flies. But the question remained whether an increase in sexual activity would also reduce the life spans of male fruit flies. The researchers designed an experiment to answer this question. They had a total of 125 male fruit flies to use and they randomly assigned each of the 125 to one of the following five groups.
The data are given as part of the data archive on the Journal of Statistics Education website
and can be found on the page
http://www.amstat.org/publications/jse/jse_data_archive.htm.
Hanley and Shapiro, (1994) "Sexual Activity and the Lifespan of Male Fruitflies: A Dataset That Gets Attention," Journal of Statistics Education
v.2, n.1
http://www.amstat.org/publications/jse/v2n1/datasets.hanley.html
Results from an experiment on male fruit flies with different levels of sexual activity and competition from other males
A data frame with 201 observations on the following 7 variables.
Mated
Was the fly allowed mating opportunities? (n
or y
)
Alone
Did the fly live alone? (y
=yes or n
= no, lived near another male)
Mating
How many mating opportunities was the fly given?
Total
Total duration of mating time over all opportunities (in seconds)
Size
Size of the thorax (in mm)
Lifespan
Lifespan (in hours, starting at the 12th day)
Activity
Number of times a movement detector was tripped starting in the 12th day
Researchers randomly assigned virgin male fruit flies to one of two treatments: live alone or live in an environment where they can sense one other male fly. Flies were randomly allocated to either have mating opportunities with female flies or to not have such opportunities. Those flies that were given mating opportunities were given 3, 4, or 5 opportunities to mate (Mating measures this number). Researchers also measured size, lifespan and activity levels of the fruit flies.
The file we are using is the link called survival at
http://rsbl.royalsocietypublishing.org/content/suppl/2013/02/25/rsbl.2012.1188.DC1.html
The article talking about the data is at
http://rsbl.royalsocietypublishing.org/content/9/2/20121188.full
Experiment with a ball swirling thorough a funnel
A data frame with 120 observations on the following 3 variables.
Funnel
Height of the funnel (inches)
Tube
Height of the drop tube (inches)
Time
Time (in seconds) for the ball to drop/swirl though the funnel
Data from a class experiment to see where a steel ball was rolled through a plastic tube into a long plastic funnel. The angle of the funnel and the angle of the tube with respect to the flat table could be adjusted by changing the height of either (Funnel measured from the table, Tube measured from the top of the funnel). The ball rolls down the tube, then swirls around the funnel until dropping out at the bottom. Total trip time was measured with a stopwatch. Heights were adjusted after every two drops in a randomized order.
The funnel dropping experiment was originally described in Gunter, B. (1993) "Through a Funnel Slowly with Ball Bearing and Insight to Teach Experimental Design," The American Statistician, Vol. 47. These data come from a class experiment based on the setup in that article.
Brightness and fecundity of female glow-worms
A data frame with 26 observations on the following 2 variables.
Lantern
Length of glow lantern (in mm)
Eggs
Number of eggs laid
Data on 26 female glow-worms captured in Finland. Female glow-worms attract males by glowing with part of their abdomen (lantern). Researchers believe the brightness of glow might be related to mating success.
Hopkins J, Baudry G, Candolin U, Kaitala A. (2015), "I'm sexy and I glow it: female ornamentation in a nocturnal capital breeder," Biol. Lett. 11: 20150599.
http://dx.doi.org/10.1098/rsbl.2015.0599
Measurements for a sample of goldenrod galls
A data frame with 1055 observations on the following 9 variables.
Gdiam03 |
Gall diameter in 2003 (in mm) |
Stdiam03 |
Stem diameter in 2003 (in mm) |
Wall03 |
Wall thickness in 2003 (in mm) |
Fate03 |
b =beetle present e =early death f =living fly larva g =living wasp o =pupal case u =unknown |
Gdiam04 |
Gall diameter in 2004 (in mm) |
Stdiam04 |
Stem diameter in 2004 (in mm) |
Wall04 |
Wall thickness in 2003 (in mm) |
Fate04 |
b =beetle present e =early death f =living fly larva g =living wasp o =pupal case u =unknown |
Fly04 |
Fly in 2004? n or y
|
Biology students collected measurements on goldenrod galls at the Brown Family Environmental Center at Kenyon College.
Thanks to the Kenyon College Department of Biology for sharing these data.
Data on houses sold between 2005 and 2015 in Grinnell, Iowa
A data frame with 929 observations on the following 15 variables.
Date
Coded value for date of sale (Jan 1, 2005=16436)
Address
Street address of the house
Bedrooms
Number of bedrooms
Baths
Number of bathrooms
SquareFeet
The square footage of the home's living space
LotSize
Lot size (in acres)
YearBuilt
Year the house was built; many pre-1900 homes are listed as 1900
YearSold
The year the house was sold, for this case
MonthSold
The month the house was sold (1=Jan, 2=Feb, to 12=Dec)
DaySold
Day of the month the house was sold (1 to 31)
CostPerSqFt
SalePrice / SquareFeet (round to nearest penny)
OrigPrice
List price of the house when originally put on the market (dollars)
ListPrice
List price at the time of sale (dollars)
SalePrice
Sale price of the house (dollars)
SPLPPct
(Sale_Price / List_Price ) * 100
A local Grinnell realtor, Matt Karjalahti, put these data together to see what patterns might be found, perhaps with an improvement in how one sells houses or buys them. He asked Grinnell College economists, Lee Logan and Eric Ohrn, to help with the analysis and we obtained the data from them.
Thanks to Grinnell realtor Matt Karjalahti who originally collected the data and Grinnell College economists Lee Logan and Eric Ohrn who gave us the data.
Grocery store sales with different discounts
A data frame with 36 observations on the following 5 variables.
Discount |
Amount of discount: 5.00% , 10.00% , or 15.00%
|
Store |
Store number (1-12) |
Display |
Featured End of Aisl , Featured Middle of A , or Not Featured
|
Sales |
Number sold during one week |
Price |
Wholesale price (in dollars) |
Grocery stores and product manufacturers are always interested in how well the products on the store shelves sell. An experiment was designed to test whether the amount of discount given on products affected the amount of sales of that product. There were three levels of discount, 5%, 10%, and 15%, and sales were held for a week. The total number of products sold during the week of the sale was recorded. The researchers also recorded the wholesale price of the item put on sale.
These data are not real, though they are simulated to approximate an actual study. The data come from John Grego, Director of the Stat Lab at University of South Carolina.
Presence/absence of gunnels (eels) at shoreline quadrats
A data frame with 1592 observations on the following 10 variables.
Gunnel |
1= gunnel present in the quadrat or 0=gunnel absent |
Time |
Minutes after midnight |
Fromlow |
Time in minutes from low tide |
Slope |
Slope (to nearest 10 degrees) perpendicular to waterline |
Rw |
Percentage cover in quadrat of rockweed/algae/plants |
Amphiso |
Density of crustacean food: 0=none to 4=high |
Subst |
Substratum: 1=solid rock, 2=rocky cobbles, 3=mixed pebbles/sand, 4=fine sand, |
5=mud, 6=mixed mud/shell detritus, 7=cobbles on solid rock, 8=cobbles on mixed pebbles/sand, | |
9=cobbles on fine sand, 10=cobbles on mud, 11=cobbles on mixed mud/shell detritus, | |
12=cobbles on shell detritus, 13=shell detritus | |
Pool |
Standing water deep? 1=yes or 2=no |
Water |
Standing water in the quadrat? 1=yes or 2=no |
Cobble |
Rocky cobbles? 1=yes or 2=no |
This dataset comes from a study on the habitat preferences of a species of eel, called a gunnel. Biologist Jake Shorty sampled quadrats along a coastline and recorded whether or not the species was found in the quadrat.
Thanks to Jake Shorty, Bowdoin biology student, for this dataset.
Survey data to see if subjects can guess author's sex from handwriting specimens
A data frame with 204 observations on the following 8 variables.
Individual
Survey Respondent Number
Gender
Gender of Respondent (0
= male, 1
= female)
Survey1
Percent correct on Survey 1
Survey2
Percent correct on Survey 2
FemaleID
Percent correct in identifying female specimens on Survey 1
MaleID
Percent correct in identifying male specimens on Survey 1
Both
Percent correctly identified on Survey 1 AND Survey 2
DIFF
Survey1 - Survey2
Bradley and colleagues at Clarke University gave two identical surveys to a sample of 203 students (each student did the survey twice). Each survey contains 25 writing specimens and students were asked to identify whether the author is male or female. Of the 25 specimens, 12 are written by a female, 13 by a male.
An example of the survey form can be found at
https://docs.google.com/forms/d/1sO6vlsozsORbqaCTsA7Ta0qZL7_6_MCEPJ7tYeKYyvI/viewform
Bradley, S., (2015), "Handwriting and Gender: A Multi-use Dataset", JSE (Datasets and Stories). March 2015.
http://www.amstat.org/publications/jse/v23n1/bradley.pdf
Data for a samples of hawks from three different species
A data frame with 908 observations on the following 19 variables.
Month |
8 =September to 12 =December |
Day |
Date in the month |
Year |
Year: 1992-2003 |
CaptureTime |
Time of capture (HH:MM) |
ReleaseTime |
Time of release (HH:MM) |
BandNumber |
ID band code |
Species |
CH =Cooper's, RT =Red-tailed, SS =Sharp-Shinned |
Age |
A =Adult or I =Imature |
Sex |
F =Female or M =Male |
Wing |
Length (in mm) of primary wing feather from tip to wrist it attaches to |
Weight |
Body weight (in gm) |
Culmen |
Length (in mm) of the upper bill from the tip to where it bumps into the fleshy part of the bird |
Hallux |
Length (in mm) of the killing talon |
Tail |
Measurement (in mm) related to the length of the tail (invented at the MacBride Raptor Center) |
StandardTail |
Standard measurement of tail length (in mm) |
Tarsus |
Length of the basic foot bone (in mm) |
WingPitFat |
Amount of fat in the wing pit |
KeelFat |
Amount of fat on the breastbone (measured by feel |
Crop |
Amount of material in the crop, coded from 1 =full to 0 =empty |
Students and faculty at Cornell College in Mount Vernon, Iowa, collected data over many years at the hawk blind at Lake MacBride near Iowa City, Iowa. The data set that we are analyzing here is a subset of the original data set, using only those species for which there were more than 10 observations. Data were collected on random samples of three different species of hawks: Red-tailed, Sharp-shinned, and Cooper's hawks.
Many thanks to the late Professor Bob Black at Cornell College for sharing these data with us.
Tail lengths for two hawk species
A data frame with 838 observations on the following 2 variables.
Species |
RT =Red-tailed, SS =Sharp-shinned |
Tail |
Length of tail (in mm) |
Tail lengths measured for a sample of 838 hawks observed in Mount Vernon, Iowa. Note: HawkTail2 has these data in unstacked format and they are a subset of the data in Hawks which has a third species (Cooper's hawk).
Observations by students and faculty at Cornell College.
Tail lengths for two hawk species
A data frame with observations on the following 2 variables.
Tail_RT |
Tail length (in mm) for a sample of Red-tailed hawks |
Tail_SS |
Tail length (in mm) for a sample of Sharp-shinned hawks |
Tail lengths measured for a sample of hawks observed in Mount Vernon, Iowa. Note: HawkTail has similar data in stacked format. The Hawks dataset has more variables and a third species (Cooper's hawk).
Observations by students and faculty at Cornell College.
Percentaged of correctly identified words in a hearing test
A data frame with 96 observations on the following 3 variables.
Subj |
Subject number (1 - 24) |
List |
List of words: L1 L2 L3 L4
|
Percent |
Percent (out of 50) of words correctly identified |
Audiologists use standard lists of 50 words to test hearing; the words are calibrated, using subjects with normal hearing, to make all 50 words on the list equally hard to hear. The goal of the study described here was to see how four such lists, denoted by L1-L4 in this dataset, compared when played at low volume with a noisy background. The response is the percentage of words identified correctly.
Data downloaded from DASL at http://lib.stat.cmu.edu/DASL/Datafiles/Hearing.html.
Loven, F. (1981), "A Study of the Interlist Equivalency of the CID W-22 Word List Presented in Quiet and in Noise." Unpublished MS Thesis, University of Iowa.
Monthly US residential consumption of fuel oil (1983-2016)
A data frame with 408 observations on the following 4 variables.
Year
Year (1983 to 2016)
Month
Month (1=Jan through 12=Dec)
t
Time index (1 to 408)
FuelOil
Residential consumption of fuel oil (in 1,000 barrels/day)
U.S. residential consumption of distillate fuel oil each month from January 1983 through December 2016.
U.S. Energy Information Administration website, https://www.eia.gov/totalenergy/data/monthly/index.php
Data on hiking trails for each of the 46 "High Peaks" in the Adirondack mountains
A data frame with 46 observations on the following 6 variables.
Peak |
Name of the mountain |
Elevation |
Elevation at the highest point (in feet) |
Difficulty |
Rating of difficulty of the hike: 1 (easy) to 7 (most difficult) |
Ascent |
Vertical ascent (in feet) |
Length |
Length of hike (in miles) |
Time |
Expected trip time (in hours) |
Forty-six mountains in the Adirondacks of upstate New York are known as the High Peaks with elevations near or above 4000 feet (although modern measurements show a couple of the peaks are actually slightly under 4000 feet). A goal for hikers in the region is to become a "46er" by scaling each of these peaks. This dataset gives information about the hiking trails up each of these peaks.
High Peaks data avaialble at http://www.adirondack.net/tour/hike/highpeaks.cfm. Thanks to Jessica Chapman at St. Lawrence University for recommending this dataset.
Data from games played by the Grinnell College men's basketball team between 1997 and 2006
A data frame with 147 observations on the following 22 variables.
Game |
An ID number assigned to each game |
Opp |
Name of the opponent school for the game |
Home |
Indicator variable where 1 = home game and 0 = away game |
OppAtt |
Number of field goal attempts by the opposing team |
GrAtt |
Number of field goal attempts by Grinnell |
Gr3Att |
Number of three-point field goal attempts by Grinnell |
GrFT |
Number of free throw attempts by Grinnell |
OppFT |
Number of free throw attempts by the opponent |
GrRB |
Total number of Grinnell rebounds |
GrOR |
Number of Grinnell offensive rebounds |
OppDR |
Number of defensive rebounds the opposing team had |
OppPoint |
Points scored in the game by the opponent |
GrPoint |
Points scored in the game by Grinnell |
GrAss |
Number of assists Grinnell had in the game |
OppTO |
Number of turnovers the opposing team gave up |
GrTO |
Number of turnovers Grinnell gave up |
GrBlocks |
Number of blocks Grinnell had in the game |
GrSteal |
Number of steals Grinnell had in the game |
X40Point |
Indicator variable that is 1 if some Grinnell player scored 40 or more points |
X30Point |
Indicator variable that is 1 if some Grinnell player scored 30 or more points |
WinLoss |
1 =Grinnell win or 0 =Grinnell loss |
PtDiff |
Point differential for the game (Grinnell score minus Opponent's score) |
Since 1991, David Arseneault, men's basketball coach of Grinnell College, has developed a unique, fast-paced style of basketball that he calls "the system." This dataset comes from the 147 games the Grinnell team played within its athletics conference between the 1997-98 season through the 2005-06 season.
These data were collected by Grinnell College students Eric Ohrn and Ben Johannsen.
Price and related characteristics of horses listed for sale on the internet
A data frame with 50 observations on the following 5 variables.
HorseID |
ID code for each horse |
Price |
Price (in dollars) |
Age |
Age of the horse (in years) |
Height |
Height of the horse (in hands) |
Sex |
f =female m =male |
Undergraduate students at Cal Poly collected data on prices of 50 horses advertised for sale on the internet. Predictor variables of price include the age and height of the horse (in hands), as well as its sex.
Cal Poly students using a horse sale website.
Selling price and characteristics for a sample of 20 houses in a small town
A data frame with 20 observations on the following 3 variables.
Price |
Selling price (in dollars) |
Size |
Size of the house (in square feet) |
Lot |
Area of the house's lot (in square feet) |
This dataset contains selling prices for 20 houses that were sold in 2008 in a small midwestern town. The file also contains data on the size of each house (in square feet) and the size of the lot (in square feet) that the house is on.
Updated to HousesNY in second edition.
Data collected from zillow.com in June 2008.
House prices for a sample of houses in Canton NY
A data frame with 53 observations on the following 5 variables.
Price
Estimated price (in $1,000's)
Beds
Number of bedrooms
Baths
Number of bathrooms
Size
Floor area of the house (in 1,000 square feet)
Lot
Size of the lot (in acres)
Data scraped from Zillow.com for a sample of houses near the 13617 area code (Canton, NY a small town in upstate NY). Houses on lots bigger than five acres (often farms) were excluded.
Data scraped from the Zillow.com website using tools an app at http://myslu.stlawu.edu/~clee/dataset/zillow/ (April 2017)
Data for a sample of 200 patients at an Intensive Care Unit (ICU)
A data frame with 200 observations on the following 9 variables.
ID |
Patient ID code |
Survive |
1 =patient survived to discharge or 0 =patient died |
Age |
Age (in years) |
AgeGroup |
1 = young (under 50), 2 = middle (50-69), 3 = old (70+) |
Sex |
1 =female or 0 =male |
Infection |
1 =infection suspected or 0 =no infection |
SysBP |
Systolic blood pressure (in mm of Hg) |
Pulse |
Heart rate (beats per minute) |
Emergency |
1 =emergency admission or 0 =elective admission |
This dataset contains information for a sample of 200 patients who were part of a larger study conducted in a hospital's Intensive Care Unit (ICU). Since an ICU often deals with serious, life-threatening cases, a key variable to study is patient survival, which is coded in the Survive variable as 1 if the patient lived to be discharged and 0 if the patient died.
Data downloaded from The Data and Story Library (DASL), http://lib.stat.cmu.edu/DASL/Datafiles/ICU.html.
Infant mortality rates in the United States by decade (1920-2010)
A data frame with 10 observations on the following 2 variables.
Mortality
Deaths within one year of birth (per 1000 births)
Year
Year (1920-2010 by decades)
Infant mortality (deaths within one year of birth per 1,000 births) in the US from 1920 - 2010 (by decade).
CDC National Vital Statistics Reports at http://www.cdc.gov/nchs/data/nvsr/nvsr57/nvsr57_14.pdf and https://www.cdc.gov/nchs/data/nvsr/nvsr64/nvsr64_09.pdf
Consumer Price Index (CPI) each month for 2009 through 2016
A data frame with 96 observations on the following 5 variables.
Month
Month: 1
=January to 12
=December
Year
Year (2009 to 2016)
CPI
Consumer Price Index (base=100 in 1984)
CPIPctDiff
Monthly percent change in CPI
t
Time index (1 to 96)
Monthly Consumer Price Index for 2009 to 2016 as produced by the Bureau of Labor Statistics (Series Id. CUUR0000SA0). Based on prices for all items in U.S. city average for all consumers (not seasonally) Base period is 1982-1984-100.
Data downloaded from Bureau of Labor Statistics at
https://www.bls.gov/data/
Congressional votes on an ObamaCare health insurance bill in 2009
A dataset with 435 observations on the following 9 variables.
Party |
Party affiliation: D =Democrat or R =Republican |
Dist. |
Congressional district (State-Number) |
InsVote |
Vote on the health insurance bill: 1 =yes or 0 =no |
Rep |
Indicator for Republicans |
Dem |
Indicator for Democrats |
Private |
Percentage of non-senior citizens in district with private health insurance |
Public |
Percentage of non-senior citizens in district with public health insurance |
Uninsured |
Percentage of non-senior citizens in district with no health insurance |
Obama |
District winner in 2008 presidential election: 1 =Obama 0 =McCain |
On 7 November 2009 the U.S. House of Representatives voted, by the narrow margin of 220-215, for a bill to enact health insurance reform. Most Democrats voted yes while almost all Republicans voted no. This dataset contains data for each of the 435 representatives.
Insurance data are from the American Community Survey
(http://www.census.gov/acs/www/data_documentation/data_main/).
Roll call of congressional votes on this bill can be found
at
http://clerk.house.gov/evs/2009/roll887.xml.
True IQ and guessed IQ (from a photo) for 40 women
A data frame with 40 observations on the following 3 variables.
Age
Age of woman
GuessIQ
Guessed IQ
TrueIQ
Actual IQ
One hundred sixty raters (75 men and 85 women) took part in judging intelligence (on a 1=high to 7=low scale) based on photographs of students. The ratings were converted to z-scores and then put on an IQ scale to compare to actual measured IQ. There were photos of 80 students, 40 men and 40 women. This data set contains data for the 40 women.
Kleisner K, Chvatalova V, Flegr J (2014), "Perceived Intelligence Is Associated with Measured Intelligence in Men but Not Women," PLoS ONE 9(3): e81237. doi:10.1371/journal.pone.0081237.
Reporting rates for bi-weekly jury pools in Franklin County Court (Columbus, OH).
A data frame with 52 observations on the following 4 variables.
Period |
Sequential 2-week periods ove the course of a year |
PctReport |
Percentage of selected jurors who report |
Year |
1998 or 2000
|
I2000 |
Indicator for data from the year 2000 |
Tom Shields, jury commissioner for the Franklin County Municipal Court in Columbus, Ohio, is responsible for making sure that the judges have enough potential jurors to conduct jury trials. Jury duty for this court is two weeks long, so Tom must bring together a new group of potential jurors twenty-six times a year. Random sampling methods are used to obtain a sample of registered voters in Franklin County every two weeks, and these individuals are sent a summons to appear for jury duty. One of the most difficult aspects of Tom's job is to get those registered voters who receive a summons to actually appear at the courthouse for jury duty. This dataset contains the 1998 and 2000 data for the percentages of individuals who reported for jury duty after receiving a summons. The reporting dates vary slightly from year to year, so they are coded sequentially from 1, the first group to report in January, to 26, the last group to report in December. A variety of methods were used after 1998 to try to increase participation rates.
Franklin County Municipal Court
Pitch-by-pitch data for baseball pitcher Clayton Kershaw in the 2013 season
A data frame with 3402 observations on the following 24 variables.
BatterNumber
Number of batters faced so far that game
Outcome
One of 14 possible results for a pitch (e.g. Ball
, Ball In Dirt
, Called Strike
, ..., Swinging Strike (Blocked))
Class
One of three classifications (B
=ball, S
=strike, or X
=in play)
Result
From pitcher's perspective (Neg
=ball or hit, Pos
=strike or out)
Swing
Did the batter swing at the pitch? (No
or Yes
)
Time
Date and time of the pitch (format yyyy-mm-ddThh:mm:ssZ
)
StartSpeed
Speed leaving the pitcher's hand (in mph)
EndSpeed
Speed crossing home plate (in mph)
HDev
Horizontal movement (inches)
VDev
Vertical movement (inches)
HPos
Horizontal position at home plate (inches from center, positive is catcher's right)
VPos
Vertical position at home plate (inches above the ground)
PitchType
Code for pitch type (CH
=changeup, CU
=curve, FF
=fastball, or SL
=slider)
Zone
1-9 in theoretical strike zone (upper left to lower right), 11-14 are out of strike zone
Nasty
A measure on a 0-100 scale of difficulty of the pitch to hit (100 is most difficult)
Count
Ball strike count (0-0
, 0-1
, 0-2
, 1-1
, 1-2
, 2-1
, 2-2
, 3-1
, or 3-2
)
BallCount
Number of balls before the pitch (0, 1, 2, or 3)
StrikeCount
Number of strikes before the pitch (0, 1, or 2)
Inning
Inning of the game
InningSide
Portion of the inning (bottom
= pitcher at home or top
=pitcher away)
Outs
Number of outs when the pitch is thrown
BatterHand
Batter's stance (L
=left or R
=right)
ABEvent
Result of the at bat (several possibilities)
Batter
Name of the batter faced
Dataset includes information for 3,402 individual pitches thrown by Los Angeles Dodger baseball pitcher Clayton Kershaw during the 2013 regular season when he won the Cy Young award as the best pitcher in the National League. Many variables are measured using Major League Baseball's PITCHf/x system that uses camera systems in each ballpark to track characteristics of each pitch thrown.
Data scraped from the MLB GameDay website (http://gd2.mlb.com/components/game/mlb/) using pitchRx
Hourly water temperatures from Gulf of Mexico near Key West, Florida
A data frame with 6572 observations on the following 3 variables.
DateTime
Date and time of reading (format mm/dd/yyyy h:00
)
WaterTemp
Water temperature (in degrees Fahrenheit)
t
Time index (1 to 673)
Hourly readings of water temperatures from a measuring device in the Gulf of Mexico near Key West, Florida. The hourly temperatures are provided from October 3, 2016 to October 3, 2017 and were obtained from station 8724580. A few missing values have been interpolated to provide a complete series.
National Oceanographic and Atmospheric Administration (2017), Key West Ocean Temperature Data, October 3, 2016 to October 3, 2017, https://www.nodc.noaa.gov, Accessed on October 4, 2017
Data were obtained by Kyle Johnston for his Senior Exercise (a capstone project).
Body measurements for a sample of 198 children
A data frame with 198 observations on the following 5 variables.
Height |
Height (in inches) |
Weight |
Weight (in pounds) |
Age |
Age (in months) |
Sex |
0 =male or 1 =female |
Race |
0 =white or 1 =other |
This dataset comes from a 1977 anthropometric study of body measurements for children. Subjects in this sample are between the ages of 8 and 18 years old, selected at random from the much larger dataset of the original study.
A sample of 198 cases from the NIST's AnthroKids dataset at http://ovrt.nist.gov/projects/anthrokids/
Lifetimes for potato leafhoppers on various sugar diets
A data frame with 8 observations on the following 2 variables.
Diet |
Control , Fructose , Glucose , or Sucrose
|
Days |
Number of days until half the leafhoppers in a dish died |
The goal of this study was to compare the effects of four diets on the lifespan of small insects called potato leafhoppers. One of the four was a control diet: just distilled water with no nutritive value. Each of the other three diets had a particular sugar added to the distilled water, one of glucose, sucrose, or fructose. Leafhoppers were sorted into groups of eight and each group was put into one of eight lab dishes. Each of the four diets was added to two dishes, chosen using chance.
"Survival and behavioral responses of the potato leafhopper, Empoasca Fabae (Harris), on synthetic media," MS thesis by Douglas Dahlman (1963), Iowa State University. The data can be found in Analyzing Experimental Data by Regression by David M. Allen and Foster B. Cady, Belmont, CA: Lifetime Learning (Wadsworth).
Measurements of Dodonaea viscosa leaves
A data frame with 252 observations on the following 5 variables.
Width
Average width (in mm)
Length
Average length (in mm)
LWRatio
Length divided by Width
Area
Area (in sq. mm)
Year
Year the leaves were collected
Data on samples of leaves from the species Dodonaea viscosa subsp. angustissima (common name hopbush), which have been collected in a certain region of South Australia for many years.
Guerin, G., Wen, H., Lowe, A. (2012), "Leaf morphology shift linked to climate change," Biol. Lett., 8, doi: 10.1098/rsbl.2012.0458
Treatment results for leukemia patients
A data frame with 51 observations on the following 9 variables.
Age |
Age at diagnosis (in years) |
Smear |
Differential percentage of blasts |
Infil |
Percentage of absolute marrow leukemia infiltrate |
Index |
Percentage labeling index of the bone marrow leukemia cells |
Blasts |
Absolute number of blasts, in thousands |
Temp |
Highest temperature of the patient prior to treatment, in degrees Fahrenheit |
Resp |
1 =responded to treatment or 0 =failed to respond |
Time |
Survival time from diagnosis (in months) |
Status |
0 =dead or 1 =alive |
A study involved 51 untreated adult patients with acute myeloblastic leukemia who were given a course of treatment, after which they were assessed as to their response.
Data come from Statistical Analysis Using S-Plus (Brian S. Everitt; first edition 1994, Chapman & Hall).
Factors relating to Mississippi River levee failure
A data frame with 82 observations on the following 14 variables.
Failure
Did the levee fail? (1
=yes or 0
=no)
Year
Year
RiverMile
Location along the river (mile marker)
Sediments
Sediments present? (1
=yes or 0
=no)
BorrowPit
Borrow pit present? (1
=yes or 0
=no)
Meander
Type of meander (1
=inside bend, 2
=outside bend, 3
=chute, 4
=straight)
ChannelWidth
Width of the river channel (in meters)
FloodwayWidth
Width of floodway (in meters, levee to levee, levee to bluff, or bluff to bluff, as appropriate)
ConstrictionFactor
Constriction of the floodway over time (1880s to present)
LandCover
1
=open water, 2
=grassy, 3
=agricultural, 4
=forest
VegWidth
Vegative buffer width (in meters)
Sinuosity
River length divided by valley length for 10 miles up- and down-valley from levee site
Dredging
Dredging intensity
Revetement
Is there a stone structure (wall) meant to hold up the bank? (1
=yes or 0
=no)
The goal of this investigation was to test the relative importance of geologic, geomorphic, and other physical factors that have led to levee failures through the past century along much of the Mississippi River.
A. Flor, N. Pinter, W.F. Remo (2010), "Evaluating Levee Failure Susceptibility on the Mississippi River Using Logistic Regression Analysis," Engineering Geology, Vol. 116, pp. 139-148
Dementia study comparing two groups of patients
A data frame with 39 observations on the following 3 variables.
Type
DLB
=Dementia with Lewy Bodies or DLB/AD
=DLB and Alzheimer's Disease
APC
Annualized Percentage Change from baseline volume of the brain
MMSE
Change in functional performance on the Mini Mental State Examination
Brain MRIs were used to study the brains of patients with Dementia with Lewy Bodies, some of whom also were diagnosed with Alzheimer's Disease.
Z. Nedelksa et al. (2015), "Pattern of brain atrophy rates in autopsy-confirmed dementia with Lewy bodies,"" Neurobiology of Aging, 36: 452-461.
Dementia Study with Lewy Bodies
A data frame with 20 observations on the following 3 variables.
group
DLB/AD
=DLB and Alzheimer's Disease
APC
Annualized Percentage Change from baseline volume of the brain
MMSE
Change in functional performance on the Mini Mental State Examination
Brain MRIs were used to study the brains of patients with Dementia with Lewy Bodies. These are the cases that were also diagnosed with Alzheimer's Disease. This is a subset of LewBody2Groups
Z. Nedelksa et al. (2015), "Pattern of brain atrophy rates in autopsy-confirmed dementia with Lewy bodies," Neurobiology of Aging, 36: 452-461.
Winning distances in men's Olympic long jump competitions (1900 - 2008)
A data frame with 26 observations on the following 2 variables.
Year |
Year of the Olympics (1900 - 2008) |
Gold |
Winning men's long jump distance (in meters) |
Gold medal winning distances for the men's long jump at the Olympics from 1900 to 2008.
Updated to LongJumpOlympics2016 in second edition.
Historical Olympic long ump results at http://trackandfield.about.com/od/longjump/qt/olymlongjumpmen.htm
Gold medal distance for Olympic men's long jump
A data frame with 28 observations on the following 2 variables.
Year
Olympic Year (1900-2016)
Gold
Gold medal distance (in meters)
Gold medal winning distances for the men's long jump at the Olympics from 1900 to 2016.
Historical Olympic long jump results at http://trackandfield.about.com/od/longjump/qt/olymlongjumpmen.htm
Hours of sleep for teenagers
A data frame with 446 observations on the following 3 variables.
Person
Cased ID number
Age
Age (in years)
Outcome
Average at least 7 hours of sleep? (1=yes
or 0
=no)
Data from a sample of 446 teens, aged 14 to 18, who answer the question, "On an average school night, how many hours of sleep do you get?" The outcome variable records whether or not each person averages at least 7 hours of sleep.
Wahlstrom, K., Dretzke, B., Gordon, M., Peterson, K., Edwards, K., & Gdula, J. (2014) "Examining the Impact of Later School Start Times on the Health and Academic Performance of High School Students: A Multi-Site Study," Center for Applied Research and Educational Improvement. St Paul, MN: University of Minnesota.
Which "lost" letters will be returned by the public?
A data frame with 140 observations on the following 8 variables.
Location |
Where letter was "lost": DesMoines , GrinnellCampus , or GrinnellTown
|
Address |
Address on the letter: Confederacy or Peaceworks
|
Returned |
1 =letter was returned or 0 =letter was not returned |
DesMoines |
Indicator for letters left in Des Moines |
GrinnellTown |
Indicator for letters left in the town of Grinnell |
GrinellCampus |
Indicator for letters left on the Grinnell campus |
Peaceworks |
Indicator for letters addressed to Iowa Peaceworks |
Confederacy |
Indicator for letters addressed to Friends of the Confederacy |
In 1999 Grinnell College students Laurelin Muir and Adam Gratch conducted an experiment for an introductory statistics class. They intentionally "lost" 140 letters in either the city of Des Moines, the town of Grinnell, or on the Grinnell College campus. Half of each sample were addressed to Friends of the Confederacy and the other half to Iowa Peaceworks. The students kept track of which letters were eventually returned.
Student project at Grinnell College
Training records for a marathon runner
A dataset with 1128 observations on the following 9 variables.
Date |
Training date |
Miles |
Miles for training run |
Time |
Training time (in minutes:seconds:hundredths) |
Pace |
Running pace (in minutes:seconds:hundredths per mile) |
ShoeBrand |
Addidas , Asics , Brooks , Izumi , Mizuno , or New Balance
|
TimeMin |
Training time (in minutes) |
PaceMin |
Running pace (in minutes per mile) |
Short |
1 = 5 miles or less or 0 =more than 5 miles |
After2004 |
1 = for runs after 2004 or 0 =for earlier runs |
Information from training records of a marathoner over a five-year period from 2002-2006.
Data from training records of one of the Stat2 authors.
Daily changes in two stock market indices
A dataset with 56 observations on the following 5 variables.
DJIAch |
Change in Dow Jones Industrial Average |
Date |
Date: 06-Aug-09 to 02-Nov-09 |
Nik225ch |
Change in Nikkei 225 stock average |
Up |
Indicator for positive Nikkei change |
lagNik |
Previous day's Nikkei change |
This dataset contains data on daily changes from two stock markets over 56 days from 06-Aug-09 to 02-Nov-09. The Dow Jones Industrial Average is based in New York and the Nikkei 225 is a stock index in Japan.
Dow Jones Industrial Average:
http://markets.cbsnews.com/cbsnews/quote/historical?
Month=11&Symbol=310%3A998313&Year=2009&Range=12&tag=cbsnewsSectionsArea
Historical Nikkei 225 index:
http://markets.cbsnews.com/cbsnews/quote/historical?
Month=11&Symbol=992%3A1900000035&Year=2009&Range=12&tag=cbsnewsSectionsArea
Semester enrollments in mathematics courses
A dataset with 11 observations on the following 3 variables.
AYear |
Academic year (for the fall) |
Fall |
Fall semester total enrollments |
Spring |
Spring semester total enrollments |
Total enrollments in mathematics courses at a small liberal arts college were obtained for each semester from Fall 2001 to Spring 2012.
The data were obtained from http://Registrar.Kenyon.edu on June 1, 2012.
Results from a Math Placement exam at a liberal arts college
A dataset with 2696 observations on the following 16 variables.
Student |
Identification number for each student |
Gender |
0=Female, 1=Male |
PSATM |
PSAT score in Math |
SATM |
SAT score in Math |
ACTM |
ACT Score in Math |
Rank |
Adjusted rank in HS class |
Size |
Number of students in HS class |
GPAadj |
Adjusted GPA |
PlcmtScore |
Score on math placement exam |
Recommends |
Recommended course: R0 R01 R1 R12 R2 R3 R4 R6 R8
|
Course |
Actual course taken |
Grade |
Course grade |
RecTaken |
1=recommended course, 0=otherwise |
TooHigh |
1=took course above recommended, 0=otherwise |
TooLow |
1=took course below recommended, 0=otherwise |
CourseSuccess |
1=B or better grade, 0=grade below B |
Scores and course results for students taking a math placement exam at a college.
Personal correspondence
Medical school admission status and information on GPA and standardized test scores
A dataset with 55 observations on the following 11 variables.
Accept |
Status: A =accepted to medical school or D =denied admission |
Acceptance |
Indicator for Accept: 1 =accepted or 0 =denied |
Sex |
F =female or M =male |
BCPM |
Bio/Chem/Physics/Math grade point average |
GPA |
College grade point average |
VR |
Verbal reasoning (subscore) |
PS |
Physical sciences (subscore) |
WS |
Writing sample (subcore) |
BS |
Biological sciences (subscore) |
MCAT |
Score on the MCAT exam (sum of CR+PS+WS+BS) |
Apps |
Number of medical schools applied to |
This dataset has information gathered on 55 medical school applicants from a liberal arts college in the Midwest.
Data collected at a midwestern liberal arts college.
Comparing meniscus repair methods on cadaver knees
A data frame with 18 observations on the following 4 variables.
Method
Meniscus repair method (1
= Vertical Suture, 2
= Meniscus Arrow, 3
= FasT-Fix)
FailureLoad
Load at failure (in Newtons)
Displacement
Displacement (in mm)
Stiffness
Stiffness (Newtons/mm)
Eighteen, lightly embalmed, cadaveric knee specimens were used in a study to compare three different methods of meniscus repair. The specimens were randomly assigned to one of the three treatments: vertical suture, meniscus arrow, FasT-Fix. They were evaluated on three different response variables: load at failure, stiffness, and displacement.
P. Borden, J. Nyland, D.N.M. Caborn, D. Pienkowski (2003), "Biomechanical Comparison of the FasT-Fix Meniscal Repair Suture System with Vertical Mattress Sutures and Meniscus Arrows," The American Journal of Sports Medicine, Vol. 31, #3, pp. 374-378
Dataset downloaded from http://www.stat.ufl.edu/~winner/data/meniscus.txt
Admissions to a mental health emergency room and full moons
A dataset with 36 observations on the following 3 variables.
Month |
Month of the year |
Moon |
Relationship to full moon: After , Before , or During
|
Admission |
Number of emergency room admissions |
Some researchers in the early 1970s set out to study whether there is a "full-moon" effect on emergency room admissions at a mental health hospital. They separated the data over 12 months into rates before the full moon (mean number of patients seen 4-13 days before the full moon), during the full moon (the number of patients seen on the full moon day), and after the full moon (mean number of patients seen 4-13 days after the full moon).
Introduction to Mathematical Statistics and its Applications by Richard J. Larsen and Morris L. Marx. Prentice Hall:Englewood Cliffs, NJ, 1986.
The original discussion of the study is in Blackman, S., and Catalina, D. (1973). "The moon and the emergency room." Perceptual and Motor Skills 37, 624-626.
Body size and metabolic rate of Manduca Sexta caterpillars
A dataset with 305 observations on the following 7 variables.
Computer |
ID number of the computer used to measure metabolic rate |
BodySize |
Size of the caterpillar (in grams) |
LogBodySize |
Log (base 10) of BodySize |
Instar |
Number from 1 (smallest) to 5 (largest) indicating stage of the caterpillar's life |
CO2ppm |
Carbon dioxide concentration (in ppm) |
Mrate |
Metabolic rate |
LogMrate |
Log (base 10) of metabolic rate |
Marisa Stearns collected and analyzed body size and metabolic rates for Manduca Sexta caterpillars.
We thank Professor Itagaki and his research students for sharing these data.
Commute times for four cities
A data frame with 2000 observations on the following 3 variables.
City
Boston
, Houston
, Minneapolis
, or Washington
Distance
Distance of commute (in miles)
Time
Time of commute (in minutes)
The data are distances (miles) and times (minutes) of daily commute (one-way) for random samples of 500 commuters in each of four cities (Boston, Houston, Minneapolis, Washington) in 2007. The random samples were taken from the Metropolitan Public Use File of the 2007 American Housing Survey
2007 American Housing Survey https://www.census.gov/programs-surveys/ahs/data/2007/ahs-2007-public-use-file–puf-.html
Health services data for 83 metropolitan areas
A dataset with 83 observations on the following 16 variables.
City |
Name of the metropolitan area |
NumMDs |
Number of physicians |
RateMDs |
Number of physicians per 100,000 people |
NumHospitals |
Number of community hospitals |
NumBeds |
Number of hospital beds |
RateBeds |
Number of hospital beds per 100,000 people |
NumMedicare |
Number of Medicare recipients in 2003 |
PctChangeMedicare |
Percent change in Medicare recipients (2000 to 2003) |
MedicareRate |
Number of Medicare recipients per 100,000 people |
SSBNum |
Number of Social Security recipients in 2004 |
SSBRate |
Number of Social Security recipients per 100,000 people |
SSBChange |
Percent change in Social Security recipients (2000 to 2004) |
NumRetired |
Number of retired workers |
SSINum |
Number of Supplemental Security Income recipients in 2004 |
SSIRate |
Number of Supplemental Security Income recipients per 100,000 people |
SqrtMDs |
Square root of number of physicians |
The U.S. Census Bureau regularly collects information for many metropolitan areas in the United States, including data on number of physicians and number (and size) of hospitals. This dataset has such information for 83 different metropolitan areas.
This dataset is in the first edition, but replaced by CountyHealth in the second edition.
U.S. Census Bureau: 2006 State and Metropolitan Area Data Book (Table B-6)
http://www.census.gov/prod/2006pubs/smadb/smadb-06.pdf
Effects of transcranial magnetic stimulation (TMS) on migraine headaches
A data frame with 2 observations on the following 4 variables.
Group
Treatment group (Placebo
or TMS
)
Yes
Count of number of patients that were pain-free in each group
No
Count of number of patients that had pain in each group
Trials
Number of patients in each group
A study investigated whether a handheld device that sends a magnetic pulse into a person's head might be an effective treatment for migraine headaches. Researchers recruited 200 subjects who suffered from migraines and randomly assigned them to receive either the TMS (transcranial magnetic stimulation) treatment or a sham (placebo) treatment from a device that did not deliver any stimulation. Subjects were instructed to apply the device at the onset of migraine symptoms and then assess how they felt two hours later. This dataset is a two-way table of the results.
This dataset was called TMS in the first edition.
Based on results in R. B. Lipton, et al, (2010) "Single-pulse Transcranial Magnetic Stimulation for Acute Treatment of Migraine with Aura: A Randomised, Double-blind, Parallel-group, Shamcontrolled Trial," Lancet Neurology, 9(4):373-380.
Attitudes towards ethics of a famous Milgram experiment
A dataset with 37 observations on the following 2 variables.
Results |
Treatment group: Actual , Complied , or Refused
|
Score |
Ethical score from 1 (not at all ethical) to 9 (completely ethical) |
One of the most famous and most disturbing psychological studies of the twentieth century took
place in the laboratory of Stanley Milgram at Yale University. Milgram's subjects were asked
to monitor the
answers of a "learner" and to push a button to deliver shocks whenever the learner gave a wrong
answer. The more wrong answers, the more powerful the shock.
Even Milgram himself was surprised by the results: Every one of his subjects ended up delivering
what they thought was a dangerous 300-volt shock to a slow "learner" as punishment for repeated
wrong answers.
Even though the "shocks" were not real and the "learner" was in on the secret,
the results triggered a hot debate about ethics and experiments with human subjects.
To study attitudes on this issue, Harvard graduate student Maryann de Mateo conducted a
randomized comparative experiment. Her subjects were 37 high school teachers who did not know
about the Milgram study. Using chance, Maryann assigned each teacher to one of three treatment
groups:
Group 1: Actual results. Each subject in this group read a description of Milgram's study,
including the actual results that every subject delivered the highest possible "shock."
Group 2: Many complied. Each subject read the same description given to the subjects in
Group 1, except that the actual results were replaced by fake results, that many but not all
subjects complied.
Group 3. Most refused. For subjects in this group, the fake results said that most subjects
refused to comply.
After reading the description, each subject was asked to rate the study according to how ethical they thought it was, from 1 (not at all ethical) to 9 (completely ethical.)
"An experimental study of attitudes toward deception" by Mary Ann DiMatteo. Unpublished manuscript, Department of Psychology and Social Relations, Harvard University (1972).
Data for Major League Baseball teams from the 2007 regular season
A dataset with 30 observations on the following 21 variables.
Team |
Name of the team |
League |
League: AL or NL
|
Wins |
Number of wins for the season (out of 162 games) |
Losses |
Number of losses for the season |
WinPct |
Proportion of games won (Wins/162) |
BattingAvg |
Team batting average |
Runs |
Number of runs runs scored |
Hits |
Number of hits |
HR |
Number of home runs hit |
Doubles |
Number of doubles hit |
Triples |
Number of triple hit |
RBI |
Number of runs batted in |
SB |
Number of stolen bases |
OBP |
On base percentage |
SLG |
Slugging percentage |
ERA |
Earned run average (earned runs allowed per 9 innings) |
HitsAllowed |
Number of hits against the team |
Walks |
Number of walks allowed |
StrikeOuts |
Number of strikeouts (by the team's pitchers) |
Saves |
Number of games saved (by the team's pitchers) |
WHIP |
Number of walks and hits per inning pitched |
Data for all 30 Major League Baseball (MLB) teams for the 2007 regular season. This includes team batting statistics (BattingAvg through SLG) and team pitching statistics (ERA through WHIP)
Updated to MLBStandings2016 in second edition.
Data downloaded from baseball-reference.com:
http://www.baseball-reference.com/leagues/MLB/2007-standings.shtml
http://www.baseball-reference.com/leagues/MLB/2007.shtml
Major League Baseball (MLB) standings and team statistics for the 2016 season
A data frame with 30 observations on the following 21 variables.
Team
Team name
League
AL
=American or NL
=National
Wins
Number of wins for the season (out of 162 games)
Losses
Number of losses for the season
WinPct
Proportion of games won
BattingAverage
Team batting average
Runs
Number of runs scored
Hits
Number of hits
HR
Number of home runs hit
Doubles
Number of doubles hit
Triples
Number of triples hit
RBI
Number of runs batted in
SB
Number of stolen bases
OBP
On base percentage
SLG
Slugging percentage
ERA
Earned run average (earned runs allowed per 9 innings)
HitsAllowed
Number of hits against the team
Walks
Number of walks allowed
StrikeOuts
Number of strikeouts (by the team's pitchers)
Saves
Number of games saved (by the team's pitchers)
WHIP
Number of walks and hits per inning pitched
Data for all 30 Major League Baseball (MLB) teams for the 2016 regular season. This includes team batting statistics (BattingAvg through SLG) and team pitching statistics (ERA through WHIP)
Data downloaded from baseball-reference.com:
http://www.baseball-reference.com/leagues/MLB/2016-standings.shtml
http://www.baseball-reference.com/leagues/MLB/2016.shtml
Body size and eggs produced for a species of moths
A dataset with 39 observations on the following 2 variables.
BodyMass |
Log of body size measured in grams |
Eggs |
Number of eggs present |
Researchers were interested in an association between body size and the number of eggs produced by a species of moths.
We thank Professor Itagaki and his students for sharing this data from experiments on Manduca Sexta.
Effects of altering serotonin levels on social interactions of mice
A data frame with 48 observations on the following 3 variables.
Contacts
Number of social contacts the mouse had during the experiment
Sex
F
=female or M
=male
Genotype
Minus
, Mixed
, or Plus
(see description below)
Serotonin is a chemical that influences mood balance in humans. But how does it affect mice? Scientists genetically altered mice by "knocking out" the expression of a gene, tryptophan hydroxylase 2 (Tph2), that regulates serotonin production. With careful breeding, the scientists produced three types of mice that we label as “Minus” for Tph2-/-, “Plus” for Tph2+/+, “Mixed” for Tph2+/-. The variable Genotype records Minus/Plus/Mixed. The variable Contacts is the number of social contacts that a mouse had with other mice during an experiment and the variable Sex is “M” for males and “F” for females.
Beis D, Holzwarth K, Flinders M, Bader M, Wohr M, Alenina N., (2015) "Brain serotonin deficiency leads to social communication deficits in mice," Biol. Lett. 11:20150057.
http://dx.doi.org/10.1098/rsbl.2015.0057
Once you go to the above link, to get the data, click on the "Figures and Data" tab. Then click on the "Juvenile SocInter Behavior Data" link to download a hairy data file that needs to be cleaned a great deal to get our data.
Estimates of 45 seconds with different music playing
A data frame with 60 observations on the following 6 variables.
MusicBg
Music playing in the background (no
or yes
)
Subject
Code for each subject (subj1
through subj20
)
Sex
Subject's sex (f
=female or m
=male)
TimeGuess
Subject's time estimating 45 seconds (in seconds)
Music
Type of music (calm
, control
, or upbeat
)
Accuracy
Absolute value of TimeGuess minus 45
Participants were asked to judge when 45 seconds had passed in silence (control), while listening to an upbeat song (Metropolis, by David Guetta and Nicky Romero), and while listening to a calm song (Bach's Das Wohltemperierte Klavier, Prelude in C Major). The order in which the three conditions were experienced was randomized for each participant. Time until subject guessed 45 seconds had elapsed (TimeGuess) and the magnitude of the difference from 45 (Accuracy) were recorded.
Data collected by Ksenia Vlasov at Oberlin College.
Data from births in North Carolina in 2001
A dataset with 1450 observations on the following 15 variables.
ID |
Patient ID code |
Plural |
1 =single birth, 2 =twins, 3 =triplets |
Sex |
Sex of the baby 1 =male 2 =female |
MomAge |
Mother's age (in years) |
Weeks |
Completed weeks of gestation |
Marital |
Marital status: 1 =married or 2 =not married |
RaceMom |
Mother's race: 1 =white, 2 =black, 3 =American Indian, 4 =Chinese |
5 =Japanese, 6 =Hawaiian, 7 =Filipino, or 8 =Other Asian or Pacific Islander |
|
HispMom |
Hispanic origin of mother: C =Cuban, M =Mexican, N =not Hispanic |
O =Other Hispanic, P =Puerto Rico, S =Central/South America |
|
Gained |
Weight gained during pregnancy (in pounds) |
Smoke |
Smoker mom? 1 =yes or 0 =no |
BirthWeightOz |
Birth weight in ounces |
BirthWeightGm |
Birth weight in grams |
Low |
Indicator for low birth weight, 1 =2500 grams or less |
Premie |
Indicator for premature birth, 1 =36 weeks or sooner |
MomRace |
Mother's race: black , hispanic , other , or white
|
This dataset contains data on a sample of 1450 birth records that statistician John Holcomb selected from the North Carolina State Center for Health and Environmental Statistics.
Thanks to John Holcomb at Cleveland State University for sharing these data.
Standings for National Football League teams in 2007
A dataset with 32 observations on the following 10 variables.
Team |
Team name |
Conference |
Conference: AFC or NFC
|
Division |
Division within conference: ACE , ACN , ACS , ACW , NCE , NCN , NCS , NCW
|
Wins |
Number of wins (out of 16 games) |
Losses |
Number of losses |
WinPct |
Proportion of games won (Wins/16) |
PointsFor |
Total points scored by the team |
PointsAgainst |
Total points scored against the team |
NetPts |
PointsFor minus PointsAgainst |
TDs |
Number of touchdowns scored by the team |
Data for all 32 National Football League (NFL) teams for the 2007 regular season.
Updated to NFLStandings2016 in the second edition.
Data downloaded from www.nfl.com
Standings and team statistics for National Football League (NFL) teams in the 2016 season
A data frame with 32 observations on the following 11 variables.
Team
Team name
Wins
Wins in the 2016 regular season (out of 16 games)
Losses
Losses in the 2016 regular season
Ties
Ties in the 2016 regular season (ties are very rare in the NFL)
WinPct
Winning percentage = (Wins+0.5*Ties)/16 games
PointsFor
Points scored
PointsAgainst
Points allowed
NetPts
Points scored minus Points allowed
YardsFor
Offensive yards gained by the team
YardsAgainst
Offensive yards against the team
TDs
Touchdowns scored
Standings for the 2016 regular season of the National Football League (NFL) along with points and scored and allowed for each team in its16 games.
Data downloaded from:
http://www.pro-football-reference.com/years/2016/
Characteristics of nursing homes in New Mexico.
A dataset with 52 observations on the following 7 variables.
Beds |
Number of beds in the nursing home |
InPatientDays |
Annual medical in-patient days (in hundreds) |
AllPatientDays |
Annual total patient days (in hundreds) |
PatientRevenue |
Annual patient care revenue (in hundreds of dollars) |
NurseSalaries |
Annual nursing salaries (in hundreds of dollars) |
FacilitiesExpend |
Annual facilities expenditure (in hundreds of dollars) |
Rural |
1=rural or 0=non-rural
|
The data were collected by the Department of Health and Social Services of the State of New Mexico and cover 52 of the 60 licensed nursing facilities in New Mexico in 1988.
Downloaded from DASL at http://lib.stat.cmu.edu/DASL/Datafiles/Nursingdat.html
Howard L. Smith, Niell F. Piland, and Nancy Fisher, "A Comparison of Financial Performance, Organizational Characteristics, and Management Strategy Among Rural and Urban Nursing Facilities," Journal of Rural Health, Winter 1992, pp 27-40.
Experiment to measure the effect of ultrasound on deapsorbing oil from sand
A data frame with 40 observations on the following 4 variables.
Salt
Type of water (1
=salt water or 0
=distilled water)
Ultra
Amount of time each sample was exposed to ultrasound (5 or 10 minutes)
Oil
Amount of oil in the sample (5ml or 10 ml)
Diff
Difference in the amount of oil removed between the ultrasound run and an equivalent control run (no ultrasound) (Diff
= Treatment
- Control
)
This data set is the result of a science fair experiment run by a high school student. The basic question was whether exposing sand with oil in it (think oil spill) to ultrasound could help the oil deapsorb from it better than sand that was not exposed to ultrasound. There were two levels of ultrasound tested (5 minutes and 10 minutes) and two levels of oil (5 ml and 10 ml). There was also a question of whether exposure to salt water or fresh water made a difference so half the samples had salt water, the others distilled water. Each combination of factor levels was replicated 5 times. There were also an equivalent number of control observations run, all factors being the same but without any exposure to ultrasound. Each experimental run was paired with an appropriate control run and the response variable is the difference in the amount of oil removed in the experimental run and the control run.
Experiment run by Las Vegas high school student Chris Mathews for a science fair project in spring 2016.
Measurements of the pesticide fenthion in olive oil over time
A dataset with 18 observations on the following 7 variables.
SampleNumber |
Code (1-6) for sample of olive oil |
Group |
Code for group: 1 or 2
|
Day |
Time (in days) when sample was measured: 0 , 281 , or 365
|
Fenthion |
Amount of fenthion (pesticide) |
FenthionSulphoxide |
Amount of fenthion sulfide |
FenthionSulphone |
Amount of fenthion sulphone |
Time |
Code (0, 3, or 4) for the number of days |
Fenthion is a pesticide used against the olive fruit fly in olive groves. It is toxic to humans so it is important that there be no residue left on the fruit or in olive oil that will be consumed. One theory was that if there is residue of the pesticide left in the olive oil, it would dissipate over time. Chemists set out to test that theory by taking a random sample of small amounts of olive oil with fenthion residue and measuring the amount of fenthion in the oil at three different times over the year - day 0, day 281 and day 365.
Data provided by Rosemary Roberts and discussed in "Persistence of fenthion residues in olive oil" by Chaido Lentza-Rizos, Elizabeth J. Avramides, and Rosemary A. Roberts in Pest Management Science, Vol. 40, Issue 1, Jan. 1994, pp. 63-69.
Number of damaged O-rings on space shuttle launches and launch temperature
A dataset with 24 observations on the following 2 variables.
Temp |
Code for temperature (in degrees F): Above65 Below65
|
Failures |
Number of O-ring failures |
The space shuttle Challenger exploded shortly after liftoff in 1987. The subsequent investigation focused on the failure of O-ring seals, which allowed liquid hydrogen and oxygen to mix and explode. These failures might be related to temperature at the launch site which was near freezing (32 degrees F) on that day. This dataset shows the number of O-ring failures on previous shuttle launches, along with an indicator for whether the temperature was above or below 65 degrees F.
Data can be found in "Risk analysis of the space shuttle: Pre-challenger prediction of failure" by Siddhartha R. Dalal, Edward B. Fowlke, and Bruce Hoadley in Journal of the American Statistical Association, Vol. 84, No. 408 (Dec. 1989), pp 945-957
Survey of college students to took at factors related to having overdrawn a checking account.
A dataset with 450 observations on the following 4 variables.
Age |
Age of the student (in years) |
Sex |
0 =male or 1 =female |
DaysDrink |
Number of days drinking alcohol (in past 30 days) |
Overdrawn |
Has student overdrawn a checking account? 0 =no or 1 =yes |
Researchers conducted a survey of 450 undergraduates in large introductory courses at either Mississippi State University or the University of Mississippi. There were close to 150 questions on the survey, but only four of these variables are included in this dataset. (You can consult the paper to learn how the variables beyond these 4 affect the analysis.) The primary interest for the researchers was factors relating to whether or not a student has ever overdrawn a checking account.
Renamed as CreditRisk in second edition.
Worthy S.L., Jonkman J.N., Blinn-Pike L. (2010), "Sensation-Seeking, Risk-Taking, and Problematic Financial Behaviors of College Students," Journal of Family and Economic Issues, 31: 161-170
Comparing methods for measuring the size of oysters
A data frame with 30 observations on the following 5 variables.
ID
ID number of each oyster
Weight
Weight (in grams)
Volume
Volume (in cubic centimeters)
ThreeD
Measurement from a 3D system (pixels)
TwoD
Measurement from a 2D cross-section (pixels)
In 2001 engineers at an R&D lab Agri-Tech, Inc, in Woodstock, Virginia, designed a 3-D system that they hoped would improve on the existing 2-D system for measuring the size of oysters. The 3-D system used computer scanning to estimate an oyster volume, whereas the old 2-D system estimated a cross-sectional area. Data shows the result of both systems, as well as the actual weight and volume of each oyster used in calibration.
Data found at JSE data archive: http://ww2.amstat.org/publications/jse/jse_data_archive.htm with the filenames of 30oysters. Contributors are G. Andy Chang, G. Jay Kerns, D. J. Lee, and Gary L. Stanek.
Original article is: Lee, D., Lane, R., and Chang, G., (2001) "Three-dimension Reconstruction for High-speed Volume Measurement," Proceedings of the International Society for Optical Engineering, Machine Vision and Three-Dimensional Imaging Systems for Inspection and Metrology, Volume 4189, p.258-267.
Votes for Geroge Bush and Pat Buchanan in Florida counties for the 2000 U.S. presidential election
A dataset with 67 observations on the following 3 variables.
County |
Name of the Florida county |
Buchanan |
Number of votes for Pat Buchanan |
Bush |
Number of votes for George Bush |
The race for the presidency of the United States in the fall of 2000 was very close, with the electoral votes from Florida determining the outcome. In the disputed final tally in Florida, George W. Bush won by just 537 votes over Al Gore, out of almost 6 million votes cast. About 2.3% of the votes cast in Florida were awarded to other candidates. One of those other candidates was Pat Buchanan, who did much better in Palm Beach County than he did anywhere else. Palm Beach County used a unique "butterfly ballot" that had candidate names on either side of the page with "chads" to be punched in the middle. This non-standard ballot seemed to confuse some voters, who punched votes for Buchanan that may have been intended for a different candidate. This dataset shows the number of votes for Bush and Buchanan in each Florida county.
Florida county data for the 2000 presidential election can be found at
http://election.dos.state.fl.us/elections/resultsarchive/Index.asp?ElectionDate=11/7/00
Monthly traffic (in 1,000's) across the Peace Bridge between Canada and the U.S.
A data frame with 156 observations on the following 4 variables.
Year
Year (2003 to 2015)
Month
Month (1 to 12)
Traffic
Vehicles (in 1,000's)
t
Time frame (1 to 156)
Monthly traffic (in thousands of vehicles) across the Peace Bridge between the U.S. and Canada near Niagara Falls between January 2003 and December 2015. Note PeaceBridge2012 has only the last four years of this series.
http://www.peacebridge.com/index.php/historical-traffic-statistics/yearly-volumes
Monthly traffic (in 1,000's) across the Peace Bridge between Canada and the U.S.
A data frame with 48 observations on the following 4 variables.
Year
Year (2012 to 2015)
Month
Month (1 to 12)
Traffic
Vehicles (in 1,000's)
t
Time frame (1 to 48)
Monthly traffic (in thousands of vehicles) across the Peace Bridge between the U.S. and Canada near Niagara Falls between January 2012 and December 2015. Note PeaceBridge2003 has similar data starting in 2003.
http://www.peacebridge.com/index.php/historical-traffic-statistics/yearly-volumes
Daily walking amounts recorded on a personal pedometer from September-December 2011
A dataset with 68 observations on the following 8 variables.
Steps |
Total number of steps for the day |
Moderate |
Number of steps at a moderate walking speed |
Min |
Number of minutes walking at a moderate speed |
kcal |
Number of calories burned walking at a moderate speed |
Mile |
Total number of miles walked |
Rain |
Type of weather (rain or shine ) |
Day |
Day of the week (U =Sunday, M =Monday,
T =Tuesday, W =Wednesday, R =Thursday, F =Friday, S =Saturday |
DayType |
Coded as Weekday or Weekend
|
A statistics professor regularly keeps a pedometer in his pocket. It records not only the number of steps taken each day, but also the number of steps taken at a moderate pace, the number of minutes walked at a moderate pace, and the number of miles total that he walked. He also added to the data set the day of the week, whether it was rainy, sunny, or cold (on sunny days he often biked, but on rainy or cold days he did not), and whether it was a weekday or weekend.
One of the Stat2 authors
Size of perch caught in a Finnish lake
A dataset with 56 observations on the following 4 variables.
Obs |
Observation number |
Weight |
Weight (in grams) |
Length |
Length (in centimeters) |
Width |
Width (in centimeters |
This dataset comes from a sample of fish (perch) caught at Lake Laengelmavesi in Finland.
JSE Data Archive, http://www.amstat.org/publications/jse/jse_data_archive.htm, submitted by Juha Puranen.
Effects of additives to pig feed on weight gain
A dataset with 12 observations on the following 3 variables.
WgtGain |
Daily weight gain (hundredths of a pound over 1.00) |
Antibiotic |
Antibiotic in the feed? No or Yes
|
B12 |
Vitamin B12 in the feed? No or Yes
|
A scientist in Iowa was interested in additives to standard pig chow that might increase the rate at which the pigs gained weight. Two factors of interest were vitamin B12 and antibiotics. To perform the experiment, the scientist randomly assigned 12 pigs, three to each of the diet combinations (Antibiotic only, B12 only, both, and neither).
Data are found in Statistical Methods by George W. Snedecor and William G. Cochran (1967). Ames, IA: The Iowa State University Press.
Original source is Iowa Agricultural Experiment Station (1952). Animal Husbandry Swine Nutrition Experiment No. 577.
Data from pine seedlings planted in 1990
A dataset with 1000 observations on the following 15 variables.
Row |
Row number in pine plantation |
Col |
Column number in pine plantation |
Hgt90 |
Tree height at time of planting (cm) |
Hgt96 |
Tree height in September 1996 (cm) |
Diam96 |
Tree trunk diameter in September 1996 (cm) |
Grow96 |
Leader growth during 1996 (cm) |
Hgt97 |
Tree height in September 1997 (cm) |
Diam97 |
Tree trunk diameter in September 1997 (cm) |
Spread97 |
Widest lateral spread in September 1997 (cm) |
Needles97 |
Needle length in September 1997 (mm) |
Deer95 |
Type of deer damage in September 1995: 0 = none, 1 = browsed |
Deer97 |
Type of deer damage in September 1997: 0 = none, 1 = browsed |
Cover95 |
Thorny cover in September 1995: 0 = none; 1 = some; 2 = moderate; 3 = lots |
Fert |
Indicator for fertilizer: 0 = no, 1 = yes |
Spacing |
Distance (in feet) between trees (10 or 15) |
This dataset contains data from an experiment conducted by the Department of Biology at Kenyon College at a site near the campus in Gambier, Ohio. In April 1990, student and faculty volunteers planted 1000 white pine (Pinus strobes) seedlings at the Brown Family Environmental Center. These seedlings were planted in two grids, distinguished by 10- and 15-foot spacings between the seedlings. Several variables were measured and recorded for each seedling over time (in 1990, 1996, and 1997).
Thanks to the Kenyon College Department of Biology for sharing these data.
Dopamine levels with different amounts of phenylalanine in diets
A data frame with 20 observations on the following 4 variables.
Subject
Initials to identify each subject
Diet
Level of phenylalanine in diet (Low
or Normal
)
DietControl
Ability to follow prescribed diet (Good
or Poor
)
Y
Concentration of dopamine (micrograms per milligram of creatinine)
Phenylketonuria (PKU) is an enzyme deficiency that keeps a person from being able to synthesize enough dopamine. The amino acid phenylalanine inhibits the enzyme needed to synthesize dopamine, and so to some extent, a diet low in phenylalanine can moderate the symptoms of PKU. In short, less phenylalanine in the diet should lead to more dopamine in the brain. The dopamine level for each patient was measured after a normal diet and after a week on a low phenylalanine diet.
Krause, Halminski, McDonald, Dembure, Salvo, Freides, and Elsas (1985) "Biochemical and Neuropsychological Effects of Elevated Plasma Phenylalanine in Patients with Treated Phenylketonuria," J. of Clinical Investigation, Volume 75, January 1985, 40-48
Several of the values were altered slightly in ways that would not change the analysis except to simplify the arithmetic.
Survey of political activity for Grinnell College students
A dataset with 59 observations on the following 9 variables.
Year |
Class year (1 to 4) |
Sex |
0 =male or 1 =female |
Vote |
Voting status: 0 =not eligible, 1 =eligible/not registered, 2 =registered/didn't vote, 4 =voted |
Paper |
Read news (per week): 0 =never, 1 =less than once, 2 =once, 3 =2 or 3 times, 4 =daily |
Edit |
Read editorial page? 0 =no or 1 =yes |
TV |
Watch TV news: 0 =never, 1 =less than once, 2 =once, 3 =2 or 3 times, 4 =daily |
Ethics |
Politics should be ruled by: 1 =ethical considerations to 5 =practical power |
Inform |
How informed are you about politics? 1 =uninformed to 5 =very well informed |
Participate |
Missing if Vote=0, 0 if Vote=1 or 2, 1 if Vote=3 |
Students Jennifer Wolfson and Meredith Goulet conducted a survey in the spring of 1992 of Grinnell College students to ascertain patterns of political behavior. They took a simple random sample of 60 students who were U.S. citizens and conducted phone interviews. Using several "call backs" they obtained 59 responses.
Student survey at Grinnell College
Polls for 2008 U.S. presidential election
A dataset with 102 observations on the following 11 variables.
PollTaker |
Polling organization |
PollDates |
Dates the poll data were collected |
MidDate |
Midpoint of the polling period |
Days |
Number of days after August 28th (end of Democratic convention) |
n |
Sample size for the poll |
Pop |
A =all, LV =likely voters, RV =registered voters |
McCain |
Percent supporting John McCain |
Obama |
Percent supporting Barack Obama |
Margin |
Obama percent minus McCain percent |
Charlie |
Indicator for polls after Charlie Gibson interview with VP candidate Sarah Palin (9/11) |
Meltdown |
Indicator for polls after Lehman Brothers bankruptcy (9/15) |
The file Pollster08 contains data from 102 polls that were taken during the 2008 U.S. Presidential campaign. These data include all presidential polls reported on the internet site pollster.com that were taken between August 29th, when John McCain announced that Sarah Palin would be his running mate as the Republican nominee for vice president, and the end of September.
Downloaded from pollster.com
Unpopped kernels in bags of microwave popcorn
A dataset with 12 observations on the following 3 variables.
Unpopped |
Number of unpopped kernels (adjusted for size difference) |
Brand |
Orville or Seaway
|
Trial |
Trial number |
Two students, Lara and Lisa, conducted an experiment to compare Orville Redenbacher's Light Butter Flavor vs. Seaway microwave popcorn. They made 12 batches of popcorn, 6 of each type, cooking each batch for four minutes. They noted that the microwave oven seemed to get warmer as they went along so they kept track of six trials and randomly chose which brand would go first for each trial. For a response variable they counted the number of unpopped kernels and then adjusted the count for Seaway for having more ounces per bag of popcorn (3.5 vs 3.0).
Student project
Compare prices for Porsche and Jaguar cars offered for sale at an internet site
A dataset with 60 observations on the following 5 variables.
Car |
Car model: Jaguar or Porsche
|
Price |
Price (in $1,000's) |
Age |
Age of the car (in years) |
Mileage |
Previous miles driven (in 1,000's) |
Porsche |
Indicator for Porsche (1 ) or Jaguar (0 ) |
Two students collected samples of Porsche and Jaguar cars that were offered for sale at an internet site. In addition to asking price, they recorded the model year (converting to age) and mileage of each advertised car.
Student project data collected from autotrader.com in Spring 2007.
Prices for Porsche cars offered for sale at an internet site.
A dataset with 30 observations on the following 3 variables.
Price |
Asking price for the car (in $1,000's) |
Age |
Age of the car (in years) |
Mileage |
Previous miles driven (in 1,000's) |
A student was interested in prices for used Porsche sports cars being sold on the internet. He selected a random sample of 30 Porsches from the ones being advertised at autotrader.com. For each car he recorded the asking price, mileage, and model year (which he converted to age).
This dataset was replaced by AccordPrice for second edition.
Data collected for a student project from autotrader.com in February 2007.
Pulse rates before and after exercise for a sample of statistics students
A dataset with 232 observations on the following 7 variables.
Active |
Pulse rate (beats per minute) after exercise |
Rest |
Resting pulse rate (beats per minute) |
Smoke |
1 =smoker or 0 =nonsmoker |
Sex |
1 =female or 0 =male |
Exercise |
Typical hours of exercise (per week) |
Hgt |
Height (in inches) |
Wgt |
Weight (in pounds) |
Students in a Stat2 class recorded resting pulse rates (in class), did three "laps" walking up/down a nearby set of stairs, and then measured their pulse rate after the exercise. They provided additional information about height, weight, exercise, and smoking habits via a survey.
Data compiled over several semesters from students taking a Stat2 course.
Putting results for a golfing statistician
A dataset with 587 observations on the following 2 variables.
Length |
Length of the putt (in feet) |
Made |
1 =made the putt or 0 =missed the putt |
A statistician golfer kept careful records of every putt he attempted when playing golf, recording the length of the putt and whether or not he was successful in making the putt. This dataset has one case for each of the 587 attempted putts. A different form of the same data (Putts2) accumulates counts of makes and misses for each putt length.
Personal observations by one of the Stat2 authors
Putting results for a golfing statistician (by length of the putts)
A dataset with 5 observations on the following 4 variables.
Length |
Length of the attempted putt (in feet) |
Made |
Number of putts made at this length |
Missed |
Number of putts missed at this length |
Trials |
Total number of putts attempted at this length |
A statistician golfer kept careful records of every putt he attempted when playing golf, recording the length of the putt and whether or not he was successful in making the putt. For each different length, this dataset records the number of putts made, missed, and the total number of attempts from that length. A similar dataset, Putts1, has one case for each of the 587 attempted putts, showing the length and outcome.
Personal observations by one of the Stat2 authors
Hypothetical putting results for a golfing statistician
A data frame with 5 observations on the following 4 variables.
Length
Length of the attempted putt (in feet)
Made
Number of putts made at this length
Missed
Number of putts missed at this length
Trials
Total number of putts attempted at this length
This is a hypothetical revision of the table of putting success in Putts2 that helps demonstrate overdispersion.
Modified from personal observations by one of the Stat2 authors.
Demographics and a measurement of racial animus in cities based on Google searches
A data frame with 196 observations on the following 7 variables.
MediaMarket
City (State)
Age65Plus
Percentage 65 and older
BachPlus
Percentage with a bachelor's degree
Black
Percentage of African-Americans
Hispanic
Percentage of Hispanics
ObamaKerry
Percentage of vote won by Obama in 2008 minus Kerry percentage in 2004
Animus
Measurement (0-250) of racial animus
Professor Seth Stephens-Davidowitz studies the level of racial animus across different areas in America by measuring the percent of Google search queries that include racially charged language. A measurement, Animus, is derived from his algorithm and is scaled to be between 0 (low racial animus) and 250 (high racial animus). The dataset includes those values along with demographic information about each media market.
Chae DH, Clouston S, Hatzenbuehler ML, Kramer MR, Cooper HLF, Wilson SM, et al. (2015) "Association between an Internet-Based Measure of Area Racism and Black Mortality, PLoS ONE 10(4): e0122963. doi:10.1371/journal.pone.0122963
Experiment comparing twins (one urban, one rural) ability to clear airborne radioactive particles from their lungs
A data frame with 30 observations on the following 3 variables.
TwinPair
Identifies the twin pairs (1 to 15)
Env
Residential environment (Rural
or Urban
)
Rate
Clearance rate (percentage radioactive particles remaining after one hour)
To assess lung health, the scientists measured "tracheobronchial clearance rate," that is, in English, "How fast do your lungs get rid of nasty stuff?" Each subject agreed to inhale an aerosol of radioactive Teflon particles. A Geiger counter held to the chest measured the radioactivity just after inhaling, and again one hour later. The clearance rate was the percentage of radioactivity remaining – the lower the better. Subjects were 15 sets of identical twins, each pair with one twin living in an urban environment and the other in a rural environment.
Per Camner MD & Klas Philipson MSc (1973) "Urban Factor and Tracheobronchial Clearance," Archives of Environmental Health: An International Journal, 27:2, 81-84, DOI: 10.1080/00039896.1973.10666323 Link to the article: https://doi.org/10.1080/00039896.1973.10666323
Sample of homes in Northampton, MA to see whether being close to a bike trail enhances the value of the home
A data frame with 104 observations on the following 30 variables.
HouseNum
Unique house number
Acre
Lot size for the house (in acres)
AcreGroup
Lot size groups (<= 1/4 acre
or > 1/4 acre
)
Adj1998
Estimated 1998 price (in thousands of 2014 dollars)
Adj2007
Estimated 2007 price (in thousands of 2014 dollars)
Adj2011
Estimated 2011 price (in thousands of 2014 dollars)
BedGroup
Bedroom groups (1-2 beds
, 3 beds
, or 4+ beds
)
Bedrooms
Number of bedrooms
BikeScore
Bike friendliness (0-100 score, higher scores are better)
Diff2014
Difference in price between 2014 estimate and adjusted 1998 estimate (in thousands of dollars)
Distance
Distance (in feet) to the nearest entry point to the rail trail network
DistGroup
Distance groups, compared to 1/2 mile (Closer
or Farther Away
)
GarageSpaces
Number of garage spaces (0-4)
GarageGroup
Any garage spaces? (no
or yes
)
Latitude
Latitude (for mapping)
Longitude
Longitude (for mapping)
NumFullBaths
Number of full baths (includes shower or bathtub)
NumHalfBaths
Number of half baths (no shower or bathtub)
NumRooms
Number of rooms
PctChange
Percentage change from adjusted 1998 price to 2014 (value of zero means no change)
Price1998
Zillow 10 year estimate from 2008 (in thousands of dollars)
Price2007
Zillow price estimate from 2007 (in thousands of dollars)
Price2011
Zillow price estimate from 2011 (in thousands of dollars)
Price2014
Zillow price estimate from 2014 (in thousands of dollars)
SFGroup
SquareFeet group (<= 1500 sf
or > 1500 sf
)
SquareFeet
Square footage of interior finished space (in thousands of sf)
StreetName
Street name
StreetNum
House number on street
WalkScore
Walk friendliness (0-100 score, higher scores are better)
Zip
Location (1060
= Northampton or 1062
= Florence)
This dataset comprises 104 homes in Northampton, MA that were sold in 2007. The authors measured the shortest distance from each home to a railtrail on streets and pathways with Google maps and recorded the Zillow.com estimate of each home's price in 1998 and 2011. Additional attributes such as square footage, number of bedrooms and number of bathrooms are available from a realty database from 2007. We divide the houses into two groups based on distance to the trail (DistGroup).
From July 2015 JSE Datasets and Stories: "Rail Trails and Property Values: Is There an Association?", Ella Hartenian, Smith College and Nicholas J. Horton, Amherst College.
http://www.amstat.org/publications/jse/v23n2/horton.pdf
Measurements for a hypothetical set of nine rectangles.
A data frame with 9 observations on the following 5 variables.
Case
ID number for each rectangle
Width
Width (1, 4, or 10)
Length
Length (1, 4, or 10)
Area
Area
logArea
Log (base 10) of area
Areas for rectangles of width 1, 4, or 10 and length of 1, 4, or 10.
Areas computed for a hypothetical set of rectangles.
Data on religiosity of countries from the Pew Global Attitudes Project
A dataset with 44 observations on the following 9 variables.
Country |
Name of country |
Religiosity |
A measure of degree of religiosity for residents of the country |
GDP |
Per capita Gross Domestic Product in the country |
Africa |
Indicator for countries in Africa |
EastEurope |
Indicator for countries in Eastern Europe |
MiddleEast |
Indicator for countries in the Middle East |
Asia |
Indicator for countries in Asia |
WestEurope |
Indicator for countries in Western Europe |
Americas |
Indicator for countries in North/South America |
The Pew Research Center's Global Attitudes Project surveyed people around the world and asked (among many other questions) whether they agreed that "belief in God is necessary for morality," whether religion is very important in their lives, and whether they pray at least once per day. The variable Religiosity is the sum of the percentage of positive responses on these three items, measured in each of 44 countries. The dataset also includes the per capita GDP for each country and indicator variables that record the part of the world the country is in.
Data from the 2007 Spring Survey conducted through the Pew Global Attitudes Project at http://www.pewglobal.org.
A student measured her pulse several times a day over 26 days.
A data frame with 104 observations on the following 3 variables.
Pulse
Pulse rate (beats per minute)
Time
Time of day (evening
, morning
, noon
, one
)
Day
Day1
to Day26
A student measured her pulse in the morning, at noon, at 1:00, and in the evening for each of 26 days.
Data supplied by a student at Oberlin College.
Quarterly production of residual oil in the U.S. from 1983 to 2016
A data frame with 136 observations on the following 7 variables.
Year
Year (1983 to 2016)
Qtr
Month (1
=Jan-Mar, 2
=Apr-June, 3
=July-Sep, 4
=Oct-Dec)
t
Time index (1 to 136)
Oil
Residual fuel oil distribution (in million gallons/day)
LogOil
Natural logarithm of Oil
The U.S. Energy Information Administration tracks the production and distribution of various types of petroleum products. The category for this dataset is called residual oil, which are heavier oils (often called No. 5. and No. 6) that remain after lighter oils (such as No. 4 home heating oil) are distilled away in the refining process. It is used in steam-powered ships, power plants, and other industrial applications.
U.S. Energy Information Administration website - Refiner sales volumes for residual fuel oil and No. 4 heating oil at https://www.eia.gov/petroleum/data.php#consumption. Specific webpage is
https://www.eia.gov/dnav/pet/pet_cons_refres_d_nus_VTR_mgalpd_m.htm.
Contributions to a supplemental retirement account (1997-2012)
A dataset with 16 observations on the following 2 variables.
Year |
1997-2012 |
SRA |
Annual contribution to the Supplemental Retirement Account |
A faculty member opened a supplemental retirement account (SRA) in 1997 to invest money for retirement. This dataset shows the annual contributions to that account. Annual contributions were adjusted downward during sabbatical years in order to maintain a steady family income.
Individual records kept by the faculty member.
Data on firefighter promotion exams as part of the Ricci v. DeStafano court case
A data frame with 118 observations on the following 5 variables.
Race
Race of firefighter (B
=black, H
=Hispanic, or W
=white)
Position
Promotion desired (Captain
or Lieutenant
)
Oral
Oral exam score
Written
Written exam score
Combine
Combined score (written exam gets 60% weight)
The city of New Haven, Connecticut administered exams (both written and oral) in November and December of 2003 to firefighters hoping to qualify for promotion to either Lieutenant or Captain in the city fire department. A final score consisting of a 60% weight for the written exam and a 40% weight for the oral exam was computed for each person who took the exam. For each person who took the exams, there are measurements on their race (black, white, or Hispanic), which position they were trying for (Lieutenant, Captain), scores on the oral and written exams, and the combined score. These data were used as part of a court case (Ricci v.DeStefano) dealing with racial discrimination
Data (RicciData.csv ) and documentation (Ricci.txt) downloaded from
http://www.amstat.org/publications/jse/jse_data_archive.htm
An article on using these data: Miao, W. (2011) "Did the Results of Promotion Exams Have a Disparate Impact on Minorities? Using Statistical Evidence in Ricci v. DeStefano," JSE 19:1 at
www.amstat.org/publications/jse/v19n1/wilson.pdf
Concentrations of elements in river water samples from upstate NY
A dataset with 12 observations on the following 27 variables.
River |
One of four rivers: Grasse , Oswegatchie , Raquette , or St. Regis
|
Site |
Location: 1=UpStream, 2=MidStream, 3=Downstream |
Al |
Aluminum |
Ba |
Barium |
Br |
Bromine |
Ca |
Calcium |
Ce |
Cerium |
Cu |
Copper |
Dy |
Dysprosium |
Er |
Erbim |
Fe |
Iron |
Gd |
Gadolinium |
Ho |
Holmum |
K |
Potassium |
La |
Lathanum |
Li |
Lithium |
Mg |
Magnesium |
Mn |
Manganese |
Nd |
Neodymium |
Pr |
Proseyodymium |
Rb |
Rubidium |
Si |
Silicon |
Sr |
Strontium |
Y |
Yttrium |
Yb |
Ytterbium |
Zn |
Zinc |
Zr |
Zirconium |
Some geologists were interested in the water chemistry of rivers in upstate New York. They took water samples at three different locations in four rivers (Grasse, Oswegatchie, Raquette, and St. Regis). The sampling sites were chosen to investigate how the composition of the water changes as it flows from the source to the mouth of each river. The sampling sites were labeled as upstream, midstream, and downstream. This dataset contains the concentrations (parts per million) of a variety of elements in those water samples. The dataset RiverIron contains the information for iron (FE) alone, along with the log of the concentration.
Thanks to Dr. Jeff Chiarenzelli of the St. Lawrence University Geology Department for the data.
Chiarenzelli, Lock, Cady, Bregani and Whitney, "Variation in river multi-element chemistry related to bedrock buffering: an example from the Adirondack region of northern New York, USA", Environmental Earth Sciences, Volume 67, Number 1 (2012), 189-204
Amounts of iron in water samples of four rivers
A dataset with 12 observations on the following 4 variables.
River |
One of four rivers: Grasse , Oswegatchie , Raquette , or St. Regis
|
Site |
Location of the site: DownStream , MidStream or Upstream
|
Iron |
Iron concentration in the water sample (parts per million) |
LogIron |
Log (base 10) of iron concentration |
Some geologists were interested in the water chemistry of rivers in upstate New York. They took water samples at three different locations in four rivers (Grasse, Oswegatchie, Raquette, and St. Regis). The sampling sites were chosen to investigate how the composition of the water changes as it flows from the source to the mouth of each river. The sampling sites were labeled as upstream, midstream, and downstream. This dataset contains the concentrations of iron in the samples. The dataset RiverElements has similar concentration data for many other elements.
Thanks to Dr. Jeff Chiarenzelli of the St. Lawrence University Geology Department for the data.
Chiarenzelli, Lock, Cady, Bregani and Whitney, "Variation in river multi-element chemistry related to bedrock buffering: an example from the Adirondack region of northern New York, USA", Environmental Earth Sciences, Volume 67, Number 1 (2012), 189-204
A sample of 30 field goal attempts in the National Football League
A dataset with 30 observations on the following 13 variables.
ID |
ID number |
PlayerID |
Code for player |
LastName |
Last name |
FirstName |
First name |
Year |
Year |
Team |
Abbreviation for team name |
Date |
Code for date: mmddyy |
FGAttempts |
Field goals attempted by the kicker that game |
FGMade |
Field goals made by the kicker that game |
Attempt |
Which attempt during the game? |
Result |
1 =made the field goal or 0 =missed |
Yards |
Number of yards for the field goal attempt |
Block |
1 =attempt blocked or 0 =not blocked |
This is a subset of just 30 field goal attempts selected at random from the larger sample of attempts made by NFL kickers that is summarized in FGByDistance.
We thank Sean Forman and Doug Drinen of Sports Reference LLC for providing us with the NFL field goal data set.
Ant counts on samples of different kinds of sandwiches
A dataset with 48 observations on the following 5 variables.
Trial |
Trial number |
Bread |
Type of bread: Multigrain , Rye , White , or Wholemeal
|
Filling |
Type of filling: HamPickles , PeanutButter , or Vegemite
|
Butter |
Butter on the sandwich? no or yes
|
Ants |
Number of ants on the sandwich |
As young students, Dominic Kelly and his friends enjoyed watching
ants gather on pieces of sandwiches. Later, as a university student, Dominic
decided to study this with a more formal experiment. He chose three types of
sandwich fillings (vegemite, peanut butter, and ham & pickles), four types
of bread (multigrain, rye, white, and wholemeal), and put butter on some of
the sandwiches.
To conduct the experiment he randomly chose a sandwich,
broke off a piece, and left it on the ground near an ant hill. After several
minutes he placed a jar over the sandwich bit and counted the number of
ants. He repeated the process, allowing time for ants to return to the hill
after each trial, until he had two samples for each combination of the three
factors.
Margaret Mackisack, “Favourite Experiments: An Addendum to What is the Use of Experiments Conducted by Statistics Students?",
Journal of Statistics Education (1994)
http://www.amstat.org/publications/jse/v2n1/mackisack.supp.html
A sample of SAT scores and grade point averages for statistics students
A dataset with 24 observations on the following 3 variables.
MathSAT |
Score (out of 800) on the mathematics portion of the SAT exam |
VerbalSAT |
Score (out of 800) on the verbal portion of the SAT exam |
GPA |
Grade point average (0.0-4.0 scale) |
In recent years many colleges have re-examined the traditional role the scores on the Scholastic Aptitude Tests (SAT's) play in making decisions on which students to admit. Do SAT scores really help predict success in college? To investigate this question a group of 24 introductory statistics students supplied the data in this dataset showing their score on the Verbal and Math portions of the SAT as well as their current grade point average (GPA) on a 0.0-4.0 scale.
Student survey in an introductory statistics course.
Area of sea ice in the Arctic measured yearly in September (1979 to 2015)
A data frame with 37 observations on the following 4 variables.
Year
Year (1979 - 2015)
Extent
Extent of arctic sea ice (in million square km)
Area
Area of arctic sea ice (in million square km)
t
Index for year (t=1 in 1979)
Climatologists have been measuring the amount of sea ice in both the Arctic and Antarctic regions for a number of years. This datafile gives information about the amount of sea ice in the arctic region as measured in September (the time when the amount of ice is at its least) since 1979. The basic research question is to see if we can use time to model the amount of sea ice.
In fact, there are two ways to measure the amount of sea ice: Area and Extent. Area measures the actual amount of space taken up by ice. Extent measures the area inside the outer boundaries created by the ice. If there are areas inside the outer boundaries that are not ice (think about a slice of swiss cheese), then the Extent will be a larger number than the Area. In fact, this is almost always true.
Data from ftp://sidads.colorado.edu/DATASETS/NOAA/G02135/Sep/N_09_areaV2.txt updated data from
Witt, G. (2103) "Using Data from Climate Science to Teach Introductory Statistics," JSE 21:1 available at www.amstat.org/publications/jse/v21n1/witt.pdf
Metamorphose rates for sea slugs exposed to different water samples
A dataset with 36 observations on the following 2 variables.
Time |
Minutes after tide come in |
Percent |
Proportion of 15 sea slug larvae that metamorphose |
Sea slugs, common on the coast of southern California, live on vaucherian seaweed. The larvae from these sea slugs need to locate this type of seaweed to survive. A study was done to try to determine whether chemicals that leach out of the seaweed attract the larvae. Seawater was collected over a patch of this kind of seaweed at 5-minute intervals as the tide was coming in and, presumably, mixing with the chemicals. The idea was that as more seawater came in, the concentration of the chemicals was reduced. Each sample of water was divided into 6 parts. Fifteen larvae were then introduced to this seawater to see what percentage metamorphosed (an indication that the desired chemical was detected).
Data downloaded from http://www.stat.ucla.edu/projects/datasets/seaslug-explanation.html
A paper based on these data: Krug, P.J. and R.K. Zimmer. 2000b. Larval settlement: chemical markers for tracing production, transport, and distribution of a waterborne cue. Marine Ecology Progress Series, vol. 207: 283-296.
Heart rates for a sample of six tree shrews at each of three stages of sleep.
A data frame with 18 observations on the following 4 variables.
ID
Row ID
Shrew
Shrew ID (A
through F
)
Phase
Phase of sleep (DSW
=deep wave, LSW
=light wave, or REM
=dreaming)
Rate
Heart rate (beats per minute)
Heart rates were recorded for a sample of six tree shrews at each of three stages of sleep.
Berger, R. J. and Walker, J. M. (1972) "The Polygraphic Study of Sleep in the Tree Shrew," Brain, Behavior, and Evolution, v. 5, pp. 62
This function computes autocorrelations for various lags of a time series.
sluacf(series, lags = 1, maxlag = NULL, ndiff = 0, sdiff = 0)
sluacf(series, lags = 1, maxlag = NULL, ndiff = 0, sdiff = 0)
series |
a time series object |
lags |
a multiplier for the lag. For example, use lag=12 for monthly data. |
maxlag |
maximum number of lags to compute |
ndiff |
number of regular differences to take before finding the ACF |
sdiff |
number of seasonal differences (with seasonal period specified by lags) |
This is is a wrapper for the acf function which allows for specifying regular (ndiff) and seasonal (sdiff) differences. The lags parameter specifies the seasonal lag and adjusts the default lags in the returned acf object to go 1, 2, ..., rather than showing fractional lags (for example, 1/12, 2/12, ... for monthly data). The lag 0 autocorrelation is set to NA (rather than 1) so that it won;t show when acf is plotted.
An object of class "acf"
data(SeaIce) ExtentY=ts(SeaIce$Extent,start=1979) sluacf(ExtentY) sluacf(ExtentY, maxlag=8,ndiff=1) data(Inflation) CPIts=ts(Inflation$CPI,start=c(2009,1),frequency=12) CPIacf=sluacf(CPIts,maxlag=36,lags=12) plot(CPIacf,lwd=2,ci.type="ma",xlim=c(1,36),xaxp=c(0,36,6),main="")
data(SeaIce) ExtentY=ts(SeaIce$Extent,start=1979) sluacf(ExtentY) sluacf(ExtentY, maxlag=8,ndiff=1) data(Inflation) CPIts=ts(Inflation$CPI,start=c(2009,1),frequency=12) CPIacf=sluacf(CPIts,maxlag=36,lags=12) plot(CPIacf,lwd=2,ci.type="ma",xlim=c(1,36),xaxp=c(0,36,6),main="")
Weight and wing length for a sample of Savannah sparrows
A dataset with 116 observations on the following 3 variables.
Treatment |
Nest adjustment: control , enlarged , or reduced
|
Weight |
Weight (in grams) |
WingLength |
Wing length (in mm) |
Priscilla Erickson from Kenyon College collected data on a stratified random sample of 116 Savannah sparrows at Kent Island. Nests that were reduced, controlled (no change), or enlarged.
We thank Priscilla Erickson and Professor Robert Mauck from the Department of Biology at Kenyon College for allowing us to use these data.
Land area and number of mammal species for islands in Southeast Asia
A dataset with 14 observations on the following 5 variables.
Name |
Name of the island |
Area |
Area (in sq. km) |
Species |
Number of mammal species |
logArea |
Natural logarithm (base e) of Area |
logSpecies |
Natural logarithm (base e) of Species |
This dataset shows the number of mammal species and the area for 13 islands in Southeast Asia. Biologists have speculated that the number of species is related to the size of an island and would like to be able to predict the number of species given the size of an island.
Heaney, Lawrence R. (1984) "Mammalian species richness on islands on the Sunda Shelf, Southeast Asia," Oecologia, 61:11 17.
Highway fatality rates 1987-2007
A dataset with 21 observations on the following 3 variables.
Year |
Year (1987-2007) |
FatalityRate |
Number of fatalities on interstate highways (per 100 million vehicle-miles) |
StateControl |
0 =1987-1994 or 1 =1995-2007 |
In 1987 the federal government allowed the speed limit on interstate highways to be 65 mph in most areas. In 1995 federal restrictions were eliminated, so that states assumed control of setting speed limits on interstate highways. This data set compares fatality rates for years before and after the states assumed control for highway speed limits.
Data from the National Highway Safety Administration website at
http://www-fars.nhtsa.dot.gov/Main/index.aspx
Experiment on the effects of oxygen on sugar metabolism by bacteria
A data frame with 16 observations on the following 3 variables.
Sugar
Type of sugar (Galactose
or Glucose
)
Oxygen
Oxygen concentration
Ethanol
Ethanol concentration
Many biochemical reactions are slowed or prevented by the presence of oxygen. For example, there are two simple forms of fermentation, one which converts each molecule of sugar to two molecules of lactic acid, and a second which converts each molecule of sugar to one each of lactic acid, ethanol, and carbon dioxide. This experiment was designed to compare the inhibiting effect of oxygen on the metabolism of two different sugars, glucose and galactose, by Streptococcus bacteria. In this case there were four levels of oxygen that were applied to the two kinds of sugar.
Data are found in Statistics: The Exploration and Analysis of Data by Jay Devore and Roxy Peck (2008). St. Paul, MN: West.
The original article is Yamada T., Takahashi-Abbe S., Abbe K. (1985) "Effects of oxygen concentration on pyruvate formatelyase in situ and sugar metabolism of Streptocucoccus mutans and Streptococcus samguis," Infection and Immunity, pp. 129-134.
Data on serious suicide attempts in Shandong, China
A data frame with 2571 observations on the following 11 variables.
Person_ID
ID number
Hospitalised
Hospitalised? (no
or yes
)
Died
Died? (no
or yes
)
Urban
Urban area? (no
, unknown
, or yes
)
Year
Year (2009, 2010, or 2011)
Month
Month (1
=Jan through 12
=December)
Sex
Sex (female
or male
)
Age
Age (years)
Education
Education level (iliterate
, primary
, Secondary
, Tertiary
, or unknown
)
Occupation
One of ten occupation categories
method
One of nine possible methods
Data from a study of serious suicide attempts over three years in a predominantly rural population in Shandong, China.
Sun J, Guo X, Zhang J, Wang M, Jia C, Xu A (2015) "Incidence and fatality of serious suicide attempts in a predominantly rural population in Shandong, China: a public health surveillance study," BMJ Open 5(2): e006762. https://doi.org/10.1136/bmjopen-2014-006762
Data downloaded via Dryad Digital Repository. https://doi.org/10.5061/dryad.r0v35
Attitudes towards the Swahili language among Kenyan school children
A dataset with 480 observations on the following 4 variables.
Province |
NAIROBI or PWANI
|
Sex |
female or male
|
Attitude.Score |
Score (out a possible 200 points) on a survey of attitude towards the Swahili language |
School |
Code for the school: A through L
|
Hamisi Babusa, a Kenyan scholar, administered a survey to 480 students from Pwani and Nairobi provinces about their attitudes towards the Swahili language. In addition, the students took an exam on Swahili. From each province, the students were from 6 schools (3 girls schools and 3 boys schools) with 40 students sampled at each school, so half of the students from each province were males and the other half females. The survey instrument contained 40 statements about attitudes towards Swahili and students rated their level of agreement to each. Of these questions, 30 were positive questions and the remaining 10 were negative questions. On an individual question the most positive response would be assigned a value of 5 while the most negative response would be assigned a value of 1. By summing (adding) the responses to each question, we can find an overall Attitude Score for each student. The highest possible score would be 200 (an individual who gave the most positive possible response to every question). The lowest possible score would be 40 (an individual who gave the most negative response to every question).
Thanks to Dr. Babusi of Kenyatta University for sharing these data.
Comparing intestine lengths for tadpoles with and without exposure to Bd fugus
A data frame with 27 observations on the following 4 variables.
Treatment
Exposed to fungus (Bd
=yes or Control
=no)
Body
Length of body (in mm)
GutLength
Length of intestine (in mm)
MouthpartDamage
Measure of damage to the mouth (e.g. missing teeth)
Biologists wondered whether tadpoles can adjust the relative length of their intestines if they are exposed to a fungus called Batrachochytrium dendrobatidis (Bd).
Venesky MD, Hanlon SM, Lynch K, Parris MJ, Rohr JR. (2013) "Optimal digestion theory does not predict the effect of pathogens on intestinal plasticity," Biol Lett 9: 20130038. http://dx.doi.org/10.1098/rsbl.2013.0038
Daily closing prices of Apple, Google, and Microsoft stocks (12/1/2015 to 12/1/2017)
A data frame with 504 observations on the following 5 variables.
Date
Date (coded as mm/dd/yyyy
)
AAPL
Apple Inc. closing price
GOOG
Alphabet Inc. (Google) closing price
MSFT
Microsoft Corp. closing price
t
Time index (1 to 505)
Closing price of Apple (AAPL), Google/Alphabet (GOOG) and Microsoft (MSFT) stocks for each trading day in a two-year period from 12/1/2015 to 12/1/2017.
Data downloaded using the Quandl R package (12/2/2017)
State teen pregnancy rates, Civil War participation, and church attendance.
A data frame with 50 observations on the following 4 variables.
State
State abbreviation
CivilWar
Role in Civil War (B
=border, C
=Confederate, O
=other, or U
=union)
Church
Percentage who attended church in previous week (from a state survey)
Teen
Number of pregnancies per 1000 teenage girls in state
State level data on teen pregnancies, church attendance, and role in the U.S. Civil War.
2010 teen pregnancy rate, per 1000 teenage women, per year. Source: Guttmacher Institute, via Tanya Lewis (5 May 2014) "Teen pregnancy rates by state," https://www.livescience.com
Prices and number of pages for a sample of college textbooks
A dataset with 30 observations on the following 2 variables.
Pages |
Number of pages in the textbook |
Price |
Price of the textbook (in dollars) |
Two undergraduate students at Cal Poly - San Luis Obispo took a random sample of 30 textbooks from the campus bookstore in the fall of 2006. They recorded the price and number of pages in each book, in order to investigate the question of whether number of pages can be used to predict price.
Student project
Votes in the US Senate on Clarence Thomas nomination for the US Supreme Court
A data frame with 100 observations on the following 6 variables.
State
State name
Senator
Senator name
Party
Party affiliation (D
=Democrat or R
=Republican)
ConfVote
Confirmation vote (Nay
or Yea
)
StateOpinion
Percentage of state residents supporting the choice
Vote
Numeric coding for vote (1
=for or 0
=against)
Data from the U.S. Senate vote on October 15, 1991 to confirm Clarence Thomas to a position on the Supreme Court.
These numbers are taken from Kastellec, J.P., Lax, J.R., and Phillips, J. (2010), "Public Opinion and Senate Confirmation of Supreme Court Nominees," Journal of Politics, 72(3): 767-84. In this paper the authors used opinion polls and an advanced statistical method known as multilevel regression and poststratification to determine the StateOpinion levels.
Compare prices for Porsche, Jaguar, and BMW cars offered for sale at an internet site
A dataset with 90 observations on the following 8 variables.
CarType |
BMW , Jaguar , or Porsche
|
Price |
Asking price (in $1,000's) |
Age |
Age of the car (in years) |
Mileage |
previous miles driven (in 1,000's) |
Car |
0 =Porsche, 1 =Jaguar, 2 =BMW |
Porsche |
Indicator with 1 = Porsche and 0 =otherwise |
Jaguar |
Indicator with 1 = Jaguar and 0 =otherwise |
BMW |
Indicator with 1 = BMW and 0 =otherwise |
Two students collected samples of Porsche, Jaguar, and BMW cars that were offered for sale at an internet site. In addition to asking price, they recorded the model year (converting to age) and mileage of each advertised car. The PorschePrice dataset (from the first edition) has only the Porsche data and the PorscheJaguar dataset has the data for those two models.
This dataset has been updated (with different car models) to ThreeCars2017 for the second edition.
Student project data collected from autotrader.com in Spring 2007.
Data from cars.com for a sample of three different models of used cars in 2017
A data frame with 90 observations on the following 7 variables.
CarType
Model (Accord
, Maxima
, or Mazda6
)
Age
Age of used car (years)
Price
Price (in thousands of dollars)
Mileage
Mileage (in thousands of miles)
Mazda6
Is the car a Mazda6? (1
=yes or 0
=no)
Accord
Is the car an Accord? (1
=yes or 0
=no)
Maxima
Is the car a Maxima? (1
=yes or 0
=no)
Data for a sample of cars from three models (Mazda6, Honda Accord, Nissan Maxima) from a website. The dataset AccordPrice is a subset of this file.
Data obtained from cars.com, February 2017 using zip code 44107, Lakewood, Ohio.
Effect of a waiter leaving a joke or an advertisement on getting a tip
A dataset with 211 observations on the following 5 variables.
Card |
Type of card used: Ad , Joke , or None
|
Tip |
1 =customer left a tip or 0 =no tip |
Ad |
Indicator for Ad card (1 =ad card left or 0 =no ad card) |
Joke |
Indicator for Joke card (1 =joke card left or 0 =no joke card) |
None |
Indicator for no card (1 =no card left or 0 =ad or joke card left) |
Can telling a joke affect whether or not a waiter in a coffee bar receives a tip from a customer? A study investigated this question at a coffee bar at a famous resort on the west coast of France. The waiter randomly assigned coffee-ordering customers to one of three groups: When receiving the bill one group also received a card telling a joke, another group received a card containing an advertisement for a local restaurant, and a third group received no card at all. He recorded whether or not each customer left a tip.
Gueguen, Nicholas (2002), "The Effects of a Joke on Tipping When it is Delivered at the Same Time as the Bill," Journal of Applied Social Psychology, 32, 1955-1963.
List and outcomes for passengers on the Titanic
A dataset with 1313 observations on the following 6 variables.
Name |
Passenger name |
PClass |
Passenger class: * =missing, 1st , 2nd , or 3rd
|
Age |
Age (in years) |
Sex |
female or male
|
Survived |
1 =survived or 0 =died |
SexCode |
1 =female or 0 =male |
The Titanic was a British luxury ocean liner that sank famously in the icy North Atlantic on its maiden voyage in April of 1912. Of the approximately 2200 passengers on board, 1500 died. The high death rate was blamed largely on the inadequate supply of lifeboats, a result of the manufacturer's claim that the ship was "unsinkable." A partial data set of the passenger list was compiled by Philip Hinde in his Encyclopedia Titanica and is given in this dataset.
Philip Hinde's Encyclopedia Titanica, http://www.encyclopedia-titanica.org/.
Data may also be downloaded from the Australasian Data and Story Library (OzDASL) at
http://www.statsci.org/data/general/titanic.html.
Effects of transcranial magnetic stimulation (TMS) on migraine headaches
A dataset with 2 observations on the following 4 variables.
Group
|
Treatment group: Placebo or TMS
|
Yes
|
Count of number of patients that were pain-free |
No
|
Count of number of patients that had pain |
Trials
|
Number of patients in the group |
A study investigated whether a handheld device that sends a magnetic pulse into a person's head might be an effective treatment for migraine headaches. Researchers recruited 200 subjects who suffered from migraines and randomly assigned them to receive either the TMS (transcranial magnetic stimulation) treatment or a sham (placebo) treatment from a device that did not deliver any stimulation. Subjects were instructed to apply the device at the onset of migraine symptoms and then assess how they felt two hours later. This dataset is a two-way table of the results.
This dataset renamed as Migraines in second edition.
Based on results in R. B. Lipton, et. al. (2010) “Single-pulse Transcranial Magnetic Stimulation for Acute Treatment of Migraine with Aura: A Randomised, Double-blind, Parallel-group, Sham-controlled Trial," 9(4):373-380.
Rushing yards for each game LaDainian Tomlinson played in the 2006 National Football League (NFL regular) season.
A dataset with 16 observations on the following 4 variables.
Game
|
Week number in the 2006 season |
Opponent
|
Name of opposing team |
Attempts
|
Number of rushing attempts |
Yards
|
Total yards gained rushing for the game |
For each of the sixteen games the San Diego Chargers played in the 2006 NFL regular season we have the number of times LaDainian Tomlinson ran the ball and the total yards he gained.
This data set from the first edition was replaced by BreesPass in the second edition.
Data downloaded from http://www.pro-football-reference.com/players/T/TomlLa00/gamelog/2006/
This function produces a Tukey nonadditivity plot for a two-way ANOVA model.
TukeyNonaddPlot(formula, data, out = "n", main = "Tukey Nonadditivity Plot", ylab = "Residuals")
TukeyNonaddPlot(formula, data, out = "n", main = "Tukey Nonadditivity Plot", ylab = "Residuals")
formula |
A formula for a two-way ANOVA in the form Response=FactorA+FactorB (or FactorA*FactorB) |
data |
A dataframe |
out |
Control what is returned. Default is "n"=nothing. Other options are "comp" for the comparisons, "line" for the equation of the line, and "resid" for the cell residuals. |
main |
Add a title, default is "Tukey Nonadditivity Plot" |
ylab |
Label vertical axis, default is "Residuals" |
More details need to be written
Depends on the option set with out
.
data(Dinosaurs) TukeyNonaddPlot(Iridium~Source*factor(Depth),data=Dinosaurs)
data(Dinosaurs) TukeyNonaddPlot(Iridium~Source*factor(Depth),data=Dinosaurs)
Experiment comparing twins (one urban, one rural) ability to clear airborne radioactive particles from their lungs
A dataset with 14 observations on the following 3 variables.
Pair
|
Code for the twin pair: A - G
|
Environ
|
Living environment: Rural or Urban
|
Percent
|
Percentage of radioactivity remaining in lungs |
This dataset is from a study to compare the effect of living environment (rural or urban) on human lung function, where the researchers were able to locate seven pairs of twins with one twin in each pair living in the country, the other in a city. To measure lung function, twins inhaled an aerosol of radioactive Teflon particles. By measuring the level of radioactivity immediately and then again after an hour, the scientists could measure the rate of “tracheobronchial clearance." The percentage of radioactivity remaining in the lungs after an hour told how quickly subjects' lungs cleared the inhaled particles.
This dataset was renamed as RadioactiveTwins for the second edition.
“Urban factor and tracheobronchial clearance" by Per Camner and Klas Philipson in Archives of Environmental Health, V. 27 (1973), page 82. Data can be found in Introduction to Mathematical Statistics and its Applications, 2nd Edition by Richard J. Larson and Morris L. Marx. Englewood Cliffs, NJ: Prentice Hall, p. 548.
Ratings of an OCD symptom in psychotherapy sessions
A data frame with 44 observations on the following 3 variables.
Group
Time frame of the session (I
=early through VI
=late)
Score
Rating of OCD symptom on a 1 to 4 scale
Symbol
Indicator for groups I, III, and IV
A patient had been diagnosed with OCD (obsessive/compulsive disorder) and underwent a series of psychotherapy sessions. Notes from the sessions were presented to three different experienced therapists who rated sessions with a particular OCD symptom (defense of undoing) on a 1 to 4 scale (smaller values indicating worse symptoms). If all three judges agreed on the stage of a session, that determined the category. Otherwise, they discussed until they reached a consensus. The sessions were also grouped into six groups with I being the earliest sessions and VI being the latest.
Sampson, Harold, Joseph Weiss, L. Mlodansky, and Edward Hause (1972) "Defense analysis and the emergence of warded off mental contents," Archives of General Psychiatry, v. 26, pp. 524-532.
Price of US stamp for first class mail 1885-2012
A dataset with 25 observations on the following 2 variables.
Year
|
Years when stamp price changed |
Price
|
Cost of a US first class stamp (in cents) |
The data record the year and price for each change in price for a US first class (1 ounce, domestic letter) stamp since 1885.
http://about.usps.com/who-we-are/postal-history/domestic-letter-rates-1863-2011.htm
Experiment to compare visual and verbal performance
A data frame with 80 observations on the following 5 variables.
Subject
Subject number (s1
to s20
)
Task
Follow a letter (Visual
) or a sentence (Verbal
)
Report
Point response (Visual
) or say response (Verbal
)
Group
Combination of Task+Report (Letter Point
, Letter Say
, Sentence Point
, or Sentence Say
)
Time
Response time (in seconds)
Subjects carried out two kinds of tasks, one visual (identify letters), one verbal (identify sentences); and to report the results in either of two ways, one visual (pointing at a response), one verbal (speaking a response). Time to complete each task was recorded in seconds.
Original experiment from Brooks, L., R. (1968) "Spatial and verbal components of the act of recall," Canadian J. Psych. V 22, pp. 349 - 368. These data collected from a Mount Holyoke College psychology class.
Voltage drop over time as a capacitor discharges
A dataset with 50 observations on the following 2 variables.
Voltage
|
Voltage (in volts) |
Time
|
Time after charging (in seconds) |
A capacitor was charged with a nine-volt battery and then a voltmeter recorded the voltage as the capacitor was discharged. Measurements were taken every 0.02 seconds.
Measurements recorded by one of the authors.
An experiment to see if special exercises help babies learn to walk sooner
A dataset with 24 observations on the following 2 variables.
Group
|
Treatments: exercise control , final report , special exercises , or weekly report
|
Age
|
Age (in months) when first walking |
Scientists wondered if they could get babies to walk sooner by prescribing a set of special exercises.
Their experimental design included four groups of babies and the following treatments:
Special exercises: Parents were shown the special exercises and encouraged to use them with
their children. They were phoned weekly to check on their child's progress.
Exercise control: These parents were not shown the special exercises, but they were told to
make sure their babies spent at least 15 minutes a day exercising.
Weekly report: Parents in this group were not given instructions about exercise. Like the
parents in the treatment group, however, they received a phone call each week to check on
progress.
Final report: These parents were not given weekly phone calls or instructions about exercises. They reported at the end of the study.
Zelazo, Phillip R., Nancy Ann Zelazo, and Sarah Kolb (1972), “Walking in the Newborn," Science, v. 176, pp. 314-315.
Daily pedometer data for one of the authors
A data frame with 223 observations on the following 7 variables.
StepCount
Number of steps taken in the day
Kcal
Calories burned (according to pedometer)
Miles
Miles walked
Weather
cold
, rain
, or shine
Day
Day of week (F
=Friday, M
=Monday, R
=Thursday, S
=Saturday, T
=Tuesday, U
=Sunday, W
=Wednesday)
Walk
Were the dogs walked? (1
=yes or 0
=no)
Steps
Steps in units of 1,000 (so StepCount/1000)
One of the authors recorded daily pedometer data, the weather, and whether or not he walked the dogs.
One of the author's pedometer records.
An experiment to see if financial incentives improve weight loss
A dataset with 38 observations on the following 3 variables.
WeightLoss
|
Weight loss (in pounds) after four months |
Group
|
Treatment group: Control or Incentive
|
Month7Loss
|
Weight loss (in pounds) after seven months |
Researchers investigated
whether financial incentives would help people lose weight more successfully. Some participants in
the study were randomly assigned to a treatment group that was offered financial incentives for
achieving weight loss goals, while others were assigned to a control group that did not use financial
incentives. All participants were monitored over a four month period and the net weight change
(Before - After in pounds) at the end of this period was recorded for each individual. Then the
individuals were left alone for three months with a followup weight check at the seven-month mark
to see whether weight losses persisted after the original four months of treatment.
The 4-month data alone (with missing values omitted) is stored in WeightLossIncentive4.
The 7-month data alone (with missing values omitted) is stored in WeightLossIncentive7.
“Financial incentive-based approaches for weight loss," Journal of the American Medical Association by Volpp, John, Troxel, et. al., Vol. 200, no. 22, pp 2631-2637, (Dec. 2008)
Weight loss after four months with/without a financial incentive
A dataset with 36 observations on the following 2 variables.
WeightLoss
|
weight loss (in pounds) after 4 months |
Group
|
Treatment group: Control or Incentive
|
Researchers investigated whether financial incentives would help people lose weight more successfully. Some participants in the study were randomly assigned to a treatment group that was offered financial incentives for achieving weight loss goals, while others were assigned to a control group that did not use financial incentives. All participants were monitored over a four month period and the net weight change (Before - After in pounds) at the end of this period was recorded for each individual. Then the individuals were left alone for three months with a followup weight check at the seven-month mark to see whether weight losses persisted after the original four months of treatment. This dataset has only the non-missing 4-month data. The 7-month data are in WeightLossIncentive7 and both measurements (including missing values) are in WeightLossIncentive.
“Financial incentive-based approaches for weight loss," Journal of the American Medical Association by Volpp, John, Troxel, et. al., Vol. 200, no. 22, pp 2631-2637, (Dec. 2008)
Weight loss after seven months with/without a financial incentive
A dataset with 33 observations on the following 2 variables.
Group |
Treatment group: Control or Incentive
|
Month7Loss |
Weight loss (in pounds) after seven months |
Researchers investigated whether financial incentives would help people lose weight more successfully. Some participants in the study were randomly assigned to a treatment group that was offered financial incentives for achieving weight loss goals, while others were assigned to a control group that did not use financial incentives. All participants were monitored over a four month period and the net weight change (Before - After in pounds) at the end of this period was recorded for each individual. Then the individuals were left alone for three months with a followup weight check at the seven-month mark to see whether weight losses persisted after the original four months of treatment. This dataset has only the non-missing 7-month data. The 4-month data are in WeightLossIncentive4 and both measurements (including missing values) are in WeightLossIncentive.
“Financial incentive-based approaches for weight loss," Journal of the American Medical Association by Volpp, John, Troxel, et. al., Vol. 200, no. 22, pp 2631-2637, (Dec. 2008)
Morality data over 20 years for 1314 women from Whickham, England
A data frame with 1314 observations on the following 5 variables.
Outcome
Status at 20-year follow-up (Alive
or Dead
)
Smoker
Smoker at baseline? (No
or Yes
)
Age
Age (in years at baseline)
AgeGroup
Age group (18-64
or 65+
)
Alive
Numeric code for Outcome (1
=alive or 0
=dead)
Twenty-year mortality, smoking status, and age for 1314 women in Whickham, England. We have named this Whickham2 to distinguish it from Whickham, which is a file in the mosaicData package.
A version of these data are in the mosaicData package but originally are from:
Appleton, D. R., French, J. M., and Vanderpump, M.P. (1996), "Ignoring a Covariate: An Example of Simpson's Paradox," The American Statistician, 50, 340-341.
Percentage of different types of words recalled
A dataset with 40 observations on the following 4 variables.
Subject
|
Code to identify each subject: A to J
|
Abstract
|
Words were abstract? No or Yes
|
Frequent
|
Words were common? No or Yes
|
Percent
|
Percentage of words recalled (out of 25) |
One hundred words were presented to each subject in a randomized order. The goal of the experiment was to see whether some kinds of words were easier to remember than others. In particular, are common words like potato, love, diet, and magazine easier to remember than less common words like manatee, hangnail, fillip, and apostasy? Are concrete words like coffee, dog, kale, and tamborine easier than abstract words like beauty, sympathy, fauna, and guile? There were 25 words each of four kinds, obtained by crossing the two factors of interest, Abstraction (concrete or abstract) and Frequency (common or rare).
This dataset appears in the first edition, but is not used in the second edition.
Data from a student laboratory project, Department of Psychology and Education, Mount Holyoke College.
Results from the online game Words with Friends (solo play)
A data frame with 444 observations on the following 11 variables.
Points
Number of points scored by the author
OppPoints
Number of points scored by opponent ("solo")
WinMargin
Points minus OppPoints, so margin of victory (or loss)
Start
Did the author go first or pass? (first
or pass
)
Ss
Number of S tiles (0 to 5)
BlanksNumber
Number of Blank tiles (0 to 2)
J
Did the author get the J tile? (1
=yes, 0
=no)
Q
Did the author get the Q tile? (1
=yes, 0
=no)
X
Did the author get the X tile? (1
=yes, 0
=no)
Z
Did the author get the Z tile? (1
=yes, 0
=no)
Blanks
Number of Blank tiles (0blanks
, 1blank
, or 2blanks
)
Results collected from one of the authors playing the "solo" mode of Words with Friends.
Author's iPhone
Results from an experiment to move wet/dry objects with wrinkled/dry fingers
A data frame with 80 observations on the following 7 variables.
Participant
Participant ID (p1
to p20
)
Time
Time to move objects (seconds)
Condition
non-wrinkled/dry
, non-wrinkled/wet
, wrinkled/dry
, or wrinkled/wet
Fingers
Status of fingers (non
or wrinkled
)
Objects
Status of objects (dry
or wet
)
WrinkledThenNon
Wrinkled first? (1
=yes or 1
=no)
DryThenWet
Dry first? (1
=yes or 1
=no)
Each of 20 participants were measured doing a "transfer task" several times under each of four conditions. The transfer task was to pick up an item with the right hand thumb and index finger, pass the item through a small hole and grab it with the left hand, and then put the item into a box that had a hole in the lid. Sometimes the participant's fingers were wrinkled; sometimes the items were sitting in water.
Kareklas, Nettle, and Smulders (2013) "Water-induced finger wrinkles improve handling of wet objects", Biology Letters, http://dx.doi.org/10.1098/rsbl.2012.0999
Data from the Youth Risk Behavior Surveillance System
A data frame with 13387 observations on the following 6 variables.
ride.alc.driver
1
=rode with a drinking driver in past 30 days or 0
=did not
female
1
=female or 0
=male
grade
Year in high school: 9
, 10
, 11
, or 12
age4
Age (in years)
smoke
Ever smoked? 1
=yes or 0
=no
DriverLicense
Have a driver's license? 1
=yes or 0
=no
This dataset is derived from the 2007 Youth Risk Behavior Surveillance System (YRBSS), which is an annual survey conducted by the Centers for Disease Control and Prevention (CDC) to monitor the prevalence of health-risk youth behaviors. This dataset focuses on whether or not youths have recently (in past 30 days) ridden with a drunk driver.
http://www.cdc.gov/HealthyYouth/yrbs/index.htm
Risky behavior (riding with a drinking driver) in youths
A dataset with 13387 observations on the following 6 variables.
ride.alc.driver
|
1 =rode with a drinking driver in past 30 days or 0 =did not |
female
|
1 =female or 0 =male |
grade
|
Year in high school: 9 , 10 , 11 , or 12
|
age4
|
Age (in years) |
smoke
|
Ever smoked? 1 =yes or 0 =no |
DriverLicense
|
Have a driver's license? 1 =yes or 0 =no |
This dataset is derived from the 2007 Youth Risk Behavior Surveillance System (YRBSS), which is an annual survey conducted by the Centers for Disease Control and Prevention (CDC) to monitor the prevalence of health-risk youth behaviors. This dataset focuses on whether or not youths have recently (in past 30 days) ridden with a drunk driver.
This dataset renamed as YouthRisk for the second edition.
The article "Which Young People Accept a Lift From a Drunk or Drugged Driver?" in Accident Analysis and Prevention (July 2009. pp. 703-9) provides more details.
A more recent version of the full dataset is available at http://www.cdc.gov/brfss/technical_infodata/surveydata.htm.
Survey of students in grades 9-12 concerning health-risk behiviors
A dataset with 500 observations on the following 6 variables.
Sleep
Average hours sleep on school night (10 or more hours
, 9 hours
, down to
4 or less hours
)
Sleep7
Seven or more hours of sleep? (0
=no or 1
=yes)
SmokeLife
Ever smoked? (No
or Yes
)
SmokeDaily
Regular smoker? (No
or Yes
)
MarijuaEver
Ever smoked marijuana? (0
=no or 1
=yes)
Age
Age (in years)
Data from the Centers for Disease Control's Youth Risk Behavior Surveillance System (YRBSS).
This data set is from the first edition, but not used in the second edition.
http://www.cdc.gov/HealthyYouth/yrbs/index.htm
Data from 220 cases in Florida where a "Stand your ground" defense was used.
A data frame with 220 observations on the following 5 variables.
Convicted
Was the defendant Convicted? (No
or Yes
)
IndWhiteVictim
Was the victim white? (1
=yes or 0
=no)
IndWhiteDefendant
Was the defendant white? (1
=yes or 0
=no)
VictimRace
Race of the victim (Minority
or White
)
DefendantRace
Race of the defendant (Minority
or White
)
Inspired by the Travon Martin case, combined fatal and non-fatal cases of assault in Florida for which the defendant used the Stand Your Ground law in defense. These data show Simpson's Paradox. Race of the victim is more important than race of the defendant.
Data from Tampa Bay Times, male plus female cases, as of 2/8/15 – final posted data http://www.tampabay.com/stand-your-ground-law/nonfatal-cases http://www.tampabay.com/stand-your-ground-law/fatal-cases