National Numeracy Network > NNN Blog

# NNN Blog

The College Board released its annual report on student load debt earlier this week. The report provides many opportunities for teaching QR in multiple disciplines (education, economics, sociology, political science, and psychology at least). What I want to focus on is the question: Who is in the population or sample?

Consider this representative excerpt from the report's executive summary;

STUDENT DEBT
About 60% of students who earned bachelor's degrees in 2012-13 from the public and private nonprofit institutions at which they began their studies graduated with debt. They borrowed an average of \$27,300, an increase of 13% over five years and 19% over a decade.

–In 2013, 40% of borrowers with outstanding education debt owed less than \$10,000, and another 29% owed between \$10,000 and \$25,000; 4% of borrowers owed \$100,000 or more. This debt includes borrowing for both undergraduate and graduate studies.

–In 2014, 2.5 million federal Direct Loan borrowers were in repayment plans that limit their payments to a specified percentage of their incomes. These borrowers constituted 14% of those in repayment plans; they held 28% of the total outstanding debt in repayment plans.

–In the third quarter of 2013-14, 9% of borrowers with outstanding Federal Direct Student Loans were in default. These borrowers held 5% of total outstanding debt.

–For-profit institutions accounted for 32% of those who entered repayment in 2010-11, and 44% of those who defaulted by the end of September 2013.

As you read through these bullet points, note the careful QR reading required. While all of the statements come under the same broad heading (which may lead careless readers to think all of the statistics come from the same population or sample), in fact there are clearly at least 7 different, overlapping populations:

• Those who earned a BA in 2012-13
• Those who earned a BA in 2012-13 and took out a student loan
• Those who held student debt in 2013 (regardless of when they graduated/acquired that debt and excluding those who had previously held student loan debt but paid it off). (Note in particular that this sample includes student debt from graduate or professional school while the two previous samples would not include such loans.)
• Those who held student debt in 2014
• Those who held student debt in the third quarter of 2013-14
• Those who entered repayment in 2010-11
• Those who entered repayment in 2010-11 and defaulted by September 30, 2013.

Some of the differences probably make little difference (eg those who held student debt in 2014 vs. those who held student debt in the third quarter of 2013-14). Other differences are enormously important. Look, for example, at the preamble which says that among the 60% of BA graduates in 2012-13 who acquired debt, the average debt was \$27,300. The very next sentence tells us that 4% of "borrowers" "in 2013" owed more than \$100,000. The report is very clear that the latter figure "includes borrowing for both undergraduate and graduate studies," but it is easy for readers to fail to see that the population for the former and latter figures are entirely different.

Note that I am not criticizing the College Board. I might suggest a tweak here or there that would, to my eye, make the distinctions clearer. But the bigger point is that we need to teach students to be continually asking themselves "Who is in the population/sample?" to avoid mistaken conclusions of their own.

## Standard Errors or Typical Errors?

Nate Silver has made election prediction sexy. This election cycle I've seen many estimates of the probability for a Republican takeover of the Senate. And when I say "many estimates" I don't mean different sources; I mean vastly different probabilities. Today I see the New York Times gives Republicans a 70% chance'] while the [link http://www.washingtonpost.com/wp-dre/politics/election-lab-2014 Washington Post puts the figure at 95%. (Sliver sets the probability at 68.9%. We can take up the topic of over-articulated precision in another blog!) In this context, those numbers mean very different things.

So, what's the source of variation in these estimates? Earlier in this election cycle one explanation was that people were estimating different probabilities. You could find estimates for "the probability that Republicans would take control if the election were held today" and for "the probability that Republicans will take control on election night." In the middle of summer, these are two vastly different concepts because the latter allows for the wide range of events that might shift elections over the course of three or four months.

But surely that can't be the explanation for the divergence in today's Timesand Postestimates. Even if they are asking different questions regarding timing, we are less than 4 days from election day and many ballots have already been cast. It seems clear that the differences here are due to model specification. When students learn statistics, we teach them how to construct standard errors to account for random sampling error. There's nothing wrong with that, but as these election forecasts make clear, the far more typical specification error often swamps sampling error.

Fortunately, the idea of omitted variables bias or other specification error can be intuitively understood by undergraduates regardless of their mathematical prowess. Happy Election Weekend!

## Compared to What: Infectious Disease Edition

The Washington Post has a great online infographic comparing attributes of the spread of Ebola to those of more common diseases such as Chicken Pox or Influenza. The site really drives home how important it is to provide context when presenting data to people who are not intimately familiar with the topic. (While have some understanding of Influenza transmission from personal experience, that experience doesn't translate well into the statistics on transmission, for example.) I could image giving students the Ebola data only and ask them to draw some conclusions. Then I could give them the comparison data and ask how that added information alters their understanding of what is happening and how we might want to respond.

## Spurious Correlation

Tyler Vigen has some great examples of correlations that are surely not evidence of causation. One downside of the examples list, however, is that they are all time series. The underlying problem is that the two time series considered are not stationary; they are both trending which explains the high correlation. (The one exception is the example of the numbers of Nicolas Cage movies and people drowning after falling into swimming pools. I may be wrong, but to my eye those two series look stationary.)

One other interesting fact I learned from the site: The number of people who die by becoming tangled in bedsheets has more than doubled in the last decade to almost 800. What can explain this steadily growing national epidemic?!

## All Depends on How You Count

The most recent jobs report is great fodder for student discussions of basic QR-in-measurement issues. This story from Yahoo provides a succinct summary. The table in the middle of the article motivates discussions about:

• How do we define unemployment? Do we want to adjust that measure for underemployed?
• Is employment always employment? Does it matter what kind of job people hold?
• How does the current labor market compare with that at other times (in particular, the time before the financial crisis)?
Ultimately, the table could motivate a good, all-around discussion of the importance of seeking multiple measures if you want a complete understanding of a complex issue like economic recovery.