Using MATLAB to understand distributions: Pokémon GO

Benjamin Bratton
Princeton University,
Author Profile

Summary

This problem set will help teach students how to describe real world distributions. Data science skills that will be covered include measures of central tendency and spread, transformations of distributions, exploratory statistics and scripted data analysis. This problem set also helps students explore metrics of biodiversity and interact with datasets that are not their own.

Used this activity? Share your experiences and modifications

Learning Goals

After completing this activity, students will be able to

  • Import and curate data from a comma separated format
  • Calculate the median and mean from a set of observations
  • Calculate the standard deviation and inter-quartile range from a set of observations
  • Plot a histogram showing the frequency of certain observations
  • Use Lilliefors test to to test if distributions arenormally distributed
  • Transform data to a space where they are normally distributed
  • Quantify biodiversity using the Shannon index and the Simpson index
  • Compare statistics and their sensitivity to extreme values
  • Compare the Shannon index and Simpson index as measures of biodiversity

MATLAB is the main vehicle of instruction during this activity. Students are required to write code to analyze the dataset as well as perform and interpret the analysis. Students will gain confidence in writing their own tools to interact with data instead of simply executing pre-build code.

The problem set is designed in a way that forces students to come up with their own strategy for executing the desired task. In some places, explicit MATLAB functions are suggested, in others, it is merely a description of what computation should take place. This leaves the algorithmic design and implementation up to the student.

This problem set can also be used as a vehicle to help students learn "best practices" in programming, either as a teaching tool or as an assessment tool for skills that had been previously taught.

Context for Use

This problem set is geared toward students taking an introductory biology course, particularly one with either a lab or discussion section. Students need a working installation of MATLAB and the Statistics and Machine Learning Toolbox to perform this activity. While it could be demonstrated in a lecture style format, it is designed for a format where individuals or pairs write their own code to interact with the dataset provided.

This problem set should take about 60 minutes to complete and could be used a lab activity or an independent, self-paced activity.

Before beginning this problem set, students should be comfortable
- importing data
- writing scripts and/or functions
- reading the help documentation to understand the syntax of built-in functions
- performing basic plotting tasks

This problem set assumes that the students have had some experience with representing data in a tabular layout. It may be beneficial if the students have had some experience with basic statistical tools such as how to calculate the mean, standard deviation and median of simple datasets.

This problem set is designed to be an introduction to working with data, but it could be modified to include a data collection component or have some of the "tips" removed for students with more computational experience. For use in an entirely computational course, it would also be easy to add additional exercises for plotting the datasets based on their geographical location and/or merging together additional data from online sources. For example, the included dataset provides the number of times various species of Pokemon were seen during the summer of 2016 and could be further divided or aggregated by the type of Pokemon. Another modification that could be added for a computational course would be a comparison of run times for the different implementations of the analysis designed by the students.

Description and Teaching Materials

This activity is a problem set that walks students through the process of analyzing a dataset of Pokémon GO spawn counts. Depending on the course format, students should be able to work on the problems at their own pace. In its ideal laboratory or small group discussion format, the instructor(s) will help provide feedback and troubleshoot issues that arise with code execution. Students should be allowed to fight with the code somewhat, but not to the point of extreme frustration. The final results that the students produce are a collection of figures, code and short answers to questions, so this activity works best if the course already has some mechanism for "turning in" electronic homework.

Choice of programming language

The tools which are built into core MATLAB and the Statistics Toolbox make MATLAB a good choice for rapid development of data analytics. One could use another language to perform the statistical analysis or the presentation of the data, but MATLAB is my preferred choice for this activity due to its excellent documentation and unified "feel" as a language. (1) Data analytics often require the use of algorithms that others have developed. Specifically for this problem set, students are introduced to the Lilliefors test for normality. MATLAB's excellent documentation makes it easier to find usage and syntax information in addition to primary literature explanations for the algorithms and their use. (2) MATLAB's visualization and analysis tools "feel" like they belong to the same language. For students who are new to programming and code design, having a single unified feeling to the language for both analysis and visualization is very important.

Instructor notes and tips

  • Most students find the process of importing and parsing the text file to be difficult, even after they inspect the file and understand how it is parsed. Depending on how much exposure the students have had to file i/o, this section can be given to them to puzzle over or an example script provided.
  • For older versions MATLAB, the example code in "Section 2.3 Frequency of Events" may need to be modified. Such an example is included but commented out in the instructor file. See also Mathworks blogpost on arthimetic expansion.
  • For the latter part of this exercise, the students are asked to only consider the non-zero observations. Some students remember to follow the instructions and make a new variable replacing zeros with nans, but sometimes choose to use their original variable instead when performing calculations.
  • MATLAB's function "log" takes the natural logarithm of the data. Depending on the context of the course, this activity could be expanded to include a discussion of taking logs with different bases. In particular, Section 4 discusses transformations of data into spaces where the data are normally distributed. The values of the transformed data depend on the base of the log, but the shape of the distributions do not.
  • Depending on the context, for example in a biostats course, this activity could benefit from a more in depth discussion of multiple-hypothesis testing and the concept of p-hacking. Currently, there is only a minor discussion of it in section 4.2. See also Wikipedia article on p-hacking.
  • Depending on the context, for example in an applied math course, section 5.1 could benefit from a more in depth discussion of issues of numerical stability. In some situations, when the probability of a single observation (p) is very small, p logp can be unstable. This leads some to set a threshold for p below which p logp is forced to zero.

Files included

For courses that are already setup to work with MATLAB, included are live script files including example code/solutions for the instructor as well as a blank live script for the students.

For courses that do not already have MATLAB, these live scripts have been exported to PDF.

For all courses, an example csv formatted dataset

Assessment

There are three major items that are produced by the problem set. Depending on the focus of the course, the weighting of importance on these three can be adjusted.

1) Did the students answer the short answer questions correctly, such as "Out of 73 species, how many of them were seen more frequently than the mean?"
2) Did the student include the figures/plots which describe the distributions of spawn events with the extra lines for mean, median, etc? This could also include points for readability/design of the plots.
3) Did the student include the code, with comments, that was used to analyze this dataset and could be reused on other datasets? This could also include points for specific best practices that were taught as extensions to this problem set.

References and Resources

For further reading:

  • A PDF description and explanation of Lilliefors test
    • https://www.utdallas.edu/~herve/Abdi-Lillie2007-pretty.pdf
  • A publication of the US Department of Agriculture about quantification of biodiversity
    • Gaines, Harrod, and Lehmkuhl. Monitoring Biodiversity: Quantification and Interpretation. PNW-GTR-443, March 1999. Available online at https://pdfs.semanticscholar.org/16ca/66e9ba9a23fe9fce7358b907d3de32f1306d.pdf