Using MATLAB to understand distributions: Pokémon GO
Summary
Learning Goals
After completing this activity, students will be able to
- Import and curate data from a comma separated format
- Calculate the median and mean from a set of observations
- Calculate the standard deviation and inter-quartile range from a set of observations
- Plot a histogram showing the frequency of certain observations
- Use Lilliefors test to to test if distributions arenormally distributed
- Transform data to a space where they are normally distributed
- Quantify biodiversity using the Shannon index and the Simpson index
- Compare statistics and their sensitivity to extreme values
- Compare the Shannon index and Simpson index as measures of biodiversity
The problem set is designed in a way that forces students to come up with their own strategy for executing the desired task. In some places, explicit MATLAB functions are suggested, in others, it is merely a description of what computation should take place. This leaves the algorithmic design and implementation up to the student.
This problem set can also be used as a vehicle to help students learn "best practices" in programming, either as a teaching tool or as an assessment tool for skills that had been previously taught.
Context for Use
This problem set should take about 60 minutes to complete and could be used a lab activity or an independent, self-paced activity.
Before beginning this problem set, students should be comfortable
- importing data
- writing scripts and/or functions
- reading the help documentation to understand the syntax of built-in functions
- performing basic plotting tasks
This problem set assumes that the students have had some experience with representing data in a tabular layout. It may be beneficial if the students have had some experience with basic statistical tools such as how to calculate the mean, standard deviation and median of simple datasets.
This problem set is designed to be an introduction to working with data, but it could be modified to include a data collection component or have some of the "tips" removed for students with more computational experience. For use in an entirely computational course, it would also be easy to add additional exercises for plotting the datasets based on their geographical location and/or merging together additional data from online sources. For example, the included dataset provides the number of times various species of Pokemon were seen during the summer of 2016 and could be further divided or aggregated by the type of Pokemon. Another modification that could be added for a computational course would be a comparison of run times for the different implementations of the analysis designed by the students.
Description and Teaching Materials
This activity is a problem set that walks students through the process of analyzing a dataset of Pokémon GO spawn counts. Depending on the course format, students should be able to work on the problems at their own pace. In its ideal laboratory or small group discussion format, the instructor(s) will help provide feedback and troubleshoot issues that arise with code execution. Students should be allowed to fight with the code somewhat, but not to the point of extreme frustration. The final results that the students produce are a collection of figures, code and short answers to questions, so this activity works best if the course already has some mechanism for "turning in" electronic homework.
Choice of programming language
The tools which are built into core MATLAB and the Statistics Toolbox make MATLAB a good choice for rapid development of data analytics. One could use another language to perform the statistical analysis or the presentation of the data, but MATLAB is my preferred choice for this activity due to its excellent documentation and unified "feel" as a language. (1) Data analytics often require the use of algorithms that others have developed. Specifically for this problem set, students are introduced to the Lilliefors test for normality. MATLAB's excellent documentation makes it easier to find usage and syntax information in addition to primary literature explanations for the algorithms and their use. (2) MATLAB's visualization and analysis tools "feel" like they belong to the same language. For students who are new to programming and code design, having a single unified feeling to the language for both analysis and visualization is very important.
Instructor notes and tips
- Most students find the process of importing and parsing the text file to be difficult, even after they inspect the file and understand how it is parsed. Depending on how much exposure the students have had to file i/o, this section can be given to them to puzzle over or an example script provided.
- For older versions MATLAB, the example code in "Section 2.3 Frequency of Events" may need to be modified. Such an example is included but commented out in the instructor file. See also Mathworks blogpost on arthimetic expansion.
- For the latter part of this exercise, the students are asked to only consider the non-zero observations. Some students remember to follow the instructions and make a new variable replacing zeros with nans, but sometimes choose to use their original variable instead when performing calculations.
- MATLAB's function "log" takes the natural logarithm of the data. Depending on the context of the course, this activity could be expanded to include a discussion of taking logs with different bases. In particular, Section 4 discusses transformations of data into spaces where the data are normally distributed. The values of the transformed data depend on the base of the log, but the shape of the distributions do not.
- Depending on the context, for example in a biostats course, this activity could benefit from a more in depth discussion of multiple-hypothesis testing and the concept of p-hacking. Currently, there is only a minor discussion of it in section 4.2. See also Wikipedia article on p-hacking.
- Depending on the context, for example in an applied math course, section 5.1 could benefit from a more in depth discussion of issues of numerical stability. In some situations, when the probability of a single observation (p) is very small, p logp can be unstable. This leads some to set a threshold for p below which p logp is forced to zero.
Files included
For courses that are already setup to work with MATLAB, included are live script files including example code/solutions for the instructor as well as a blank live script for the students.
- pokemonGoDistributions_instructor_livescript (MATLAB Live Script 91kB Jan9 19)
- pokemonGoDistributions_student_livescript (MATLAB Live Script 10kB Jan9 19)
For courses that do not already have MATLAB, these live scripts have been exported to PDF.
- pokemonGoDistributions_student_pdf (Acrobat (PDF) 58kB Jan9 19)
- pokemonGoDistributions_instructor_pdf (Acrobat (PDF) 154kB Jan9 19)
For all courses, an example csv formatted dataset
- Pokemon Go Spawn Dataset (Comma Separated Values 20kB Oct5 16)
Assessment
1) Did the students answer the short answer questions correctly, such as "Out of 73 species, how many of them were seen more frequently than the mean?"
2) Did the student include the figures/plots which describe the distributions of spawn events with the extra lines for mean, median, etc? This could also include points for readability/design of the plots.
3) Did the student include the code, with comments, that was used to analyze this dataset and could be reused on other datasets? This could also include points for specific best practices that were taught as extensions to this problem set.
References and Resources
For further reading:
- A PDF description and explanation of Lilliefors test
- https://www.utdallas.edu/~herve/Abdi-Lillie2007-pretty.pdf
- A publication of the US Department of Agriculture about quantification of biodiversity
- Gaines, Harrod, and Lehmkuhl. Monitoring Biodiversity: Quantification and Interpretation. PNW-GTR-443, March 1999. Available online at https://pdfs.semanticscholar.org/16ca/66e9ba9a23fe9fce7358b907d3de32f1306d.pdf