From zero to Matlab in six weeks - with Freshmen
Course URL: http://geoweb.princeton.edu/people/simons/FRS-SESC.html
In this Freshman Seminar, you will combine field observations of the natural world with quantitative modelling and interpretation to answer questions like: "How have Earth and human histories been recorded in the geology of Princeton, the Catskills, and Spain, and what experiments can you do to query such archives of the past?" In the classroom, through problem sets, and around campus, you will gain practical experience collecting geological and geophysical data in geographic context, and analyzing these data using statistical techniques such as regression and time series analysis, with the programming language Matlab. During the required one-day field trip and over the week-long Fall break field trip, you will engage in research projects that focus on the cycles and shifts in Earth's shape, climate, and life that occur now on timescales of days, and have been recorded in rocks over timescales of millions of years. The classroom component of this Freshman Seminar will have graded (bi)weekly assignments built around on-campus data collection, data preparation or analysis, and scientific programming. A significant part of your assessment comes from writing assignments that teach you to communicate your scientific results, and culminate in an original research paper and an oral presentation for an audience of peers, Freshman Seminar alumni, and invited guests from the university community.
This is a science class: students should come prepared with an aptitude for, and a willingness to learn, the quantitative aspects of scientific inquiry. Scientific writing is an integral part of this seminar and its assessment. We teach and require the use of LaTeX and BibTeX!
What do we want students to know and be able to do at the end of the course?
Students will be exposed to natural science at its most fundamental level, which is: making models from data - as relating to the Geosciences (but without much explicitly 'domain' knowledge and certainly no rote learning of any kind). The core skills that we will teach and expect the students to master are:
- Making observations: measurement, instrumentation, interpretation. This involves equipment (smartphones, at the most basic level, have GPS receivers, magnetometers, and accelerometers!). Unlike in other departments, with the possible exception of Astrophysics, ours are not actively conducted experiments. The only experiments that we have are the ones that nature did for us (often millions of times, under varying circumstances), in the "natural laboratory". This also involves our own minds: what is important, what is interesting? What do our eyes see, how do we distill the essence of such information, and how do we record it without mechanical equipment (e.g. by sketching, in physical or digital field notebooks).
Assessment is by grading of their lab reports and field notes.
- Predictive modeling: data analysis, statistical inference, computational techniques. This involves computer code. Unlike in Computer Science courses, we will teach students how to write computer code for a specific purpose. As such it is a much more gentle introduction to algorithms and data structures. We will 'build the hammer' as we are 'making the cabinet' with it - we won't just build 'an awesome hammer'. But like Computer Science courses, it is about proper computer programming, to which we introduce the students, from scratch, and bring them up to a basic level of proficiency, enough for them to meaningfully build on Objective 1 above.
Assessment is by grading of their code samples, or directly of the figures and data analysis conducted using the code that they wrote.
- Scientific writing: reporting, referencing, and communicating uncertainty. Few courses on campus teach scientific writing per se. As with the example of Computer Science above, scientific writing isn't just about logic, presentation and proper sentence structure; it is about having something to say, wanting to say it, and saying it well. It is also about how to access and use the literature, quote the sources in the proper manner (which is very varying by discipline, but more or less consistent through the natural sciences), and presenting uncertainty (without falling into the journalistic trap of simply presenting two opposing viewpoints) while exercising sound scientific judgment. We will teach the formal aspects of this (making it look good) but also the ethical aspects (attribution, plagiarism, reproducible research, data and code sharing practices, etc).
Assessment is by grading of their oral presentations and written papers.
What sorts of activities or assignments will help students meet these objectives?Ad 1 above, the essential vehicles are the campus exercises (outdoors) and the field trip. On campus, students learn to use the equipment; off-campus, they learn how to use it for an actual research purpose. On- campus exercises end up as short written laboratory reports, which are graded. Field trip research activities are recorded in an edited (by the student) laboratory (field) notebook, which is graded, and evolve into an end-of year research project, which is proposed, fine-tuned, peer-reviewed, and written up and separately graded in the form of a substantial piece of written work. The lab notebooks in question, a vital part in the practicing scientist's life, are both physical (for use in the field) and digital (for more digested and edited field notes, usually compiled and annotated at the end of the day while the memory is fresh.)
Ad 2 above, we introduce the students to the Matlab (as a computer programming language!) in a one-to-two-hour formal session every week (during "class time") and in a one-hour informal help session every week (during "lunch time" - a very effective low-threshold venue). It's hard to teach someone how to code, but it is workable in the intimate setting of our course - since the field trip location and budget limit our enrollment, we are in such a fortunate situation to bring everyone up to speed relatively quickly. The classroom will be "flipped" to some degree. We are working on producing short video clips (inspired by Minute-Physics and Minute-Earth - hits on YouTube!) that introduce students to a feature of the computer language (both "low-level", e.g. What are 'strings' and 'structures', how do 'for' and 'while' loops work, what is the difference between a 'script' and a 'function', how to control 'input-output' features?; and "high-level", e.g. how to make and annotate a histogram, how to mesh and render a surface, how to perform linear regression, how to do hypothesis testing?). The student can watch these ahead of time (and repeatedly afterwards), try a small exercise at home, and come to class for a thorough review, troubleshooting, and further learning. On campus and in the field the students will collect so much (digital) data that the taught tools will be immediately applied to the real-world setting from which they derive. This will strengthen students' learning by doing, and by working on a project (often in a small team) about which they care and feel ownership. We count on formal coding support from the instructors, but also informally (using the "they join us for lunch and they help students with their questions in the meantime" model that has worked so well over the past few years in our other courses) from graduate students and post-docs working in our Department and elsewhere.
Ad 3 above, there are first, the 'mechanics': we require all written work in LaTeX and all bibliographic information in BibTeX; these are open-source packages used by most physical scientists and engineers. Second, the 'fundamentals': how to formulate and pursue a hypothesis, how to track down sources and published data, how to report on data collected by the students themselves, how to represent data and illustrate inference using graphs and diagrams. Third, the 'ethics': what is an appropriate use of sources, and what isn't? Who owns an idea? What is an appropriate level of editorializing? In short, how to write well, for a specific audience, of professional (though not necessarily active in the same field) scientists. We envisage a stepped approach to producing a final paper that comes very close to actual practice: (a) brainstorm the hypothesis among peers and with the instructors, (b) find the material, (c) produce a short proposal, (d) circulate a draft, (e) peer-review the final 'submitted' version, (f) incorporate suggested changes, (g) print and proof for final 'publication'. Our past experience shows us that specific guidance (and graded feedback) on each of these steps, individually, produces very impressive results at the end.
How do we evaluate student performance?Least loved part of the process: by grading! But with detailed individual feedback, and by group feedback during class time! (Students do complain it's hard -- harsh even, but they find the process fair, and value the feedback, and our accessibility.) Every week something is due, and during class there are 5-minute quizzes every week (about some reading assigned, or on a piece of code that 'works', etc). So much as the materials under 1 and 3 above are 'traditional', we don't require long philosophical considerations.
As far as 2 is concerned: how to specifically evaluate proficiency in writing computer code? Programming assignments (both the 'educational' ones that ask for specific pieces that do something specific, and the 'research' ones where the objective is to 'get something done', as a manageably broken-down piece of a larger computational task of data analysis or statistical inference) will result in 'code'. Functions, scripts, subroutines, compiled sets: the students' efforts under any of these possible forms (but all in the programming language Matlab) will be posted to us, such that we are able to try and run their code (and suggest modifications!). Code will be evaluated for usability, portability, speed of execution, and yes, 'elegance' - which will be taught by example and grown by experience.
Page numbers refer to the PDF document posted below under 01.The cheat sheet of which we write is the PDF document below under 02, with the LaTeX source code under 03.
Two example Matlab functions are posted under 04 and 05, and the image that they load is found under 06. Note that you need to verify that they are named correctly after downloading.
We begin by showing some illustrations of what Matlab can do in the context of geological data analysis. Example: the analysis of images, e.g. a scanned picture of a layered rock (p02, p03 top), and how Matlab's canny edge detection routine is able to identify and define the boundaries between the layers (p03, bottom). Admittedly, canny is rather sophisticated, but using mostly homegrown tools we can get some very good results for image segmentation ourselves (p04). At this point the students should be well motivated... how cool is that... but how? How do we achieve control over our digital environment? Through programming. Next item of business is to introduce the very small handful most basic Matlab commands. My personal list of top-fifteen commands (p05, p06) leads to a cheat-sheet with function names, what they do, and examples (p07) that we hand out in hard-copy (and also as the LaTeX source code, since we teach LaTeX also... but that's another story), and we ask the students to continue filling out this sheet (by hand or in their own digital version) throughout the semester. We continue with a short list of other vital commands (p08) that we introduce by functionality types: addressing, and logical operators. All of these commands were shown "live" in Matlab as projected on-screen. Next up us to actually use the command line to execute basic functions, culminating in the writing of the first script. In our case, we show the students how to load one of the images that we prepared, and cut a profile through the color values and show them in function of the "column" number, as running left to right through the image (p09). After spending time with the students until they manage this procedure on their own command line, we turn their script into their first function, which goes through the entire suite of manipulations and ends in the production of an annotated figure, saved in a publication-quality format (p10), ready for inclusion in their first lab report. And then it's onto more sophisticated things, such as making histograms of the colors in the image (p11). This first session ends with a lab homework assignment that is extremely simple (p12), and mostly designed for students to walk through the process - of generating a simple figure (p13) - from the beginning to the end (and also, to teach them how to use our course management system for the submission of assignments). Whether throughout the session or explicitly at the end (p14), we spend time on the basics of code hygiene, and the essential pieces of what makes a good Matlab function.
We send students around campus tocollect GPS data of a series of control points and a control line, which they revisit daily over the course of the week. We give the students a csv-file data template, provide them with a data set collected by a student in a previous year, and walk them through a simple script to plot the data, with different symbols, means and standard deviations added, in a georeferenced plot, and then iterate through the process so they can identify outliers in the data sets. We discuss various ways of doing data analysis and leave them with an assignment to integrate their own data with that of the class, which we discuss in great detail during the next class.
In the first PDF document posted below under 07, you will find the plot generated during class. In the first plot (entitled "data collection") a massive outlier is easily identifiable. In the second plot (entitled "data curation") that outlier has been removed, and a new plot has been made. The lab assignments themselves are posted below under 08 and 09, and a simple script with the command sequence to ingest the data and plot the results is posted under 10. A template, a slightly more complete data file from the instructor, and a student's result are given to the student for processing during this class, and these are posted under 11, 12, and 13. Note that the csv files may need to be edited, i.e. the header line removed, etc., for which we recommend the simplest of text editors.
Now we let the students be creative! Everyone came to class with their personal GPS data file, and we teach them how to integrate everyone's data sets into a master plot showing the results of the locations of all the control points collected during the past week. We walk through the various ideas of what to do next posted in the lab assignments (see Week 2).
The master plot is posted below under 14, and the script to make it under 15. Note that of course there will be a need for additional data files, and some edits will be required. We make these scripts "live" with the students in class, and then give them to the students with the task of modifying them for their own purposes. The students are learning how to troubleshoot, how to program, and how to deal with their own computational problems. Getting through the simple task of making one "simple" plot take a full three weeks of sessions! But they manage.
Now it is time for some more serious fare. We spend the lecture and lab time on the creation of a series of increasingly sophisticated scripts to load in a data file containing some piece ofterrestrial topography (a NETCDF file, as it happens), and we show how to plot it, annotate it, draw profiles through it (horizontal, vertical, hand-drawn), look at slopes, hand-drawn polygons for analysis, histograms, and the analysis of roughness. Each of the little scripts also writes a figure file (a PNG file) for inclusion into a lab report.
Under 16 below, we list the lecture materials, and under 17, we make the entire script sequence available, including the data files, the figures generated, and the LaTeX source. Under 18 below, we list an example homework that we assigned to build on these competences. Some customization will be needed for your own purposes, of course!
In this lecture we play with cycles! We have the students generate some time series that are superpositions of sines and cosines, and then we try to find their amplitudes, periods, and phase angles. We cover two methods. The first is by simple correlation analysis. Generate a signal of a certain period, amplitude and phase, and then generate a trial signal with which to correlate the first. The idea is that when the correlation between the signals is high, you must have found the right period and amplitude! (Note the special role of the phase in the correlation analysis!) But that's not all. We also teach the students how to formally invert for the best-fitting amplitude given a trial period, or a set of trial periods. For this we use Matlab's "pinv" function, and with this, we teach the basics of regression analysis.
We play with knocking out values, adding noise, interpolating and cleaning up data, and so on. We do all of this on the command line, but then turn our code into a script, and subsequently into a function. The students leave the lecture with two functions by which they can replicate the analysis that we did in class. They also leave with two more sets of commands and a data set (a magnetic time series collected by us on campus) which we ask them to analyze - e.g. for diurnal variations, i.e. the 24-hour period, as part of a lab assignment.
Under 19 below, we list a set of figures output by various invocations of the command-line goings on, scripts, and ultimately, functions. Under 20 below, we give the Matlab source files and the sample data set.
We are professionals! Forget correlation and regression analysis, we are now going to do spectral analysis via the windowed periodogram. Move over, Fourier! We spend a great deal of time on the massaging of data (column csv format, as in previous weeks), and the proper date formatting. We breeze past the correlation-coefficient approach from the previous week (writing a first script with the students) and then move on to the Fourier analysis approach (writing a second script with the students).
We perform crude significance analysis on the power spectral density. It's incredibly gratifying that we should see the diurnal cycle (in temperature records!), but even more so, that we can teach Freshmen to conduct sophisticated time series analysis using state-of-the-art statistical methods using, ultimately, just a few lines of code.
Under 21 below, we give two figures output by the two main pieces of code that we write with the students in class, and under 22, we give all the source code.
01 From zero to Matlab in six weeks - with Freshmen (Acrobat (PDF) 1.2MB Jul14 16) Week 01 Lecture materials
02 From zero to Matlab in six weeks - with Freshmen (Acrobat (PDF) 304kB Jul14 16) Week 01 Cheat sheet
03 From zero to Matlab in six weeks - with Freshmen ( 3kB Jul14 16) Week 01 Cheat sheet (LaTeX source)
04 From zero to Matlab in six weeks - with Freshmen (Matlab File 2kB Jul16 16) Week 01 Matlab function 1 (rename to varves2.m)
05 From zero to Matlab in six weeks - with Freshmen (Matlab File 1015bytes Jul16 16) Week 01 Matlab function 2 (rename to varves3.m)
07 From zero to Matlab in six weeks - with Freshmen (Acrobat (PDF) 47kB Jul16 16) Week 02 Lecture materials
08 From zero to Matlab in six weeks - with Freshmen (Acrobat (PDF) 2MB Jul16 16) Week 02 Lab assignment I
09 From zero to Matlab in six weeks - with Freshmen (Acrobat (PDF) 394kB Jul16 16) Week 02 Lab assignment II
10 From zero to Matlab in six weeks - with Freshmen (Matlab File 2kB Jul16 16) Week 02 Matlab script (rename to simplescript.m)
11 From zero to Matlab in six weeks - with Freshmen (Comma Separated Values 134bytes Jul16 16) Week 02 Data template (rename to template.csv)
12 From zero to Matlab in six weeks - with Freshmen (Comma Separated Values 327bytes Jul16 16) Week 02 Template GPS data set (rename to fjsimonsl02a.csv)
13 From zero to Matlab in six weeks - with Freshmen (Comma Separated Values 16kB Jul16 16) Week 02 Sample GPS data set (rename to adriantl01a_H.csv so it can be read by simplescript.m above)
14 From zero to Matlab in six weeks - with Freshmen (Acrobat (PDF) 9kB Jul16 16) Week 03 Lecture materials
15 From zero to Matlab in six weeks - with Freshmen (Matlab File 1kB Sep12 16) Week 03 Matlab function (rename to emgps.m)
16 From zero to Matlab in six weeks - with Freshmen (Acrobat (PDF) 995kB Oct16 16) Week 04 Lecture materials
17 From zero to Matlab in six weeks - with Freshmen (Zip Archive 4.2MB Oct16 16) Week 04 Matlab scripts for topographic analysis, generated figures,sample topography data set, and lecture LaTeX source
18 From zero to Matlab in six weeks - with Freshmen (Acrobat (PDF) 562kB Oct16 16) Week 04 Lab assignment
19 From zero to Matlab in six weeks - with Freshmen (Acrobat (PDF) 1.2MB Oct16 16) Week 05 Lecture materials
20 From zero to Matlab in six weeks - with Freshmen (Zip Archive 1MB Oct16 16) Week 05 Matlab scripts for time-series analysis, and sample geomagnetic data set
21 From zero to Matlab in six weeks - with Freshmen (Acrobat (PDF) 64kB Oct16 16) Week 06 Lecture materials
22 From zero to Matlab in six weeks - with Freshmen (Zip Archive 371kB Oct16 16) Week 06 Matlab scripts for power-spectral density (Fourier) analysis, and sample temperature data set
References and Notes:
The Elements of MATLAB Style by Richard K. Johnson
Essential MATLAB for Engineers and Scientists by Brian H. Hahn and Daniel T. Valentine