# How do I create and interpret histograms?Binning data for analysis in the Earth sciences

## An introduction to representing data with a histogram

Have you ever wondered how your quiz grade of 83% compared to the rest of the class? Histograms can help with answering that. You may have seen your instructor display quiz scores on special type of bar graph called a histogram (Figure 1). A histogram is a visual representation of a single variable (typically numeric) sorted into bins of values (or buckets) that can help you answer this question.

Based on the histogram, your 83 quiz score was about the same as most of the class (who earned B's, in the 80%-90% range.) This is the tallest bar in Figure 1, displaying the highest count or frequency.

In Earth and Environmental Sciences, histograms help visualize the underlying spread of values in a dataset , provide a feel of the high and low values of the data, and identify any potential outliers. Histograms are especially useful for large datasets (e.g., more than 20 points) which aren't easily understood by looking through raw numbers/data.

## When do I use histograms?

Histograms are a simple and powerful first step in analyzing one variable -- they are typically the first graph made.

For example, one renewable energy source is a wave power generator, which uses ocean wave oscillations to generate energy. Can you determine which location would be better for a wave generator that is designed for higher wave heights - Coos Bay, Oregon or Bill Baggs State Park, Florida? What if the generator was designed to maximize energy with higher frequency waves? Examine the histograms of wave height at each location to answer these questions. (Figure 2).

Because histograms help visualize the underlying distribution of a dataset, they are closely related to statistical measures such as mean, median, mode, and skew. Without performing any complex calculations, histograms immediately identify distribution modality: one dominant peak frequency (unimodal--like the wave height histograms) or two (bimodal--like if half the class got an A- and most others scored a C+). As in the example above, you can immediately answer questions about how often you expect to see values of different magnitudes.

## How do I create a histogram from my data?

• Bin: a range of values for your data
• Count: the number of values in each bin
• Frequency: similar to count, but can be expressed as the total number OR a percent of the total

Once you have collected your data, you will need to define bins (ranges of values) into which you will sort the data. The selection of bin range is critical to the usefulness of your histogram. Sometimes the type of data you have will suggest a good size for the bins, and sometimes you will decide the bin ranges after looking at your data a bit. For example, if you are counting pebbles on a stream bed you will likely use pre-set categories (ex. small is 4-8 mm, medium is 8-16 mm, etc.) and use those for your bin ranges.

The San Gabriel River drains from the San Gabriel and San Bernardino Mountains in semi-arid southern California. Streamflow here is driven by winter precipitation and varies significantly over the course of the year. How would you initially characterize the flow of water through this river?

This is a set of monthly mean streamflow data (Q = discharge) for the San Gabriel River in California in cubic feet per second (cfs). These data are from USGS stream gage 11087020 at the Whittier Narrows Dam from 1995-2022:

Mean Monthly Discharge San Gabriel River

 Month Mean Q (cfs) Jan 372 Feb 410 Mar 149 Apr 103 May 84 Jun 46 Jul 38 Aug 43 Sep 33 Oct 72 Nov 106 Dec 212

The general steps for creating a histogram are:

Step 1: Select your bin range--use established ranges for the topic or look at your data and estimate what would be good.

Step 2: Sort your data by value (low-to-high or high-to-low).

Step 3: Count the number of data points that fall into each bin range. This is your frequency. You many report this as a raw count or convert it to a percent of the total

Step 4: Plot the data as bars with the frequency on the vertical axis and the bins on the horizontal

Step 1: Select your bins in increments of 100 cfs.

Step 2: Sort your data by value (high-to-low). Currently the data are listed chronologically, January to December.

Step 3: Count the number of data points that fall into each bin range. This is your count.

Step 4: Plot the data as bars with your bin values on the x-axis (horizontal) and the count values on the y-axis (vertical). Don't forget to shade in your bars and add axis labels and a title to your plot!

Alternatively, you might want to create bins after examining your data. Notice that most of the streamflow values are in the lower bins and the last three bins are single outliers. Let's try setting the bins sizes based on the data to see more information.

Step 1: Starting with the sorted dataset from Step 2 above, find the range of your data (difference between high and low values)

Step 2: Create bins by dividing the range by the number of bins.  What are the bin ranges for 9 bins?

Step 3: Count the number of data points that fall into each bin range.

Step 4:  Plot the data

Step 5: Sometimes histograms are presented as percent frequency instead of counts. Can you convert this histogram to percent frequency?

## How many bins should I use?

If you are creating your own bins, the number/range size of your bins will affect how your data may be interpreted. If you do not have enough bins, you will lose detail that may be important. If you have too many bins, you will lose overall characteristics of your data. In the figure below, with bin size of 5, you lose enough resolution that you do not see the second peak in this bimodal distribution.  With bin size 0.2, you start to see too much noise which gives the false impression that there may be more peaks. The appropriate number of bins for your dataset depends on how many data points you have and the overall spread of the data.

## Creating a histogram with a spreadsheet

In practice, you are unlikely to ever construct a histogram "by hand". Remember histograms are most useful for understanding larger datasets. You will likely be using a spreadsheet, a statistics package, or code like Python. Let's start with a spreadsheet.

Thinking about the streamflow in the San Gabriel, maybe looking at monthly averages over a long time period does not really tell us much about how the river flows. Did we loose a lot of the variability? Are there more low flows and high flows that we are missing? Here is a data set of the monthly average flow in this stream from 1995-2022.

Download this document A very short introduction to Excel (Acrobat (PDF) 43kB Jun7 23) for a quick reference on simple Excel functions used here.

Open the file and take a look at the dataset:  San Gabriel River monthy discharge 1995_2022.xlsx (Excel 2007 (.xlsx) 17kB Jun6 23)

Step 1: After you download the file, open the file in excel by double clicking on it. How many years do these data span? How many discharge measurements are there?

Step 2: Plot the histogram

Step 3a: Choose your bin size

Step 3b: Choose the number of bins to plot

## How do I read histograms?

• median: value that splits the dataset in half, such that half of the data are larger and half are smaller
• mean: most commonly used "average" of dataset
• mode: the most repeated value (aka the highest bin in your histogram).
• skew: when the data are NOT symmetric and distributed with longer tails to one side of the peak

A histogram tells about the underlying shape of the data (the distribution). For larger datasets (more than 10 data points), histograms are the perfect first step for visualizing data because it shows how frequently a value falls into a particular bin. When looking at a histogram keep these questions in mind:

• What is the shape of the data? Is it symmetric, skewed, uniform, or bimodal? (Figure XX)
• Where is the center of the data? The value with the same number of data to the right and to the left is the median (Figure XX)
• What is the average of the data? The arithmetic average of the data is the mean
• What is the most common value? The highest bar is the mode (Figure XX)
• What is the spread of the data? The difference between the lowest and highest value bin is the range

A histogram makes it easy to see which values are most common and which values are least common in a data set. Here is a good visualization of common patterns seen in histograms:

Frequently,  statistics such as mean, median, mode, and skew are used to describe the shape and pattern (e.g., the distribution) of the data. These values can be calculated explicitly from the data (see Introductory Statistics) but can also be inferred from the histograms:

When a dataset is unimodal and symmetric, the data are roughly equally distributed on either side of the peak value and the mean, median, and mode are approximately equal (as shown in Figure XXA) above. On the other hand, frequently data are asymmetric and have more values on one side of the peak, which is called skew. For example a positive or right-skewed dataset, has more values on the larger magnitudes such that the mean is larger than the median value (Figure XXB). On the other hand, a negative or left-skewed dataset (Figure XXC), has more values on the lower magnitudes such that the mean value is smaller than the median.

## Where do you use histograms in Earth science?

Histograms can be used in a wide variety of earth and environmental science problems especially when there are continuous numerical data collected that span a large range of values or magnitudes.

• Ecology: use a histogram to investigate the distribution of plankton size from data collected at different sites.
• Sedimentology: frequently histograms are used to investigate the distribution of sediment sizes (i.e. pebble counts or sieving) that can be used in conjunction with more advanced techniques such as cumulative histograms or cumulative distribution functions to infer information about the depositional environment.
• Seismology: histograms can be a quick way to visualize the magnitude of earthquake events  at different locations and could even be used to understand how this might change over time, if for example there is a histogram of earthquakes at two different temporal periods
• Hydrology: histograms are used very frequently to look at distributions of rainfall or river discharge or other climatic processes to either compare changes in time or location (aka does the distribution of rainfall magnitude change when looking at 1950-1960 compared to 2010-2020 for Puerto Rico)
• Planetary:  To investigate the crater perimeter on Mars compared to crater perimeters found on the moon use a histogram.
• Geography: when using satellite imagery, histograms become very useful in visualizing the distribution of pixel values in a given band or to calculate thresholds defining the presence of vegetation in an image automatically (i.e. Otsu method).

## Next steps

Two format options we can choose from