How do I create and interpret histograms?
Binning data for analysis in the Earth sciences

Initial Publication Date: August 11, 2023

This module is available for public use, but it is undergoing revision after classroom implementation with the Math Your Earth Science Majors Need project.

Jump to: Using histograms/ Creating histograms / Bin size / Histograms in spreadsheets / Interpreting histograms

An introduction to representing data with a histogram

Have you ever wondered how your quiz grade of 83% compared to the rest of the class? Histograms can help with answering that. You may have seen your instructor display quiz scores on special type of bar graph called a histogram (Figure 1). A histogram is a visual representation of a single variable (typically numeric) sorted into bins of values (or buckets) that can help you answer this question. Based on the histogram, your 83 quiz score was about the same as most of the class (who earned B's, in the 80%-90% range.) This is the tallest bar in Figure 1, displaying the highest count or frequency.

In Earth and Environmental Sciences, histograms help visualize the underlying spread of values in a dataset , provide a feel of the high and low values of the data, and identify any potential outliers. Histograms are especially useful for large datasets (e.g., more than 20 points) which aren't easily understood by looking through raw numbers/data.

When do I use histograms?

×

Figure 2: Two plots showing the histogram of wave heights for Bill Baggs State Park in FL compared to Coos Bay in OR from 1980-2022.
Provenance: Alejandra Ortiz, Colby College
Reuse: This item is offered under a Creative Commons Attribution-NonCommercial-ShareAlike license http://creativecommons.org/licenses/by-nc-sa/3.0/ You may reuse this item for non-commercial purposes as long as you provide attribution and offer any derivative works under a similar license.

Histograms are a simple and powerful first step in analyzing one variable -- they are typically the first graph made. For example, one renewable energy source is a wave power generator, which uses ocean wave oscillations to generate energy. Can you determine which location would be better for a wave generator that is designed for higher wave heights - Coos Bay, Oregon or Bill Baggs State Park, Florida? What if the generator was designed to maximize energy with higher frequency waves? Examine the histograms of wave height at each location to answer these questions (Figure 2).

Show Answer: Does Coos Bay or Bill Baggs generally have larger wave heights? Which site has higher frequency waves?

The green bar histogram shows the most typical wave height at Coos Bay is 2 meters (15,000 occurrences) compared to the most common wave height of 0.5 meters at Bill Baggs (80,000 occurrences.) This means the coast of Oregon will be better for a generator designed for larger waves, but a generator that generates more electricity from shorter, more frequent waves will be better off the coast of Florida!

Because histograms help visualize the underlying distribution of a dataset, they are closely related to statistical measures such as mean, median, mode, and skew. Without performing any complex calculations, histograms immediately identify distribution modality: one dominant peak frequency (unimodal--like the wave height histograms) or two (bimodal--like if half the class got an A- and most others scored a C+). As in the example above, you can immediately answer questions about how often you expect to see values of different magnitudes.

How do I create a histogram from my data?

Once you have collected your data, you will need to define bins (ranges of values) into which you will sort the data. The selection of bin range is critical to

the usefulness of your histogram. Sometimes the type of data you have will suggest a good size for the bins, and sometimes you will decide the bin ranges after looking at your data a bit. For example, if you are counting pebbles on a stream bed you will likely use pre-set categories (ex. small is 4-8 mm, medium is 8-16 mm, etc.) and use those for your bin ranges.

The San Gabriel River drains from the San Gabriel and San Bernardino Mountains in semi-arid southern California. Streamflow here is driven by winter precipitation and varies significantly over the course of the year. How would you initially characterize the flow of water through this river?

This is a set of monthly mean streamflow data (Q = discharge) for the San Gabriel River in California in cubic feet per second (cfs). These data are from USGS stream gage 11087020 at the Whittier Narrows Dam from 1995-2022:

Bin: a range of values for your data
Count: the number of values in each bin
Frequency: similar to count, but can be expressed as the total number OR a percent of the total

Mean Monthly Discharge San Gabriel River

Month	Mean Q (cfs)
Jan	372
Feb	410
Mar	149
Apr	103
May	84
Jun	46
Jul	38
Aug	43
Sep	33
Oct	72
Nov	106
Dec	212

The general steps for creating a histogram are:

Step 1: Select your bin range--use established ranges for the topic or look at your data and estimate what would be good.

Step 2: Sort your data by value (low-to-high or high-to-low).

Step 3: Count the number of data points that fall into each bin range. This is your frequency. You many report this as a raw count or convert it to a percent of the total

Step 4: Plot the data as bars with the frequency on the vertical axis and the bins on the horizontal

Step 1: Select your bins in increments of 100 cfs.

Show Bins for San Gabriel River streamflow

Step 2: Sort your data by value (high-to-low). Currently the data are listed chronologically, January to December.

Show Streamflow sorted from high to low value

Step 3: Count the number of data points that fall into each bin range. This is your count.

Show bins and counts

Step 4: Plot the data as bars with your bin values on the x-axis (horizontal) and the count values on the y-axis (vertical). Don't forget to shade in your bars and add axis labels and a title to your plot!

Show histogram plot

Alternatively, you might want to create bins after examining your data. Notice that most of the streamflow values are in the lower bins and the last three bins are single outliers. Let's try setting the bins sizes based on the data to see more information.

Step 1: Starting with the sorted dataset from Step 2 above, find the range of your data (difference between high and low values)

Show range of streamflow data

Step 2: Create bins by dividing the range by the number of bins. What are the bin ranges for 9 bins?

Show values for 9 bins

Step 3: Count the number of data points that fall into each bin range.

Show bins and counts

Step 4: Plot the data

Show histogram with 9 bins

Note that with more bins we have a better picture of the streamflow. Most flows (5) are less than 75 cfs and there are a number of outliers, or data outside the range of most of the data.

Step 5: Sometimes histograms are presented as percent frequency instead of counts. Can you convert this histogram to percent frequency?

Show bins and percentages

To convert to percentages, first you need the total number of data points n=12. Divide each count by the total: for example, the first bin is 5/12=41.7% The shape of this plot will look the same as the previous one, but the values on the y-axis will range from 0-100 percent. You can now easily see than 40% of the average streamflows are less than 75 cfs, and 67% of the average streamflows are less than 117 cfs.

Bin: Q	Frequency
33-75 cfs	41.7%
75-117 cfs	25%
117-159 cfs	8.3%
159-201 cfs	8.3%
201-243 cfs	8.3%
243-285 cfs	0
285-327 cfs	0
327-369 cfs	0
369-411 cfs	8.3%

How many bins should I use?

If you are creating your own bins, the number/range size of your bins will affect how your data may be interpreted. If you do not have enough bins, you will lose detail that may be important. If you have too many bins, you will lose overall characteristics of your data. In the figure below, with bin size of 5, you lose enough resolution that you do not see the second peak in this bimodal distribution. With bin size 0.2, you start to see too much noise which gives the false impression that there may be more peaks. The appropriate number of bins for your dataset depends on how many data points you have and the overall spread of the data.

Creating a histogram with a spreadsheet

In practice, you are unlikely to ever construct a histogram "by hand". Remember histograms are most useful for understanding larger datasets. You will likely be using a spreadsheet, a statistics package, or code like Python. Let's start with a spreadsheet.

Thinking about the streamflow in the San Gabriel, maybe looking at monthly averages over a long time period does not really tell us much about how the river flows. Did we loose a lot of the variability? Are there more low flows and high flows that we are missing? Here is a data set of the monthly average flow in this stream from 1995-2022.

Download this document A very short introduction to Excel (Acrobat (PDF) 43kB Jun7 23) for a quick reference on simple Excel functions used here.

Open the file and take a look at the dataset: San Gabriel River monthy discharge 1995_2022.xlsx (Excel 2007 (.xlsx) 17kB Jun6 23)

Step 1: After you download the file, open the file in excel by double clicking on it. How many years do these data span? How many discharge measurements are there?

Show me how to do this

You can use Excel's inbuilt functions such as count, minimum, and maximum to find these answers.

Show me these functions

The file contains 324 data points (streamflow measurements) with a high value of 5133 cfs and a low value of 0.

Step 2: Plot the histogram

Show me how to do this

Step 3a: Choose your bin size

Show me how to do this

For Macs: Right click the histogram chart on a blue bar, select the format data series. In the options, switch from auto bins to bin width and input a bin width of 100 cfs.

For PCs: Click on the plus sign on the upper right hand corner of the new plot to open plot options, then select axes, then more axis options.

To select your bin size, select "bin width" and enter 100 cfs

Step 3b: Choose the number of bins to plot

Show me how to do this

For Macs: Right click the histogram chart on a blue bar, select the format data series. In the options, switch from auto bins to number of bins and input 20 bins.

For PCs: In options, select number of bins and input 20 bins. Note how these are different from the automatic bins that Excel selected.

Step 4: Scale your x-axis

Show me how to do this

For Macs: right click the histogram chart on a blue bar, select the format data series. In the options, click on overflow bin and put in a value of 500 cfs.

For PCs: You can group outliers into one bin by selecting "overflow bin" and entering a value (ex. 500), or let Excel select the overflow value for you by leaving the setting as "auto". Don't forget to add axis labels and titles to your final histogram!

How do I read histograms?

median: value that splits the dataset in half, such that half of the data are larger and half are smaller
mean: most commonly used "average" of dataset
mode: the most repeated value (aka the highest bin in your histogram).
skew: when the data are NOT symmetric and distributed with longer tails to one side of the peak

A histogram tells about the underlying shape of the data (the distribution). For larger datasets (more than 10 data points), histograms are the perfect first step for visualizing data because it shows how frequently a value falls into a particular bin. When looking at a histogram keep these questions in mind:

What is the shape of the data? Is it symmetric, skewed, uniform, or bimodal? (Figure 6)
Where is the center of the data? The value with the same number of data to the right and to the left is the median (Figure 7)
What is the average of the data? The arithmetic average of the data is the mean (Figure 7)
What is the most common value? The highest bar is the mode (Figure 7)
What is the spread of the data? The difference between the lowest and highest value bin is the range

A histogram makes it easy to see which values are most common and which values are least common in a data set. Here is a good visualization of common patterns seen in histograms:

Frequently, statistics such as mean, median, mode, and skew are used to describe the shape and pattern (e.g., the distribution) of the data. These values can be calculated explicitly from the data (see Introductory Statistics) but can also be inferred from the histograms:

When a dataset is unimodal and symmetric, the data are roughly equally distributed on either side of the peak value and the mean, median, and mode are approximately equal (as shown in Figure ) above. On the other hand, frequently data are asymmetric and have more values on one side of the peak, which is called skew. For example a positive or right-skewed dataset, has more values on the larger magnitudes such that the mean is larger than the median value (Figure ). On the other hand, a negative or left-skewed dataset (Figure ), has more values on the lower magnitudes such that the mean value is smaller than the median.

Where do you use histograms in Earth science?

Histograms can be used in a wide variety of earth and environmental science problems especially when there are continuous numerical data collected that span a large range of values or magnitudes.

Ecology: use a histogram to investigate the distribution of plankton size from data collected at different sites.
Sedimentology: frequently histograms are used to investigate the distribution of sediment sizes (i.e. pebble counts or sieving) that can be used in conjunction with more advanced techniques such as cumulative histograms or cumulative distribution functions to infer information about the depositional environment.
Seismology: histograms can be a quick way to visualize the magnitude of earthquake events at different locations and could even be used to understand how this might change over time, if for example there is a histogram of earthquakes at two different temporal periods
Hydrology: histograms are used very frequently to look at distributions of rainfall or river discharge or other climatic processes to either compare changes in time or location (aka does the distribution of rainfall magnitude change when looking at 1950-1960 compared to 2010-2020 for Puerto Rico)
Planetary: To investigate the crater perimeter on Mars compared to crater perimeters found on the moon use a histogram.
Geography: when using satellite imagery, histograms become very useful in visualizing the distribution of pixel values in a given band or to calculate thresholds defining the presence of vegetation in an image automatically (i.e. Otsu method).

Next steps

I am ready to PRACTICE!

If you think you have a handle on the steps above, click on this bar to try practice problems with worked answers.
Or, if you want even more practice, see 'More help' below.

More help (resources for students)

Khan Academy Histograms
Wolfram Mathworld Histograms
Lab Xchange has a strong description of reading and interpreting histograms
This Complete Guide to Histograms does a strong job of walking through understanding and creating histograms

Pages written by Freddi-Jo Bruschke (CSU Fullerton) and Alejandra Ortiz (Colby College).