Histograms - Practice Problems
Solving Earth science problems with data binning
Creating histograms by hand
Problem 1: In fluvial geomorphology, it is frequently important to understand what is the dominant sediment size in a river bed, so in the field, students might do pebble counts. Using the data below collected at Mill Creek in the San Bernardino Mountains, CA, calculate three different histograms about pebble counts at three transects on this creek.
Size (cm) | Transect 1 | Transect 2 | Transect 3 | |
9.6 cm -12.8 cm | 17 | 5 | 14 | |
Cobbles | 12.8 cm -19.2 cm | 9 | 20 | 11 |
19.2 cm - 25.6 cm | 10 | 14 | 5 | |
25.6 cm - 38.4 cm | 6 | 4 | 4 | |
Boulders | 38.4 cm - 51.2 cm | 2 | 3 | 6 |
51.2 cm - 102.4 cm | 0 | 0 | 1 |
Can you draw a histogram for each transect and determine what is the most common grain size in each transect given the table of pebble counts?
- Determine the bin size for your histogram text
In pebble counts, the bins are pre-defined by the sizes measured, so you can just use each row of size as a bin (i.e. 9.6-12.8 cm,. 12.8-19.2 cm, 25.6-38.4 cm, etc). - Determine the count of values in each bin
Because the data are pre-binned, we already have the counts in each size bin. So if we plan to use the six bins found in step 1, then we can use the values already calculated, i.e. for Transect 1, the largest grain size (38.4-51.2 cm) has a count of 2. - Plot the histogram for each transect
Now create three graphs with the same x-axis with the bins labeled from 9.6 cm to 102.4 cm. The y-axes should range from zero to the maximum value in your dataset (~20). For transect 1, make a bar for each row where the height of the bar (Figure 1). Repeat this for each transect.×
- Calculate the modal grain size for each transect.
Remember the mode is defined as the value that occurs most often in the dataset. For histograms, we look at which bar is the highest (or has the largest frequency of occurrence or counts) and then read the x-value (bin) for the mode. In this case for Transect 1 the most common sediment is the smallest size of cobbles 9.6 - 12.8 cm. For Transect 2, it is medium sized cobbles from 12.8-19.2 cm, and for Transect 3, it is again the smallest size of cobbles (9.6-12.8 cm) same as transect 1 (Figure 2).×
Problem 2: Sedimentologists use grain size distributions to help identify the possible origins of sediment samples. Using the sieve data below, draw a histogram for each sample and decide which sediment sample has the smallest average grain size?
Grain Size (mm) | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | |
> 2 mm | 5.00 | 0.03 | 2.86 | 5.00 | 0.00 | |
1-2 mm | 2.00 | 39.98 | 25.47 | 5.00 | 1.00 | |
0.5 - 1 mm | 1.92 | 11.63 | 20.40 | 27.27 | 6.45 | |
0.25 - 0.5 mm | 43.77 | 10.00 | 4.38 | 0.04 | 41.74 | |
0.125 - 0.25 mm | 4.16 | 0.11 | 0.23 | 19.38 | 8.00 | |
.063-0.125 mm | 1.00 | 0.00 | 0.02 | 1.69 | 5.00 | |
Total Sample Mass (g) | 57.85 | 61.75 | 53.36 | 58.37 | 62.19 |
- Determine the bin size for your histogram text
In sediment sieves, the bins are pre-defined by the sizes of sieves used, so you can just use each row of grain size as a bin (i.e. >2 mm,. 1-2 mm, 0.5 - 1 mm, etc).
- Determine the frequency of values in each bin
Now the data are pre-binned by mass, however, as each total sample mass varies, we need to normalize our data and plot by frequency of occurrence. To do this, you want to take each measurement of mass per sieve size and divide it by the total sample mass (the last row). Your table should look like the one below. Now we will plot our y-values based on these % mass of the sample (you can also think about it as the frequency of occurrence in sample 1 that the sediment is X% percent in a specific grain size).
Percent Mass Grain Size (mm) Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 > 2 mm 8.64 0.05 5.35 8.57 0.00 1-2 mm 3.46 64.74 47.74 8.57 1.61 0.5 - 1 mm 3.32 18.84 38.23 46.71 10.37 0.25 - 0.5 mm 75.66 16.19 8.21 0.06 67.12 0.125 - 0.25 mm 7.19 0.18 0.43 33.20 12.86 .063-0.125 mm 1.73 0.00 0.04 2.89 8.04 Total Mass 100 100 100 100 100 - Plot the histogram for each transect
Now create five graphs with the same x-axis with the bins labeled from 0.063 mm to 2 mm. The y-axes should range from 0 to 100. For sample 1, make a bar for each row where the height of the bar (Figure XX). Repeat this for each sample.×
- Determine the center of the dataset (the average)
For these histograms, we know that the smallest sediment size is on the left side of the xaxis and increases in grain size moving right. So we are looking for a histogram with most of the %mass on the left. Sample 1 clearly has the smallest average size (with the mode of 0.25-0.5 mm)
Creating histograms in Excel
Problem 3: Geologists can classify volcanic eruptions based on the VEI, volcanic explosivity index, a way to measure the relative explosiveness of volcanic eruptions. It measures how much volcanic material is ejected, the height of the material thrown into the atmosphere, and how long the eruptions last. The scale is logarithmic, or based on 10; therefore, an increase of 1 on the scale indicates an eruption 10 times more powerful than the number before it on the scale.
Using the attached excel file VEI_1600_2023_AllData.xlsx (Excel 2007 (.xlsx) 41kB Jun7 23), find the modal elevation of all the volcanic eruptions recorded in the dataset. Also, is the VEI right or left-skewed? Given the skew of the VEI data, do we expect the mean to be larger or smaller than the median?
- Open the file in excel
Download the excel file and save it to your computer. Double click on the excel file to open it.
- Create a histogram of the volcano elevations
Select the elevation column, then click insert --> choose the histogram chart type.
- Edit the histogram for readability
For Macs: Right click the histogram chart on a bar, select the format data series. In the options, switch from auto bins to bin width and input a reasonable bin width (perhaps 500 m) and use the overflow and underflow bins to clean up the image. For PCs: Click on the plus sign on the upper right hand corner of the new plot to open plot options, then select axes, then more axis options.×
- Find the modal elevation.
Look for the largest peak (the tallest bar) on the histogram, then read down to the x-axis and find the elevation range. If you use a bin width of 500 m, the modal range is 1,500 - 2,000 m, with 120 volcanoes in that bin.
- Create a new histogram of the VEI values
Select the VEI column, then click insert --> choose the histogram chart type.×
- Edit the VEI value histogram for readability
For Macs: Right click the histogram chart on a bar, select the format data series. In the options, switch from auto bins to bin width and input a reasonable bin width (1) and use the overflow and underflow bins to clean up the image. For PCs: Click on the plus sign on the upper right hand corner of the new plot to open plot options, then select axes, then more axis options.
- Interpret the histogram
Looking at the graph of VEI, we see that the graph is unimodal (one dominant peak at 2), but the data are not symmetrically distributed around this peak. There are more values (a longer tail) on right side of the peak. So these data are right-skewed.
- Descriptive Statistics
In our histogram, we can clearly see that the the mode is 2 for VEI, but we know that for right skewed histograms, the median is smaller than the mean, because the mean value will be "pulled" up by the extra larger VEI values that skews our average.
Problem 4: Through automated analyses of satellite Landsat imagery, the size of over 1500 atoll islands was collected and saved as an excel file. When we look at the atoll island width, are the data uniform or skewed? Can you tell if the median or the mean atoll island width is the smaller value based on the histogram? AtollIslands_ACO_Landsat.xlsx (Excel 2007 (.xlsx) 75kB Jun7 23)
- Open the file in excel
Download the excel file and save it to your computer. Double click on the excel file to open it.
- Create a histogram of the atoll island width
Select the atoll island width column, then click insert --> choose the histogram chart type.
- Edit the histogram for readability
For Macs: Right-click the histogram chart on a bar, select the format data series. In the options, switch from auto bins to bin width and input a reasonable bin width (perhaps 25 m) and use the overflow and underflow bins to clean up the image. For PCs: Click on the plus sign on the upper right-hand corner of the new plot to open plot options, then select axes, then more axis options.×
- Interpret the histogram
Looking at the graph of the atoll island width, we see that the graph is unimodal (one dominant peak at 2), but the data are not symmetrically distributed around this peak. There are more values (a longer tail) on the right side of the peak. So these data are right-skewed.
- Descriptive Statistics
In our histogram, we can clearly see that the mode is 170-196 m for atoll island width, but we know that for right skewed histograms, the median is smaller than the mean, because the mean value will be "pulled" up by the larger atoll island width values that skew our average.
Reading histograms
Problem 5: In the early 2000s, hydraulic fracturing became a common method to retrieve fossil fuels "trapped" in rocks like shale. One concern about this practice was the potential to induce earthquakes. Examine and compare the two histograms graphically displaying the magnitude of earthquakes in Oklahoma before and after 2008.
××
- What is the shape of the distribution of data? Is it symmetric, skewed, uniform, or bimodal?
Do you see one or two "highest" bars? Are the data distributed evenly on each side of the highest bars?Each of these histograms is unimodal (one highest value). The pre-2008 histogram shows earthquake magnitudes skewed toward the highest values (right skewed), but this is likely because earthquakes below magnitude 2.5 are not presented in the data set. The post-2008 histogram shows earthquake magnitudes distributed symmetrically about the most frequent value. Note that you would expect the distribution of earthquake magnitudes to be skewed as the frequency of earthquakes should decrease as the magnitude increases.
- Where is the center of the data? What is the average value of the data?
The median (center) of the data is found on the histogram by determining how many data points you have and locating the bin value of the data point in the middle.
- What is the most common earthquake magnitude before 2008 and after 2008?
Identify the largest bars on each histogramPre-fracking there were ~72 earthquakes of magnitude 2.5-3; after 2008 there were over 3000 earthquakes of magnitude 2.5-3. You can see that magnitude 2.5-3 earthquakes are the most common in both time periods (the mode), but the number of earthquakes has increased significantly.
- What is the spread of the data? What is the range of earthquake magnitudes before and after 2008?
Identify the highest and lowest value of each histogram.Before 2008, earthquake magnitudes ranged from 2.5 to 4.5. After 2008, earthquake magnitudes ranged from 0.5 to 6. Although they are not as frequent as the lower magnitude earthquakes, the larger earthquakes began occurring only after 2008.
Problem 6: Below, we have two histograms of the measured pH of coal-mine discharges in Pennsylvania. How would you describe the shape of these histograms?
××
- Is it a uniform distribution?
do the bars all have roughly the same height or is there variation for each bin in the frequency (y-axis)? In this case, we can see that there are clear variations from 0-20% frequency across all the pH values shown, so it is NOT a uniform distribution.
- Is it unimodal, bi-modal, or something else?
How many peaks do you see? In both graph A and B, we see two distinct peaks. We would call this bi-modal- What is the skew?
How are the data distributed around the peaks? In this case, it seems like it's about an even distribution of data on either end of both peaks in both graphs, so this is a roughly symmetric graph.- What is the most common measured pH? Find the mode of each histogram
Remember the mode is the most repeated value in the dataset, on a histogram that is the bin with the tallest bar. In this case, we can see that the anthracite coalfield (A) has a mode of 3.25-3.75 pH and the surface coal mines (B) has a mode of 6.25-6.75 pH.×- Is the histogram for each band uniform or are there distinct peaks?
Check to see for each histogram how many peaks there are. Only the near-infrared band (the bottom histogram) has a single dominant peak, all the other bands (blue, green, and red) are bi-modal or multimodal.
- Find the mode of each histogram
Remember the mode is the most repeated value in the dataset, on a histogram that is the bin with the tallest bar. In this case, we can see that the unimodal near-infrared band has the smallest mode.
×- What is the shape of this distribution? Is this a uniform distribution? Is there a single most likely value? Are the data skewed?
If you're having trouble visualizing the shape of the histogram, trace an outline of the histogram by connecting the top of each bar. There should be two peaks visible, so we know this histogram is bimodal and it is not uniform. The single, most likely value, or mode, is found at the tallest peak (-4.5 km). By looking at the distribution, the histogram is skewed slightly to the right (it is easier to see this if you outline your histogram!)
- Can you estimate the median elevation of the Earth? The average of the data?
The median elevation of the Earth is -1 km (or 1 km below sea level). The average of the data is likely similar and also close to -1 km. You may notice that the median and mean are not close to either peak. An alternative approach to this problem would be to separate the bimodal distributions into two separate unimodal distributions and then estimate the median and mean of each unimodal distribution. With this approach, the median and mean for each unimodal distribution would be a similar value to the mode for each unimodal distribution.
- What is the most common value (mode)?
Following similar logic to the median and mode in a bimodal histogram (above), we report the mode of each peak. The mode is the most repeated value in the data set, or the tallest bar. Here, the modes can be found at -4.5 km and 0.5 km. If strictly considering the whole histogram, the mode is -4.5 km.
- What is the range of elevations on the Earth?
The range is found by subtracting the lowest number on the x-axis of the histogram from the highest number. Here, the highest number is 4.5 km and the lowest is -6.5 km. Thus, the range is found by the following equation: 4.5 km - (-6.5 km) = 11 km.
×- Compare the distribution of values in each unit. Are they the same shape? If they are skewed, are they skewed in the same direction?
Each of these three histograms are unimodal, however they differ in skew. The first (a) is skewed to the left, (b) is skewed to the right, and (c) is symmetric.
- If you wanted to use the median as a single value to represent each unit, would you expect the medians to be the same or different?
We would expect the medians to be different due to this difference in skew. While the median is typically the same value as the mode in a symmetric distribution. Histograms that skew right have medians that fall to the right of the mode, and histograms that skew left have medians that fall to the left of the mode.
- What is the mode of each unit? Are they the same?
The mode is the tallest bar of each histogram. For (a), this is `2.6xx10^5`, for (b) it is `6xx10^4` and for (c) it is `2xx10^5`. They are not the same.
- Is the spread of the data the same for each geologic unit or different? What is the range?
The spread is visualized by the x-axis. We can see that the spread is different for each. The range is calculated by taking the smallest value on the x-axis and subtracting this from the greatest value. For (a) the greatest value is `2.9xx10^5` and the smallest value is `2.2xx10^5`. We can calculate the range for (a) using the following equation: `2.9xx10^5 - 2.2xx10^5 = 0.7xx10^5`. We can repeat this process for (b): `1.4xx10^5 - 2xx10^4 = 1.2xx10^5`. We do this once more for (c): `3xx10^5 - 1xx10^5 = 2xx10^5`.
Next Steps
TAKE THE QUIZ!!
I think I'm competent with histograms and I am ready to take the quiz! This link takes you to WAMAP. If your instructor has not given you instructions about WAMAP, you may not have to take the quiz.Or you can go back to the Histogram explanation page.
- Is it unimodal, bi-modal, or something else?
- Plot the histogram for each transect