Scatterplots and Correlation
Scatterplots
Scatterplots show the relationship between two (usually) continuous variables. Recall that continuous have many different numeric values; age or income are examples. Scatterplots are very useful for data visualization because they can give us an intuition for the direction of the relationship between variables (positive or negative) and the strength of the relationship. Usually, we are interested in both things.
With a scatterplot, we normally assume that one variable is the independent variable. Most researchers denote the independent variable as X. The independent variable is the input to the model. The dependent variable is the output from the model. One way to keep these straight is the dependent variable is dependent on another variable in the model, the independent variable. Researchers denote the dependent variable as Y. Just like in the alphabet, X comes before Y, meaning a change in X results in some change in Y. In some cases, the independent variable X may be a “cause” of the dependent variable Y, but in most cases, causation is difficult to establish. We discuss the distinction between correlation and causation toward the end of the chapter.
In the examples below, we will be using the State Kids Count data. In each example, the dependent variable is the infant mortality rate (imr) for both scatterplots. We will construct two scatterplots using two different independent variables: the percentage of low-birth-weight babies in each state and the median family income in the state. Figure 1 shows the scatterplot for infant mortality (y axis) and low birth weight babies (x axis).
Figure 1: The relationship between low birth weights and infant mortality
Here low birth weight is on the x axis and the infant mortality rate is on the y axis. This scatterplot helps answer two questions.
⦁ Direction of Relationship. The graph shows there is a positive relationship between low birth weights and the state infant mortality rate. As low birth weights increase, so does infant mortality. This makes sense, as low birth weight babies are often premature or have other health difficulties, making survival less likely. So, it makes sense that states that have a high percentage of low birthweight infants, would also have higher overall infant mortality rates.
⦁ Strength of the Relationship. The way to determine the strength of the relationship in a scatterplot is to look at how tightly (or loosely) the data points cluster around the line. This line is the “best fit” line for the data. This graph shows a strong relationship between low birth weight and infant mortality but interpreting graphs can be a bit like interpreting art! It is important to note that while the direction of the relationship is usually easy to figure out, determining the strength of the relationship from a scatterplot alone is a subjective judgment.
Figure 2: The relationship between median family income and infant mortality
⦁ Direction of Relationship. The graph shows there is a negative relationship between state median family income and the state infant mortality rate. In states with higher median family incomes, there is less infant mortality. This also makes sense: in states with higher family incomes, more private resources are available throughout the pregnancy, which reduces infant mortality.
⦁ Strength of the Relationship. The way to determine the strength of the relationship in a scatterplot is to look at how tightly (or loosely) the data points cluster around the line. In this respect, the data fit the line well, but not as well as the scatterplot in Figure 1. But again, such an interpretation is inherently subjective.
The Correlation Coefficient
Scatterplots are helpful for visualizing the association between X and Y, but graphs cannot provide a precise numerical estimate of the relationship between X and Y . The numerical estimate of the relationship between X and Y is called the correlation coefficient, it is sometimes denoted as r in published research. Correlation coefficients tell us both the direction of the relationship between X and Y and the strength of the relationship. The correlation coefficient is easy to interpret once we understand its properties.
Box 1: Properties of the Correlation Coefficient
Correlation Coefficient Property 1: r will always indicate a positive or negative relationship through its sign.
Correlation Coefficient Property 2: r will always lie within a defined range between -1 and 1. r is a normalized measure. This means that r does not depend on the scale of measurement for a variable. For example, age and income are measured on different scales, but r is not affected by the scales, it will always be between -1 and +1.
Correlation Coefficient Property 3: r is bidirectional. This means that the correlation between X and Y is the exact same as the correlation between y and x. In other words, the “ordering” of the independent and dependent variable is irrelevant to the value of r.
Correlation Coefficient Property 4: r measures the strength of the linear relationship between X and Y. That means it measures how well the data fit along a straight line. R is also an effect size measure.
Correlation Coefficient Effect Size
Property 4 says that r measures the degree to which the data fit along a single straight line. But what does an r=0.58 or an r=-0.10 tell us? Is this a large effect? This brings in the concept of effect size. Effect sizes tell us how strong the relationship is between variables. Effect sizes help to answer the question of substantive significance (McCloskey, 1996). Cohen (1988) offers this guidance for benchmarking r. Note that whether r is positive or negative, the effect size is the same.
Table 1: Cohen’s Effect Size Benchmarks for r
r Value (-) r Value (+) Effect Size
-0.1 to- 0.3 0.1 to 0.3 Small
-0.3 to- 0.5 0.3 to 0.5 Medium
-0.5 to -1.0 0.5 to 1.0 Large
We can now answer the question as to what an r=0.58 means in terms of effect size. Using Cohen’s benchmarks, 0.58>0.50, so we concluded that there is a large effect size, or in other words, a strong relationship between X and Y. And r=-0.10=0.10, which is a small effect size, or equivalently a weak relationship between X and Y.
Correlation Coefficients for Infant Mortality, Low Birthweight and Median Family Income
The Stata output below is called a correlation matrix. Correlation matrices show us how each variable is correlated with another. This matrix only contains three variables: imr (infant mortality rate), lobweight and mhhif (median family income).
The first thing you’ll notice is the three ones in the diagonal. This is because those cells in the matrix report the correlation of the variable with itself.
Figure 3: Correlation Matrix for Infant Moraliity Data
The correlation between infant mortality and low birth weight is 0.66 (rounded). Based on Cohen’s benchmarks, anything above r=0.5 is considered a large effect size. Therefore, we conclude that the correlation shows a strong relationship between the variables. The correlation between infant mortality and median family income is -0.59. Because 0.59 exceeds Cohen’s 0.5 benchmark for a large effect size, it is also a large effect size. Notice that the matrix also reports the correlation between low birth weight and median family income as -0.47. This correlation would be classified as a medium effect size because it is in between 0.3 and 0.5.
Correlation and Causation
Correlation does not necessarily mean causation. Correlation can only establish that two variables are related to one another mathematically. Consider a simple example where a researcher is looking at the relationship between snow cone consumption and swimming pool accidents. The researcher finds that there is a positive correlation between snow cone consumption and swimming pool accidents. Are we to conclude that eating snow cones causes swimming accidents? Here the relationship is not causal even though a correlation exists. Correlation cannot establish causation. Instead, researchers must use theory to explain and justify why correlations exist between variables.
Review
⦁ Scatterplots show the relationship between two continuous variables
⦁ The correlation coefficient r measures the linear association between two variables
⦁ The sign tells us the direction of the relationship
⦁ The effect size can be determined by using Cohen’s effect size benchmarks
⦁ Usually, correlations are displayed in a correlation matrix that shows the pairwise correlation between the variables
⦁ Correlation matrices are an easy way to see how all the variables in a list are related.
⦁ Correlation cannot establish causation
Stata Code
*Scatterplots and Correlation
* This Code Uses the Annie E. Casey Foundation Data
*Figure 1
twoway (scatter imr lobweight) (lfit imr lobweight)
*Figure 2
twoway (scatter imr mhhif) (lfit imr mhhif)
*Correlation Matrix
correlate imr lobweight mhhif