Statistics

Bivariate Data & Scatter Plots

1

Correlation Type Match

Draw a line from each correlation type to its description.

Strong positive
Weak positive
Strong negative
No correlation
Weak negative
Points scattered randomly with no pattern
Points cluster tightly along an upward line
Points loosely trend upward with much scatter
Points cluster tightly along a downward line
Points loosely trend downward with much scatter
2

Positive or Negative Correlation

Circle the correct correlation direction for each scenario.

As temperature increases, ice cream sales increase. This is:

Positive correlation
Negative correlation
No correlation

As the age of a car increases, its resale value decreases. This is:

Negative correlation
Positive correlation
No correlation

As altitude increases, air temperature generally decreases. This is:

Negative correlation
Positive correlation
No correlation

As hours of study increase, exam marks tend to increase. This is:

Positive correlation
Negative correlation
No correlation
3

Strong or Weak Correlation

Circle whether each description suggests a strong or weak correlation.

Data points cluster very tightly around an upward-sloping line:

Strong correlation
Weak correlation
No correlation

Data points are widely scattered but show a slight downward trend:

Weak correlation
Strong correlation
No correlation

Almost every increase in x leads to a predictable decrease in y:

Strong correlation
Weak correlation
No correlation

Points form a rough cloud with only a vague upward drift:

Weak correlation
Strong correlation
No correlation
4

Classify the Correlation Strength

Sort each scenario into the correct column: Strong, Weak, or No Correlation.

Height vs shoe size
Hours of study vs exam mark
Shoe size vs favourite colour
Temperature vs ice cream sales
Age of car vs car value
Number of pets vs maths score
Distance driven vs fuel used
Hours of sleep vs number of siblings
Rainfall vs umbrella sales
Strong Correlation
Weak Correlation
No Correlation
5

Variable Pairs & Expected Correlation

Draw a line from each variable pair to the expected correlation direction.

Hours of exercise per week & fitness level
Number of cigarettes smoked & lung capacity
Shoe size & IQ
Advertising spend & product sales
Distance from equator & average temperature
No correlation expected
Positive — more exercise, higher fitness
Negative — more cigarettes, lower lung capacity
Positive — more advertising, more sales
Negative — further from equator, lower temperature
6

Independent vs Dependent Variable

Circle the correct identification of the independent variable (the one we control or expect to influence the other).

Investigating how study time affects exam results. The independent variable is:

Study time (hours)
Exam result (%)
Student name

Testing whether temperature affects plant growth. The independent variable is:

Temperature (°C)
Plant height (cm)
Type of soil

Exploring the link between age and reaction time. The independent variable is:

Age (years)
Reaction time (ms)
Number of trials

Does distance from school affect travel time? The independent variable is:

Distance from school (km)
Travel time (minutes)
Mode of transport
7

Reading Scatter Plot Axes

Circle the correct answer about what each axis represents on a scatter plot.

The horizontal axis (x-axis) of a scatter plot typically shows:

The independent variable
The dependent variable
The frequency

The vertical axis (y-axis) of a scatter plot typically shows:

The dependent variable
The independent variable
The sample size

On a scatter plot of 'Hours of sunlight vs Plant growth', the x-axis should show:

Hours of sunlight
Plant growth (cm)
Number of plants

Each point on a scatter plot represents:

One data pair (one observation of both variables)
The average of all data
A single variable measurement
8

Categorical vs Numerical Bivariate Data

Sort each variable pair into the correct column: both variables are numerical (suitable for a scatter plot) or at least one is categorical (not suitable for a scatter plot).

Height (cm) vs Weight (kg)
Favourite sport vs Gender
Temperature (°C) vs Rainfall (mm)
Eye colour vs Hair colour
Age (years) vs Reaction time (ms)
Brand of phone vs Satisfaction rating (1–5)
Hours of screen time vs Hours of sleep
State of residence vs Annual income ($)
Both Numerical (scatter plot)
Includes Categorical (not scatter plot)
9

Line of Best Fit Properties

Circle the correct statement about lines of best fit.

A line of best fit should:

Pass through or near as many points as possible with roughly equal points above and below
Connect the first and last data points
Pass through every data point

If a scatter plot shows a strong negative correlation, the line of best fit:

Slopes downward from left to right
Slopes upward from left to right
Is horizontal

A line of best fit is most useful when:

There is a clear linear trend in the data
The data points form a curved pattern
There is no correlation between the variables

If the data has no correlation, a line of best fit:

Is not meaningful and should not be drawn
Should still be drawn through the middle
Will always be perfectly horizontal
10

Interpolation vs Extrapolation

Circle the correct answer about using a line of best fit for predictions.

Using a line of best fit to predict a value within the range of the collected data is called:

Interpolation
Extrapolation
Correlation

Using a line of best fit to predict a value beyond the range of the collected data is called:

Extrapolation
Interpolation
Regression

Which type of prediction is generally more reliable?

Interpolation, because the trend is supported by nearby data
Extrapolation, because it extends the known pattern
Both are equally reliable

Data was collected for students who studied between 1 and 6 hours. Predicting the mark of a student who studied 12 hours is:

Extrapolation and may be unreliable
Interpolation and is reliable
Not possible with a line of best fit
11

Correlation vs Causation

Circle the correct answer about the difference between correlation and causation.

Correlation means:

Two variables tend to change together in a predictable pattern
One variable directly causes the other to change
The variables are always related by a formula

Causation means:

A change in one variable directly produces a change in the other
Two variables happen to change at the same time
The correlation coefficient is close to 1

Ice cream sales and drowning rates both increase in summer. This is an example of:

Correlation without causation — a third variable (hot weather) drives both
Causation — ice cream causes drowning
No correlation at all

A randomised controlled experiment can help establish:

Causation
Only correlation
Neither correlation nor causation
12

Identifying Outliers

Circle the correct answer about outliers in scatter plots.

An outlier on a scatter plot is:

A data point that lies far from the overall pattern of the other points
The point closest to the line of best fit
Any point on the x-axis

What effect can an outlier have on a line of best fit?

It can pull the line toward itself, making the fit less accurate for the rest of the data
It has no effect on the line of best fit
It always improves the accuracy of the line

Before removing an outlier, you should:

Investigate whether it is a data entry error or a genuine unusual observation
Always remove it because outliers are mistakes
Ignore it completely

A student recorded the heights and weights of 20 classmates. One point is far from all others. The best first step is:

Check whether the data was recorded correctly for that student
Delete the point immediately
Draw the line of best fit through it
13

Steps in a Bivariate Investigation

Put the steps for conducting a bivariate data investigation in the correct order.

?
Formulate a question about the relationship between two numerical variables
?
Plan data collection: decide on sample size, method, and how to record both variables
?
Collect the data systematically, recording paired values
?
Organise the data in a table of ordered pairs
?
Construct a scatter plot with the independent variable on the x-axis
?
Describe the association: direction, form, and strength
?
Draw a line of best fit if the trend is approximately linear
?
Use the line to make predictions (interpolation) and draw conclusions
14

Scatter Plot Pattern Match

Draw a line from each scatter plot description to its correlation type.

Points rise steeply from left to right in a tight band
Points fall gradually from left to right with wide scatter
Points form a random cloud with no trend
Points fall steeply from left to right in a tight band
Points rise gradually from left to right with wide scatter
Strong positive correlation
Weak negative correlation
No correlation
Strong negative correlation
Weak positive correlation
15

Predict Using a Line of Best Fit

Use the described line of best fit to circle the best prediction.

A line of best fit for 'Hours studied (x) vs Exam mark (y)' passes through (2, 50) and (6, 80). Predict the mark for 4 hours of study:

65
55
75

Using the same line, predict the mark for 5 hours of study:

72.5
60
80

The line of best fit for 'Temperature (x) vs Hot drinks sold (y)' passes through (10°C, 60) and (30°C, 20). Predict sales at 20°C:

40
50
30

Using the same line, would you trust a prediction for sales at 50°C?

No — 50°C is far outside the data range (extrapolation)
Yes — the line can be extended indefinitely
Yes — as long as we use the equation
16

Interpreting r-values

The correlation coefficient (r) measures the strength and direction of a linear association. Circle the correct interpretation.

An r-value of +0.95 indicates:

Strong positive linear correlation
Weak positive correlation
No correlation

An r-value of −0.82 indicates:

Strong negative linear correlation
Weak negative correlation
Strong positive correlation

An r-value of +0.15 indicates:

Weak positive correlation (close to no linear relationship)
Strong positive correlation
Perfect correlation

An r-value of 0 indicates:

No linear correlation (but a non-linear relationship may still exist)
A perfect negative correlation
The data has no variability

An r-value of −1 indicates:

A perfect negative linear correlation — all points lie exactly on a downward line
No correlation
A weak negative correlation
17

Valid vs Invalid Conclusions

Sort each conclusion into the correct column: Valid or Invalid based on scatter plot data.

There is a positive association between hours studied and exam marks
Studying more hours causes higher exam marks
As temperature increases, hot drink sales tend to decrease
Hot weather causes people to stop drinking hot drinks entirely
There appears to be no linear relationship between shoe size and IQ
Countries that eat more chocolate win more Nobel Prizes, so chocolate makes people smarter
The data suggests a strong negative correlation between car age and resale value
Since ice cream sales and sunburn rates are correlated, eating ice cream causes sunburn
Valid Conclusion
Invalid Conclusion
18

Confounding (Third) Variables

Circle the most likely confounding variable that could explain the observed correlation.

Correlation: cities with more fire stations have more crime. The confounding variable is likely:

City population size
Number of firefighters
Colour of fire trucks

Correlation: children who eat breakfast score higher on tests. A possible confounding variable is:

Overall family socioeconomic status and home support
The brand of cereal eaten
The colour of the breakfast bowl

Correlation: people who sleep more tend to weigh less. A confounding variable could be:

Overall health habits (exercise, diet, stress levels)
Pillow type
Bedroom wall colour

Correlation: countries with higher chocolate consumption per capita have more Nobel Prize winners. The confounding variable is likely:

National wealth and investment in education and research
Type of chocolate preferred
Average temperature of the country
19

Design a Bivariate Investigation

Design a detailed bivariate data investigation.

Design a statistical investigation to test whether there is a relationship between the number of hours people exercise per week and their resting heart rate. In your response, describe: (a) the variables and which is independent/dependent, (b) how you would collect data (sample size, method, potential bias), (c) what you would expect the scatter plot to look like and what correlation you predict, (d) how you would draw and use a line of best fit, and (e) whether finding a correlation would prove that exercise causes a lower resting heart rate. Explain your reasoning.

20

Analyse a Dataset

Analyse the following bivariate dataset and describe the association.

A teacher recorded the number of hours each student spent on their phone per day and their average test score: Phone hours: 1, 2, 2, 3, 3, 4, 4, 5, 6, 7 Test score: 88, 82, 85, 75, 78, 70, 65, 60, 55, 50 (a) What type of correlation does this data suggest? (b) Estimate the strength of the correlation (strong, moderate, or weak). (c) If you drew a line of best fit, would its gradient be positive or negative? (d) Predict the test score for a student who uses their phone for 3.5 hours per day. (e) Would it be appropriate to predict the score for a student who uses their phone for 15 hours per day? Why or why not?

21

Correlation vs Causation Explained

Explain the difference between correlation and causation using examples.

Using your own examples, explain the difference between correlation and causation. Include: (a) one example where two variables are correlated AND one causes the other, (b) one example where two variables are correlated but neither causes the other (identify the confounding variable), and (c) an explanation of why scientists use controlled experiments rather than observational studies to establish causation.

22

Critique a Study's Conclusions

Read the study summary and critique its conclusions.

A newspaper reports: 'A study of 500 adults found that people who drink more coffee tend to live longer. Researchers concluded that coffee extends your lifespan.' Critique this conclusion by addressing: (a) Does correlation prove causation here? (b) What confounding variables might explain this relationship? (c) What type of study would be needed to establish whether coffee actually extends lifespan? (d) How might the sample or data collection method affect the reliability of the findings?

23

Compare Two Scatter Plots

Compare two bivariate datasets and their scatter plots.

Two investigations were conducted at a school: Investigation A — Hours of sleep vs Reaction time (ms): Sleep: 5, 6, 6, 7, 7, 8, 8, 9, 9, 10 Reaction: 420, 380, 400, 340, 350, 300, 310, 270, 280, 250 Investigation B — Hours of TV vs Reaction time (ms): TV: 1, 2, 2, 3, 3, 4, 5, 5, 6, 7 Reaction: 310, 280, 350, 300, 370, 320, 290, 340, 360, 300 (a) Describe the correlation you would expect in each investigation. (b) Which investigation would likely show a stronger correlation? Explain why. (c) For the investigation with the stronger correlation, describe what the line of best fit would look like. (d) Can either investigation prove causation? Why or why not?

24

True or False — Statistics Concepts

Circle TRUE or FALSE for each statement about bivariate data and scatter plots.

A correlation coefficient (r) can have a value of 1.5.

FALSE — r always lies between −1 and +1
TRUE

A scatter plot can only show positive correlations.

FALSE — scatter plots can show positive, negative, or no correlation
TRUE

If r = 0, there is definitely no relationship between the variables.

FALSE — r = 0 means no linear relationship, but a non-linear relationship may exist
TRUE

Interpolation is more reliable than extrapolation.

TRUE — interpolation predicts within the data range where the trend is supported
FALSE

An outlier should always be removed from a dataset.

FALSE — outliers should be investigated before deciding whether to keep or remove them
TRUE

The independent variable is placed on the x-axis of a scatter plot.

TRUE
FALSE
25

Collect Data and Predict

Plan a real data collection and make predictions.

You want to investigate whether there is a relationship between the distance students live from school and the time it takes them to travel to school. (a) Which variable is independent and which is dependent? (b) Describe how you would collect data from at least 15 students. (c) What type of correlation do you predict? Explain your reasoning. (d) Sketch what you think the scatter plot might look like (describe it in words). (e) Identify one potential source of bias in your data collection and how you would minimise it.

26

Identify Confounding Variables

Identify and explain confounding variables in real-world correlations.

For each of the following correlations, identify at least one confounding variable and explain how it could account for the observed relationship: (a) Students who eat breakfast tend to get better grades. (b) Countries with more televisions per household have longer life expectancies. (c) People who own more books tend to earn higher salaries. (d) Suburbs with more parks have lower rates of obesity. For one of these examples, describe how you could design a study to test whether the relationship is causal rather than just a correlation.

27

Collect Bivariate Data at Home

Collect your own bivariate data and create a scatter plot.

  • 1Record the temperature and the number of people at a local park over several days. Create a scatter plot — is there a correlation?
  • 2Survey family members: compare their height with their arm span. Plot the data and describe the association.
  • 3Track your screen time and hours of sleep for a week. Create a scatter plot and describe any pattern you observe.
  • 4Measure the length and width of 10 different leaves from the same type of tree. Plot the data and describe the association.
28

Find Correlations in Daily Life

Look for examples of correlation (and possible causation) in your everyday life and in the media.

  • 1Find a news article that claims one thing causes another. Identify whether the evidence shows correlation or causation. What confounding variables might be involved?
  • 2Over a week, record two variables you think might be related (e.g., time spent outdoors vs mood rating 1–10). Create a scatter plot and describe what you find.
  • 3Look at the nutrition labels on 10 food items. Plot sugar content vs calorie count. Is there a correlation? Is it what you expected?
  • 4Ask five people to estimate how far they live from the nearest shop (in km) and how often they visit per week. Plot the data and describe any pattern.
29

Correlation — Describe and Classify

Describe scatter plot correlations accurately.

For each pair of variables, state whether you would expect a positive correlation, negative correlation, or no correlation, and give a brief reason: (a) Study hours and exam score (b) Temperature and hot chocolate sales (c) Shoe size and intelligence (d) Height and weight of adults (e) Daily exercise and resting heart rate

30

Scatter Plot Description to Correlation Type

Draw a line from each scatter plot description to the correct correlation type.

Points cluster tightly from bottom-left to top-right
Points are randomly scattered with no pattern
Points cluster loosely from top-left to bottom-right
Points curve upward steeply then level off
Points cluster very tightly along a nearly perfect line (upward)
No correlation
Non-linear relationship
Strong positive linear correlation
Weak negative linear correlation
Moderate positive linear correlation
31

Line of Best Fit — Equation and Interpretation

Find and interpret the equation of a line of best fit.

A scatter plot shows study hours (x) and exam scores (y) for 10 students. The line of best fit passes through (2, 55) and (8, 85). Find: (a) The gradient (m) of the line of best fit. (b) The y-intercept. (c) The equation of the line. (d) Predict the score for a student who studies 5 hours. (e) Explain what the gradient means in context.

32

Correlation Coefficient r — Interpret

Circle the correct interpretation of each correlation coefficient.

r = 0.92 means:

Strong positive linear correlation
Weak positive linear correlation
Strong negative linear correlation

r = −0.15 means:

Weak or no negative linear correlation
Strong negative linear correlation
Perfect negative correlation

r = 0 means:

No linear relationship (but could have non-linear)
Perfect correlation
Exactly half strong and half weak

r = −0.85 means:

Strong negative linear correlation
Weak negative correlation
No relationship
33

Causation vs Correlation

Sort each example: Correlation implies Causation (likely), or Correlation does NOT imply Causation.

Higher cigarette smoking rates → higher rates of lung cancer
Ice cream sales and drowning rates both peak in summer
More study hours → better exam scores
Number of Nicolas Cage films per year correlates with pool drowning deaths
Higher alcohol consumption → increased liver disease risk
Countries with more TVs per capita have higher life expectancy (wealth confound)
Likely Causal
Correlation but NOT Causation
34

Residuals and Goodness of Fit

Calculate and interpret residuals from a line of best fit.

Using the model: Exam score = 5 × (study hours) + 45, calculate the residual for each student: (a) Studied 3 hrs, scored 62 (b) Studied 6 hrs, scored 72 (c) Studied 9 hrs, scored 92 For each, state whether the line overestimates or underestimates the actual score. What does a pattern of large residuals suggest about the model?

35

Extrapolation — When to Be Careful

Critique the use of extrapolation beyond the data range.

A model for plant height over time gives h = 1.5t + 3 (h in cm, t in weeks) based on data from weeks 1–8. A student uses this to predict the height at week 52. (a) What prediction does the model give? (b) Explain why this prediction is likely unreliable. (c) What factors might limit the plant's actual growth?

Explain the difference between interpolation and extrapolation. Which is more reliable and why? Give an example of each using a scatter plot context.

36

Scatter Plot Correlations in a Research Study

Record correlation types observed across 20 variable pairs in a dataset.

ItemTallyTotal
Strong positive correlation (r > 0.7)
Moderate positive correlation (0.3 < r ≤ 0.7)
Weak/no correlation (−0.3 ≤ r ≤ 0.3)
Moderate negative correlation (−0.7 ≤ r < −0.3)
Strong negative correlation (r < −0.7)
37

Two-Way Tables — Bivariate Categorical Data

Construct and analyse a two-way frequency table.

100 students were surveyed about sport preferences and gender: • 60 are female: 25 prefer netball, 20 prefer swimming, 15 prefer soccer • 40 are male: 5 prefer netball, 10 prefer swimming, 25 prefer soccer (a) Construct the two-way table. (b) What percentage of females prefer swimming? (c) What percentage of soccer players are male? (d) Is there an association between gender and sport preference? Justify.

Draw here
38

Outliers in Bivariate Data

Identify and analyse outliers in scatter plots.

Explain what an outlier means in the context of bivariate data. How does an outlier differ from a point that is merely an extreme value on one axis? Describe how outliers can affect the line of best fit and the correlation coefficient r.

In a scatter plot of shoe size vs reading level for 30 children aged 5–15, there is a strong positive correlation (r = 0.82). Does this mean bigger feet cause better reading? Identify the confounding variable and explain how it creates a spurious correlation.

39

Scatter Plot Variables — Independent vs Dependent

Sort each variable: which is the independent variable (x-axis) and which is dependent (y-axis)?

Hours of sunlight per day
Crop yield per hectare
Advertising spend ($)
Monthly sales revenue ($)
Daily temperature (°C)
Number of beach visitors
Years of experience
Annual salary
Independent Variable (x)
Dependent Variable (y)
40

Collect Your Own Bivariate Data

Design and conduct a data collection activity to investigate bivariate relationships.

  • 1Measure your reaction time (use an online reaction time test) 10 times at different times of day (morning, afternoon, evening). Record time-of-day and reaction time. Create a scatter plot and describe any pattern you see.
  • 2Record the outside temperature and the number of people wearing jackets when you go out for 7 different days. Create a scatter plot. Is there a negative correlation?
  • 3Survey at least 15 people on two numerical variables (e.g. hours of sleep vs energy rating out of 10). Plot the scatter graph and calculate the correlation coefficient using a spreadsheet.
41

Pearson's Correlation Coefficient — Calculation

Calculate and interpret Pearson's correlation coefficient.

For the 5 data points: (1,2), (2,4), (3,5), (4,4), (5,7): (a) Calculate the mean of x and the mean of y. (b) Calculate Σ(x − x̄)(y − ȳ), Σ(x − x̄)², and Σ(y − ȳ)². (c) Use r = Σ(x−x̄)(y−ȳ) / √[Σ(x−x̄)² × Σ(y−ȳ)²] to find r. (d) Interpret the value of r you found.

Draw here
42

Steps to Draw a Line of Best Fit

Put the steps in the correct order for drawing a line of best fit by eye.

?
Plot all data points on a clearly labelled scatter graph
?
Identify the overall trend (positive, negative, no correlation)
?
Draw a straight line that best represents the trend
?
Ensure approximately equal numbers of points above and below the line
?
Make sure the line passes through or near the mean point (x̄, ȳ)
?
Select two points on the line (not data points) to calculate the equation
43

Critique a Statistical Claim

Critically evaluate a statistical claim involving correlation.

A newspaper headline reads: 'Research shows children who eat breakfast score higher on tests — proof that breakfast improves brain function.' Critically evaluate this claim. Identify: (a) what type of study this might be, (b) at least two confounding variables, (c) why correlation does not prove causation, (d) what type of study would be needed to establish causation.

44

Correlation Strength — Match the Description

Draw a line from each correlation coefficient to its description.

r = 1.0
r = 0.8
r = 0.3
r = 0
r = −0.7
r = −1.0
Moderate negative correlation
Perfect positive correlation
No linear correlation
Strong positive correlation
Weak positive correlation
Perfect negative correlation
45

Least Squares Regression Line

Understand and apply the line of best fit equation.

Explain what the least squares regression line minimises. Why is it called 'least squares'?

The regression line for study hours (x) vs test score (y) is ŷ = 42 + 8x. Interpret the slope and y-intercept in context.

Predict the test score for a student who studies 6 hours. Is this interpolation or extrapolation?

Predict the score for a student who studies 15 hours. Why should this prediction be treated with caution?

46

Correlation or Causation?

Sort each claim as showing genuine causation or merely correlation.

Smoking and lung cancer
Ice cream sales and drowning rates
Exercise and improved cardiovascular health
Number of TVs owned and life expectancy
Vaccination and reduced disease incidence
Shoe size and reading ability in children
Likely causation
Correlation only
47

Collecting and Graphing Bivariate Data

Design and carry out a small bivariate data investigation.

Choose two variables you believe might be correlated (e.g. temperature and ice cream sales, hours of sleep and concentration). State a hypothesis about their relationship.

Describe how you would collect data for your two variables. How many data points would you collect? What controls would you apply?

Sketch the shape of the scatter plot you would expect to see if your hypothesis is correct.

Draw here

How would you calculate r for your data? What value of r would support your hypothesis?

48

Scatter Plot Patterns Identified

Tally each type of correlation pattern observed in the scatter plots you studied.

ItemTallyTotal
Strong positive
Weak positive
Strong negative
Weak negative
No correlation
49

Identify the Correct Interpretation

Circle the best interpretation of each statistical statement.

r = 0.85 between height and shoe size means:

Height causes larger shoe size
There is a strong positive linear association
Knowing height exactly predicts shoe size

The slope of the regression line is 2.5. This means:

For each 1-unit increase in x, y increases by 2.5 on average
x is 2.5 times y
When x = 0, y = 2.5

An outlier in a scatter plot:

Can strongly affect the regression line
Should always be deleted
Proves the data is wrong

Extrapolation beyond the data range is unreliable because:

The linear pattern may not continue outside the data range
The formula changes outside the range
We run out of decimal places
50

Residuals and Model Quality

Assess how well a regression model fits the data.

Define a residual in the context of regression analysis.

A student scores 68 on a test. The regression model predicts 74. Calculate and interpret the residual.

If residuals are randomly scattered above and below the regression line, what does this suggest about the model?

If residuals show a curved pattern, what does this suggest? What model might be better?

51

Bivariate Data Investigation at Home

Design and conduct a small bivariate data study using household data.

  • 1Collect data on two variables for at least 10 observations (e.g. temperature vs electricity bill for 10 months). Draw a scatter plot and estimate the correlation.
  • 2Research a real Australian dataset (e.g. ABS website). Find two related variables and describe their correlation.
  • 3Look at a health or fitness app on your phone or family member's phone. Find two variables that are tracked and describe any pattern you see.
  • 4Research Simpson's Paradox — a situation where a trend appears in groups of data but disappears or reverses when groups are combined. Write a short summary.
  • 5Find a scatter plot in a scientific journal or newspaper. Write three observations about the data shown, including the direction, strength, and any outliers.
52

Non-Linear Relationships in Data

Recognise when a linear model is not appropriate.

Sketch scatter plots showing: (a) a linear relationship, (b) a curved (quadratic) relationship, (c) no relationship. Label each.

Draw here

Population data for a city over 10 years shows exponential growth. Why would a linear regression model be inappropriate here?

What transformations (e.g. log, square root) could linearise an exponential relationship in data? Explain how you would apply them.

53

Pearson's Correlation Coefficient

Understand and calculate Pearson's r.

Explain what Pearson's correlation coefficient r measures. What are its maximum and minimum values?

For data: x = {2, 4, 6, 8, 10}, y = {5, 9, 13, 16, 21}. Calculate the mean of x and mean of y. Then calculate r using the formula or technology. Interpret the result.

Can two variables have r ≈ 0 but still have a strong non-linear relationship? Explain and give an example.

54

Scatter Plot Vocabulary

Match each scatter plot term to its correct description.

Response variable
Explanatory variable
Outlier
Cluster
Line of best fit
Extrapolation
Predicting outside the range of the data
A point clearly separated from the main pattern
Placed on the x-axis; independent variable
Minimises the sum of squared residuals
A group of points separate from others
Placed on the y-axis; depends on x
55

Confounding Variables and Study Design

Identify confounding variables and distinguish study types.

Define a confounding variable. Give an example of how a confounder could lead to a misleading correlation.

A study finds that areas with more hospitals have higher death rates. Does this mean hospitals cause death? Identify the confounding variable.

Explain the difference between an observational study and a randomised controlled experiment. Which one can establish causation?

Design a controlled experiment to test whether lack of sleep causes lower test scores. Describe your key controls.