Instaskills, n,

one paged quick learning so you can hit the ground running while on the job


Understanding Statistics: Basic Data Analysis


This page provides you with some basic data analytics to help you quickly make sense of social science or population data.  All data types can be represented in a basic data matrix structure of rows (which represent 'cases' or 'observations') and columns (which represent 'variables') in a spreadsheet format.


VARIABLE TYPES

There are two kinds of variables - categorical and quantitative (which can be continuous or discrete).  It is important to understand the different types as the type determines the sort of analysis we do.

Binary - categorical - two categories (e.g. male/female)

Nominal - categorical - more than two categories - sometimes called 'named' categories (e.g. social class, region of residence)

Ordinal - categorical - ranked or ordered categories - numeric codes 1,2,3 etc are used as labels but the numeric order corresponds to the ordering of categories  (e.g. class of university degree, level of job satisfaction) 

Interval/scaled - continuous - differences have the same meaning at different points of the scale  - we know the order and the exact differences between the values (e.g. income, calendar year)

Ratio - continuous - like interval but contains 0 as its point of origin (e.g. time, age)

Discrete count - discrete - (e.g. number of children in a family, number of patients per year)


Variable Type

Descriptive Statistics

Graphing data

Binary

Frequencies

Descriptive of 0/1 variable

Crosstabs

Barcharts

Nominal

Frequencies

Crosstabs

Barcharts

Piecharts

Ordinal

Frequencies

Crosstabs

Barcharts

Piecharts

Box and whiskers

Interval or ratio

Descriptive

Grouped frequencies

Scatterplots

Histograms

Stem & leaf plots

Box and whiskers

Discrete counts

Frequencies (if few values)

Descriptive

Histograms

Stem & leaf plots

Box and whiskers


SHOWING RELATIONSHIPS


Categorical independent variable with a categorical dependent variable:

CROSSTABS

Example: table of counts of households with/without computer

BAR CHART in clustered or stacked form

Categorical independent variable with an interval/ratio dependent variable:

COMPARE MEANS

Example: table of mean income by sex

Interval/ratio independent variable with a categorical dependent variable:

CROSSTABS

Example: table of counts of households with/without computer by age-group of head of household

Interval/ratio independent variable with an interval/ratio dependent variable:

COMPARE MEANS (with grouped variables)

Example: table of mean income by age-group

SCATTERPLOT

Example: graph of income by age


DATA TYPES

  1. Survey
  2. Aggregate
  3. Time series
  4. Experiment
  5. Event based


What is Survey Data?

Data comes from a questionnaire administered to a number of respondents, usually a random sample of members of a population of interest.  Conclusions are to be made about the population.

Cases = Respondents

Variables = questionnaire responses

Requirements: estimation of population quantities; inference about relationships between variables

Issues to note: non-response (non-contacts) introducing bias; missing data (refusals, not-applicable); reliability and validity of measures; sampling and non-sampling errors; sample design; weighting to deal with complex design and any differential non-response


What is Aggregate Data?

Aggregate data involve statistics about a set of administrative, political, social or economic data.

Cases = administrative units (e.g. schools, local authority areas, general practices)

Variables = measures of characteristics of the unit - usually aggregates of individual level data such as counts and percentages

Requirements: description (e.g. ranking) and estimation; inference about relationships between variables

Issues to note: be careful about inference about individual behaviour (ecological inference); can use Geographical Information Systems (GIS) to map and manipulate spatial data


What is Time Series Data?

Time series data involve a set of measurements on an entity of interest over time.

Cases = time points

Variables = measures

Requirements: relationships between series; behaviour of series  (e.g. existence of trends, cycles, lagged relationships, effect of shocks)

Issues to note: time dependence between rows (need special techniques such as econometrics); changes in definition in measures


What is Experiment Data?

Experiment data involve the random allocation of treatments to subjects. Measures of the effects of the treatment are taken together with measures of potential explanatory variables (covariates).  If the subjects are randomly selected from a larger population conclusions can be generalised to the population. 

Cases= subjects

Variables=measures of treatment effects and covariates

Requirements: inference about the effect of treatments controlling for covariates and within subgroups

Issues to note: analysis must take account of experimental design


What is Event based Data?

Event based data involve observations which are constructed from a set of possible events which could occur (e.g. countries being at war or not; people dead or alive)

Cases: possible events

Variables: measures of the characteristics of the participants in the event and outcomes


Measures of outcome

Public health outcomes can be measured in a number of ways but the three most common are rate, proportion and ratio.  

A rate measures the frequency of an event in a population.  The numerator must be included in the denominator and time is important.  Rates indicate the time during which the outcome occurred.  It is usually expressed as per 1,000, or 10,000 or 100,000 persons.  E.g. number of car accidents per 100,000 people per year.  Proportion is sometimes confused with rate but it doesn't have a time component.  Like rates, it includes the numerator in the denominator- 25% of young people in the neighbourhood have hay fever.  

A ratio is a value obtained by dividing one number by another.  These two can be related to each other or not.  Relative risk (risk ratio) is the frequency of outcome in the exposed group divided by the frequency of outcome in the unexposed.  If the frequency is the same, than the ratio is 1.  If the outcome is more frequent in those exposed, then the ratio will be greater than 1, implying an increased risk associated with the exposure.  Less than 1, implies the frequency of disease is less among the exposed.  


Relative odds (odds ratio) indicates the odds of exposure among the case group divided by the odds of the exposure among controls.  Again if the odds are 1, there are equal odds. 

Univariate Statistics 

This is essentially your mean, median and mode (also known as the measures of central tendency).  Another useful measure to know is the standard deviation.  This shows you how widely dispersed cases are around the mean.  If most cases are near the mean, it will be low.  A big standard deviation means that there is a wide dispersion.  

If you want to save yourself some time on the calculator and don't have access to a statistics software package, this calculator is very useful.

Mean, Median, Mode Calculator









Bivariate Statistics

Bivariate statistics look at the relationship between two variables.  Mostly this is done through a contingency table (cross-tabulation).  A good 'rule of thumb' to note is that the independent variable is put across the top of the table and then you read the column percentage.  The chi-square test is the mostly commonly used statistical inference test for contingency tables.  Note that it will only tell you if two categorical variables are independent or not.  It does not indicate anything about the relationship or direction of causality. 

There are also t-tests, which are used to test differences in means differences between two groups, which are related in some way.    A t-test looks at the t-statistic, the t-distribution values, and the degrees of freedom to determine the statistical significance.  Calculating a t-test requires three key data values. They include the difference between the mean values from each data set (called the mean difference), the standard deviation of each group, and the number of data values of each group. There are different types of T-tests but they can be categorised as dependent or independent.   

T-tests can be one sided or two sided tests of significance (worth noting that all comparisons of 3 groups or more are two sided).  For 1 sided t-test, you specify the direction in advance.  Two sided t-tests can go either way.  You can only say they're different not by what. This is because we usually start with a null hypothesis that says no difference between the two groups.   For example, a 2 sided independent samples t-test found that males had different self-esteem scores to females.  With a one sided test, you could say whether this was higher or lower.  Another example of 1 sided test use is when you want  the alternative hypothesis to state that the treatment is better than the placebo.  In general, two sided tests should be used unless there is good reason not to.  A two sided t-test may use an adjusted alpha level (0.025 rather than 0.05 in a one sided t-test).    

A scatterplot is used to show the relationship between two continuous variables.  A useful term to know is 'precision'.  This is the amount of spread in the points on the graph.  A high level precision means the points hug the line and low level refers to the points being spread out.  The coefficient of determination (R²) tells you how the percentage in one variable dependent) is accounted for or explained by the other (independent).  This is often referred to as the 'goodness of fit'.  The statistic varies between 0 and 1.  Be careful interpreting 0.  This can mean no relationship or that the relationship is curvilinear.  

Linear regression is used to model the relationship between two continuous variables.  T-tests are used to compare the means between two groups.  This can also include before and after tests.  





Multivariate Statistics

This looks at the relationship between 3 or more variables.   You are likely to explore this in multiple, logistic and loglinear regression models.  ANOVA analysis (or analysis of variance) generalises the t-test beyond two means to see if 2 or more population means are equal. 
 
Often an Elaboration Model is sought.   Under this,  the researcher explores what happens to the relationship between two variables, when a third variable is held constant as a control. This examination focuses on the "partial relationships" found in each of the subsets created by the control variable.  Five main outcomes have been identified:
  1. Replication - the original relationship is reproduced (stays the same) in each partial relationship.
  2. Specification - the original relationship is replicated in one partial relationship but not in the others.
  3. Interpretation - the control variable comes after the independent variable and interprets the situation when and how the relationship occurs.  
  4. Explanation - the presence of the control variable removes the original relationship or is much weaker in the partial tables.  Usually this means the control variable comes before the independent variable. 
  5. Suppressor - the bivariate contingency table suggested independence but a relationship appears in 1 or 2 of the partials.   
You may also come across the above model being described in multiple regression models as the following categories:
  • Confirmation of original relationship
  • A spurious or an intervening relationship (the relationship is weaker in the partials or disappeared altogether).  This may mean that the original independent variable may affect the control variable which affects the dependent.  Or it could mean that the control variable is a determinant of both independent and dependent variables. 
  • An interaction (the original relationship is stronger in some partials than others).  This indicates that specific values of the control variable enhance the relationship between original independent and dependent variables, while others attenuate or suppress it. 

Choosing a statistical test

Difference between conditions

1 variable 2 conditions:
  • Independent subjects - use independent t-test (parametric) or Mann-Whitney (non-parametric)
  • Related subjects-  use related t-test (parametric) or Wilcoxon (non-parametric)
1 variable > 2 conditions:
  • Independent measures  - use one factor independent measures ANOVA (parametric) or Kruskal-Wallis (non-parametric)
  • Repeated measures - use one factor repeated measures ANOVA (parametric) or Friedman (non-parametric; FYI uses chi-square on the ranks)
2 variables:
  • Independent measures on both variables, use two-factor independent measures ANOVA
  • One independent and one repeated measures factor, use two factor mixed measures ANOVA
  • Repeated measures on both variables, use two factor repeated measures ANOVA

Correlation

Two variables:Use Pearson's r (parametric) or Spearman's r (non-parametric)

More than two variables:
Multiple correlation R 


Compare
frequency counts

Nominal/categorical variables.

Chi-square (p<0.01)