Instaskills, n,
one paged quick learning so you can hit the ground running while on the job
More Statistical Understanding
Following on from the understanding basic statistics page, this page provides some information on calculating sample sizes, different types of regression models and other multi-variate analysis. This may be of use if you're doing surveillance, epidemiology or other modelling for health intelligence.
Multivariate analysis encompasses a variety of statistical methods used to analyse measurements on two or more variables. Regression analysis is a major subset of multivariate analysis that includes methods for predicting values of one or more response variables from one or more predictor variables.
A model is a description of a relationship connecting the variables of interest. It becomes a statistical model when it is fitted to sample data with the aim of generalising beyond the sample to the underlying population from which the sample was drawn.
You will come across the terms 'Bayesian' and 'Frequentists'. Bayesian methods make statements about the relative evidence for parameter values given a dataset. Frequentists compare relative chance of datasets given a parameter value. Bayesian statistics starts from what has been observed and assesses possible future outcomes. Frequentist (or classical) statistics starts with an abstract experiment of what would be observed if one assumes something, and only then compares the outcomes of the abstract experiment with what was actually observed. The key difference for me is that Bayesians say we have prior information about the outcome and use this information in their modelling. To illustrate, if you lose your car keys, a frequentist will use a model to determine the likelihood of where you lost it and infer which area you should search. A Bayesian will note the places you've been since last seeing your car keys and use this information to adapt the model and limit the areas where you should search.
Calculating Sample Size
When calculating sample sizes for randomised control trials (RCTs) or control studies (observational studies), you need to account for the power and the outcome being measured.
For a continuous outcome (e.g. as measured by a t-test) you will need:
- mean difference between treatment groups which you would like to be able to detect
- an estimate of the standard deviation within either group
- expected proportion with a good outcome under the experimental treatment
- expected proportion with a good outcome under the control treatment
Sample Size for RCTs
- Difference in response rates (e.g. if the risk ratio is 0.88, than risk of outcome in the intervention group will be 88% of the control group percentage - 88% of 30% =26.4%)
- Response rate in 1 group
- Level of statistical significance (alpha) - usually 0.05 (5%)
- Power desired (1-beta) - usually at least 80%
- 1 sided or 2 sided - ratio of sample sizes in 2 treatment groups is usually 1:1
- For clusters (in cluster RCTs), you'll need the cluster size and estimate of Intra-cluster co-efficient (ICC).
Sample Size for Observational Studies
- Size of effect to be detected.
- Statistical significance level.
- Power of study (usually 0.8 or 0.9)
- Ratio of one group to the other (exposed versus unexposed; cases versus controls).
Types of Regression Models
Generalised (Least Squares) Linear Model or bivariate linear regression - this is the simplest form is the bivariate linear regression, involving a straight line relationship between one response (dependent or regressor) variable and one predictor (independent or explanatory) variable.
Multiple regression -this extends the bivariate linear regression to include more than one predictor variable.
Linear Models for Categorical Data - categorical data require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. Instead, they need to be recoded into a series of variables which can then be entered into the regression model. Types of linear models:
- Logit/logistic/multinomial logistic - used when the response variable is binary (rather than continuous). Models how the logarithm of the odds of having a particular characteristic varies with the values of the predictor variables.
- Loglinear - this is used when we only have group level data and the data take the form of a contingency table. The dependent variable is the number of cases in a cell of a table.
- Life Tables - this is a statistical presentation (table or spreadsheet) of the life history of a cohort, commencing with the starting event, as the cohort is progressively thinned out over time by failures (i.e. terminating events). A life table is a basic building block for hazards models. There are two ways for calculating a life table: (1) actuarial method and (2) product limit model (used in estimation of hazard models and is also known as the Kaplan-Meier life table; survival function is calculated at each unique failure time). The actuarial method is often used by life insurance companies as they can show the probability of a person at a certain age dying before their next birthday (they are often called mortality tables). These statistics calculate the remaining life expectancy for people at different ages and stages and the probability of surviving a particular year of age. Actuarial life tables are computed separately for men and women as they have different mortality rates.
- Cox Proportional Hazards Regression - this could be viewed as multivariate life table where the hazard is a function of time and other specified predictor variables, such as residence and education.
- Kaplan-Meier Curve - this is a visual representation that shows the probability of an event at a respective time interval.
- Discrete Time Model - used to study the patterns and correlates of the occurrences of events (marriages, deaths, becoming unemployed etc).
- Hierarchical linear modelling
- Random coefficients modelling (RC)
- Covariance components models
Discriminant Function Analysis or Latent Class Analysis - related to logistic regression
- Cluster Analysis - aim is the detection of patterns or indications of potentially interesting relationships in the data. Only when some pattern is thought to exist can the further steps be taken of setting up models and hypotheses for future investigation. The results are produced in the form of a graph or some other type of visual display.
- Factor Analysis & Covariance Structure Models (path analysis/LISREL models) - two methods of testing latent models. Latent variables are often theoretical concepts such as intelligence, which cannot be directly measured or cannot be measured without error. We have to make measurements using variables that are assumed to be indicators of the concepts that we are interested in. Factor analysis is a regression model for the observed variables on the unobserved latent variables or factors. There are two types: Exploratory Factor Analysis where the detailed model rating the latent to the observed is not determined before the analysis and Confirmatory Factor Analysis where the number of latent variable is set by the analyst.
- Structural Equation Modelling - these look at tentative casual relations between a set of latent dependent and latent independent variables.
Causality
- Path Analysis - can be seen as an extension of the ordinary regression model. It analyses how a predictor variable affects the response variable not only directly but also indirectly through one or more intervening variables. First step is to portray it in a diagram with arrows indicating direction of causality.
- Graphic Chain Modelling - used to understand the causal structure underlying the dependence among variables. Variables are grouped into response, intermediate and explanatory variables. Intermediate variables can be treated as response to some variables and explanatory for others. Arrows point from explanatory variables to response variables.
- Worth noting that to establish causality, you need to be able to prove
X came before Y (temporal priority of the independent variable), that the observed relationship between X and Y didn't happen by chance alone (non-spurious), and that there is nothing else that accounts for the X -> Y relationship (empirical association). In epidemiology, people use the Bradford Hill Criteria to show casualty. This consists of:
- Strength
- Consistency
- Specificity
- Temporality
- Biological gradient
- Coherence
- Experimental evidence
- Analogy
- Plausibility