More Statistical Understanding

Following on from the understanding basic statistics page, this page provides some information on calculating sample sizes, different types of regression models and other multi-variate analysis. This may be of use if you're doing surveillance, epidemiology or other modelling for health intelligence.

Multivariate analysis encompasses a variety of statistical methods used to analyse measurements on two or more variables. Regression analysis is a major subset of multivariate analysis that includes methods for predicting values of one or more response variables from one or more predictor variables.

A model is a description of a relationship connecting the variables of interest. It becomes a statistical model when it is fitted to sample data with the aim of generalising beyond the sample to the underlying population from which the sample was drawn.

You will come across the terms 'Bayesian' and 'Frequentists'. Bayesian methods make statements about the relative evidence for parameter values given a dataset. Frequentists compare relative chance of datasets given a parameter value. Bayesian statistics starts from what has been observed and assesses possible future outcomes. Frequentist (or classical) statistics starts with an abstract experiment of what would be observed if one assumes something, and only then compares the outcomes of the abstract experiment with what was actually observed. The key difference for me is that Bayesians say we have prior information about the outcome and use this information in their modelling. To illustrate, if you lose your car keys, a frequentist will use a model to determine the likelihood of where you lost it and infer which area you should search. A Bayesian will note the places you've been since last seeing your car keys and use this information to adapt the model and limit the areas where you should search.

Calculating Sample Size

When calculating sample sizes for randomised control trials (RCTs) or control studies (observational studies), you need to account for the power and the outcome being measured.

For a continuous outcome (e.g. as measured by a t-test) you will need:

mean difference between treatment groups which you would like to be able to detect
an estimate of the standard deviation within either group

For a binary outcome (e.g. as measured by a chi squared test) you will need:

expected proportion with a good outcome under the experimental treatment
expected proportion with a good outcome under the control treatment

It's worth noting that inferential statistics should only be used for studies that utilise random sampling. You will see studies based on convenience samples (e.g. surveys of students on site in a library or recruited from web platforms) reporting significance tests and confidence intervals. For these convenience samples, the summary statistics are only telling us about the sample population. To provide inference about the wider parent population (from which the sample is drawn), you need a random sample where the sample mean becomes the unbiased estimate for the unknown population mean and the simple statistic of standard deviation becomes the estimated standard error of the mean. While statistical inference presupposes random sampling, there is the fact that even a random sample does not exactly reflect the properties of the parent population. This is called random sampling error but this decreases when the sample size increases and completely disappears when we can study a full population (census). You can still have non-random errors which can cause validity problems even when you have a full population data so it's important that your measures are valid and reliable. Self selection bias is another thing to be aware of: people can freely decide to participate in a study or not. Participants could end up being systematically different to those who didn't participate. Self-selection bias can be addressed through using 'missing data' analyses.

Sample Size for RCTs

Difference in response rates (e.g. if the risk ratio is 0.88, than risk of outcome in the intervention group will be 88% of the control group percentage - 88% of 30% =26.4%)
Response rate in 1 group
Level of statistical significance (alpha) - usually 0.05 (5%)
Power desired (1-beta) - usually at least 80%
1 sided or 2 sided - ratio of sample sizes in 2 treatment groups is usually 1:1
For clusters (in cluster RCTs), you'll need the cluster size and estimate of Intra-cluster co-efficient (ICC).

Note on clusters: people are allocated by clusters rather than individually (e.g. vaccine trials). Patients within the same cluster may be more similar to each other than patients from different clusters (think intra and inter cluster variations). Measured by ICC. You need to allow for this similarity of patients from the same cluster when you analyse your data. It also affects the sample size required to achieve given power. ICC has a value between 0 and 1; 0=2 patients from the same cluster are no more similar than 2 patients from different clusters and 1= 2 patients from the same cluster have identical outcomes. Calculate sample size and multiple the required sample size by the design effect where design effect = 1 + [(k-1)XICC]; k=no of patients per cluster

Sample Size for Observational Studies

Size of effect to be detected.
Statistical significance level.
Power of study (usually 0.8 or 0.9)
Ratio of one group to the other (exposed versus unexposed; cases versus controls).

Main types of Statistical Analysis

There are six major types of statistical analysis.

Descriptive Statistical Analysis

This is the simplest form, using numbers to describe the qualities of a data set - e.g. mean, mode, median, frequencies, range, variation, standard deviation etc.

Inferential Statistical Analysis

This is used to make inferences or draw conclusions about a larger population based on the findings from a sample group.

Associational Statistical Analysis

This is used to make predictions and find causation. Can also be used to find relationships among multiple variables - e.g. correlation, regression, coefficients of variation.

Predictive Analysis

This uses statistical algorithms, computer simulation and machine learning tools to predict future events and behaviour based on new and historical data trends.

Exploratory Data Analysis

This is used to identify patterns and trends in a data set. Can also be used to determine relationships among samples in a population and find missing data points.

Causal Analysis

This is used to determine causation or why things happen the way they do. It can be used to uncover the underlying factors that led to an event.

Types of Regression Models

Generalised (Least Squares) Linear Model or bivariate linear regression - this is the simplest form is the bivariate linear regression, involving a straight line relationship between one response (dependent or regressor) variable and one predictor (independent or explanatory) variable.

Multiple regression -this extends the bivariate linear regression to include more than one predictor variable.

Linear Models for Categorical Data - categorical data require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. Instead, they need to be recoded into a series of variables which can then be entered into the regression model. Types of linear models:

Logit/logistic/multinomial logistic - used when the response variable is binary (rather than continuous). Models how the logarithm of the odds of having a particular characteristic varies with the values of the predictor variables.
Loglinear - this is used when we only have group level data and the data take the form of a contingency table. The dependent variable is the number of cases in a cell of a table.

Survival and Event History Analysis - survival models are applied to data that specify the time elapsed until an event occurs. This concept of 'time elapsed' implies a starting event and a terminating event (e.g. birth and death, divorce and remarriage). Survival times are the observed times from the initiation of a process of interest (e.g. birth) and the occurrence of the event of interest (e.g. death). In practice, time is always measured in discrete units. When discrete units are very small, time can be treated as if it were measured on a continuous scale. When larger (months, years or decades), it is more appropriate to use discrete time methods). Events consists of some qualitative change that occurs at a specific point in time and does not refer to a gradual change. An event history is a longitudinal record of when events happened to a sample of individuals or collectives. It's worth noting the term 'censoring' which means here as 'lost to observation'. Left censoring is when the event happens prior to the observational period and right censoring when the subject has not had the event when the observational period is terminated.
Life Tables - this is a statistical presentation (table or spreadsheet) of the life history of a cohort, commencing with the starting event, as the cohort is progressively thinned out over time by failures (i.e. terminating events). A life table is a basic building block for hazards models. There are two ways for calculating a life table: (1) actuarial method and (2) product limit model (used in estimation of hazard models and is also known as the Kaplan-Meier life table; survival function is calculated at each unique failure time). The actuarial method is often used by life insurance companies as they can show the probability of a person at a certain age dying before their next birthday (they are often called mortality tables). These statistics calculate the remaining life expectancy for people at different ages and stages and the probability of surviving a particular year of age. Actuarial life tables are computed separately for men and women as they have different mortality rates.
Cox Proportional Hazards Regression - this could be viewed as multivariate life table where the hazard is a function of time and other specified predictor variables, such as residence and education.
Kaplan-Meier Curve - this is a visual representation that shows the probability of an event at a respective time interval.
Discrete Time Model - used to study the patterns and correlates of the occurrences of events (marriages, deaths, becoming unemployed etc).
Multilevel models - hierarchical regression analysis, designed to handle hierarchical and cluster data; looks at group effects on individuals when grouping is present. Types of models include:

Hierarchical linear modelling
Random coefficients modelling (RC)
Covariance components models

Multiple Classification Analysis (MCA) - related to multiple regression and is a technique for examining the interrelationship between several predictor variables and one dependent variable in the context of an additive model. Best explained as multiple regression with dummy variables! Response variable is quantitative and predictive variables are categorical represented by dummy variables.

Discriminant Function Analysis or Latent Class Analysis - related to logistic regression
Cluster Analysis - aim is the detection of patterns or indications of potentially interesting relationships in the data. Only when some pattern is thought to exist can the further steps be taken of setting up models and hypotheses for future investigation. The results are produced in the form of a graph or some other type of visual display.
Factor Analysis & Covariance Structure Models (path analysis/LISREL models) - two methods of testing latent models. Latent variables are often theoretical concepts such as intelligence, which cannot be directly measured or cannot be measured without error. We have to make measurements using variables that are assumed to be indicators of the concepts that we are interested in. Factor analysis is a regression model for the observed variables on the unobserved latent variables or factors. There are two types: Exploratory Factor Analysis where the detailed model rating the latent to the observed is not determined before the analysis and Confirmatory Factor Analysis where the number of latent variable is set by the analyst.
Structural Equation Modelling - these look at tentative casual relations between a set of latent dependent and latent independent variables.

Causality

Path Analysis - can be seen as an extension of the ordinary regression model. It analyses how a predictor variable affects the response variable not only directly but also indirectly through one or more intervening variables. First step is to portray it in a diagram with arrows indicating direction of causality.
Graphic Chain Modelling - used to understand the causal structure underlying the dependence among variables. Variables are grouped into response, intermediate and explanatory variables. Intermediate variables can be treated as response to some variables and explanatory for others. Arrows point from explanatory variables to response variables.
Worth noting that to establish causality, you need to be able to prove X came before Y (temporal priority of the independent variable), that the observed relationship between X and Y didn't happen by chance alone (non-spurious), and that there is nothing else that accounts for the X -> Y relationship (empirical association). In epidemiology, people use the Bradford Hill Criteria to show casualty. This consists of:
- Strength
- Consistency
- Specificity
- Temporality
- Biological gradient
- Coherence
- Experimental evidence
- Analogy
- Plausibility