You have received the desired datasets.
What should you do next?

Initial Steps for harmonisation

The aim of harmonisation (also referred to as data "cleaning") is to create a usable dataset that contains the data from each of the studies with available data. The labels/values (for categorical/binary data) and scales (for continuous data) should be standardised across all studies.

You have received data! What next?

Check all data you receive
  • Check that you have received all variables you expected to receive
  • Check that all variables are labelled appropriately. All categorical values should have clear labels for each category (e.g. sex: 1 = male, 2 = female), and scales for continuous variables should be clearly defined (e.g. pounds versus kilograms for weight)
  • Check for missing data
  • Check the data against published findings, including; number of participants, baseline characteristics, outcome measures
What to do if your checks identify problems?
  • Contact the authors of the provided data and politely request for them to aid you in resolving the queries you have about their data
  • If you receive more data, make sure you check this data
Decisions regarding harmonisation
  • Decisions regarding harmonisation (for instance, whether a study’s definition of a variable is “close enough” to the IPD project’s desired definition) should be made using clinical, statistical and other relevant expertise

On this page, we present an example master codebook for the IPD. The codebook includes information regarding the data including the variable names, definitions, values and labels. A codebook is useful for all users of the IPD to understand the data and what is contained including the measures, scales and labels. All studies that contribute to the IPD project will need to be mapped to the codebook (see Preparing data tab).

Some variables are coded at the study level (all participants from the same study receive the same value), whereas others are coded at the participant level (different participants from the same study may have different values).

Example of a master codebook for an IPD meta-analysis, with an annotated note:

VariableVariable descriptionLabels/valuesNotes regarding variables in the codebook
TrialID This variable uniquely identifies what trial the data observation originates from. Continuous identifier Study-level variable. This variable is important to uniquely identify the datasets within the IPD. This maintains the clustering in the date.
TrialName The name of the trial the data originates from. String/free text Study-level variable. This will help you to identify which trial the data originates from. Alternatives to trial name are author name and year of publication.
PatID Patient identifier, uniquely identifies individual in a trial. Continuous Identifier Participant-level variable. Each participant should be uniquely identified in the data.
Age Age of participant in years.
Min = 0
Missing = .
Participant-level variable.
Sex Participant’s gender. 1 = Male
2 = Female
Participant-level variable.
Trtgrp Treatment group participant was randomised to. 0 = Control group
1 = Treatment group
Participant-level variable.
Duration The duration of symptoms in years experienced by the participant. Continuous
Min = 0
Missing = .
Participant-level variable.

For some further examples:

  • Hee, S.W., Dritsaki, M., Willis, A., Underwood, M. and Patel, S. Development of a repository of individual participant data from randomized controlled trials of therapists delivered interventions for low back pain. European Journal of Pain 2017; 21(5): 815-826
  • Thombs, B.D., Benedetti, A., Kloda, L.A. et al. The diagnostic accuracy of the Patient Health Questionnaire-2 (PHQ-2), Patient Health Questionnaire-8 (PHQ-8), and Patient Health Questionnaire-9 (PHQ-9) for detecting major depression: protocol for a systematic review and individual patient data meta-analyses. Syst Rev 3, 124 (2014)
  • >>Link to codebook

There are many suitable ways to recode and prepare the data. On this page, we provide a few examples.

Convert data

The data you receive will often have different file formats. You will need to convert the data into the appropriate file format for the statistical software you will use (e.g. STATA (.dta), R (.rdata) ). Also, convert data into the same “shaping” format (e.g. wide or long format).

Rename variables

The data you have received will have named their variables differently to how you would like them to be organised in your IPD project.
For example, your master codebook might have a variable called “trtgrp”, which aims to identify the treatment group. In a dataset received, however, the variable for identifying the treatment group might have been called “Group” instead.

In Stata, you can rename the variable in your received dataset using the following code:

rename Group trtgrp

In R, you can use the following code:

colnames(ReceivedData)[colnames(ReceivedData)=="Group"] <- "trtgrp"

Recoding categorical/binary data

Data will be recorded in different ways across the different datasets you have received. In addition to renaming variables, you will also need to standardise the variable labels and values across all received datasets.

For example, your master codebook might define the treatment group variable (trtgrp) to be coded 1 for the treatment group and 0 for the control group. In a dataset received, however, the relevant variable might have been coded as 2 for the treatment group and 1 for the control group.

In Stata, you can recode your variable using the following code:

recode trtgrp (2=1) (1=0)
label define trt 1 "Treatment group" 0 "Control group"
label values trtgrp trt

In R, you can use the following code:

ReceivedData$trtgrp[ReceivedData$trtgrp == 1] <- 0
ReceivedData$trtgrp[ReceivedData$trtgrp == 2] <- 1
ReceivedData$trtgrp <- factor(ReceivedData$trtgrp)
levels(ReceivedData$trtgrp) <- c("Control group", "Treatment group")

Standardising continuous variables

Continuous variables in the data received may be coded in very different ways. For instance, different studies may have used different scales (e.g., kilograms vs. pounds for weight) and different directions (a high score may refer to a different outcome i.e. more severe or less severe). It is important to calibrate these things across all received datasets.

For example, your master codebook might have a variable for the duration of symptoms in years (called “Duration”). In a dataset received, however, the duration of symptoms might have been recorded in months.

In Stata, you can rescale your variable using the following code:

replace Duration = Duration * 12

In R, you can use the following code:

ReceivedData$Duration <- ReceivedData$Duration * 12