Data preparation takes the vast majority of time in the analytical part of the project. During this stage you should double-check your data – quality assurance and control are crucial. Have the data been input correctly? Did anything that may affect the data quality happen during the collection phase? Make a note of anything suspicious, check the data formats and variable units. Is it all consistent? 

Here is where all the processing of the data is done. We are not extracting any value from the data just yet, we are simply getting it into a good condition, ready for analysis, by performing various data checks - such as logic and range checks.

It is good to create data summaries, look for impossible/implausible data and to visualise the data (e.g. to spot outliers). Create metadata and documentation too, and ideally process your data in a scripted way – so that all the steps you have undertaken and all the manipulations you have applied to the data are documented. Version control is important and can be invaluable help in keeping track of any changes made.

There is a variety of terms used to describe processes involved in data preparation. Many people refer to the earliest stages as preprocessing – this is the process of turning the raw data into a clean structured dataset. The usual steps include cleaning (cleansing), transformation, integration and reduction.(Integration – data from multiple sources are combined to create one data set ready for analysis. Keep in mind that these data sets have to be compatible with each other.) It is here that missing, inaccurate and inconsistent data are dealt with. This process should be performed once.

In contrast, data wrangling is carried out later, often alongside the analysis, and it is the process of shaping the dataset into a format that works well for a particular part of the analysis. This may involve data extraction, filtering and grouping, or focusing on specific level of accuracy etc.

Sometimes, despite our best efforts, further problems are identified later on or during the analysis – this is why data maintenance is important.

Data maintenance is the process of continual improvement and regular checks, carrying out ongoing correction and verification. An example would be querying the data from the originator e.g. research nurse on the paediatric ward, with any range and or logic checks e.g. logic checks - a participant cannot be prescribed the intervention before they have been consented into the study; range checks - the head circumference of a 2 month, 2 week infant could be 35 – 45 cm (covers 3rd – 97th percentile), so a value of 2 cm would be a concern).

Particularly relevant keywords:

  • data processing
  • data curation
  • data integrity
  • data visualisation
  • reproducibility
  • metadata
  • data stewardship
  • data management