Prepare & process the data

‍What to expect‍

Pre-processing data is essential for an accurate and insightful data analysis. How well you clean and transform the data has a huge impact on the quality of the result, since incorrect and inconsistent data leads to false conclusions. The steps and techniques for data cleaning will vary from dataset to dataset. However, this section will provide an overview of the most important steps.

Pre-processing data is a very technical step that requires a profound knowledge of statistics and relevant software. While it isn’t the role of policymakers to perform this step, it’s important to understand what needs to be done and why. At the end of this section, you should feel confident to join technical meetings with data analysts, make meaningful and realistic data requests and promote structural change that facilitates the pre-processing of data.

How to get started

Before you start cleaning and pre-processing data, get to know the dataset. By examining the data, you can identify its structure, format and the types of variables present. It helps you identify data quality issues and determine appropriate strategies for handling various aspects of the data. This ensures that subsequent pre-processing steps are conducted effectively, and that the resulting data is suitable for the desired analysis or modelling tasks.‍

Let’s assume you work in the Ministry of Health, and you’re planning a healthcare awareness campaign. You received data on how different factors affect life expectancy to analyse what has the greatest impact and decide on what your campaign should focus on: eating habits or alcohol consumption. Note that the following data is fictionalized and utilized for this example.

You see that your dataset contains 14 columns with information on country, year, life expectancy and several factors that affect life expectancy such as alcohol, BMI, HIV/AIDS, etc. The varying numbers of non-null values in the third column of the table make you suspicious.

One option for further exploration is to display the first rows of the dataset to gain a better understanding of its contents. This examination highlights various elements that require cleaning. We encounter structural errors like typos, inconsistent capitalization and duplicate entries.

Figure 2: First row of data
ligh blue = typos; darke blue = missing values; orange = duplicate

To further investigate your data and get an overview of the countries represented in the dataset, you could decide to create a bar chart that counts the number of entries for each country in the dataset.

Figure 3: Plot counting the entries of every country name in the column country of a dataset

You find that there are naming errors, especially with Afghanistan. This can be seen from the fact that the number of data points for the country 'Afghanistan' is higher than that for other countries as well as from the fact that data points that can obviously be assigned to Afghanistan show typos. It becomes clear that certain data-cleaning procedures are necessary to ensure the validity of the data analysis.

‍

Once you’re familiarized with the dataset, you can begin the process of pre-processing the data. This typically involves the following five steps:

Figure 4: Process of pre-processing the data

Relevant steps of data pre-processing

Step 1: Data Cleaning‍

Data cleaning is the process of identifying and correcting or removing errors, inconsistencies and inaccuracies in the dataset:

Remove duplicates:

Duplicate records or observations can occur when data is collected from multiple sources and scraped or obtained from clients or different departments. Data cleaning involves identifying and removing these duplicates to avoid bias in the analysis or modelling process.

Returning to our example of exploring the factors that influence life expectancy, you discovered the presence of duplicate records. To address this issue, you now remove duplicates and double-check you haven’t missed any:

Figure 5: Before and after removing a duplicate entry from a dataset

Fix structural errors:‍

The next step in data cleaning focuses on addressing structural errors. These errors can occur due to issues during measurement, data transfer or general carelessness in data management. For instance, the variable date might have entries that put the month first and others that put the day first. Structural errors also include things like typos or inconsistent capitalization in categorical features. To identify and fix these errors, data analysts typically use so-called summary tables of all entries in a column that show you useful information like the number of missing values or the mean value of column. Visualizations can also help to spot any inconsistencies or irregularities (see more on how to “visualise data”).

Under the first section “How to get started”, we already spotted structural errors with the spelling of Afghanistan. Being aware of these errors, you can now conduct steps to clean (correction of variables in case of typos and inconsistent capitalizations) the dataset and ensure the accuracy of the analysis:

Figure 6: Data before and after cleaning. Typos have been fixed, duplicates have been removed, etc.

Handle missing values:

Missing values are a common issue in datasets and can arise due to various reasons such as data entry errors, equipment malfunctions or survey non-responses. Before deciding on how to handle this issue, you should explore what and why data is missing.

To get started, ask the following questions:

Which variables have a lot of missing values?
Which observations have a lot of missing values?
Can you detect any patterns behind missing values? To answer this question, you can use a scatterplot to plot the missing values, create a point/bar chart where missing values are colored differently or reorder columns or rows (e.g., by the number of missing values).

Common ways to deal with missing values:

Delete observations with missing values: One way to address missing values is by removing the rows that contain null values.

In our example dataset, the employee of the Ministry of Health discovered that some rows contain missing values. To prevent potential biases in our data that could affect our subsequent analysis, we make the decision to remove those rows entirely.

Figure 7: Before and after removing rows with missing values

Impute missing values for continuous variables: To handle missing values in columns with continuous numeric data, a statistical approach involves replacing the missing values with the mean, median or mode of the remaining values. This method helps preserve the data and prevents loss compared to simply deleting the rows with missing values.

Estimate missing values: Previous methods of handling missing values don’t take advantage of the relationship between the variable containing the missing value and other variables. However, you can build models to estimate missing values (e.g., using a regression model as described in the article Data Analysis and Interpretation). Estimating missing values is more prone to mistakes, therefore, it’s important to have a good understanding of the variables in your dataset and how they relate to each other.

Find a step-by-step guide for handling missing values in Python here.

‍Filter outliers:‍

Outliers are data points that are significantly higher or lower than the nearby data points and the general pattern of the dataset or graph. These extreme values can disrupt the overall trend or distribution of the data, and they may indicate potential errors, anomalies or unique observations. It's important to identify and handle outliers appropriately during data analysis to ensure accurate and reliable results.

Common methods to detect outliers include Z-Score, Standard Deviation Method, Boxplot Method and Isolation Forest. The choice of method depends on factors such as data distribution, dimensionality, presence of noise and the specific requirements of the analysis. It's often a good idea to experiment with different methods and evaluate their performance on the specific dataset at hand.

You can treat outliers by either deleting them when you’re sure they’re incorrect and insignificant or by replacing them, for instance, through the mean value or a predictive model (see above on how to handle missing values).

Outliers are sometimes extreme values that are valid and genuine observations. In such cases, it can make sense to keep the outlier and understand the reasons behind its value to uncover hidden knowledge that has the potential to improve your analysis. However, handling outliers is quite complex and requires a data expert.

Step 2: Data Transformation‍

Data transformation is the process of changing the format, structure or values of data to make it more suitable for analysis, modeling or other data processing tasks. It usually involves applying various operations or functions to the data to achieve the desired transformation. Below are two standard types of data transformation.

Standardizing and normalizing units of measure:‍

Standardizing units involves converting observations to a consistent unit of measure, whereas normalizing observations describes the process of scaling measurements to a common reference or range, typically to a scale between 0 and 1 or -1 and 1. This process enables the comparability of variables. By bringing all measurements to the same scale, analysts can accurately understand relationships and patterns within the data. Furthermore, it can significantly impact the performance of statistical models and machine learning algorithms. Some algorithms, such as regression models or neural networks, bring better results when the scale and magnitude of input variables is standardized.

Creating new variables: ‍

An analysis may require a new variable. This can be the case if, for example, average values or new calculations from the existing variables are of interest for the analysis.
Or in some cases, the relationship between a variable and the outcome variable may not follow the pattern that the model you’d like to apply assumes. For instance, the model assumes a linear relationship between two variables, but the observed relationships are curved or U-shaped. By squaring, taking the logarithm or root of a variable, you can adjust to the non-linear relationship, so a linear regression model can be more precise. Note that this changes the interpretation of the variable’s effect in a model. For more information on interpreting data analyses, see the section on how to “analyse and interpret data”.

Step 3: Data Reduction‍

Data reduction refers to the process of reducing the size or complexity of data while preserving its essential information. Data reduction is commonly used when dealing with datasets that are large, unwieldy or contain redundant or irrelevant information. The objective is to simplify the data without losing critical insights or compromising the accuracy of analyses. The reduction in data volume directly translates to lower storage requirements, which in turn leads to cost savings in hardware, infrastructure and maintenance. Additionally, statistical models for data analysis perform better if irrelevant variables, which don’t help to explain a certain effect, are omitted.

Data reduction techniques can include:

Feature selection:

This process is a part of data analysis that involves selecting a subset of relevant features (another word for variables) from a larger set of available features. The objective is to reduce complexity or size of the dataset and to enhance interpretability and performance of your model by focusing on the most informative and discriminative features. There are different statistical techniques for how to determine which features carry the most informative value.

Sampling:

Data sampling is a statistical technique for selecting a subset of data from a larger population or dataset. By carefully choosing a representative sample, data scientists or researchers can reduce the overall size of the data while still capturing meaningful information about the entire population. This subset can then be used to draw accurate conclusions, make inferences or perform analyses that are representative of the larger dataset, saving computational resources and time.
There are two main types of sampling methods that can be utilized:

Probability sampling: This method involves the random selection of elements or individuals represented in the dataset, ensuring that there’s no correlation between points chosen for the sample.
Non-probability sampling: Non-probability sampling is another method of sampling where the selection of individuals or elements for the sample is based on the judgment of the analyst or researcher, rather than random chance. In this approach, the analyst decides which individuals or elements to include in the sample based on their expertise or convenience. However, since the selection isn’t based on randomization, it can be challenging to determine if the sample truly represents the larger population.

Subset data:

Data subsetting involves extracting specific portions of a dataset based on certain criteria or conditions. This process allows us to focus on particular rows or columns of the data or isolate observations that possess specific characteristics. Subsetting is commonly performed during the data exploration phase to examine the relationship between variables, especially to identify potential interactions. It also enables us to extract subsets of data for separate analyses or to perform specific operations on them. Essentially, data subsetting helps us narrow down our data to relevant portions that are of interest for further investigation or analysis.

During the exploratory data analysis phase, you discovered a variable named "Status" that categorizes countries as either "Developed" or "Developing". However, for your analysis, you’re only concerned with developing countries. Therefore, you decide to create a subset of your data that includes information exclusively on developing countries. This subset will allow you to focus specifically on the relevant observations that meet your criteria of interest.

Figure 8: Selecting only entries, where the status variable shows the value "Developing"

Step 4: Data Integration‍

Data integration is the process of combining and consolidating data from multiple sources into a unified and consistent view. It involves bringing together data from different systems, databases, files or formats, and transforming and harmonizing it to create a comprehensive and reliable dataset.

Harmonizing data:

The first step in data integration is to harmonize the different dataset. The goal is to ensure that data collected from various systems or sources can be combined, compared and analysed effectively, eliminating discrepancies and inconsistencies. Harmonizing the data involves transforming, cleaning and reformatting it so that it adheres to a common set of standards and definitions. From the previous sections, you should already know what cleaning and transforming data entails. So, you only need to decide what standards all data sources should be aligned to. For instance, you might have in every dataset a column with the variable “Date”. You need to decide which format the variable “Date” should take (e.g., DD-MM-YYYY, MM-DD-YYYY or MM-DD-YY) and then transform all “Date” columns in every dataset to this format.

Merging data:

After harmonizing your datasets, you might want to merge them, combining multiple datasets into a single dataset based on a common attribute or key. For instance, the country name could be an attribute by which you want to merge. This means that two datasets with different information have a column with country names. Now, you can merge the information on, say, “Afghanistan” from both datasets into one larger dataset. The process of merging data is commonly used to consolidate information from different sources to create a more comprehensive and unified dataset. This can facilitate analyses and reporting.

Step 5: Data Storage‍

Once the data has been cleaned and pre-processed, a decision must be made on how and where best to store the data.

How to store data: Creating databases

When you’re dealing with large amounts of data from various sources, it makes sense to develop a database. Databases provide a structured way to store, retrieve and adjust information. Creating a database involves several steps, and the specific process may vary depending on the database management system (DBMS) you’re using.

Identify data requirements: Determine what data you need to store, the relationships between different data entities and the expected volume of data.
Choose a DBMS: Select a suitable DBMS based on your requirements and the type of data you’re dealing with. Common DBMS options include MySQL, PostgreSQL, Microsoft SQL Server, Oracle, MongoDB, etc.
Design the database schema: Create a logical model of your database by defining tables, columns and relationships between tables. This process is known as database schema design.
Set constraints: Apply constraints to ensure data integrity, such as primary keys, unique constraints and check constraints.
Populate data: Insert initial data into the tables. This can be done manually, through data import or programmatically using scripts.
Test and optimize: Thoroughly test the database to ensure it works as expected. Optimize the database structure and queries for better performance if necessary.
Backup and recovery: Set up regular data backups and implement a disaster recovery plan to protect against data loss.

Where to store data: Cloud storage vs. Network Attached Storage

There are different options to store and access the data. Data can be stored on personal devices (such as hard drives, disk drives or USB drives), cloud storage and Network Attached Storage (NAS). Typically, the institution you’re working for has clear guidelines on which storage option to use, and personal devices will probably be the exception.

Here are some practical considerations when deciding between cloud storage and NAS:

Cloud storage involves storing data on remote servers managed by a third-party provider. The data is accessed over the internet, and the provider is responsible for the storage infrastructure and maintenance. The main advantage is a convenient data access from anywhere with an internet connection and scalability without additional hardware investments. The main disadvantages include dependence on stable internet connectivity, data privacy concerns from relying on third-party providers, loss of data sovereignty and expensive service contracts.

NAS involves storing data on a dedicated storage device that is connected to a local network. It provides shared storage accessible by multiple devices within the network. Typically, public institutions have such networks already set up. The main advantage is the complete ownership of the data infrastructure and the centralized data management. However, typical drawbacks include limited accessibility outside the local network as well as security, maintenance and administration costs. Also, scaling of NAS storage capacity can be more complex and may require investing in additional hardware.

How do I know I’ve successfully pre-processed my data?

Data Cleaning:

Are there missing values in the dataset, and if so, how will you handle them? Remove, impute or use default values?

Are there any duplicated records, and should they be removed?

Are there any outliers that need to be addressed? How will you handle them?

Data Transformation:

Do numerical values have different scales? Should you standardize them (z-score normalization) or normalize them?

Do you need to convert categorical variables into numerical representations?

Do you need to create new variables for more meaningful input to your model?

Should you apply any mathematical or statistical transformations to the data? (e.g. logarithmic transformation)

Data Reduction:

Do you need to select a subset of relevant variables or only select meaningful variables to improve model performance and reduce computational overhead?

Are there imbalanced classes in the target variable (e.g., more men than women)? How will you address class imbalances?

Data Integration:

Are there multiple datasets that need to be merged and harmonized? How will you combine them?

Data Storage:

Have you decided how you will store the data to ensure easy access and efficient retrieval for future analysis or modelling tasks? Will you have to set up a new database?

Are there any privacy or security concerns that need to be addressed when storing the data?

Can you ensure sustainable resourcing for the database?

What’s next?

Before moving on, ensure that all pre-processing steps are properly documented. This will help others to understand the reasoning behind each change and ensure reproducibility. With that, you can move to the section “Analyse and Interpret Data”.

‍