Validate, analyse & interpret the data

Analysing data is complex and there are many considerations to make. This section will provide you with an overview of the main considerations and common mistakes. You will also have a better understanding of what to look out for when interpreting data, so the conclusions you draw will be correct and robust.

Validate, analyse & interpret the data

What to expect

At the end of this section, you should be able to understand what data analysis is appropriate for your set of analytical questions. You’ll also be able to make considerations about the accuracy and interpretability of your desired analytical outcome. When interpreting data, you’ll now be aware of common mistakes and limitations.  

Policymakers are usually not doing the data analysis themselves. Rather, they need to make meaningful and realistic requests to their technical teams. Therefore, this section only provides an overview of the main considerations and common mistakes when analysing and interpreting data.

How to get started

Before starting any type of analysis, you should be very clear about the set of questions that you’d like to answer. You start with the problem you defined (see “Define your Problem Statement”) and make a list of everything you need to know to have a better understanding of your problem and solution. A trick that might help you to come up with a list of questions is to think about the counterfactual of the problem that you’re trying to solve and consider marginalized groups that are often left out of the discussion.  

Imagine you work in the Ministry of Health, and you’ve been asked to revise a policy regarding healthcare expenditures. To make a more informed decision about how healthcare expenditures should be designed, you focus on the different diseases, lifestyles, and socioeconomic circumstances and how they affect life expectancies. You come up with a long list of questions that your data analysis should answer. Here are a few:

  • Does life expectancy have a positive or negative correlation with eating habits, lifestyle, exercise, drinking alcohol, etc.?
  • What’s the impact of schooling on life expectancy?
  • What’s the impact of immunization coverage on life expectancy?

Analyse the data

Descriptive data analysis  

Descriptive data analysis is a statistical method used to summarize and describe the main characteristics of a dataset. It helps in understanding the key features, trends and patterns within the data without making any inferences or generalizations about a larger population. Oftentimes a descriptive analysis is completely sufficient to make relevant conclusions.  

Let's consider the example of using life expectancy data. Suppose we have a dataset that includes information about the life expectancy of individuals over a certain period of time. The dataset might include variables such as year recorded, life expectancy, diseases, alcohol consumption, etc.

  1. Summary Statistics: The first step in descriptive analysis is to compute summary statistics. This could involve calculating measures such as the mean, median and standard deviation of life expectancy across years. These statistics provide an overall understanding of the average and spread of life expectancy and other variables in the dataset.
  1. Distribution Analysis: To further examine the distribution of life expectancy, you can create a histogram or a box plot to visualize the distribution of life expectancy values. This helps you understand how many observations for life expectancy lie in a certain range of values. In statistics, the distribution of all observations can be split into so-called quartiles. The 25% quartile is a value for life expectancy, e.g. 55 years, and it tells you that 25% of all observations are within the range of 0 to 55 years life expectancy. Examining the distribution of a variable will help you to detect outliers and understand the skewness of the data.  
  1. Group Comparison: Descriptive analysis can also involve comparing life expectancy between different groups. For instance, you can compare different gender groups and examine the differences in life expectancy across these groups. This analysis can provide insights into the relationship between gender and life expectancy.
  1. Correlation Analysis: Another aspect is exploring the relationship between life expectancy and other variables. For example, you can investigate the correlation between life expectancy and factors such as healthcare expenditure, education level or income level. In statistics, you can calculate a correlation coefficient that tells you the strength and direction of a relationship between two variables. For instance, we might observe that the higher the income level the higher the life expectancy. This is a positive relationship and probably very strong. Scatter plots can help to further understand the strength and direction of these relationships.

By performing these types of analyses on the life expectancy data, you gain valuable insights into the average life expectancy, its distribution, trends over time and potential associations with other variables.  

What’s statistical learning?

Statistical learning refers to a set of approaches to automatically learn patterns and relationships within data to make predictions or uncover insights. Prediction and inference are two fundamental concepts in statistical analysis and have different goals and methods. Understanding the difference between the two will help you to set up the appropriate analysis for your set of questions.

  1. Inference: Inference is concerned with generalizing and drawing conclusions about the underlying population from which a sample was taken.  

For example, in the case of life expectancy data, you might want to infer whether there’s a significant relationship between factors like education level and life expectancy at the population level. By using statistical techniques such as hypothesis testing or confidence intervals, you can draw conclusions about the relationship between these variables.

The goal of inference is to produce a better understanding of underlying patterns and relationships within the data and make broader statements about the population based on the sample data.

  1. Prediction: Prediction aims to estimate or forecast a specific outcome or variable of interest based on available data. In the context of life expectancy, you might be interested in predicting the average life expectancy of a community based on certain factors like age, gender, education level and healthcare access.

For example, you could use a predictive model like regression analysis to develop a model that predicts life expectancy based on these factors. The model would be trained using historical data from different communities where we know both the input variables (age, gender, education, healthcare access) and the corresponding life expectancy values. Once the model is trained, you can use it to predict the life expectancy of new communities given their characteristics and forecast how the life expectancy will change in the future.

The goal of prediction is to provide accurate estimates or forecasts for future observations. It focuses on making specific predictions for individual cases within the dataset, rather than drawing general conclusions about the population.

Trade-off between model complexity and interpretability?

The trade-off between model complexity and interpretability refers to the fact that as a statistical or machine learning model becomes more complex and sophisticated, it tends to be less interpretable or understandable by humans.  

Suppose you want to develop a model to predict life expectancy based on factors such as age, gender, education level, income and healthcare access. To illustrate the trade-offs between different models, let’s consider two options, which are very far apart on the spectrum of complexity: A simple linear regression and a very complex neural network.

  1. Model Complexity:

Linear regression: Linear regression is a simple and interpretable model. It assumes a linear relationship between the input variables and target variable (life expectancy). The model estimates the contribution and impact of each variable on life expectancy. It’s relatively easy to interpret these relationships and understand how changes in the input variables affect the predicted life expectancy.

Neural network: Neural networks are highly complex and non-linear. They consist of multiple layers of interconnected nodes (neurons) and can capture intricate relationships and patterns within the data. These models are capable of learning complex representations and interactions between variables, which can potentially improve the accuracy of predictions. However, as the complexity increases, it becomes harder to interpret how the model arrives at its predictions. The relationship between the input variables and the output (life expectancy) is often obscured within the numerous layers and weights of the neural network, making it challenging to understand the underlying factors contributing to the prediction.

  1. Interpretability:

Linear regression: In the case of linear regression, the model provides interpretable coefficients for each input variable. For example, if the coefficient for education level is positive, it suggests that higher education is associated with increased life expectancy. This interpretability allows you to draw meaningful conclusions and make informed decisions based on the model's outputs.

Neural networks: On the other hand, neural networks often lack interpretability. The multiple layers and complex interactions make it difficult to explain why the model arrived at a particular prediction. The weights assigned to different variables, or the internal representations learned by the neural network, may not have a direct and intuitive interpretation. Consequently, it becomes challenging to gain insights into the factors that drive the predictions, limiting our ability to interpret and trust the model's outputs.

Qualitative data analysis

Qualitative data analysis refers to the process of examining and interpreting non-numerical or non-quantifiable data to gain insights, identify patterns and generate meaningful interpretations. Qualitative data can include various types of information, such as interview transcripts, survey responses, field notes, observation records, open-ended questionnaire responses, audio or video recordings and textual data from documents or literature. Unlike quantitative data that can be analysed using statistical methods, qualitative data analysis involves a more interpretive and subjective approach.  

Imagine you’ve done some interviews with health experts on factors that influence life expectancy. Here are some general steps of qualitative data analysis:

Step 1: Data Coding

  • Begin open coding: Read through the transcripts and identify segments related to factors affecting life expectancy, such as healthcare access, lifestyle choices, social support and environmental conditions.
  • Assign codes to these segments to describe their content. For example, "Healthcare access" and "Healthy lifestyle".

Step 2: Develop a Coding System

  • Create a coding system that includes main codes (e.g., "Healthcare access") and sub-codes (e.g., "Affordability of healthcare" and "Proximity to medical facilities").
  • Use the coding system consistently throughout the analysis to categorize relevant data.
Figure 1: Each transcribed segment is assigned one or more codes that describe the overall category by which you want to cluster your information.

Step 3: Axial Coding

  • Connect related codes to develop broader categories. For instance, connect “Social support” with "Healthcare access" and "Lifestyle choices" to understand the influence of social support structures on life expectancy.
  • Use axial coding to identify patterns and relationships between themes, such as how socioeconomic factors intersect with lifestyle choices.

Figure 2: Example of Axial Coding I Source: GIZ

Step 4: Thematic Analysis

  • Look for patterns and recurring themes within the data. Themes may include "Access to healthcare", "Social support networks" and "Health-conscious lifestyle".
  • Summarize and group related codes and categories to develop these overarching themes.

These steps are very generic and should give you a rough idea of how you could approach a qualitative data analysis. The steps for data collection and interpretation are left out as they’re covered in other parts of the navigator.  

Natural Language Processing

The progress in artificial intelligence allows for the development of algorithms and models that enable computers to understand, interpret and generate human language in a way that is both meaningful and useful. Natural language processing is the keyword to find further information.

Interpreting data

In today's data-driven world, policymakers rely heavily on accurate and meaningful data analysis to make informed decisions that shape the course of society. However, misinterpreting data can have severe consequences, undermining the very essence of evidence-based policymaking. Erroneous conclusions drawn from misinterpreted data can yield policies that are ineffective, inefficient or even counterproductive. Such missteps squander resources, hinder progress and fail to address the needs of the communities that policymakers serve.

One of the gravest dangers of misinterpreting data lies in the perpetuation of biases and the reinforcement of existing inequalities. Data can inadvertently reflect societal biases due to various factors, such as biased data collection methods or skewed sample selection. Recognizing and challenging these biases are paramount to ensuring fair and equitable policies serve all segments of society.

Understanding the results of the data analysis

Correlation: Correlation refers to a statistical relationship between two variables. With life expectancy, we might examine the correlation between factors such as education level and life expectancy. For example, we might find a positive correlation between higher education levels and longer life expectancy. This means that, on average, individuals with higher education tend to live longer and vice versa. However, correlation alone doesn’t imply causation. It indicates that there’s a relationship between the variables, but it doesn’t explain the cause-and-effect relationship between them.

Causation: Causation, on the other hand, suggests a cause-and-effect relationship between variables. In the case of life expectancy, identifying causation would involve determining whether a specific factor directly causes changes in life expectancy. For example, we might investigate whether smoking directly causes a decrease in life expectancy. Establishing causation requires rigorous scientific studies, such as randomized controlled trials or longitudinal studies that can demonstrate a direct causal relationship by controlling for other confounding factors. In general, causation is much more difficult to prove than correlation and should, therefore, only be claimed if you’re certain of it.

Statistical Significance: Statistical significance is a measure that helps determine whether an observed result is likely to be due to a real effect or is simply due to chance. In the context of life expectancy data, statistical significance is used to assess whether a relationship or difference between two groups (e.g., smokers vs. non-smokers) is likely to be meaningful or whether it could have occurred by random chance.

Confidence Intervals: Confidence intervals provide a range of values within which a population parameter, such as the mean or median, is likely to fall. It provides a measure of the uncertainty associated with an estimate. In the context of life expectancy data, a confidence interval can be used to estimate the range within which the true difference in life expectancy between two groups lies.

For example, the data analysts might calculate a 95% confidence interval for the difference in life expectancy between smokers and non-smokers. Let's say they find that the confidence interval is two to five years. This means that they’re 95% confident that the true difference in life expectancy between smokers and non-smokers falls within this range.

The confidence interval provides a measure of the precision of the estimate. A narrower confidence interval indicates more precise estimates, whereas a wider interval indicates greater uncertainty or variability in the data.

Human bias

Overgeneralization

refers to drawing broad conclusions based on limited or insufficient data. As policymakers, you must be very careful about which group or cohort the data is referring to and how it links to the general population.  

For instance, someone might observe a correlation between exercise habits and longer life expectancy in a study and then overgeneralize by assuming that all individuals who exercise regularly will have extended life spans. However, this overlooks other crucial factors such as genetics, lifestyle choices and access to healthcare that also influence life expectancy.

Confirmation bias

occurs when individuals seek or interpret data in a way that confirms their pre-existing beliefs or expectations while disregarding contradictory evidence. Therefore, it’s essential to question your own intentions related to the results of data analysis. In general, data analysis should follow sound and objective statistical methods with an open-end result. However, often enough statistics are used to solely confirm the position of the one who produced them.

For example, if someone holds the belief that genetics is the sole determinant of life expectancy, they may search for and interpret data that supports this notion while disregarding evidence that highlights the influence of other factors like behavior or environmental factors.

Neglecting context

involves interpreting data without considering the broader context or the complexities surrounding the topic. The oversimplification and incomplete understanding of the factors influencing your target variable, such as life expectancy, can lead to the wrong conclusion.

For instance, let's say a study finds a correlation between income levels and life expectancy. Neglecting context would involve solely attributing differences in life expectancy to income while disregarding the potential influence of factors like access to healthcare, education, lifestyle choices or environmental factors that often interact with income to shape life expectancy outcomes.

Important types of data analysis bias

Sampling bias

occurs when the sample used in a study or analysis is not representative of the target population, leading to inaccurate conclusions. Sampling bias can occur if a study sample disproportionately includes individuals from certain demographics or geographic areas.

For example, if a study on life expectancy only includes participants from a specific age group or a particular socioeconomic background, the findings may not be applicable to the entire population. The conclusions drawn from such a sample would be limited in their generalizability and could result in biased interpretations as mentioned above.

Selection bias

arises when the selection of participants or data points is not random or representative. This typically happens when certain individuals or groups are systematically excluded or included in the study based on specific criteria.

For instance, if your data analysis on life expectancy only includes individuals who voluntarily participate or only includes those who have access to healthcare services, the findings may not accurately represent the entire population. This bias can lead to misleading interpretations of life expectancy patterns and factors.

Most of the time the data handed to an analystcomes with inherent and often difficult-to-quantify bias. It is often left tothe analyst to uncover how much this bias may affect conclusions. That isseparate from the less common but distinct task of gaining deeper meaning froma reliable sample. It is rare that data providers (both internal or external ofan organization) are fully transparent with the fact that most of the samplesused are convenience samples drawn from those willing to participate andshare at the time of data collection; and rarely were chosen specifically to bepart of a specific sample pool with the goal of reducing overall bias. As apolicymaker, you should always discuss with your data analyst what biases couldpotentially distort your results and what statistical methods could help toaddress them. In the end, you should always be transparent about potentialbiases that might have influenced your conclusions.

Limitations

Confounding variables

are variables that are not the main focus of analysis but can influence the relationship between the variables being studied.  

For example, if your data analysts find a correlation between higher life expectancy and the consumption of a particular food item, it may be tempting to conclude that the food item directly causes increased life expectancy. However, there might be confounding variables at play. People who consume the food item may also have higher incomes, better access to healthcare or engage in other health-conscious behaviors that contribute to longer life expectancy.

To address confounding variables, data analysts employ various techniques such as statistical adjustments, stratification or regression analysis to isolate the effects of the variables of interest. However, it can be challenging to fully eliminate the impact of all confounding variables, and their presence may still introduce bias and affect the interpretation of data.

Incomplete data

refers to missing or unavailable information, which can hinder the accurate interpretation of data. In the context of life expectancy, incomplete data can arise due to several reasons, including variations in data collection methods across different regions or countries, underreporting of deaths, errors in recording birth or death dates or gaps in historical data. Incomplete data can lead to biased conclusions and inaccurate assessments. Often, this deeply affects already marginalized groups, as they encounter even greater barriers to meaningful participation in society, at times being completely excluded from the data collection efforts.

For instance, if data from people in rural areas are missing, it may result in an underestimation or overestimation of life expectancy for those groups, which might lead to inadequate policy decisions in these areas. In addition, if data collection methods change over time, it becomes challenging to compare life expectancy trends accurately.

To mitigate the impact of incomplete data, data analysts may use statistical techniques like imputation to estimate missing values or employ data validation methods to ensure the reliability of the available data. However, these methods may introduce their own limitations and biases (see article on “pre-process data”).

Comparison

Comparison is an effective method to interpret data and understand patterns within a dataset. When it comes to life expectancy data, the ability to compare and contrast across different populations, regions or time periods can provide valuable information about health disparities, socio-economic factors and the impact of various interventions or policies. Here are some ideas on how to compare data based on the example of life expectancy:  

  1. International Comparison: Comparing life expectancy across countries allows for a better understanding of global health disparities. It enables us to identify variations in healthcare systems, social determinants of health, economic development and other factors that contribute to differences in life expectancy. By examining countries with high life expectancies, we can uncover successful approaches and strategies that may be beneficial in improving health outcomes in other nations.
  1. Historical Comparison: Analysing life expectancy data over time allows us to track trends and assess the impact of social, economic and medical advancements on human longevity. Comparing life expectancy data from different decades or centuries can reveal patterns such as the effects of improved healthcare, vaccination programs, public health initiatives or changes in lifestyle behaviors. It helps identify major milestones in public health and areas that still require attention.
  1. Regional or Subnational Comparison: Comparing life expectancy within specific regions or subnational areas provides insights into health disparities within a country. It helps identify areas with lower life expectancies and allows policymakers to focus on targeted interventions and resource allocation. Comparisons between urban and rural areas, socioeconomic groups, or different ethnicities within a country can shed light on health inequalities and inform public health policies and programs.
  1. Gender Comparison: Analysing life expectancy data highlights potential disparities in healthcare access, lifestyle behaviors or biological factors. It can lead to targeted interventions or policies aimed at improving health outcomes for specific gender groups.
  1. Comparative Analysis of Contributing Factors: Comparing life expectancy data alongside other relevant factors, such as income levels, education, healthcare expenditure or disease prevalence, helps identify correlations and potential causal relationships. Understanding how these factors interact with life expectancy can inform policy decisions, interventions and resource allocation to address disparities and improve overall health outcomes.
  1. Benchmarking: Using a benchmark or reference point for comparison, such as regional averages, global averages or best-performing countries, can provide a target value against which to evaluate progress. This enables policymakers and healthcare professionals to set goals, track improvements and learn from successful approaches implemented elsewhere.  

How do I know I’ve successfully analysed and interpreted my data?  

As mentioned in the introduction to this article, analysing data is typically a task for data analysts and not policymakers. However, even if you’re missing the technical knowledge to conduct the actual analysis, you should still be able to understand the main considerations when analysing data, so you can request the right types of analysis.

The following questions might help you to ensure your data analysis is done correctly:

After reading this article, you should be aware of the many pitfalls when interpreting data. As a policymaker, it is crucial to mitigate the risk of misinterpreting your data analysis as much as possible to ensure no wrong conclusions will be drawn.  

The following questions might help you to produce accurate and objective interpretations:

What’s next?

Learn how to visualize your data.

Case Downloads

Analysing data with no programming skills by using ChatGPT
Download

Related Use Cases

Vietnam
Using Online Job Vacancy Data for more evidence-informed labour policies in Vietnam
Learn More
Argentina
Using big data to understand movement around Buenos Aires
Learn More