Identify data gaps

What to expect

Data gaps form when quality data that is critical to formulating effective policies for the citizens is not readily available.

For instance, there may be a pressing need to address educational levels in your country. The educational outcomes across different geographical regions (counties) in the country vary drastically, and it may be intuitive for you to start analysing the data there. However, are you considering data on other important aspects that impact education levels - such as residence neighbourhood, socio-economic status, gender or specific cultural aspects in these regions? Are there any political and historical factors that are also shaping citizen preferences? Do you have quality data that can help you make effective, evidence-based decisions rooted in these granular aspects?

Data gaps lead to missed opportunities for creating effective policies as well as an inability to properly attribute impact on the ground. People who are most at risk of being left behind are the ones most affected by data gaps since they are most likely under-represented or missed in the data. This section addresses how you can identify the type(s) of data gaps specific to your problem that are impeding your policies to reach their full potential. This is a crucial transition point from analysing the data problem ‘as is’ to drawing up a plan for bridging the gaps.

How to get started

A good way to start is by looking at the data gaps from the ‘source’ or at the initial stages from where the data is being produced. This typically involves looking carefully at the description of the data set being used including raw data structure, design of survey questions (in the case of census or other national data sets) or a general analysis provided by the data producer.

Do you believe the data you are looking for is entirely missing?

Good to know: NSOs often possess more data than is disseminated to you. Start by talking with relevant officials at your NSOs before deciding on data gaps to avoid duplicity in solutions.

Some commonly seen types of data gaps emerge from the incompleteness of data, timeliness (or lack thereof) of data, lack of appropriate data coverage for policy decisions to shaky data flows. Below are some attributes of data to keep in mind while identifying and classifying your data gaps.

Classifying your data gaps:

‍

Unavailable and incomplete data

Unavailable and incomplete data is often the foundational reason for why you are not yet able to leverage it for effective policy design. You have probably identified this unavailability of data at the problem definition or map the data ecosystem stage. For example, you want to measure the impact of climate change on natural resources to formulate or strengthen your climate policy. However, historical data for annual mean temperatures, precipitation values or forest water balance is not easily available. Can you formulate an effective climate policy that works for all regional contexts in your country with no or limited information on the impacts of climate change in specific regions over the years?

Data completeness refers to the comprehensiveness of data in terms of the attributes it covers relevant to the problem in question. Identifying critical data that must be complete for addressing your problem and matching it against what is available is a helpful step.

For example, while evaluating how Australian youth are faring, the Australian Institute of Health and Welfare assembled a wide range of data (administrative and survey). However, during the analysing process, it was realized that most surveys were administered with those who were age 15 and over or age 18 and over and therefore, data for ages 12- 15 was missing. Any policy changes would therefore need additional evidence to have a full picture for them to be truly impactful.

Timeliness and periodicity

Data timeliness issues rise from the potential lag between the moment a data point is being collected and the time it is being used in your decisions. Most federal policies consider data from national census, economic as well as health surveys, among others. While these data sources provide maximum national coverage that is important for decision-making, the frequency of the collection of this data for most countries is every ten years. Administering these surveys is very expensive and increasing the frequency is not feasible. However, it is imperative that our policies keep in mind that the relevance of data collected ten years back may change in present times, even with predictions and especially in the post-COVID era.

Women owned enterprises accounted for about 25 per cent of total enterprises in Viet Nam as of 2019. The Ministry of Planning and Investment (MPI) of the Government of Viet Nam is undertaking a number of initiatives with the objective of strengthening women’s entrepreneurship in Viet Nam. However, they also recognize that COVID-19 had severe implications for women entrepreneurs, impacting their overall access to employment and entrepreneurship opportunities and also significantly altering the entrepreneurship landscape (and perhaps even numbers). Therefore, a critical focus of MPI has been to develop fresh evidence and base their policies in accordance to the current context.

Accuracy

Data accuracy is the level to which data represents the real-world scenario and confirms with a verifiable source i.e., consistency of data with reality. Accurate data is substantial for forecasting, planning, program budgeting and strategy development in governments. At the same time, inaccurate data can lead to wrong decisions and have tremendous unintended consequences. For example, education data typically involves data compiled from school districts on graduation rates, drop-out rates, test score averages and attendance rates. Education data is often used to measure the success of a state or a school district, and policies are evaluated and redesigned based on them. But there’s a problem. This information is not always reliable, and the fault lies in the way the data is collected (data entry), compiled and presented.

Tips for ensuring data accuracy ‍

Gather data from the right sources, and vet the external sources. For more information, see the next section on “Identify data sources”
Adopt effective data entry practices to minimize mistakes (ensure fields are clear, provide drop-down options as much as possible, automate data entry where possible)
Regulate who can access and manipulate your data
Ensure data is reviewed and cleaned by experts

Granularity and quality

The data you work with needs to be detailed, granular and disaggregated for the conditions of different sectors of society to be understood, for example, showing:

Sex
Age
Ethnicity
Disability
Geographic areas
Religion
Income levels

Not all of these details are as relevant in different issues. Addressing education levels in a country may demand different levels of data disaggregation than addressing agricultural productivity. Similarly, other factors such as mobile phone ownership and bank account ownership, are also increasingly playing an important role in understanding the context of present-day problems such as multidimensional poverty. Even more important is how these different factors put together can completely miss certain populations. For example, data from certain tribes in your country might not be easily available. Even with what you have, you may be missing representation of women or other sexes within these tribes – how then could you make your policies and programmes work for the entire population? The lack of this granularity in different aspects may create blind spots in your work.

Australian Institute of Health and Welfare, while evaluating how Australian youth are faring, faced some critical data gaps, wherein the data from certain youth was entirely missing:

young people of refugee and asylum seeker families
young people from culturally and linguistically diverse (CALD) backgrounds or children born overseas
young people living and who have lived in out-of-home care
incarcerated young people
young people with disability
young people who identify as lesbian, gay, bisexual, trans and gender diverse, or young people who have intersex variations

These gaps limit governments from developing and implementing targeted policy interventions for vulnerable populations.

While data gaps can exist in many shapes and forms, recent years have brought into spotlight the specific data gaps that are a result of women being consistently underrepresented or overlooked in data ecosystems causing a gender data gap that has led to lack of knowledge about their living conditions. Acknowledging this challenge, the UN Women’s global gender data programme, Women Count, in collaboration with the ISWGHS, has produced the Counted and Visible: Toolkit to Better Utilize Existing Data from Household Surveys to Generate Disaggregated Gender Statistics. This resource may help you to bridge the gender data gap.

The Counted and Visible Toolkit provides recommendations and practical country examples on how to utilize existing data to generate disaggregated gender statistics.

‍Granularity of data is one of the biggest contributors to ensuring the quality and reliability of data. For more information on frameworks for ensuring data quality, see the data sources and reliability section.

How will I know I have successfully identified the data gaps?

Different aspects of your problem will have different levels of data maturity. And therefore, some gaps are more easily identifiable than others. However, one key indicator to move forward in the process would be to answer the question ‘Will I have all/most of the data I need to solve my problem if I am able to access the data identified?’ Sub-parts to this question may look like this:

Do I know what data I need to answer my defined problem statements?
Do I know what additional data would complete my information on my issue?
Is the last data available still useful for problem solving?

What's next?

Once you have identified your data gaps, the next step is to understand the feasibility of bridging them, given the limited resources and competing priorities you may have. Classifying your data gaps may be a good way to understand this feasibility.

Source: 'For Good Measure' - data gaps in the big world

The next section will take you through a number of resources, recommendations and examples on different types of data sources and how reliable data can be accessed, collected and used. However, at this stage, there are already a few resources that can be used to dive deep into your data gaps.

ADAPT helps National Statistical Offices (NSOs) and other data producers and users within the country effectively plan for data required by you and to continuous monitor the progress. Developed by PARIS21, ADAPT is a free, cloud-hosted, multilingual and consultative data planning tool for data producers to adapt their data production to the priority data needs. It promotes the reuse of data and the quality assessment of data sources. Additionally, it reinforces a co-ordinated data infrastructure in a national or regional context. ADAPT enables detailed data demand and supply analysis. The mismatch between required and available indicators is reflected as data gaps in ADAPT. While it supports various types of decision making, at this stage, this would be the most relevant use of the tool for you, and you may consider partnering with your counterparts at the NSO to adopt this tool in critical policy planning.

The tool takes you through the entire National Strategy for Development of Statistics (NSDS) lifecycle – assessing, identifying, elaborating, executing and monitoring strategies backed in data. For a more comprehensive overview on how ADAPT can be used, see the user manual.

Note: If your experience in integrating data into policymaking is relatively limited, we recommend referring to the tool for step 1 (assessing) and step 2 (identifying) only.

ADAPT IN PRACTICE IN PARAGUAY

Paraguay: The National Institute of Statistics (INE) in Paraguay has been using the tool to have a clearer picture of the actors within the Paraguayan statistical system (SISEN) and what data sources are at its disposal. With ADAPT, 79 agencies in SISEN have been identified that are in the production chain or data users. It has also helped identify and manage a large inventory of data that can be used for the monitoring of national development and the production of official statistics. Currently, there are data from 311 sources that include: 116 censuses and surveys, 149 administrative databases and 40 mixed data sources all registered and specified in ADAPT.

ADAPT was used in planning of the Paraguay National Strategy for Development of Statistics (NSDS). It highlighted the shortcomings and inconsistencies of public policies by showing that there are no follow-up tools to monitor whether an objective is met or not. For example, half of public policies do not have indicators, and these provide information that public policy makers rely on so that they can improve their subsequent work. This way, ADAPT has been instrumental in improving INE’s legal, institutional and strategic framework.