COVID-19 Vaccination Distribution & Trend


COVID-19, a new pandemic that had first occurred in 2019, reached 193 countries as of today and acquired a death count of 2.96 million with a reported 137 million cases of potential contact. As an event of the century that had resulted in dramatic changes for most people around the world, there is now global attention to the progress of vaccination, as well as what actually causes the spread of the pandemic.

In a time when the vaccinations had started to go around the globe, we wanted to find out which nation-level factors that might have a correlation with COVID-19 vaccination. While bigger nations such as the US are beginning to roll out intensive vaccination programs, other countries are struggling to secure a stable source of vaccines. For example, it has been reported by various media that the U.S. is doing better than European countries with the vaccinations of COVID-19. Potentially finding out about what the national factors are for the difference in vaccination rate, certain emphasis on the national profile can be taken into consideration to understand and perhaps better the vaccination strategies for the countries that are comparatively underperforming. We are interested in the significant divergence in vaccine distribution and vaccination rates between different countries.

We would like to answer the following questions in order the highlight these factors:

  • How many vaccines have been distributed in each country? What are the vaccination rates in each country?
  • Is there a correlation between the geolocation, population of a country, and vaccination distribution?
  • Is there a correlation between the economic and social indicators of a country and the vaccination progress?

Data Sources

The first dataset is named Country Statistics — UN Data(link: This dataset was returned as two CSV files, country_profile_variables.csv, and kiva_country_profile_variables.csv. Combined together, the dataset returns profiles for 86 + 229 = 315 countries over 50 features for each. They were transferred to an SQL database for use. We used all possible numerical features of the dataset, a total of 15 columns. There was no time period associated with the entries for the dataset.

The second dataset is named COVID-19 World Vaccination Progress(link: This dataset was also returned as two CSV files, country_vaccinations.csv and country_vaccinations_by_manufacturer.csv. These two CSVs were passed to the SQL database and had different types of entries. The first CSV, country_vaccinations.csv, had entries for countries ranging from December 2020 to April 2021, listing the vaccination trends for each day. The second CSV, country_vaccinations_by_manufacturer.csv, lists the trends according to the vaccines used per manufacturer for each day along a similar time period with the first CSV. The variables used are dates, total vaccinations, country name, and daily vaccinations.

Data Processing

When preparing the data for analyzing vaccination distribution, we found out that a lot of the data was missing. Since the data are sourced from each countries’ public health sectors respectively, the date that each country began to record their vaccination progress and the quality of the records would vary greatly. Since we are mainly focused on the total number of doses administered, we replaced all the missing records with 0 and focused on the maximum records for this part of the calculation. We are also interested in how the vaccination progress differed in each continent. However, with the given attribute “region”, we were not able to categorize the countries into their respective continents. To resolve this problem, we introduced a Python module named “pycountry_convert” which provides functions that help us identify the continent that each country is in.

There were several inconsistencies between the datasets, and even between entries for the purpose of calculating the correlation.

The first inconsistency involved the lack of data for dates. While the span of data was about 120 days, most of the countries either did not have data for each day or lacked the date field for that day as well. Therefore, we created a new data frame involving the unique dates as indices, populating it going through all countries and dates.

The second inconsistency was how the count for vaccination numbers varied too much for different countries. As more population the country has, there ought to be more vaccinations. Therefore, we divided the number of vaccinations by the population count, getting the percentage of daily vaccination instead of the number of daily vaccinations.

The third inconsistency was a lack of information for some of the profile variables. Even after they are converted to one unified ratio, some of the profile variables lacked in some countries. Therefore, to ensure that all vaccination trends and the correlation between factors are contributed equally for all countries, we simply dropped countries' profile variables that had lacking information from the end result.

Analysis and Visualization

Using the data from Our World in Data, we have vaccination data for 176 countries. Here, we can see the statistic summary for the dataset:

While the average number of vaccines per country is 4.8 million, the standard deviation is extremely high at 21 million. This indicates that the total number of vaccines a country administered differs drastically from another country. The country with the most number of vaccines administered is the United States, which has already administered around 189 million vaccines. On the other hand, the country with the least amount of vaccines administered is Papua New Guinea, which only has 250 doses administered. The map below shows the number of vaccines administered per country.

The second map shows the number of people fully vaccinated per hundred people in each country. From the two maps, we can see that the number of total vaccinations does not directly correlate with the vaccination progress. While the number of total vaccinations in the US and India is significantly higher, many countries in Europe and South America have the same or higher vaccination rates than the two countries. Looking at the data on the country-level, we can see the top ten countries with the most number of people vaccinated per a hundred people. While 40% of the countries are located in Europe 30% of the countries are located in Asia. The rest are located in South America and Africa. The factors behind the differences will be discussed later in the correlation section.

On the continent level, we can see that Oceania has a significantly lower rate when it comes to the total number of vaccinations administered per hundred people. While all other continents have a rate greater than 5.1, Oceania only has 1.69 doses administered per hundred people. Based on our research and analysis, there are two potential factors that might contribute to this phenomenon. Firstly, the two main countries in Oceania, New Zealand and Australia managed the pandemic well and kept the new cases rate at a low level. Since the threat of the pandemic is relatively lower than in other countries, the citizens do not feel the urge to be vaccinated. Secondly, after the new advice on pausing the AstraZeneca vaccinations due to its unknown correlation with severe blood clots, the Australian government had abandoned all their COVID-19 vaccination targets. Without any pressing goals for vaccinations, it is possible for one to categorize it as a factor for the low vaccination rate in the continent of Oceania.

Correlation analysis involved three main steps to generate an interpretable result.

The first step was, as described in the data processing section, to create a data frame that accounts for the maximum possible number of dates included in the dataset. Because our initial visualization and the statistic exhibited too many outliers, we normalized the data by converting number counts to percentages of the respective country population.

As shown from the visualization, we noticed that the vaccination data did not have full information on all the dates present in the dataset. Moreover, the interpretation of a general trend was impossible with the current data as there were no uniformities or patterns present. We needed to find a way to account for this as well.

Therefore, we used linear regression to come up with a degree for each of the countries, unifying the trend rate to a single variable, a second major step in producing the analysis.

Proof of how linear regression worked for any one country(e.g. Angola):

Through finding the rate at which vaccination was done daily, the date inconsistency was not so much of a deal at this point. While some of the countries were discarded for the lack of significant information, we still had enough information to come up with a correlation. Moreover, the inconsistency with the start date and the end date could also be ignored as we are now only concerned with the degree, the rate at which vaccinations changed over time given the provided information in the dataset.

Finally, we produced the final correlation coefficients for all countries against the country profile variables as shown below.

  • Is there a correlation between the population of a country and vaccination distribution?

Yes. Our results show that there is a correlation with the urban population, not the overall population. Having a correlation coefficient of 0.1, it showed the strongest positive correlation with the rate at which the vaccination was distributed among people.

However, it showed that there was almost no correlation with the overall population of the country, exhibiting a ~0.02 correlation coefficient with the vaccination rate.

  • Is there a correlation between the economic and social indicators of a country and the vaccination progress?

For economic indicators, yes. Among the top 15 indicators, economic indicators including economy industry, GDP per capita, services, and other activity ranked as a few of the highest correlated indicators.

For social indicators, not quite. Indicators including sex ratio and any political factors showed an insignificant amount of correlation. However, interestingly, the number of individuals using the Internet ranked 5th for all indicators to be correlated with the vaccination rate. While this is not enough to conclude that social indicators influence vaccination progress, we also understand that there is an explainable trait to the relationship.

Correlation Conclusion and Interpretation

Overall, the correlations were too weak to conclude any possible relationships between how the vaccinations are distributed in the country. However, among the profile variables, we see that the urgency of the COVID-19 response plays a role in the distribution of the vaccine. As the urban population, economic deterioration, GDP per Capita, and the internet distribution are among the top correlated profile factors, we were able to conclude that technologically connected, comparatively rich countries with the majority of the country living in an urban environment tend to want vaccination as soon as possible.

A possible explanation for the relationship could be that the population of the respective countries were able to see the effect of COVID-19 by themselves as the virus is most likely very contagious in a populated urban area. Moreover, since they will be able to quickly educate themselves with the reports and statistics online, these countries could have quickly felt the need for vaccinations.

Our Source Code


Country Statistics — UN Data

COVID-19 World Vaccination Progress

Our world in Data

Interactive Map of Vaccine Distribution (Credit to Jessica)