Week 2 - Running My First Program

June 18, 2020

I chose to use Python software to perform the data analysis. The data from the Gapminder data set was categorical so carrying out the frequency programming as shown in the video gave results of 1 for each count as well as the same percentage value for each variable. An example of this is shown in the following 2 images. This was because the values for the chosen variables were unique to each country.

After looking through the discussion forum, I decided to try one of the suggested solutions which was to use pd.cut to create 'bucket categories' for each variable. The code shown below was used to create 4 intervals of the values of the CO2 emissions variable, and label each interval.

co2 = pd.cut(mydata["co2emissions"],4,labels=['low emissions','low-mid emissions','high-mid emissions','high emissions'])

This was done for each variable and then I adjusted the code to work with the newly created grouped variables (co2, elec and urban) to obtain the frequency distributions.

The Program

The blacked out area is the file address on my PC

Frequency Tables

Description of Frequency Distribution

Surveys were done of 213 countries' cumulative CO2 emissions, residential electricity use and urban rate. This total value was obtained from the print(len(mydata)) function.

With regard to CO2 emissions, data was available for 200 countries. Of the 200, the majority of 99% fell into the low emissions subset while the remaining 1% was evenly split into the lower-middle range and high range.

For electricity consumption in 2008, data was available for 136 countries. Of those 136, the majority (89.71%) fell into the low usage category while the minority (0.74%) were in the high usage category. Additionally, 7.35% were in the lower-middle usage range while 2.21% were in the higher-middle range.

Data on the urban rate in 2008 was available for 203 countries. Of that 203, higher-middle urbanisation made up the majority with 33.50% of the countries falling into this category. This was followed by the lower-middle category, in which 25.12% of countries belonged. Finally, the lowest frequency occurred with the categories of low and high urbanisation rates, which were equally 20.69%.

One limitation I faced with this exercise was that, as a beginner to Python, I did not know how to obtain the numerical values which bounded the intervals. Therefore I was only able to discuss the distribution in terms of the ranges.

Search This Blog

Learning Data Management on Coursera