Learning Data Management on Coursera

Posts

Showing posts from June, 2020

Week 4 - Creating Graphs for Data

June 22, 2020

The data from the Gapminder data set consisted of quantitative variables. Thus, the appropriate methods were used to analyse the relationships between the data. Univariate Graphs Univariate graphs were generated for each of the 3 variables - co2emissions, relectricperperson and urbanrate. All of the variables were quantitative and therefore histograms were generated to represent the distribution of the data. The program used is shown below, as well as the output graphs. The Program Univariate Graph for co2emissions This graph is unimodal with most observations having cumulative emissions around 0.1e11 or 1,000,000,000 (i.e. 1 billion) metric tons. The graph is also skewed right with most observations having the lowest emissions. There is also an outlier at round 3.4e11(34 billion) metric tons. Univariate Graph for relectricperperson This graph also seems to be unimodal and skewed right, with the peak at around 250 kWh and most observations having the lowest...

Week 3 - Making Data Decisions

June 21, 2020

For this assignment, due to the nature of the dataset chosen, the only data management that was necessary was coding out missing data. In fact, as I discovered, the software had already disregarded the missing data so this was not an absolutely necessary step - however, it was a learning moment and allowed me to verify what I had assumed in last week's assignment. I.e. The total number of observations (countries) was 213 based on the length, or number of rows, of the dataset. In the following I will discuss how I determined if this assumption was correct or not. Furthermore, I had unknowingly already executed the technique of grouping variables with pd.cut in last week's assignment so that I could demonstrate and use frequency tables. Replacing the Blanks with 'Nan' This was done as in the video with the following code (program variable for co2 used as the example): c1= c1.replace(r'^\s*$', np.nan, regex=True) where c1 was the program variable assigne...

Week 2 - Running My First Program

June 18, 2020

I chose to use Python software to perform the data analysis. The data from the Gapminder data set was categorical so carrying out the frequency programming as shown in the video gave results of 1 for each count as well as the same percentage value for each variable. An example of this is shown in the following 2 images. This was because the values for the chosen variables were unique to each country. After looking through the discussion forum, I decided to try one of the suggested solutions which was to use pd.cut to create 'bucket categories' for each variable. The code shown below was used to create 4 intervals of the values of the CO2 emissions variable, and label each interval. co2 = pd.cut(mydata["co2emissions"],4,labels=['low emissions','low-mid emissions','high-mid emissions','high emissions']) This was done for each variable and then I adjusted the code to work with the newly created grouped variables ( ...

Week 1 - Getting the Research Project Started

June 17, 2020

Codebooks and Datasets Selecting the Variables As I am interested in pursuing a career in renewable energy and sustainability, one of my main areas of interest was carbon dioxide (CO2) emissions. From my studies, I know that some fuels used in the generation of electricity produce these emissions and therefore, I chose the following two variables from the Gapminder codebook: co2emissions - the cumulative metric tons of co2 produced from 1751 to 2006 relectricperperson - the residential electricity consumption per person during 2008 in kWh Additionally, increased urbanisation implies increased human activity which means increased amounts of carbon dioxide emission. Thus a third variable was selected from the dataset: 3. urbanrate - percentage of total population living in urban areas in 2008 Research Question Are CO2 emissions associated with the levels of residential electricity consumption and urbanisation? Hypothes...