Udacity „Nanodegree Data Analysis“ Journey

Background

I finished my Bachelor of Science in Medical Informatics in 2015 and started working and now work since 3 years as a Application Developer / Data Engineer / Datawarehouse Developer at a University Hospital. My work heavily focuses on relational databases, data aggregation from clinical source systems and data integration into the clinical / research Datawarehouse. Before my Udacity journey I already had a string skillset in MSSQL, ETL processes and tools (e.g. SSIS) and programming in .NET/C#. However I did not write even a single code of Python yet. Since the Datawarehouse I am working with is used to pseudonymize/anonymize and later securely transfer research data to either internal or external research groups, I have many interactions with people working in data science.

The University Hospital itself does not write any Python code on their own at the time, but to support the process of a research project from the early stages to the bedside and productive use, the hospital will need at least the basic knowledge to integrate the algorithms in the enterprise environment. It is from great interest to the hospital that a research project does not end after the submission of the research paper, but when the actual results were successfully implemented in a valuable software solution, providing a reliable decision support to the medical staff.

At that time I was not able to already reduce my workload to 80%, so I had to find a way to dive as deep as possible into the python language as possible, without it having an impact on my performances at work. That is when I started googling about possible self-paced training programs with at least some kind of certificate with an acceptance not only in the healthcare industry. I realized very quickly that it will have to be an online course and my final intent then just was to have the highest cost-benefit ratio possible.

… And then I found Udacity.

Udacity Nanodegrees In General

The headline of the homepage already sounds very promising:

Master in-demand skills with our online learning programs available anytime, from any device. Build and design amazing projects. Earn a valued credential. Launch your career in Data Science, Machine Learning, AI, Android, iOS, Marketing and more. Be in demand.

https://eu.udacity.com/nanodegree

Udacity is an online educational institution offering courses for many different (and even very specific) tech-skills, providing an infrastructure allowing:

  • work on real-life projects from industry leaders
  • 1-on-1 mentorships
  • career coaching
  • stay in contact with a huge network of students from all around the world

…and all of this at very affordable prices. The best thing about it is that you can start and study whenever you want (you just have to reach the milestones) and that it is recognized by some of the top companies worldwide like Google, IBM, Amazon and others. It teaches you not only the subject you were originally interested in but also self-discipline, perseverance and ambition.

Subscription

The subscription is very easy and just requires setting up a Udacity Account (or you can even log in with your Google or Facebook Account). Sometimes you find discount codes on other websites that give you 10% off. You can either pay monthly or you just do a one-time payment. The Data Analysis Nanodegree that I chose costs about 1000 Swiss Francs and included all of the above listings for the time frame of 5 months. As I stated before the learning is self-paced and you can also graduate in much less time. It just depends on how much effort you can invest per week. Even after graduation all the materials, exercises and projects stay available for this 5-month time frame.

After setting up your account you will get access to your so called „Classroom“ which in fact is of course just virtual. In the „Program Home“ you see an overview of the status of all your projects and you can browse to the specific courses. Figure 1 shows a screenshot of how the classroom looks like:

Udacity „Classroom“

Courses & Syllabus

The core curriculum includes the following topics and subtopics:

  • Introduction Data Analysis
    • Anaconda, Jupyter Notebook etc.
    • Data Analysis Process
    • Programming Workflow
  • Practical Statistics (95% Theory, 5% Python)
    • Descriptive Statistics
    • Probability & Distributions
    • Bayes Rule, Central Limit Theorem
    • Confidence Intervals, Hypothesis Testing
    • A/B Testing
    • Regressions
  • Data Wrangling (Python)
    • Gathering
    • Assessing
    • Cleaning
  • Data Visualization (Python)
    • Univariate, bivariate and multivariate data exploration
    • Explanatory visualizations
    • Communication of findings

So the Nanodegree covers a huge range of theoretical and technical components. All content is presented in illustrative videos (all with subtitles), summary pages and many additional ressources provided at each specific content section.

The main part of all analysis processes is based on Python. Some of the most common Python libraries like pandas, numpy, matplotlib etc. are used during the course.

Exercises & Projects

During every course Udacity provides a variety of exercises with instant feedback on it. You can work directly in the classroom in emulated jupyter notebooks. When you decide do skip an exercise you are free to do it, but you will then have a harder time fulfilling the requirements of the mandatory projects. The projects have no hard deadline, but if you were not able to finish all projects successfully by the end of the 5-month period you will get no certificate. Basically the Nanodegree consists of seven projects but two of them are just organizational.

Project 1 – Explore Weather Trends

In this project I analyzed local and global temperature data and compared the temperature trends to where I live to overall global temperature trends. I had to first get the dataset with simple SQL queries and store them as csv files. After importing the files in pandas data frames I did some basic investigations like linecharts, rolling averages and linear regressions and wrote down all my observations.

Rolling Average Temperature World vs. Bern

Feel free to have a look at the Jupyter Notebook and the report in my GitHub repository: https://github.com/patrick-hirschi/udacity_data_analyst/tree/master/Project1

Project 2 – Investigate a Dataset

For this project Udacity provided a set of possible topics and I chose to explore indicators of countries. Therefore I downloaded some interesting csv files directly from Gapminder (https://www.gapminder.org/data/). I downloaded the following files:

  • population_total.csv : Total Population (data after 2010 is based on the medium estimates from UN population division). The dataset even contains projections for the future. This data will be cut off in this project as for all of the other datasets no values for future years are available.
  • population_density.csv : Population density (people per sq. km of land area)
  • life_expectancy_years.csv : Life expectancy (years) – average number of years a newborn child would live if current mortality patterns were to stay the same
  • income_per_capita.csv : Income per person (GDP/capita, PPP$ inflation-adjusted)
  • educational_attainment.csv : Educational attainment, at least completed primary school, population 25+ years total (%)
  • tuberculosis_incd.csv : Incidence of suffering a tuberculosis disease
  • hiv_incd.csv : Incidence of suffering a hiv disease

I wanted to explore the trends for different countries in these datasets for the three main indicators: Health, Economy/Education, Population. For each of these three main indicators I had corresponding datasets to look at.

Seaborn pairplot to explore the correlations between all country indicators

Again, feel free to have a look at everything on GitHub: https://github.com/patrick-hirschi/udacity_data_analyst/tree/master/Project2

Project 3 – Analyze A / B Test Results

Project 3 was all about understanding the results of an A/B test run by an e-commerce website. The goal was to work through a prefilled workbook to help acompany understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision.

A/B Testing Logarithmic Regression

Feel free to have a look at everything on GitHub: https://github.com/patrick-hirschi/udacity_data_analyst/tree/master/Project3

Project 4 – Optimize your GitHub Profile

Project 4 was to create a GitHub repository for all the projects of this nanodegree. The effort to do this was minimal. After setting up an account on GitHub one only had to create the repository, install git and run a set of powershell commands to load everything into the repository.

GitHub Logo (https://github.com/)

The relevant powershell commands were:

git init
git add .
git commit -m "Project Submission Files"
git remote add origin https://github.com/patrick-hirschi/udacity_data_analyst.git
git push -u origin master

Project 5 – Wrangle and Analyze Data

Project 5 was the most intense of all projects. It covered all aspects of the data analysis process. After having successfully implemented everything there was an additional documentation work pending. This resulted in two separate reports (act report and wrangle report). The subject of the project was the twitter page WeRateDogs.

The page WeRateDogs from user @dog_rates has 8.1 million followers (June 2019) and an own webpage weratedogs.com. It gives ratings for the dogs of the users. All ratings can be very special since the owner of the page states that every dog deserves a 10/10. Therefore it is possible to have a rating of 15/10 for a specific dog.

After a long wrangling process I ran three different analyses with visualizations in a python jupyter notebook. The created figures were stored as png and described in the act report.

The detailed tasks were as follows:

  • Data wrangling, which consists of:
    • Gathering data (file download, Twitter API call)
    • Assessing data
    • Cleaning data
  • Storing, analyzing, and visualizing the wrangled data
  • Reporting on 1) data wrangling efforts in the wrangle report and 2) the data analyses and visualizations in the act report

For using the Twitter API to get the relevant project data some additional steps had to be done. These included setting up a twitter developer account and register an application for the project.

GitHub: https://github.com/patrick-hirschi/udacity_data_analyst/tree/master/Project4

Project 6 – Improve your LinkedIn Profile

Just like Project 4, Project 6 was not a huge effort. Udacity provides a service that checks your LinkedIn profile and gives you a very sophisticated and qualified feedback with many recommendations for improvements.

You can not fail this project but it is in your very own interest to have your social profiles look as professional as possible. Downside of this project is that of course the profile has to be in English and if you do not have an English profile yet the project might end in a bigger effort than initially thought.

Project 7 – Communicate Data Findings

Project 7 and therefore the last project focused on the data visualization part. The wrangling efforts were minimal and the whole work was in the analysis part.

I chose the PISA 2012 dataset.

PISA 2012 is the programme’s 5th survey. It assessed the competencies of 15-year-olds in reading, mathematics and science (with a focus on mathematics) in 65 countries and economies. In 44 of those countries and economies about 85 000 students also took part in an optional assessment of creative problem solving; and in 18 countries and economies, students were assessed in financial literacy.
Around 510 000 students between the ages of 15 years 3 months and 16 years 2 months participated in PISA 2012 as a whole representing about 28 million 15-year-olds globally.
The students took a paper-based test that lasted 2 hours. The tests were a mixture of open-ended and multiple-choice questions that were organised in groups based on a passage setting out a real-life situation. A total of about 390 minutes of test items were covered. Students took different combinations of different tests. They and their school principals also answered questionnaires to provide information about the students‘ backgrounds, schools and learning experiences and about the broader school system and learning environment.

OECD Homepage: http://www.oecd.org/pisa/aboutpisa/pisa-2012-results.htm

There were two main parts in the analysis section.

On one side the exploratory data analysis where many of the Python data science and data visualization libraries were used to explore the dataset’s variables and understand the data’s structure, oddities, patterns and relationships. The exploration contained univariate, bivariate and multivariate analyses.

The second part was all about taking the main findings from the exploration and convey them to others through an explanatory analysis. I had to create a slide deck that leverages polished, explanatory visualizations to communicate my results.

Multivariate pairplot

At the very end I also had to include all analyses and results in a README.md file.

GitHub: https://github.com/patrick-hirschi/udacity_data_analyst/tree/master/Project5

Graduation

FINALLY! : -)

After completing all projects you instantly get access to the last button of the course named „GRADUATE ->“. To get the certificate you need a government issued identification (ID / passport) and a webcam to take a selfie. After the identity is checked you get a link to access your certificate and also a summary of the syllabus.

Udacity Certificate

Conclusion

I can really recommend this course to anyone who is already working in an IT department and wants to get familiar with the Python programming language. It helps a lot to be kind of a Data Geek that loves exploring complex data structures, analyze them and finally summarize the findings in illustrative figures. You do not need to have any experience in Python yet but it helps if you know any other programming language and also if you already know the basics of descriptive statistics (not mandatory but it helps a lot). I was able to work 100% during the whole course. It requires a lot of self-discipline and perseverance to sometimes also sacrifice weekends or evenings during the week.

Leave a Comment