Course Description
Data Scientists enjoy one of the toppaying jobs, with an average salary of $120,000 according to Glassdoor and Indeed. That’s just the average! And it’s not just about money – it’s interesting work too!
If you’ve got some programming or scripting experience, this course will teach you the techniques used by real data scientists in the tech industry – and prepare you for a move into this hot career path. This comprehensive course includes 68 lectures spanning almost 9 hours of video, and most topics include handson Python code examples you can use for reference and for practice. I’ll draw on my 9 years of experience at Amazon and IMDb to guide you through what matters, and what doesn’t.
The topics in this course come from an analysis of real requirements in data scientist job listings from the biggest tech employers. We’ll cover the machine learning and data mining techniques real employers are looking for, including:
Regression analysis
KMeans Clustering
Principal Component Analysis
Train/Test and cross validation
Bayesian Methods
Decision Trees and Random Forests
Multivariate Regression
MultiLevel Models
Support Vector Machines
Reinforcement Learning
Collaborative Filtering
KNearest Neighbor
Bias/Variance Tradeoff
Ensemble Learning
Term Frequency / Inverse Document Frequency
Experimental Design and A/B Tests
…and much more! There’s also an entire section on machine learning with Apache Spark, which lets you scale up these techniques to “big data” analyzed on a computing cluster.
If you’re new to Python, don’t worry – the course starts with a crash course. If you’ve done some programming before, you should pick it up quickly. This course shows you how to get set up on Microsoft Windowsbased PC’s; the sample code will also run on MacOS or Linux desktop systems, but I can’t provide OSspecific support for them.
Each concept is introduced in plain English, avoiding confusing mathematical notation and jargon. It’s then demonstrated using Python code you can experiment with and build upon, along with notes you can keep for future reference.
If you’re a programmer looking to switch into an exciting new career track, or a data analyst looking to make the transition into the tech industry – this course will teach you the basic techniques used by realworld industry data scientists. I think you’ll enjoy it!
What are the requirements?
You’ll need a desktop computer (Windows, Mac, or Linux) capable of running Enthought Canopy 1.6.2 or newer. The course will walk you through installing the necessary free software.
Some prior coding or scripting experience is required.
At least high school level math skills will be required.
This course walks through getting set up on a Microsoft Windows based desktop PC. While the code in this course will run on other operating systems, we cannot provide OSspecific support for them.
What am I going to get from this course?
Extract meaning from large data sets using a wide variety of machine learning, data mining, and data science techniques with the Python programming language.
Perform machine learning on “big data” using Apache Spark and its MLLib package.
Design experiments and interpret the results of A/B tests
Visualize clustering and regression analysis in Python using matplotlib
Produce automated recommendations of products or content with collaborative filtering techniques
Apply best practices in cleaning and preparing your data prior to analysis
What is the target audience?
Software developers or programmers who want to transition into the lucrative data science career path will learn a lot from this course.
Data analysts in the finance or other nontech industries who want to transition into the tech industry can use this course to learn how to analyze data using code instead of tools. But, you’ll need some prior experience in coding or scripting to be successful.
If you have no prior coding or scripting experience, you should NOT take this course – yet. Go take an introductory Python course first.
Section 1: Getting Started  

Lecture 1 
Introduction

02:44  
Lecture 2 
[Activity] Getting What You Need

02:37  
Lecture 3 
[Activity] Installing Enthought Canopy

06:19  
Lecture 4 
Python Basics, Part 1

15:58  
Lecture 5 
[Activity] Python Basics, Part 2

09:41  
Lecture 6 
Running Python Scripts

03:55  
Section 2: Statistics and Probability Refresher, and Python Practise  
Lecture 7 
Types of Data

06:58  
Lecture 8 
Mean, Median, Mode

05:26  
Lecture 9 
[Activity] Using mean, median, and mode in Python

08:30  
Lecture 10 
[Activity] Variation and Standard Deviation

11:12  
Lecture 11 
Probability Density Function; Probability Mass Function

03:27  
Lecture 12 
Common Data Distributions

07:45  
Lecture 13 
[Activity] Percentiles and Moments

12:33  
Lecture 14 
[Activity] A Crash Course in matplotlib

13:46  
Lecture 15 
[Activity] Covariance and Correlation

11:31  
Lecture 16 
[Exercise] Conditional Probability

11:03  
Lecture 17 
Exercise Solution: Conditional Probability of Purchase by Age

02:18  
Lecture 18 
Bayes’ Theorem

05:23  
Section 3: Predictive Models  
Lecture 19 
[Activity] Linear Regression

11:01  
Lecture 20 
[Activity] Polynomial Regression

08:04  
Lecture 21 
[Activity] Multivariate Regression, and Predicting Car Prices

08:06  
Lecture 22 
MultiLevel Models

04:36  
Section 4: Machine Learning with Python  
Lecture 23 
Supervised vs. Unsupervised Learning, and Train/Test

08:57  
Lecture 24 
[Activity] Using Train/Test to Prevent Overfitting a Polynomial Regression

05:47  
Lecture 25 
Bayesian Methods: Concepts

03:59  
Lecture 26 
[Activity] Implementing a Spam Classifier with Naive Bayes

08:05  
Lecture 27 
KMeans Clustering

07:23  
Lecture 28 
[Activity] Clustering people based on income and age

05:14  
Lecture 29 
Measuring Entropy

03:09  
Lecture 30 
[Activity] Install GraphViz

Article  
Lecture 31 
Decision Trees: Concepts

08:43  
Lecture 32 
[Activity] Decision Trees: Predicting Hiring Decisions

09:47  
Lecture 33 
Ensemble Learning

05:59  
Lecture 34 
Support Vector Machines (SVM) Overview

04:27  
Lecture 35 
[Activity] Using SVM to cluster people using scikitlearn

05:36  
Section 5: Recommender Systems  
Lecture 36 
UserBased Collaborative Filtering

07:57  
Lecture 37 
ItemBased Collaborative Filtering

08:15  
Lecture 38 
[Activity] Finding Movie Similarities

09:08  
Lecture 39 
[Activity] Improving the Results of Movie Similarities

07:59  
Lecture 40 
[Activity] Making Movie Recommendations to People

10:22  
Lecture 41 
[Exercise] Improve the recommender’s results

05:29  
Section 6: More Data Mining and Machine Learning Techniques  
Lecture 42 
KNearestNeighbors: Concepts

03:44  
Lecture 43 
[Activity] Using KNN to predict a rating for a movie

12:29  
Lecture 44 
Dimensionality Reduction; Principal Component Analysis

05:44  
Lecture 45 
[Activity] PCA Example with the Iris data set

09:05  
Lecture 46 
Data Warehousing Overview: ETL and ELT

09:05  
Lecture 47 
Reinforcement Learning

12:44  
Section 7: Dealing with RealWorld Data  
Lecture 48 
Bias/Variance Tradeoff

06:15  
Lecture 49 
[Activity] KFold CrossValidation to avoid overfitting

10:55  
Lecture 50 
Data Cleaning and Normalization

07:10  
Lecture 51 
[Activity] Cleaning web log data

10:56  
Lecture 52 
Normalizing numerical data

03:22  
Lecture 53 
[Activity] Detecting outliers

07:00  
Section 8: Apache Spark: Machine Learning on Big Data  
Lecture 54 
[Activity] Installing Spark – Part 1

07:02  
Lecture 55 
[Activity] Installing Spark – Part 2

13:29  
Lecture 56 
Spark Introduction

09:10  
Lecture 57 
Spark and the Resilient Distributed Dataset (RDD)

11:42  
Lecture 58 
Introducing MLLib

05:09  
Lecture 59 
[Activity] Decision Trees in Spark
Preview 
16:00  
Lecture 60 
[Activity] KMeans Clustering in Spark

11:07  
Lecture 61 
TF / IDF

06:44  
Lecture 62 
[Activity] Searching Wikipedia with Spark

08:11  
Lecture 63 
[Activity] Using the Spark 2.0 DataFrame API for MLLib

07:57  
Section 9: Experimental Design  
Lecture 64 
A/B Testing Concepts

08:23  
Lecture 65 
TTests and PValues

05:59  
Lecture 66 
[Activity] Handson With TTests

06:04  
Lecture 67 
Determining How Long to Run an Experiment

03:24  
Lecture 68 
A/B Test Gotchas

09:26  
Section 10: You made it!  
Lecture 69 
More to Explore

02:59  
Lecture 70 
Don’t Forget to Leave a Rating!

Article  
Lecture 71 
Bonus Lecture: Discounts on my Spark and MapReduce courses!

01:28 
Instructor Biography
Frank Kane, Data Miner and Software Engineer
Frank Kane spent 9 years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers, all the time. Frank holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology, and teaching others about big data analysis.