Machine Learning Essentials with Python and Spark (PySpark) is a foundation-level, three-day hands-on course that teaches students core skills and concepts in modern machine learning at scale practices, leveraging Python and Spark. This course is geared for attendees new to machine learning who need introductory level coverage of these topics, rather than a deep dive of the math and statistics behind Machine Learning. The focus of this course is on the machine learning skills, as opposed to Spark essentials. Students will learn basic algorithms from scratch. For each machine learning concept, students will first learn about and discuss the foundations, its applicability and limitations, and then explore the implementation and use, reviewing and working with specific use cases.
Audience Profile
This course is geared for experienced, intermediate-skilled developers or others (with prior Python experience) intending to start using learning about and working with basic machine learning algorithms and concepts. Attendees should be comfortable working with Python programming. Students should also be able to navigate Linux command line, and who have basic knowledge of Linux editors (such as VI / nano) for editing code.
Student Testimonials
Instructor did a great job, from experience this subject can be a bit dry to teach but he was able to keep it very engaging and made it much easier to focus.
Student
Excellent presentation skills, subject matter knowledge, and command of the environment.
Student
Instructor was outstanding. Knowledgeable, presented well, and class timing was perfect.
Student
Click here to print this page »
Prerequisites
Students should have attended or have incoming skills equivalent to those in this course:
· Strong basic Python Skills. Attendees without Python background may view labs as follow along exercises or team with others to complete them. (NOTE: This course is also offered in Python (without Spark), Scala or R – please inquire for details.
· Good foundational mathematics in Linear Algebra and Probability
· Basic Linux skills, including familiarity with command-line options such as ls, cd, cp, and su
Take Before: Attending students should have incoming skills equivalent to those in the course(s) below:
· TTPS4800 Introduction to Python (3 days)
Detailed Class Syllabus
1. Machine Learning (ML) Overview
· Machine Learning landscape
· Machine Learning applications
· Understanding ML algorithms & models
2. Machine Learning in Python and Spark
· Spark ML Overview
· Introduction to Jupyter notebooks
· Working with Jupyter + Python + Spark
3. Machine Learning Concepts
· Statistics Primer
· Covariance, Correlation, Covariance Matrix
· Errors, Residuals
· Overfitting / Underfitting
· Cross-validation, bootstrapping
· Confusion Matrix
· ROC curve, Area Under Curve (AUC)
4. Feature Engineering (FE)
· Preparing data for ML
· Extracting features, enhancing data
· Data cleanup
· Visualizing Data
· Lab: data cleanup
· Lab: visualizing data
5. Linear regression
· Simple Linear Regression
· Multiple Linear Regression
· Running LR
· Evaluating LR model performance
· Use case: House price estimates
6. Logistic Regression
· Understanding Logistic Regression
· Calculating Logistic Regression
· Evaluating model performance
· Use case: credit card application, college admissions
7. Classification: SVM (Supervised Vector Machines)
· SVM concepts and theory
· SVM with kernel
· Use case: Customer churn data
8. Classification: Decision Trees & Random Forests
· Theory behind trees
· Classification and Regression Trees (CART)
· Random Forest concepts
· Use case: predicting loan defaults, estimating election contributions
9. Classification: Naive Bayes
· Theory
· Use case: spam filtering
10. Clustering (K-Means)
· Theory behind K-Means
· Running K-Means algorithm
· Estimating the performance
· Use case: grouping cars data, grouping shopping data
11. Principal Component Analysis (PCA)
· Understanding PCA concepts
· PCA applications
· Running a PCA algorithm
· Evaluating results
· Use case: analyzing retail shopping data
12. Recommendations (Collaborative filtering)
· Recommender systems overview
· Collaborative Filtering concepts
· Use case: movie recommendations, music recommendations
13. Performance
· Best practices for scaling and optimizing Apache Spark
· Memory caching
· Testing and validation
Time Permitting: Capstone Project
· Hands-on guided workshop utilizing skills learned throughout the course