Conrad Kennington

Welcome to Algorithms of Machine Learning

In general, machine learning is all about making predictions and classifications.

Machine learning algorithms use training data to generate models. Models are generated functions used to predict and classify new, previously unseen data. Sometimes models are called classifiers.

Before we train a model, we can hold back a subset of the data for testing the predictive power of the model. This is called test data.

The predictive power of the model depends on many factors, including the quality of data, number of samples, algorithm used, and whether a pattern can be learned.

All machine learning algorithms have the same basic goal, but go about it using different statistical methods.

Like sorting algorithms, each machine learning algorithm has its own strengths and weaknesses. Some are simple. Some are complicated. There is no silver bullet.

Conceptually, a model is a mathematical function. The complexity of the function depends on the ML algorithm, and the amount/variety of training data used. It can be represented as an object, serialized to a file, and used repeatedly in research and production systems.

Video: Gentle Introduction to Machine Learning

Lecture 1 Slides

About Me

I’m a Principal Machine Learning Engineer at Kount Inc, and manage the machine learning engineers.

I’m a machine learning practitioner, not a data scientist, but I get to interview and work with data scientists on a regular basis.

I have a software engineering background (a masters in computer science from Boise State). My training in machine learning was from my brother, professor Casey Kennington, his course Natural Language Processing, and on-the-job by data scientists like Josh Johnston. I’ve been doing Machine Learning in some form or another professionally for 5 years.

Definition of Machine Learning

Machine Learning is the process of building a model, or a function, with data.

f(x) = data

data = f(x)

f(x) is a model (a function is a model built from data)

The line between the ML algorithm that created the model, and the model itself is sometimes fuzzy — consider them completely separate things. The ML algorithm creates the model.

A model maps input to output.

Think of the way you interpret the world - how you learn.

Patterns that you’ve evolved over time in your behavior. Thing’s you’ve “learned” from previous experience. You’ve developed a mental model of the world around you. Dopamine makes you happy — you’ve “learned” this subconsciously. You do things to make yourself happy: eat a good meal, go on a run, talk to a friend, play with a baby. Over time, you collect more data points, and either reinforce, or adjust your mental model. You learn “what works” and “what doesn’t”.

You’ve learned how to make choices based on subconscious probabilities throughout your life.

Taco Bell makes you gassy, so you avoid it.

You bring Dramamine to theme parks because you typically get motion sickness.

You tell your friend the party starts at 6:30pm instead of 7:00pm, because they are usually late.

You adjust your mental model based on new input. Somestimes you get it wrong.

You learn that you don’t “bring up that topic” around this person, because last time it resulted in an argument. This is new a new thing you are learning.

You go to a foreign country and learn that they eat grasshoppers. Your model of “food” has just expanded to encompass more things. You’ve collected more data that matters in a different context.

Something that used to be socially okay now just became taboo. You adjust your mental model of what’s okay in society. You learn with new data, and apply it to your existing mental model.

The more experiences you have, the more “data” points you have, the more accurate your model will be — as long as your experiences are varied enough, but you still recognize repeatable patterns. This is a model!

This semester teaches you how to build a model using data points from database or a data stream, and use that data to predict outcomes, or classify new data you’ve never seen before.

ML spans multiple disciplines

Computer Science

Mathematics

Statistics

Depending on the background of the textbook/youtube video, they will use different terminology, and a different approach.

This is not a deep learning class

Deep learning is just one facet of the large discipline of machine learning.

We will touch on deep learning, but the coursework is considered a broad introduction — a survey.

Tools

In industry we use Python 3, Jupyter, scikit-learn, numpy/pandas/matplitlib, Apache Spark, pyspark, and many of the tools we will use in class.

This class will introduce the mathematical equations and code examples.

Because these algorithms have been around for decades, in some cases centuries, they have extensive and formal mathematical proofs.

We will discuss ML algorithms and the reason they were invented.

We will categorize ML algorithms and analyze their strengths and weaknesses.

I will supplement the course with real examples from industry that we see and use every day.

There are many ML algorithms, but only roughly a dozen are widely used.

Things you should already know

Python 3, pandas, probability theory, matplotlib, numpy, linear algebra

Jupyter notebooks

The HYPE surrounding AI/ML is huge right now. Some of it is warranted, most is overblown.

Machine learning is extremely powerful. It can find patterns in immense amounts of data that cannot be found using other methods.

Use ML where necessary, but avoid it when a simpler approach will work.

Sort of like a regular expression. If you’re just checking for length of a string, or if it’s an int, a regular expression is overkill. On the other hand, there are situations where a regular expression is an invaluable tool, such as the format of an email address.

Algorithms Cheat Sheet

Review - Python 3, Jupyter, Linear Algebra, Probability

We will use Python 3 exclusively

Python 3 is is not backward compatible with Python 2. It's essentially a fork.

The two main native data structures in Python are a list and a dict

Libraries and functions

You should know how to use numpy, matplotlib, and pandas.

numpy - numeric python

Add support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Usually much faster than using native lists. Most ML libaries in Python are based on numpy

Remember these things from CS 223:

pandas - panel data

Pandas Dataframes Jupyter Notebook

Pandas docs

Become familier with the data frame and its various functions - create, drop, insert, update, etc.

Jupyter Notebook

Based originally on iPython Notebook

Used by 90%+ of data scientists across the world, from academia to industry.

Strengths: can run cells independently, imbed images, code and document in same place, hold data in memory

Weaknesses: difficult to version control, leads to monolithic design, cells can be run in any order, which can lead to confusion

Linear Algebra

Linear Algebra w/ Numpy Jupyter Notebook

Linear algebra is a prereq for this class. This is just review.

Best review for linear algebra is 3Blue1Brown. I recommend watching at least the first video, and the one on vectors.

Be able to answer: what is a tensor? What is a vector?

What is discrete vs continuous?

Probability

Video: Probability vs Likelihood

Conditional probability P(B|A) P=Probability, B=event, |=given, A=event - Probability event B happens given event A as already happened.

Joint probability P(A∩B) - Probability events happen together

Distributions

Bernouli Distribution

Simulates the probability of an event with 2 outcomes: success or failure, heads or tails.

Binomial Distribution

Similar to the bernouli distribution. A binomial variable also models a scenario where there are only 2 outcomes but instead of one trial it models many.

Gamma Distribution

Entirely different from the past two. Instead of modeling the outcome of an event, the gamma distribution models the probability of when an event will occur. The gamma distribution is a bit more complicated than binomial, as it is not so obviously related to a simple like success or failure.

Poisson Distribution

Likelyhood that a certain number of events will occur in a given time interval.

Normal Distribution

Most continious measurements of any large population, such as height or weight, tend to fall into a normally distributed curve, or "bell curve" named for its shape.

Video: Maximum Likelyhood and Distributions

Simple Linear Regression

Linear Regression is a simple model

A line on a graph. y = mx + b

Plot a bunch of data points on a graph, find the line that is the "closest" to all the data points. Now use that line to make "predictions".

It's used to map a linear relationship between two variables, a "dependant" and an "independant" variable.

Think if this linear relationship as a conditional probability. e.g. Location on the y axis is conditional on the x axis.

Simple Linear Regression Jupyter Notebook

Video: Fitting a Line to Data

Video: Linear Regression

Properties of Linear Regression

Easy to learn, easy to interpret, fast to calculate - it's like the "hello world" of machine learning

Predicts on a continuous numeric scale - not typically used for classifiction (predicting temperature rather than "is it going to rain?")

Works quite well on data where there is a linear relationship

Bias vs Variance

All of machine learning is plagued by the bias/variance trade-off

Obviously, when fitting a linear regression line to data, it's not perfect. It doesn't pass through all the data points. There is some error. This this a big deal?

The question is, do you want it to be perfect?

As a thought experiment, let's create an equation that plots a squiggly line through all the data points. Does that mean that future data points will land on that squiggle? Probably not.

Fitting the training data too well can create unnecesary complexity, and may not generalize well. Indeed, it may be WORSE than a simple model.

Fitting training data well but not generalizing to new data is called overfitting

Overfitting also plagues all of machine learning

Video: Bias vs Variance

In the example below, the left model has low bias & high variance, the right model has high bias & low variance.

Models that Generalize

What does it mean to have a model that generalizes well? It means the model performs well on unseen data - the predictions are accurate. It was trained well

A model that generalizes was trained on a variety of data, and finds a sweet-spot between variance and bias

The model is flexible enough to be okay with a few mistakes, but overall has good performance. This explains why sometimes simple models (Linear Regression) outperforms more complex models (SVM). But, like with all data science, it depends on the data!

Example: a model trained to identity pictures with apples doesn't generalize well if it was only trained on pictures with red apples

Video Download: Model Generalization

Multivariate Linear Regression

Uses several linear regression lines simultaneously

If simple linear regression is a line, 3 variables becomes a plane, and so on through n dimensions.

Once you go beyond 3 dimensions (which most ML algorithms do) it becomes difficult for our feeble brains conceptualize. This is why ML models are a "black box".

Linear Regression -- Gradient Descent

Finding the line

ML Algorithms iterate toward a "best fit" for the data to create a model

In the case of linear regression, it's the best equation for y=mx+b given the data

How do we meature "best fit" for linear regression? The "sum of least squares"

Sum of Least Squares

The cost function for linear regression.

Sum of all distances from the data points along Y axis to the line. You want the smallest number possible

The distance between the a data point and the regression line is called a "residual".

Why is least squares... squared? So negative (above the regression line) doesn't cancel out positive (below the regression line).

Cost Function

ML algorithms have a cost function (also called a loss function, error function, or objective function)

This function is different for each ML algorithm. With Linear Regression, it's the distance from the data to the line

If you were to graph the "cost" or "total distance" for 100s of possible regression lines for any given data set, linear regression costs would yeild a curve, where the bottom is the lowest cost and the best parameters for the regression line.

The shape of this cost function makes linear regression a "convex optimization" problem

You can graph any cost function. Sometimes it's pretty like this curve, or maybe it will be crazy, all over the place with many peaks and valleys.

Linear Regression -- Gradient Descent Notebook

Absolutly amazing video describing gradient descent.

Gradient Descent

Assume the cost function is graphed. Remember, we are trying to find the "lowest" part of the graph.

Now, assume the cost function isn't graphed, because calculating all those possible fits is computationally expensive. Gradient Descent to the rescue!

A derivative is a Calculus technique for finding the steepness and direction of a slope.

Gradient descent takes the derivative of a cost function at a given point, and iterates "down" the slope — when the derivative is 0 (or close, like 0.001) then it’s the best/lowest cost, and the best fit.

Gradient descent jumps down the slope quickly at first, then slows down (makes smaller jumps) when it senses it's reaching the bottom. It doesn't want to accidentally go up the other side!

Gradient Descent is a Big Deal

Many machine learning algorithms use gradient descent - not just Linear Regression

Linear Regression can always find the global minimum, but many other ML algorithms have a harder time. They are too complex

Stochastic Gradient Descent

Stocastic gradient descent is for when you have a LOT of data points to check the current fit in your gradient descent.

Stochastic means “random” so it takes a random sample of data points rather than all the data points.

Video: Stochastic Gradient Descent

Supervised vs Unsupervised

Examples of supervised ML algorithms

Linear Regression

Logistic Regression

Decision Tree / Random Forest

SVM

Naive Bayes Classifier

Nearest Neighbor

Examples of unsupervised ML algorithms

KMeans Clustering

Expectation Maximization

Supervised means Labeled Data

Supervised learning is done using a "ground truth", or in other words, we have prior knowledge of what the output values for our samples should be.

Video: Supervised vs Unsupervised Machine Learning

Whenever anyone tells me they are doing machine learning, my first question is "supervised" or "unsupervised"? If supervised, my second question is "what are the labels?"

What is a label?

Winners vs losers

Cancer vs healthy

Good vs Fraud

Rain vs Shine

Approve vs Decline

Fast vs slow

Is a cat vs Not a cat

etc.. you know what binary means!

Typically labeled with a 0 or 1 (you pick which is which, doesn’t really matter) associated with a bunch of features in a sample.

Also can be more than 2 categories!

A label is typically a column in a dataset.

Given a row in a database (also called a sample, or a vector, or instance) most columns are features, and one of the columns is the label.

Jupyter Notebook: Supervised Machine Learning with Logistic Regression

How does data get labeled?

Good question!

By hand -- a human! Machines learn from humans.

Crowd sourcing.

AWS Mechanial Turk

AWS Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning.

Unsupervised Mean No Labels

Learns "groupings" or "clusters" rather than classifcation or prediction

Compares and groups samples together based on salient features. You can have control over how many groups are created (KMeans)

May discover cross-sections that are novel and unexpected

Examples: advertizing (grouping customers), anomaly detection (looking for outliers of a group), recommender systems

Weak Supervision

Insufficient quantity of labeled data

Imprecise or inexact labels

Inaccurate labels

Insufficient subject-matter expertise to label data

Insufficient time to label and prepare data

Examples: fraud detection, cybersecurity, biomedical

Semi Supervised

Uses a combination of some labeled, mostly unlabeled data

Why? It's expensive and time-consuming to label data!

We can still leverage unlabeled data to learn in a supervised environment

Uses a process called "self-supervized learning" to give data pseudo labels

"Psuedo Labeling" - predicting unlabeled data using labeled data (works with neural networks)

"Adversarial Training" - purposfully adding "noise" to labels which helps generalization

Examples: speech analysis, image classification, natural language processing, robotics

Reinforcement Learning

Good for learning a process, or sequence of steps - i.e. simulations

You can start with zero examples, but you can classify the outcome.

A system "guesses" or "randomizes" a sequence of steps, then is given a "score" based on how well it did. +1 for good. -1 for bad. Those are added into the system to increase or decrease the probability the system will take those steps again.

Imagine reinforcement learning as teaching a dog "good" and "bad" behaviors by giving them a treat vs a spritz of water to the face. You are "reinforcing" behaviors - you're not "pre-labeling" but "post-labeling"

Simulations can iterate and learn millions of scenarios very quickly

Examples: AI in games such as StarCraft (more games than a single person can play in a lifetime), Go, Atari Games, learning where people "click" for content

Simulations don't transfer very well to the real world. For example, reinforcement learning doesn't transfer well with self-driving cars

Video: Google's DeepMind Learning to Walk via RL

Excercise: Which of these business requirements could be supervised vs unsupervised?

Customer Lifetime Value

Dynamic Pricing

Customer Segmentation

Recommending New Products

Stock Market Investment

Self Driving Cars

Inventory Optimization

Logistic Regression

Logistic Regression is a binary classifier

Logistic Regression is like linear regression, but morphs the line into an S-shaped curve

The equation that morphs the line into an S-curve is called the Logit function - hence the name. Sometimes it's also called the sigmoid function

The logit function can be expressed as logit(p) = log (p / (1-p)) where p is probability

This is an extension of something called the "odds ratio"... simply (p / (1 - p))

Properties of Logistic Regression

If you're trying to predict or classify whether something is true/false, yes/no, on/off, or literally and two outcomes

Contrains the estimated probabilities to lie between 0 and 1

One of the most commonly used algorithms in ML

Simple, yet powerful

Technically a linear model

Once the model is trained, you can generate a list of the most useful features!

Video: Logistic Regression

Technically, Logistic Regression can deal with multiple categorical outputs

Logistic Regression Cost Function

Rather tha sum of least residuals like Linear Regression, Logistic Regression uses Maximum Liklihood Estimation

The likelihood function (L) measures the probability of observing the particular set of dependent variable values (p1, p2, ..., pn) that occur in the sample: L = Prob (p1* p2* * * pn)

Training the Model

Steps are similar to Linear Regression - using gradient descent, find the lowest cost using the cost function specific for this algorithm

Since Logistic Regression is a linear classifier, you start with similar data to linear regression, but transform the grap to between 0 and 1 using log odds

Jupyter Notebook: Logistic Regression from Scratch

Coefficients

Video: Logistic Regression Coefficients

Jupyter Notebook: Logistic Regression Coefficients

Video: Fitting a Line Using Maximum Liklihood

Model Evaluation Metrics

Model Evaluation is about Context

Good models aren't just about what you got right, but minimizing what you got wrong

"How good are you at stopping fraud?" "REALLY GOOD. We stopped all transactions."

"I think the most important thing in machine learning (and essentially every aspect of life) is to think carefully about what you're trying to optimize." - Nate Monnig, Senior Data Scientist at Kount Inc

Cross Validation

Splitting your data into a train/test set. 70/30 is common.

^ This is called "Holdout Validation"

What if you set a 90/10 split? Then a different 90/10 split? ...and tried 10 combinations of 90/10 splits so you get to try all combinations as a test set? This is called folding. The number of times you split the data equals the number of folds.

^ This is called "K-folds Validation"

There are also "Stratified K-fold Cross-Validation", "Leave One Out Cross-Validation", and "Repeated Random Test-Train Splits"

Jupyter Notebook: Cross Validation

Confusion Matrix

Once a confusion matrix is filled out we can calculate two more metrics

Sensitivity = true positives / (true positives + false negatives) = percentage WITH correctly identified

Specificity = true negatives / (true negatives + false positives) = percentage WITHOUT correct identified

i.e. Sensitivity = correctly identified positives

i.e. Specificity = correctly identifying negatives

F1-score = (2 / (1/recall) + (1/precision)

Accuracy = correct predictions / total predictions

Precision, Recall, and the F1 Score

Recall is the same thing as Sensitivity (see above)

Precision and Recall tells us how well the model deals with things it got wrong in each class prediction.

Recall gives us information about false negatives. Precision gives us information about false positives.

F1 score takes into account how the data is distributed. Useful when you have data with imbalance classes.

The F1 score combines precision and recall of the model, and it is defined as the harmonic mean of the model's precision and recall

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0

F1 score vs accuracy score? They both give you insights, and it depends on your data set.

F1 tells you how well you balance between predicting both classes.

Accuracy might be high, but F1 can be 0 if your model only predicts one class.

F1 doesn't tell you which class is better through (precision vs recall).

Most Common Baseline

Which of your target labels has the highest count?

Your accuracy has to be higher than your most common (maximum baseline)

Otherwise, your model is no better than guessing with a weighted coin flip

Class imbalance is something to watch out for -- the more balanced the classes (50/50) the better the model can learn

This was my first mistake when I got into ML. I trained a model to detect fraud. Only 1% of transactions are fraudulent, so my model was 99% accurate. I thought that was amazing, but it was no better than a wild guess!

ROC Curve

Receiver operator characteristic

Makes it easy to identify the best threshold for a given method

Summarizes all confusion matrixes for given thresholds

X axis = specificity (false positive rate - incorrectly classified)

Y axis = sensitivity (true positives - correctly classified)

The best points are the ones that are furthest from the diagonal line - up and to the left, and also depends on the false positive rate are you willing to accept

AUC Area under the Curve

A good way to get the feel for the overall model performance, rather than a specific threshold.

The higher the area under the curve, the better

A perfect AUC would be hugging the Y axis vertically, and the 1.0 horizontally (100% perfect predictions with no false positives)

Be suspicious when a model is perfect. That usually means overfitting, or using test data in the training data

ROC and AUC

AUCs are commonly used to compare ML algorithms for a given data set

Precision-Recall Curve

Graph the rate of precision vs recall. Up to the right is better.

Used when an ROC curve is already pretty good (80s or 90s) and you want a more sensitive graph, and have a large class imbalance (like fraud). We use these a lot at Kount.

How Good Is Your Model?

Ususally a complex and multi-faceted answer

"99% AUPRC on the training, but 50% on the test, then you’re massively overfitting the training set and need to regularise the model or simplify the hyperparameters." - Matthew Jones, Senior Data Scientist at Kount

Lots of metrics, use them all to paint a more comprehensive picture

Requires "hunches"... this doesn't sound very data-sciency! It takes experience

The data scientist makes the final call -- based on tolerance for risk from the business case

"Just knowing the area under the curve is not enough — it’s important to always keep in mind the training and test data that you were using. If the distributions don’t match up then you can have a really high metric but it won’t actually evaluate anything very well in the real world." - Matthew Jones, Senior Data Scientist at Kount Inc

Improving Your Model -- Feature Engineering

Data Science is mostly Feature Engineering

According to a survey in Forbes, data scientists spend 80% of their time on data preparation.

Therefore, lets talk about feature engineering!

Choose your sample carefully

Most data science suffers from a lack of data.

Use a representative sample of what you're trying to learn -- get more data if you need

If your target class is imbalanced, you may need to drop samples to make it more 50/50

Get better labels, if they're inadequate or incorrect.

Real world example: at Kount, if we want to catch fraud every day of the week, we wouldn't just use samples from weekends

Real world example: at Kount, sometimes we drop non-fraud samples in our training set, so the fraud samples stand out a bit more

Real world example: at Kount, chargebacks (our label) are under-reported, late, and sometimes dishonest (this is called friendly fraud, when someone claims fraud but they actually made the order

Feature selection

Feature selection is primarily focused on removing non-informative or redundant predictors from the model

Rank features, throw out ones that don’t help - they just add noise

How do you rank features? Many algorithms let you list feature importance - like we did in the Logistic Regression lecture with coefficients.

Check for collinearities. Two variables are perfectly collinear if there is an exact linear relationship between them.

For example: temperature in C and temperatue in F. They go up at the same rate, no need for both features.

It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model

Get better features

Derive new features out of the old ones.

This is VERY domain specific, and requires creativity

Let's brainstorm for the current homework assignment

Pick features that you beleive have the strongest relationship with the target variable

Look for more data sets

Feature Scaling

Machine learning is like making a mixed fruit juice. If we want to get the best-mixed juice, we need to mix all fruit not by their size but based on their right proportion.

In many machine learning algorithms, to bring all features in the same standing, we need to do scaling so that one significant number doesn’t impact the model just because of their large magnitude.

For example: salary (25,000 - 250,000) vs age (19 - 65)

The most common techniques of feature scaling are Normalization and Standardization.

Normalization is used when we want to bound our values between two numbers, typically, between [0,1] or [-1,1]

Standardization transforms the data to a standard deviation around the mean. Good for things with a gausian distrubtion, and handles outliers.

There are LOTS of methods for scaling.

Like most other machine learning steps, feature scaling too is a trial and error process, not a single silver bullet.

Jupyter Notebook: Feature Engineering

Numeric vs Categorical

In most programming languages there are a handful of primitive data types: int, float, boolean, string, etc.

In machine learning there are two: numeric (continuous or discrete), and categorical (string)

Numeric: measurments like temperature, height, weight, speed, count, etc. Typically an int or a float

Categorical: low cardinality sets such as "eye color: blue, green, brown". Typically a boolean or a string

Many machine learning algorithms cannot handle categorical features, so they must be converted to numeric features.

Converting Categegorical Features

One hot encoding turns categorial features into booleans - all possiblies in a given categorical feature becomes its own feature!

Careful! This can cause a feature EXPLOSION!

One hot encoding is for when you have a handful of categories.

Dealing with Missing Data

Data science practitioners eventually must to deal with gaps in their data - null features, blanks, empty strings, etc

Sometimes the data is curated and cleaned - neat and tidy

But often the data is gathered from faulty sensors, or systems that return null, or incomplete questionaires

For example: gathering information about devices your website for analysis, but 10% JavaScript information isn't present (what causes this)?

Maybe a weather station thermometer was malfunctioning for a few hours, but you still got all of the other readings.

There is NO good way to deal with missing data - sometime missing data means something - sometimes it's an error with the observation

When does missing data "mean something"? Maybe an online survey has an optional question, and it's always null.

That could mean it's a sensitive question, a poorly worded question, or there are responses but a bug prevented the data from getting saved.

These are sigificant for DIFFERENT reasons in your model. Don't assume, or you could be learning a bug.

Missing Data Pitfalls

In the real world, beautifully formatted and curated data is rarely given to you.

Real data is messy.

Data will always pass through several systems. Each hand-off affects the format, and thus, the meaning. It's like the game of "telephone"

Data types matter a lot in data science. Unfortunatly, sometimes we lose important type information when moving data - e.g. converting Python objects to JSON, or DateTime objects in MySQL to a timeseries

Boolean "false" getting implicity cast to 0, or empty string "" cast to null.

null and empty string "" are not the same thing - treat them differently

Imputing

Machine Learning Algorithms cannot deal with null values, so you must fill them in, or drop the feature or sample

Dropping columns, or entire samples because some of the features are null can lose valuable information

Imputing means "filling in missing data with a meaningful substitute". For example: the mean, median, or mode of all present values in that feature

Imputation for missing values in machine learning

Numbers aren't always Numeric

Machine Learning Algorithms expect that if something looks like a number, it can be multiplied, added, divided, subtracted, plotted, etc.

1 and 2 are closer than 1 and 5. There's a numeric relationship based on proximity.

Just because something looks like a number, doesn't mean it should be treated like one.

Consider a zip code. There's a loose correlation that zip codes starting with the same number are geograpically near each other. But you don't perform math functions on zip codes. They are essentially strings. Categories. Which means you have to be explicit about that in your feature types.

Ask yourself, "Is this number measuring something?" If not, it's probably a categorical feature.

Binning ... aka Bucketing

No Free Lunch Theorem

The “No Free Lunch” theorem states that there is no one model algorithm that works best for every problem

The assumptions of a great model for one problem may not hold for another problem

It's common in machine learning to try multiple models and find one that works best for a particular problem

Remember your algorithms class, where you learned about various types of sorting algorithms? Each has their strengths and weaknesses

Class Imbalance

Imbalanced classes are a common problem in machine learning classification where there are a disproportionate ratio of observations in each class.

Class imbalance can be found in many different areas including medical diagnosis, spam filtering, and fraud detection.

There are many ways to deal with class imbalance:

Change the algorithm

Oversample minority class

Undersample majority class

Generate synthetic samples

Lifecycle of a Model

What problem are you trying to solve?

It's not finding the right answer, but the right question.

This is often the most difficult step, esspecially from a business perspective.

Get the Data

This is the main reason it's very difficult to create a startup business in Machine Learning. They don't have any data.

Collect it yourself?

There are tools for this. AWS Ground Truth. Web development screen scraping. APIs (easy if you took my CS 401 class!)

How do you find novel data?

Google, Kaggle, public record databases, you might even purchase a data set.

Most companies are unwilling to donate data sets due to privacy policies (PCI, FERPA, HIPAA).

Medical records, student records, e-commerce transactions, experiments involving humans or childen (toy design, interface design, etc) are illegal to share without written consent.

Depending on what you’re trying to do, this data can be easy to find and free, impossible to find, and/or illegal to obtain.

Industry and Academica partherships are tricky. For example: if a company shares data with a university and a discovery is made, who gets the patent?

Anonymizing the data (removing names and other personally identifiable data) might work, but may also defeat the purpose of what your algorithm is trying to learn.

Inventing your own random data typically doesn’t work. You want to learn things from the real world!

Scatterplot the Features

Get familier with the data. Wade around in it, get intimatly familar with each feature.

Plot the distrubtions. Are they normal? Random? Categorical? Are they full of nulls? How would you impute the missing data? Make sure you know the types — are they binary? What are the upper/lower limits of each feature?

Averages are rarely useful. Distrubtions are king.

Can numeric features be converted to categorical features, or vice versa? Can they be “bucketed”?

Understand the source of the data. Did it pass through several pipelines? Did it convert from JSON to a Python dictionary? Was any data lost or transformed in the process?

Was it curated or is it wild? Was it collected by a sensor, or inputed by a grad student? Was the sensor functioning properly?

Food for thought: manually entered datasets were probably input by a grad student.

Feature Engineering

You might not have enough data.

You might have too much data (features)!

Get rid of features that don’t contribute to learning your problem.

Your data might be imbalanced - what do you do? Get rid of samples?

Should you skew, risize the data to make numeric range features on equal footing?

Can you use features to create MORE features?

Creating more derived features is very common. Always look for ways to do this.

Try Various Algorithms

Hypothesize, due to the nature of the data, which algorithm would/should work?

Cross validation - various data splits

sklearn makes this step easy. Loop data through a handful of algorithms, and evaluate the results.

Iterate loop over hyper parameters

Evaluate

How good are the models?

This is a difficult step. How do you know when “good” is “good enough”?

It should at LEAST be better than a weighted coin flip.

Use the metrics discussed in the “Model Evaluation Metrics” lecture.

There is no magic number that tells you when a model is ready. This is a judgement call.

Is the model TOO good? This was likely caused by a bug — you accidentally included your test data in your training set. Oops! This has happened to literally every ML engineer and data scientist, ever. Common mistake!

If you think you can do better, go back to step 4, or maybe even step 2, and iterated through this loop until you’re satisfied.

Deploy your model

It needs to be put in the right place to make it useful in the real world.

Many call this "inference", or putting the "inference" into production

Collect predictions, and compare them against actual outcomes

Support Vector Machine

Guest Lecture References

Ryan Pacheco's Power Point Presentation

Jupyter Notebook: SVM

Video: SVM Clearly Explained

Support Vector Machine - The Kernel Hack

Guest Lecture References

Josh Johnston's Jupyter Notebook

Decision Trees

A Decision Tree is a Flowchart Structure

A decision tree generates a (generally) binary tree where each node contains a rule on a given feature targeting a label.

That rule "splits" the data into two parts. For example: if a feature is greater than, equal to, or less than.

The highest node on the tree separates as many samples as possible

As you descend the tree, the nodes get more and more specific.

The leaf nodes are the classifcation for that given sample.

The leaf nodes also contain metadata about their respective "path", such as the ratio of training samples matching that path.

If you get to a leaf node and there are still samples from both classes, it's called "impure".

Impure leaf nodes much more common than pure leaf nodes.

Video: StatQuest Decision Trees

Decision Trees are like a game of 20 questions

When you formulate questions, you start with broad strokes, like "animal, mineral, or vegetable"?

It's sort of like a binary search.

The entire training set is considered at the root note. Samples are distrubuted recursively on the basis of feature values.

Decision Trees figure out how to ask the right questions at each stage, giving maximum separation.

Decision Trees are Powerful

Features can be repeated at any stage with any splitting criteria.

Though decision trees are deceptively simple, they tend to rank very high in accuracy when going head-to-head against other algorithms.

Decision trees are typically uses as supervised classifiers.

Decision Trees can grow to an arbitrary depth, or can be trimmed to a maximum depth via a hyperparameter.

Decision trees can become massive.

Decision Trees are Flexible

Decision trees can be built with a combination of categorical and numeric data!

Scaling is not required!

Missing data doesn't affect building the tree!

Decision trees have many possible ways of being generated.

How a tree is generated is called its "splitting criteria". Such as GINI, Entropy, Information Gain, etc.

Decision Trees are Interpretable

Jupyter Notebook: Decision Trees

It's easy to visualize a decision tree! You can display the entire tree and see how it makes decisions.

From a Computer Science perspective, a decision tree is an auto-generated chain of if-then-else statements.

Most Machine Learning models are a black box - but decision tree models are easy to read.

They're intuative. You start at the top, and work your way down.

Decision Trees tend to work well on data that is non-linear.

Pruning

"Pruning" is a technique used to reduce the size of a decistion tree without reducing its predictive accuracy.

If we reduce the complexity of the tree, it’s accuracy can be even higher.

To prune: remove the branches that make use of features having low importance.

Decision Tree Weaknesses

A small change in the data can cause a large change in the structure of the decision tree causing instability.

Representation can take a lot of memory compared to simpler algorithms, like Logistic Regression.

Decision Trees: Splitting Criteria

What makes a good split?

Purity/Impurity:

Impurity is symmetrical:

We need way to measure the "impurity" of a set.

Gini Impurity

This measure \( I_G(X) \) is defined as the probability of making a mistake when drawing and labeling an element from a set.

Randomly choose an element. Probability is based on the composition of the set.

Randomly choose a label, using the same probability distribution.

The probability of a mistake is that of choosing an element and labeling it with anything but the correct label. \[ \begin{align*} I_G(X) &= \sum_{x} P(x)(1-P(x)) \\ &= \sum_{x} P(x)-P(x)^2 \\ &= \sum_{x} P(x)-\sum_{x}P(x)^2 \\ &= 1 - \sum_{x}P(x)^2 \end{align*} \]

Entropy

The information content of an event \(x\) from sample space \(X\) is defined as: \[ I(x)=-log_2P(x) \]

The entropy of a random variable \(X\) is the expected value of the information content of that random variable. \[ \begin{align*} H(X) &= \sum_{x}P(x)I(x) \\ &= \sum_{x}P(x)(-log_2P(x)) \\ = &-\sum_x P(x)log_2 P(x) \end{align*} \]

Gini vs Entropy

Gini	Entropy
Range: \( [0,0.5] \)	Range: \( [0,1] \)

If we multiply Gini by 2 so that it’s scaled the same as Entropy, you can see that they’re not very different.

Determining the Best Split

Given the set of samples \( S \) and the set of attributes \( A \):

The values of one attribute are \( V(a \in A) \)

The subset of samples with a given value \( v \in V(a) \) for attribute \( a \in A \) is \( S[a=v] \)

We measure impurity of a set with \( I_G(X) \), but you could swap it for Entropy (or maybe even something else you find.)

The weighted impurity average for a split on attribute \(a \in A\): \[ W(S,a)= \sum_{v \in V(a)} \frac{|S[a=v]|}{|S|} I_G(S[a=v]) \]

The best attribute to split is the one that minimizes the impurity: \[ B(S,A) = \underset{a \in A}{\operatorname{argmin}} W(S,a) \]

Ensemble Models - Random Forests

Guest Lecture References

Dr Matthew Jones' Power Point Presentation

Dr Matthew Jones' Jupyter Notebook

Sequential Modeling

Guest Lecture References

Arthur Putnam PDF Presentation

Bayesian Networks

Guest Lecture References

Dr Casey Kennington's PDF Presentation

Jupyter Example: Continuous Naive Bayes Classifier

Jupyter Example: Discrete Naive Bayes Classifier

Reinforcement Learning

Guest Lecture References

Dr Casey Kennington's PDF Presentation

KMeans Clustering

Guest Lecture References

See lecture

Anomaly Detection

Guest Lecture References

Dr Nate Monnig's PDF Presentation

Principle Component Analysis

Guest Lecture References

Dr Divy Murli's PDF Presentation

Deep Learning - Perceptron

Neural Networks

Inspired by the brain. Contains "neurons"

(Don't take this analogy too far)

In AI, neuron is a thing that take inputs and holds a number

This number is its "activation threshold" or "activation function"

Neural Networks are "layers" of neurons

Input neurons point to additional neurons in another "layer". If enough "activated" neurons point to another neuron, that neuron "activates", and so on.

The number of neuron layers, and how they are organized often determines the "type" of neural network

There are many variants in Neural Networks

Video: But what is a Neural Network

Perceptron

The concept of a Perceptron dates back to 1957 by Frank Rosenblatt

It was an amazing breakthrough, but notoriously oversold

"The embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."

After a long "AI Winter", perceptrons regained interest in the 1980s.

Adding additional layers helped perceptrons become the building block of neural networks

A Perceptron is essentially one layer of a neural network

It's list of inputs with weights, the higher the weight, the more influence it has on the result

Sum the products of inputs and weights, and pass it to the threshold function

Since inputs and weights are each vectors, you can take the dot product to figure the result

Perceptrons are good with data that is linearly separable - it's a binary classifier

A Perceptron is a convex optimization problem

Perceptron Learning Steps

If a point is classified correctly, do nothing.

If a point is misclassified, adjust the Perceptron’s decision boundary until the point is classified correctly.

Do this for all points, until settling on a decision boundary which minimises the number of misclassified points, possibly zero of them.

Perceptron vs Logistic Regression

Similarities

Both require linear separability (indeed they may both come up with the same decision boundary)

Both are convex optimization problems

Both use coefficients to weight their inputs

Both use learning rates to optimize their weights

Differences

Perceptrons use a step function, while Logistic Regression is a probabilistic range

The main problem with the Percepron is that it's limited to linear data - a neural network fixes that.

A Perceptron is essentially a single layer neural network - add layers to represent more information and complexity

Neural Networks

A Neural network consists of nodes and connections between the nodes

Neural networks are a row of neurons connected to every neuron on another row, and so forth.

Typically, the row of neurons on the far left are represented as the input layer.

The row of neurons on the far right are your output layer.

Any option number of middle rows are called "hidden layers"

Each connection between neurons have weights

If a neuron is has enough weighted inputs, it's activated

When you build a neural network, one of the first things you do is decide how many hidden layers you want

Lots of layers (3+) makes it a "deep" neural network

Video: Statquest on Neural Networks

Activation Function

The function in each neural that decides whether a minumum threshold of weighted inputs was reached

Can be a sigmoid (like Logistic Regression), curved lines like SoftPlus, or a bent line like ReLU

When you build a neural networks, you have to decide what kind of activation function you will use

Most use sigmoid to as a starting point

In practice, SoftPlus ReLU is common too

Backpropagation

How do you train the weights and biases between each neuron? Backpropagation!

Conceptually, it starts with the last parameter, and works backwards

Use the chain rule to calcualte derivatives

Plug the derivatives into gradient descent

This uses the sum of squared residuals -- very similar to linear regression

Each neuron has a global minumum

Repeat this process until you find the optimum weight and bias for each connection

Video: Statquest on Backpropagation

Jupyter Notbook: Deep Learning - Perceptron

Many Types of Neural Networks

Feedforward Neural Network

Radial Basis Function Neural Network

Convolutional Neural Network

Recurrent Neural Network

Probablistc Neural Network

Autoencoder

etc...

Deep Learning - Image Classification

Guest Lecture References

Gerardo Caracas Presentation

Model Drift Factors

What is model drift?

Also called "Concept Drift" or "Model Decay"

Model accuracy goes down over time

Eventually models become obsolete

What Causes Model Drift?

New data comes in and needs to be incorporated into the model

Trends change in the data. Perhaps a better term is “data drift”

Seasonal changes, expanding model capabilities, cataclysmic events

Real World Examples

COVID caused hospitality and travel transactions to nearly disappear - that changed the global trends

Kount Inc blocks fraudsters, and then fraudsters try a new strategy to commit fraud

New carcinogens are introduced into the environment, and alters the occurrence of cancer

New funds are added to the Stock Market, altering return predictions

A new football season means new players and new statistics, altering the likelihood of winning games

How to Address Model Drift?

Do nothing

Periodically re-train

How often to re-train?

Every “cycle”, depending on how often the data changes

How do you measure model drift? Monitoring, multiple pipelines, and side-by-side comparisons.

Monitoring tools? Jupyter, Splunk, DataDog, Dash, Tableau, etc.

Where should your pipeline live? Highly model dependant.

Batch vs Real Time?

At what point are you unhappy with the accuracy?

How much does the model improve after re-training?

Judgement call — lots of things to balance.

Have a plan in advance.

Considerations in Re-training

Cost (data scientists time, CPU resources)

Should you throw away outliers?

Does its makes sense to incorporate once-every-30-year events like “Snowmagedden” for future snowfall predictions?

Does re-training cause overfitting over recent unusual trends?

Model Interoperability

Training Environment vs Inference Environment

The training environment for a model refers to the code and library used to train the model.

The inference environment is where the model is loaded into memory and used for evaluations.

Thus far, you've have used scikit-learn in a Jupyter notebook for both training and inference. In industry, training and inference are treated as separate environments

Training is still typically done in a Jupyter Notebook, but then the model is persisted to disc and hosted in a production environment

The model is typically wrapped in a RESTful endpoint, hosted on a production server in the cloud

Persisting a Model

If you want to use a model later without re-training it, you must persist it to disc

In scikit, you save a model using the pickle.dumps function

It uses the joblib library to pipline the file

Jupyter Notebook: Persisting a Model to Disk

Model Formats

Scikit models have a big weakness: they can only be used in scikit

This is true of virtually every model training framework -- the training library must also be used the inference environment

There are MANY other frameworks and languages for machine learning that CANNOT use a scikit model

Examples of other model frameworks and formats:

GoLearn Machine Learning Library in Go

Keras Model Format

PMML (Portable Model Markup Language

PMML format examples

This is a big problem if you ever try to SHARE your model

This is a big problem if you ever try to PRODUCTIONIZE your model

What does this imply?

"Building ML models is hard. Deploying them in real business environments is harder."

Machine Learning Engineers and Data Scientists work together with the business to determine how the model will be used.

Is the model a prototype? (Small, inaccurate, but good enough for integration or as a proof oc concept)

Is the model complex or simple? Small or large? (10k, 5mb, 500mb?)

Will evaluations be in batches? (Latency isn't as important)

Will evaluations be one sample at a time?

What are the latency requirements?

What are the memory requirements?

Scaling Machine Learning For Production Systems

Large-scale Machine Learning

Imagine building massive models beyond toy data sets

Billions of Samples

Thousands of Models

Millions of Predictions

How do you bring your model to the masses?

Training your model was only the beginning...

Production Machine Learning

An interdisciplinary approach to hosting ML models in production

Distributed systems - designing a complex network of dependencies

Parallel computing - collecting data simultaneously

Enterprise architecture - determining the best approach for middleware

Quality Assurance - the production pipeline must match accuracy from research

Resource Management - Building a cluster is expensive, how to maximize usage?

Deployment - Moving the model to production using a non-obtrusive strategy - i.e. minumal downtime

Monitoring - logging and aggregating logs, creating alerts and on-call schedules

Marketing and Sales - sometimes ignored by engineers, but a vital part of the process

Post Production Support - Does it work? How often should it be refreshed? Customer phone calls?

Machine Learning in the Cloud

AWS SageMaker

Microsoft Azure

Latency

Track latency by logging every request in a log file

Use software to aggregate and create a time chart

Examples of log visualization software: Grafana, Splunk, DataDog

If latency matters, consider hosting your model in a stand-alone container

MLeap allows you to host a Spark model in a REST API

Tracking Experiments

A pillar of scientific research is tracking experiments

Can you reproduce an experiment from scratch?

The code might be the same, but what about the data?

What were the hyperparameters?

How do you go back to previous experiments?

Do you use version control?

Can you defend the decision you made to use a given model in production?

MLFlow is an open source platform for the machine learning lifecycle

Jupyter Notebook: MLFlow experiment tracking

Ethics in Machine Learning

Guest Lecture References

Dr Ekstrand's Power Point Presenation

Adjunct Professor

Computer Science Department

Welcome to Algorithms of Machine Learning

Review - Python 3, Jupyter, Linear Algebra, Probability

Simple Linear Regression

Linear Regression -- Gradient Descent

Supervised vs Unsupervised

Logistic Regression

Model Evaluation Metrics

Improving Your Model -- Feature Engineering

Lifecycle of a Model

Support Vector Machine

Support Vector Machine - The Kernel Hack

Decision Trees

Decision Trees: Splitting Criteria

Ensemble Models - Random Forests

Sequential Modeling

Bayesian Networks

Reinforcement Learning

KMeans Clustering

Anomaly Detection

Principle Component Analysis

Deep Learning - Perceptron

Deep Learning - Image Classification

Model Drift Factors

Model Interoperability

Scaling Machine Learning For Production Systems

Ethics in Machine Learning