An Introduction to Machine Learning Topics

Hello all!

So after my post last week, I received some feedback saying that I should better explain what the concepts that I was talking about are and why / how we use them. So in this post I’m going to attempt to explain most of the concepts I used in my last post.

To start off I’m just gonna break things down and list out the terms I’ll be defining.  In order to do machine learning you should usually have at least two sets of data, a learning set and a testing set of data.  Machine learning is also usually broken down into two main forms, these are supervised and unsupervised learning.  These then break out into the three common types of machine learning problems.  Underneath supervised learning we have classification and regression problems.  And underneath unsupervised learning we have clustering problems.  There’s a handy infographic I found to represent this:As SciKit Learn puts it “Machine learning is about learning some properties of a data set and then testing those properties against another data set.”  In this way, we can define our two data sets.  Our training set is the dataset we are training the computer to recognize data properties off of, and our testing set is what we are trying to predict or classify based on the properties we found.

Now we can move on to the two main types of machine learning, supervised and unsupervised learning.  Supervised learning is defined as a problem in which we feed the program some data as our training set, and that data has additional characteristics that we keep from it.  We then feed it that hidden data as our testing set, and task it with predicting the characteristics.

Underneath supervised learning, we have classification and regression.  Classification is when we feed the program a set of already labelled data, and use that as our training set.  We then feed the program some unlabelled data, and have it predict what that data is based off of our labelled training set.  In my previous post this is what I was doing with handwriting recognition.  Regression is feeding a set of data that has one or more continuous variables to the program, and having it predict the relationship between the variables and the results observed.  This task is a bit weird to envision but I find I can understand it better if I think of an example.  The one that makes the most sense to me is inputting a set of data with three salmon variables, length, age, and weight.  A regression problem using this data would be having the computer predict the length of a salmon based on its age and weight.

Unsupervised learning is defined as a problem in which our training set consists of an infinite amount of input values, but no corresponding target values.  This means our program will be finding common factors in the data reacting based on the absence or presence of them.  A common approach to this is clustering, in which you feed the computer a set of data, and it will separate this data into the common groups of data that share similar characteristics.

I hope this clarifies some of the things from my last post on classification that might be a bit unclear, and feel free to leave a comment if you would like any clarification or I made an error somewhere.

Thanks for reading and have a wonderful day!
~ Corbin

 

Diving into Machine Learning

Hello all!

So lately I’ve been messing with machine learning because I’ve always been interested in it and it’s just very cool and interesting to me.  I’d like to talk a bit about what I’ve been doing and struggling with and show some examples. I will be working with scikit learn for Python, and it comes with 3 datasets. Iris and Digits are for classification and Boston House Prices are for regression. Simply put classification is identifying something like a handwritten number as the correct number it is and regression is essentially finding a line of best fit for a dataset.  I still have a lot to learn about sklearn and machine learning in general, but I find it really interesting nonetheless and thought you guys would too.

So my code begins with the import of a bunch of libraries.  The only ones I use in my example here are sklearn and matplotlib, the others are simply either dependencies or libraries I plan to use in the future.

import sklearn
from sklearn import datasets
import numpy as np
import pandas as pd
import quandl
import matplotlib.pyplot as plt
from sklearn import svm

In this import, sklearn is the main library I’m using to fit my data and predict things, sklearn.datasets comes with the 3 base datasets Iris Digits and Boston Housing Prices.  I don’t know much about sklearn.svm, but I do know that it is the support vector machine which essentially separates our inputted data and runs our actual machine learning, so when we input testing data it can determine what number we have written. Numpy is a science / math library that adds support for larger multidimensional arrays and matrices. Pandas is a library for data analysis. Quandl is a financial library that lets me pull a lot of data that I can use for linear regression in the future. And matplotlib and it’s sub-library pyplot allow me to output the handwriting data.
So far my code for the recognition looks like this:

clf = svm.SVC(gamma=0.001, C=100)
clf.fit(digits.data[:-1], digits.target[:-1])
clf.predict(digits.data[-1:])
plt.figure(1, figsize=(3, 3))
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

Although my understanding is rudimentary, I can explain a little bit of what this does. Clf is our estimator which is the actual machine that is learning, and that is what we pass out training data through with clf.fit().  Clf.fit() lets us pass data into the svm that we made clf off of, and it trains our machine to know what the numbers should look like.  I am passing all digits except for the last one through this function, because we will be testing with the last one.  We then pass a digit through clf using clf.predict(),  which passes data for a know handwritten digit, 8, through clf.  Our object clf then outputs the text <code>array([8])</code> which means that it has predicted our inputted number as 8.  If we print out digits.target[-1:] we can see it and determine if it was correct. We do this using out 3 lines from matplotlib that create the figure, print it, and then show it. The figure we get is this: 

It’s a very low resolution, but it’s an 8! I think that this is brilliant, and I definitely need to learn more about what is happening here with my code. Machine learning is very cool and I definitely need to mess with it more and learn more.  So far I’m learning some of the basic elements like how to fit and predict things, how training and testing sets work, and a lot of the vocabulary that is used when talking about machine learning.  I can now actually talk about things like supervised and unsupervised learning, or classification and regression methods.  Along with this, I’m also learning more about other libraries like matplotlib, and how to write more pythonic (readable) code.  For anyone who wants to try this themselves, there’s a lot of really cool stuff online, but I’m using some of the resources from hangtwenty‘s GitHub repo dive-into-machine-learning.  It can be found here: https://github.com/hangtwenty/dive-into-machine-learning Hopefully by my next post I will have created a basic understanding of linear regression and I can create some cool examples using it, and in my next post I will attempt to give my explanation on how fitting, predicting, and training actually works.

Thanks for reading and have a wonderful day!
~ Corbin