PDP 5

Hey all! This update is coming in a little late, since I’ve been working through a bunch of interviews and projects for companies recently. After I get the definitive yes/no from those companies, I’ll see if I can post the projects here, but until then, here’s this week’s update.

This week, I learned in depth how to combat overfitting and faulty initialization, preprocess data, and a few state-of-the-art learning rate and gradient descent rules (including AdaGrad, RMSProp, and Adam). I also read some original ML research, and got started on doing my ML “Hello World”: the MNIST problem.

The section on overfitting was complete and explained the subject well, but the bit on initialization left me questioning a few things. For example, why do we use a sigmoid activation function if a lot of the possible values it can take (near 1, 0, and 0.5) are practically linear? Well, the answer, from the cutting-edge research, seems to be “we shouldn’t”. Xavier Glorot’s paper, Understanding the difficulty of training deep feedforward neural networks, explored a number of activation functions, and found that the sigmoid was one of the least useful, compared to the hyperbolic tangent and the softsign. To quote the paper, “We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. […] We find that a new non-linearity that saturates less can often be beneficial.”

Within the course I’m using, section 41 deals with the state-of-the-art gradient descent rules. It’s exceedingly math-heavy, and took me a while to get through and understand. I found it helpful to copy down on paper all the relevant formulas, label the variables, and explain in short words what the different rules were for. Here’s part of a page of my notes.

I did teach myself enough calculus to understand the concepts of instantaneous rate of motion and partial derivative, which is all I’ve needed so far for ML. Here was the PDF I learned from, and which I will return to if I need to learn more.

The sections on preprocessing weren’t difficult to understand, but they did gloss over a decent amount of the detailed process, and I anticipate having a few minor difficulties when I start actually trying to preprocess real data. The part I don’t anticipate having any trouble with is deciding when to use binary versus one-hot encoding: they explain that bit relatively well. (Binary encoding involves sequential ordering of the categories, then converting those categories to binary and storing the 1 or 0 in an individual variable. One-hot encoding involves giving each individual item a 1 in a specific spot along a long list of length corresponding to the number of categories. You’d use binary encoding for large numbers of categories, but one-hot encoding for smaller numbers.)

The last thing I did was get started with MNIST. For anyone who hasn’t heard of it before, the MNIST data set is a large, preprocessed set of handwritten digits which can be categorically sorted by an ML algorithm into ten categories (the digits 0-9). I don’t have a lot to say about my process for doing this circa this week, but I’ll have an in-depth update on it next week when I finish it.

PDP 4

This week, I learned about deep learning and neural networks, and I wrote a handful of blog posts relating to concepts I learned last week.

The most poignant of these posts was Language: A Cluster Analysis of Reality. Taking inspiration from Eliezer Yudkowsky’s essay series A Human’s Guide To Words, and pieces of what I learned last week about cluster analyses, I created an abstract comparison between human language and cluster analyses done on n-dimensional reality-space.

Besides this, I started learning in depth about machine learning. I learned about the common loss functions, L2-norm and cross-entropy. I learned about the concept of deep neural nets: not just the theory, but the practice, all the way down to the math. I figured out what gradient descent is, and I’m getting started with TensorFlow. I’ll have more detail on all of this next week: there’s a lot I still don’t understand, and I don’t want to give a partially misinformed synopsis.

The most unfortunate part of this week was certainly that in order to fully understand deep neural networks, you need calculus, because a decent portion of the math relies on partial derivatives. I did statistics instead of calculus in high school, since I dramatically prefer probability theory to differential equations, so I don’t actually have all that much in the way of calculus, and there was an upper bound on how much of the math I actually got. I think that I’ll give myself a bit of remedial calculus in the next week.

The most fortunate part of this week was the discovery of how legitimately useful my favorite book is. Around four or five years ago, I read Rationality: From AI to Zombies. It’s written by a dude who’s big on AI, so obviously it contains rather a lot referencing that subject. When I first read it, I knew absolutely nothing about AI, so I just kind of skimmed over it, except to the extent that I was able to understand the fundamental theory by osmosis. However, I’ve been recently rereading Rationality for completely unrelated reasons, and the sections on AI are making a lot more sense to me now. The sections on AI are scattered through books 3, 4, and 5: The Machine in the Ghost, Mere Reality, and Mere Goodness.

And the most unexpected part of this week was that I had a pretty neat idea for a project, entirely unrelated to any of this other stuff I’ve been learning. I think I’ll program it in JavaScript over the next week, on top of this current project. It’s not complicated, so it shouldn’t get in the way of any of my higher-priority goals, but I had the idea because I would personally find it very useful. (Needless to say, I’ll be documenting that project on this blog, too.)

Language: A Cluster Analysis of Reality

Cluster analysis is the process of quantitatively grouping data in such a way that observations in the same group are more similar to each other than to those in other groups. This image should clear it up.

Whenever you do a cluster analysis, you do it on a specific set of variables: for example, I could cluster a set of customers against the two variables of satisfaction and brand loyalty. In that analysis, I might identify four clusters: (loyalty:high, satisfaction:low), (loyalty:low, satisfaction:low), (loyalty:high, satisfaction:high), and (loyalty:low, satisfaction:high). I might then label these four clusters to identify their characteristics for easy reference: “supporters”, “alienated”, “fans” and “roamers”, respectively.

What does that have to do with language?

Let’s take a word, “human”. If I define “human” as “featherless biped”, I’m effectively doing three things. One, I’m clustering an n-dimensional “reality-space”, which contains all the things in the universe graphed according to their properties, against the two variables ‘feathered’ and ‘bipedal’. Two, I’m pointing to the cluster of things which are (feathered:false, bipedal:true). Three, I’m labeling that cluster “human”.

This, the Aristotelian definition of “human”, isn’t very specific. It’s only clustering reality-space on two variables, so it ends up including some things that shouldn’t actually belong in the cluster, like apes and plucked chickens. Still, it’s good enough for most practical purposes, and assuming there aren’t any apes or plucked chickens around, it’ll help you to identify humans as separate from other things, like houses, vases, sandwiches, cats, colors, and mathematical theorems.

If we wanted to be more specific with our “human” definition, we could add a few more dimensions to our cluster analysis—add a few more attributes to our definition—and remove those outliers. For example, we might define “human” as “featherless bipedal mammals with red blood and 23 pairs of chromosomes, who reproduce sexually and use syntactical combinatorial language”. Now, we’re clustering reality-space against seven dimensions, instead of just two, and we get a more accurate analysis.

Despite this, we really can’t create a complete list of all the things that most real categories have in common. Our generalizations are leaky in some way, around the edges: our analyses aren’t perfect. (This is absolutely the case with every other cluster analysis, too.) There are always observations at the edges that might be in any number of clusters. Take a look at the graph above in this post. Those blue points at the top left edge, should they really be blue, or red or green instead? Are there really three clusters, or would it be more useful to say there are two, or four, or seven?

We make these decisions when we define words, too. Deciding which cluster to place an observation happens all the time with colors: is it red or orange, blue or green? Splitting one cluster into many happens when we need to split a word in order to convey more specific meaning: for example, “person” trisects into “human”, “alien”, and “AI”. Maybe you could split the “person” cluster even further than that. On the other end, you combine two categories into one when sub-cluster distinctions don’t matter for a certain purpose. The base-level category “table” substitutes more specific terms like “dining table” and “kotatsu” when the specifics don’t matter.

You can do a cluster analysis objectively wrong. There is math, and if the math says you’re wrong, you’re wrong. If your WCSS is so high that you have a cluster that you can’t label more distinctly than “everything else”, or if it’s so low you’ve segregated your clusters beyond the point of usefulness, then you’ve done it wrong.

Many people think “you can define a word any way you like”, but this doesn’t make sense. Words are cluster analyses of reality-space, and if cluster analyses can be wrong, words can also be wrong.


This post is a summary of / is based on Eliezer Yudkowsky’s essay sequence, “A Human’s Guide to Words“.

PDP 3

This week, I went even further in depth into doing statistical analyses in Python. I learned how to do logistic regressions and cluster analyses using k-means. I got a refresher on linear algebra, then used it to learn about the NumPy data type “ndarray”.

Logistic regressions are a bit complicated. The course I used explains it in a kind of strange way, which probably didn’t help. Fortunately, my mom knows a decent amount about statistical analyses (she used to be a researcher), so she was able to clear things up for me.

You do a logistic regression on a binary dependent variable. It ends up looking like a stretched-out S, either forwards or backwards. Data points are graphed on one of two lines, either y=0 or y=1. The regression line basically demonstrates a probability: how likely is it that you’ll pass an exam, given a certain number of study hours? How likely is it that you’ll get admitted to a college, given a certain SAT score? Practically, we care most about the tipping point, 50% probability, or y=0.5, and what values fall above and below that tipping point.

This can be slightly confusing since regression lines (or curves, for nonlinear regressions) usually predict values, but since there are only two possible values for a binary variable, the logistic regression line predicts a probability that a certain value will occur.

After I finished that, I moved on to K-means clustering, which is actually surprisingly easy. You randomly generate a number of centroids (generic term for the center of something, be it a line, polygon, cluster, etc.) corresponding to the number of clusters you want, and you assign points to centroids based on least Euclidean distance, move the centroids to the center of those new clusters, then assign the points to the centroids a second time.

Linear algebra is a little harder to understand, especially if your intuition isn’t visual like mine is. In essence, the basic object of linear algebra is a “tensor”, of which all other objects are types. A “scalar” is just an ordinary integer; a “vector” is a one-dimensional list of integers, and a “matrix” is a two-dimensional plane of integers, or a list of lists. These are tensors of type 0, 1, and 2, respectively. There are also tensors of type 3, which have no special name, as well as higher-order types.

I learned some basic linear algebra in school, but I figured it was a bit pointless. As it turns out though, linear algebra is incredibly useful for creating fast algorithms for multivariate algorithms, with many variables, many weights, and many constants. If you use standard integers (scalars) only, you’d need a formula like:
y1 + y2 + … + yk = (w1x1 + w2x2 + … + wkxk) + (b1 + b2 + … + bk).
But if you let all relevant variables be tensors, you can simplify that formula to:
y = wx + b

There are a handful of other awesome, useful ways to implement tensors. For example, image recognition. In order to represent an image as something the computer can do stuff with, we have to turn it into numbers. A type 3 tensor of the form 3xAxB, where AxB is the pixel dimension of the image in question, works perfectly. (Why use a third dimension of 3? Because images are commonly represented using the RGB, or Red/Green/Blue, color schema. In this, every color is represented with different values of R/G/B, between 0 and 255.)

Tensors, in the context of NumPy, which has a specific object type which is designed to handle them, are implemented using “ndarray”, or n-dimensional array. They’re not difficult to implement, and the notation is for once pretty straightforward. (It’s square brackets, similar to the mathematical notation.)

This should teach me to think of mathematical concepts as “pointless”. Computers think in math, so no matter how esoteric or silly the math seems, it’s part of how the computer thinks and I should probably learn it, for the same reasons I’ve devoted a lot of time to learning about all humans’ miscellaneous cognitive biases.

I’ve asked a handful of the statisticians I know if they wouldn’t mind providing some data for me to do some analyses of, since that would be a neat thing to do. But if I don’t do that, this coming week I’ll be learning in depth about AI, which my brain is already teeming with ideas for projects on. I’ve loved AI for a long time, and I’ve known how it works in theory for ages, but now I get to actually make one myself! I’m excited!

PDP 2

This week, I rehashed all the basics of Python. Since I haven’t studied it at all in ten years, this was a very useful refresher. (Basically, it seems to me that Python is essentially Java structure with something like JavaScript syntax. This is a huge oversimplification, but hey, it’s an extremely high-level language that I’m using it in an object-oriented way for this purpose. There are demonstrable similarities.)

The course I’m currently using doesn’t go over Python in any great detail, so if you’d like to supplement the Python they teach you, or you’d like to add to your knowledge of the language (since this course teaches only a very limited scope of Python), I highly recommend Learn Python The Hard Way. Python was my first programming language ever, and this was the course I used. It gives you a solid grasp of not just Python but how programming works in general.

In addition to the general Python refresher, I learned about all the libraries that I’ll need to use it to do data science: namely, NumPy, Pandas, SciPy, StatsModels API, MatPlotLib, Seaborn, and SciKitLearn. In combination, these libraries add methods that can import data from a variety of sources including Excel spreadsheets, conveniently calculate and tabulate relevant statistical data, do a variety of regressions and cluster analyses, and display elegant and understandable graphs.

This week, I learned how to do a simple linear regression (least squares). Next week, I’ll learn how to do multiple regressions and cluster analyses! And after that, the real fun begins with deep learning and AI. I’m looking forward to it!

In the future, expect me to start creating some little projects. I can’t do much with what I’ve learned this week, but by next week, I’ll absolutely have something at least moderately interesting, and I’ll absolutely do a nice write-up for it.

PDP 1

(Professional Development Project. Week 1: first update.)

Over the past year, as I finished up my AS in Computer Science, as I’ve been a participant in the Praxis program, I spent a good deal of time gaining entry-level skills in a variety of technological areas: SQL, systems analysis, web development skills including HTML, CSS, and JavaScript, etc. After all that, I wanted to pick something to focus on for continuing professional development.

As I considered what I should do next, I realized that while I have some decent skills in web development, I have a stronger aptitude in analytical areas. Further, I really enjoy solving problems and doing analyses, so I decided that I would go ahead and start doing that.

Ultimately, I decided on a data science course from Udemy, which I’ve now been working on for a week.

So far, I’ve completed about 130 out of 470 total segments (each includes a lecture and an accompanying quiz). This was basically the first two major sections: an overview including definitions of industry jargon, and an in-depth section on descriptive and inferential statistics. Given that I just finished a class in statistics as one of the final classes for my degree, I was able to skim through the second section.

Besides the refresher on statistics, what I basically learned this week was a lot of jargon and technical terms. I learned the distinctions between analysis and analytics; between business analysis, systems analysis, and data analysis; between neural networks and deep learning.

From here on out, you can expect weekly updates every Sunday, detailing what I’ve done that week. When it becomes applicable, I’ll be doing some coding projects as demonstrations of what I’ve learned by that point. I’ll post those here too, giving each project its own write-up, and I’ll then link back to those project posts at the end of each week in the wrap-up post.