PDP 5

Hey all! This update is coming in a little late, since I’ve been working through a bunch of interviews and projects for companies recently. After I get the definitive yes/no from those companies, I’ll see if I can post the projects here, but until then, here’s this week’s update.

This week, I learned in depth how to combat overfitting and faulty initialization, preprocess data, and a few state-of-the-art learning rate and gradient descent rules (including AdaGrad, RMSProp, and Adam). I also read some original ML research, and got started on doing my ML “Hello World”: the MNIST problem.

The section on overfitting was complete and explained the subject well, but the bit on initialization left me questioning a few things. For example, why do we use a sigmoid activation function if a lot of the possible values it can take (near 1, 0, and 0.5) are practically linear? Well, the answer, from the cutting-edge research, seems to be “we shouldn’t”. Xavier Glorot’s paper, Understanding the difficulty of training deep feedforward neural networks, explored a number of activation functions, and found that the sigmoid was one of the least useful, compared to the hyperbolic tangent and the softsign. To quote the paper, “We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. […] We find that a new non-linearity that saturates less can often be beneficial.”

Within the course I’m using, section 41 deals with the state-of-the-art gradient descent rules. It’s exceedingly math-heavy, and took me a while to get through and understand. I found it helpful to copy down on paper all the relevant formulas, label the variables, and explain in short words what the different rules were for. Here’s part of a page of my notes.

I did teach myself enough calculus to understand the concepts of instantaneous rate of motion and partial derivative, which is all I’ve needed so far for ML. Here was the PDF I learned from, and which I will return to if I need to learn more.

The sections on preprocessing weren’t difficult to understand, but they did gloss over a decent amount of the detailed process, and I anticipate having a few minor difficulties when I start actually trying to preprocess real data. The part I don’t anticipate having any trouble with is deciding when to use binary versus one-hot encoding: they explain that bit relatively well. (Binary encoding involves sequential ordering of the categories, then converting those categories to binary and storing the 1 or 0 in an individual variable. One-hot encoding involves giving each individual item a 1 in a specific spot along a long list of length corresponding to the number of categories. You’d use binary encoding for large numbers of categories, but one-hot encoding for smaller numbers.)

The last thing I did was get started with MNIST. For anyone who hasn’t heard of it before, the MNIST data set is a large, preprocessed set of handwritten digits which can be categorically sorted by an ML algorithm into ten categories (the digits 0-9). I don’t have a lot to say about my process for doing this circa this week, but I’ll have an in-depth update on it next week when I finish it.

PDP 4

This week, I learned about deep learning and neural networks, and I wrote a handful of blog posts relating to concepts I learned last week.

The most poignant of these posts was Language: A Cluster Analysis of Reality. Taking inspiration from Eliezer Yudkowsky’s essay series A Human’s Guide To Words, and pieces of what I learned last week about cluster analyses, I created an abstract comparison between human language and cluster analyses done on n-dimensional reality-space.

Besides this, I started learning in depth about machine learning. I learned about the common loss functions, L2-norm and cross-entropy. I learned about the concept of deep neural nets: not just the theory, but the practice, all the way down to the math. I figured out what gradient descent is, and I’m getting started with TensorFlow. I’ll have more detail on all of this next week: there’s a lot I still don’t understand, and I don’t want to give a partially misinformed synopsis.

The most unfortunate part of this week was certainly that in order to fully understand deep neural networks, you need calculus, because a decent portion of the math relies on partial derivatives. I did statistics instead of calculus in high school, since I dramatically prefer probability theory to differential equations, so I don’t actually have all that much in the way of calculus, and there was an upper bound on how much of the math I actually got. I think that I’ll give myself a bit of remedial calculus in the next week.

The most fortunate part of this week was the discovery of how legitimately useful my favorite book is. Around four or five years ago, I read Rationality: From AI to Zombies. It’s written by a dude who’s big on AI, so obviously it contains rather a lot referencing that subject. When I first read it, I knew absolutely nothing about AI, so I just kind of skimmed over it, except to the extent that I was able to understand the fundamental theory by osmosis. However, I’ve been recently rereading Rationality for completely unrelated reasons, and the sections on AI are making a lot more sense to me now. The sections on AI are scattered through books 3, 4, and 5: The Machine in the Ghost, Mere Reality, and Mere Goodness.

And the most unexpected part of this week was that I had a pretty neat idea for a project, entirely unrelated to any of this other stuff I’ve been learning. I think I’ll program it in JavaScript over the next week, on top of this current project. It’s not complicated, so it shouldn’t get in the way of any of my higher-priority goals, but I had the idea because I would personally find it very useful. (Needless to say, I’ll be documenting that project on this blog, too.)