Neural Network – notes

This is not a yet another Neural Network form scratch or something that would detail on Neural Network. This is more of a notes, that I would like to recollect from the Coursera times.

NNs are more of a brute force algorithm rather than a straight mathematics ones like SVM or PCA (Except PDs in Back propagation). Though again everything starts with \(y=mX+C\) instead NN model is a logistic model, we use \(y = Wx + b\), where \(W\) is weights vector and b bias and most importantly \(h(z)\) is a non-linear could be either sigmoid \(g(z) = \frac 1 {1+e^{-z}}\)  or \(tanh(y)\).

There is a nice joke, “When you torture the data long enough, it would start confessing to everything!”, that’s exactly we do in NNs. We pass the data through multiple layers, at each layer we calculate the sigmoid, also called activation \(a_j^l\) for each weight. But layer nodes should be same as number of labels (one in case of binary)

Getting to the logic,


  • Layer 1 is the input layer , where each feature is passed through each node, the number nodes is equal to number of features (plus 1 for bias)
  • Hidden layer(s), no restriction on number of hidden layers or nodes
  • Output layer, nodes equal to the number of labels. And one in case of binary classification
  • The weights matrix is based on number of input and output nodes of each layer. Say if Layer 1 has input 3 and output 5, then the weight matrix would be 3×5. Here is the simple matrix for 10×3 (10 samples x 3 features). One hidden layer with 5 nodes and output is one node.
    • Input [10,3],  5 nodes on layer 1 then w1[3, 5] and weights for layer 2 is w2 [5,1], so the matrix multiplication 10×3 by 3×5 will yield  10×5 multiplied by 5×1  out put is array of 10 predictions. Details below

Activation Functions

  • Activation functions are Nonlinear functions, in a sense the out put is 0 to 1, or -1 to 1. Like Sigmoid, \(tanh\), etc.
  • The choice is based on labels. if the labels are 0 and 1 then Sigmoid is better, while -1 to 1, \(tanh\) is the option.

The flow –  Forward propagation Cost Function and Backward propagation

Forward propagation

  • Initialize network weights for each layer, based on input nodes and output nodes, add bias.
  • Calculate sum of activation (sigmoid) for the each node, i.e \(h(w^T.x)\) called \(a\)
  • Use (a) of Layer 1 and calculate  sum of activation (sigmoid) for the each node in Layer 2   \(h(w^T.a)\)
  • Do it until output layer. This is called Forward propagation

Cost Function

It’s the same one used for Logistic function

\[ J = sum_{i=1}^{n}y_{i}log(a^l_i) + (1-y_{i})log(1-a^l_i))\]

In the forward propagation, we get the predicted values \(a^l\), and from the actual \(y\) we get the cost of weights.

Backward propagation

What we do here is we traverse from the output layer too the input layer and at each layer we calculate the update value for weights, again partial derivative of \(h(z)\), here we use derivative of activation function, for sigmoid it would be \( h^{‘}(z) = sigmoid(z)*(1-sigmoid(z))\)

  • Compute all \(h{‘}(z)\) for each node and each layer. update the weights for example \(w2 += h(z^{l-1})*2*(h(z^l)-y)*  h^{‘}(z^l) \), please note we use previous layer sigmoid and current layer sigmoid
  • Similarly compute weights for other layers except input layer 😉

Putting all together in a loop!

initialize network weights
do until epochs=10000:
       forward propagation;
       Backward propagation;
       Calculate cost;
       update weights;

Here is the code snippet

def sigmoid( x):
      return 1 / (1 + np.exp(-x))
def sigmoid_derivative( x):
      return x * (1 - x)
X = np.array([[0,0,1],
y = np.array([[0],[1],[1],[0],[0]])
#for a 3 layer, Input, one hidden and output layer two weight matrices
# X is 5 x 3 and y is 2 labels lets use hidden layer with 5 nodes
nodes = 5
epochs = 10000
w1 = np.random.rand(n,nodes) 
w2 = np.random.rand(nodes,1)
for i in range(epochs):
  #forward probagation
  layer1 = sigmoid(, w1))
  a1 = sigmoid(, w2))
  #back propagation
  dw2 =, (2*(y - a1) * sigmoid_derivative(a1)))
  dw1 =,  (*(y - a1) * sigmoid_derivative(a1), w2.T) 
* sigmoid_derivative(layer1))) w1 += dw1 w2 += dw2 print(a1) # actual result. we can do predict with w1 and w2 as well epochs = 10000 array([[0.0068427 ], [0.99415196], [0.99178343], [0.00575189], [0.00715451]]) epochs = 100000 array([[1.87429403e-03], [9.98297263e-01], [9.98299114e-01], [1.84753683e-03], [1.58679526e-04]])

Yeah! it works! for higher epochs  run, the labels are much closer to the actuals