This is not a yet another Neural Network form scratch or something that would detail on Neural Network. This is more of a notes, that I would like to recollect from the Coursera times.

NNs are more of a brute force algorithm rather than a straight mathematics ones like SVM or PCA (Except PDs in Back propagation). Though again everything starts with \(y=mX+C\) instead NN model is a logistic model, we use \(y = Wx + b\), where \(W\) is weights vector and b bias and most importantly \(h(z)\) is a non-linear could be either sigmoid \(g(z) = \frac 1 {1+e^{-z}}\) or \(tanh(y)\).

There is a nice joke, “When you torture the data long enough, it would start confessing to everything!”, that’s exactly we do in NNs. We pass the data through multiple layers, at each layer we calculate the sigmoid, also called activation \(a_j^l\) for each weight. But layer nodes should be same as number of labels (one in case of binary)

Getting to the logic,

**Model**

- Layer 1 is the input layer , where each feature is passed through each node, the number nodes is equal to number of features (plus 1 for bias)
- Hidden layer(s), no restriction on number of hidden layers or nodes
- Output layer, nodes equal to the number of labels. And one in case of binary classification
- The weights matrix is based on number of input and output nodes of each layer. Say if Layer 1 has input 3 and output 5, then the weight matrix would be 3×5. Here is the simple matrix for 10×3 (10 samples x 3 features). One hidden layer with 5 nodes and output is one node.
- Input [10,3], 5 nodes on layer 1 then w1[3, 5] and weights for layer 2 is w2 [5,1], so the matrix multiplication 10×3 by 3×5 will yield 10×5 multiplied by 5×1 out put is array of 10 predictions. Details below

**Activation Functions**

- Activation functions are Nonlinear functions, in a sense the out put is 0 to 1, or -1 to 1. Like Sigmoid, \(tanh\), etc.
- The choice is based on labels. if the labels are 0 and 1 then Sigmoid is better, while -1 to 1, \(tanh\) is the option.

**The flow – Forward propagation Cost Function and Backward propagation**

**Forward propagation**

- Initialize network weights for each layer, based on input nodes and output nodes, add bias.
- Calculate sum of activation (sigmoid) for the each node, i.e \(h(w^T.x)\) called \(a\)
- Use (a) of Layer 1 and calculate sum of activation (sigmoid) for the each node in Layer 2 \(h(w^T.a)\)
- Do it until output layer. This is called Forward propagation

** Cost Function**

It’s the same one used for Logistic function

\[ J = sum_{i=1}^{n}y_{i}log(a^l_i) + (1-y_{i})log(1-a^l_i))\]

In the forward propagation, we get the predicted values \(a^l\), and from the actual \(y\) we get the cost of weights.

**Backward propagation**

What we do here is we traverse from the output layer too the input layer and at each layer we calculate the update value for weights, again partial derivative of \(h(z)\), here we use derivative of activation function, for sigmoid it would be \( h^{‘}(z) = sigmoid(z)*(1-sigmoid(z))\)

- Compute all \(h{‘}(z)\) for each node and each layer. update the weights for example \(w2 += h(z^{l-1})*2*(h(z^l)-y)* h^{‘}(z^l) \), please note we use previous layer sigmoid and current layer sigmoid
- Similarly compute weights for other layers except input layer 😉

Putting all together in a loop!

initialize network weights do until epochs=10000: forward propagation; Backward propagation; Calculate cost; update weights; end:

Here is the code snippet

```
def sigmoid( x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative( x):
return x * (1 - x)
X = np.array([[0,0,1],
[0,1,1],
[1,0,1],
[1,1,1],[0,0,0]])
y = np.array([[0],[1],[1],[0],[0]])
#for a 3 layer, Input, one hidden and output layer two weight matrices
# X is 5 x 3 and y is 2 labels lets use hidden layer with 5 nodes
nodes = 5
epochs = 10000
m,n=X.shape
w1 = np.random.rand(n,nodes)
w2 = np.random.rand(nodes,1)
#loop
for i in range(epochs):
#forward probagation
layer1 = sigmoid(np.dot(X, w1))
a1 = sigmoid(np.dot(layer1, w2))
#back propagation
dw2 = np.dot(layer1.T, (2*(y - a1) * sigmoid_derivative(a1)))
dw1 = np.dot(X.T, (np.dot(2*(y - a1) * sigmoid_derivative(a1), w2.T)
```

* sigmoid_derivative(layer1)))
w1 += dw1
w2 += dw2
print(a1) # actual result. we can do predict with w1 and w2 as well
epochs = 10000
array([[0.0068427 ],
[0.99415196],
[0.99178343],
[0.00575189],
[0.00715451]])
epochs = 100000
array([[1.87429403e-03],
[9.98297263e-01],
[9.98299114e-01],
[1.84753683e-03],
[1.58679526e-04]])

Yeah! it works! for higher epochs run, the labels are much closer to the actuals