# 读书笔记葡京手机，创新神经网络的上学方法

(本文是依据
neuralnetworksanddeeplearning

learn

Neil Zhu，简书ID Not_GOD，University AI 创办者 & Chief
Scientist，致力于推进世界人工智能化进度。拟订并实践 UAI

DL Center（深度学习文化基本全世界市场总值互连网），AI
growth等，为神州的人为智能人才建设输送了汪洋的血流和养分。别的，他还加入或许实行过各个国际性的人工智能峰会和活动，发生了光辉的影响力，书写了60万字的人造智能精品本事内容，生产翻译了海内外率先本深度学习入门书《神经网络与深度学习》，生产的源委被多量的职业垂直群众号和媒体转发与连载。曾经受邀为国内一级大学制定人工智能学习安插和讲课人工智能前沿课程，均受学生和师资好评。

$$0$$，规范差为 $$1$$

hoc，所以大家须要探索一些更加好的措施来安装大家互连网的起初化权重和错误，那对于支持互连网学习进程的晋升很有价值。

### 权重最初化

0，八分之四的神经细胞是 1。输入到遮掩层的权重和为 $$z=\sum_j{w_j x_j}+b$$。由于有八分之四的
$$x_j=0$$，所以 $$z$$ 也正是是 501

0，标准差为 $$\sqrt{501} \approx 22.4$$。那是三个很「宽」的分布：

$$\sigma'(z)$$ 在 $$|z|>1$$ 时趋于
0，那样梯度下跌就无助更新参数了）。在此以前大家用交叉熵函数化解了输出层中学习率低的难点，但对于中等的掩饰层并未意义。而且，前一层掩盖层的输出若是也成高斯布满，那么再今后的隐藏层也会破灭。

$$n_{in}$$

500 个输入为 0，500 个为 1 的场面下，新高斯分布的均值为 0，规范差为
$$\sqrt{3/2}=1.22…$$，如下图所示：

$$一千$$。假使，大家曾经采用专门的学问的高斯布满初步化了连接第一遮蔽层的权重。今后自个儿将集中力集中在这一层的连接权重上，忽略互联网别的一些：

### 什么样挑选超参数

#### 大面积的战略

debug（总不可能每调度一回要等个十来分钟才出结果吧）。小编本身在 debug

$$0$$，另八分之四为
$$1$$。上边包车型大巴观念也是能够更宽广地采纳，但是你能够从特例中赢得背后的观念。让大家着想带权和
$$z=\sum_j w_j x_j + b$$ 的遮蔽元输入。在那之中 $$500$$

$$\sqrt{501}\approx 22.4$$ 的遍及。$$z$$

#### 学习率

Learning

0.1，假诺这几个数值太大，则调低三个数量级到 0.01，甚至0.001…若是开掘学习进程中代价函数未有出现「抖动」的事态，再适合增强学习率，如由原本的
0.1 进步到 0.2、0.5…但最后不可能赶过变成「抖动」的阈值。

stopping

#### 调节学习率

weights

0.1，当正确率回涨到 十分七 后，调低到 0.01，上涨到 百分之九十后，再持续调低，直到学习率独有初叶值的难得终了）。

$$0$$ 只怕 $$1$$，学习进程也会一定迟缓。

#### 正则化参数

0.0，等学习率鲜明而且互联网可以健康磨练后，再安装 $$\lambda$$。具体该装置为何，未有通用的准绳，只好依靠实况判别，能够是
1.0，恐怕 0.1，也许 10.0。显而易见，要基于说明集上的准确率来剖断。

$$n_{in}$$ 个输入权重的神经细胞。大家会利用均值为 $$0$$ 标准差为
$$1/\sqrt{n_{in}}$$

$$0$$ 标准差为 $$1$$

$$z=\sum_j w_j x_j + b$$ 如故是贰个均值为 $$0$$

$$0$$ 标准差为 $$\sqrt{3/2} = 1.22$$

### 参考

• 验证 $$z=\sum_j w_j x_j + b$$ 标准差为
$$\sqrt{3/2}$$。下边两点只怕会有帮助：
独立随机变量的和的方差是种种独立随固然方差的和；方差是典型差的平方。

$$0$$，依赖梯度下跌来学习合适的不是。可是因为距离不是不小，大家前边还大概会听从前边的不二等秘书技来进行开头化。

$$\lambda=5.0$$，然后是交叉熵代价函数。大家将学习率从 $$\eta=0.5$$

$$0.1$$，因为如此会让结果在图像中展现得越发刚烈。大家先利用旧的初步化方法磨练：

>>> import mnist_loader>>> training_data, validation_data, test_data = \... mnist_loader.load_data_wrapper()>>> import network2>>> net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)>>> net.large_weight_initializer()>>> net.SGD(training_data, 30, 10, 0.1, lmbda = 5.0,... evaluation_data=validation_data, ... monitor_evaluation_accuracy=True)


net.large_weight_initializer() 调用：

>>> net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)>>> net.SGD(training_data, 30, 10, 0.1, lmbda = 5.0,... evaluation_data=validation_data, ... monitor_evaluation_accuracy=True)


87% 一下，而新的秘诀已经差不离达到了
93%。看起来的意况正是大家新的有关权重起头化的议程将训练带到了贰个新的境界，让大家能够进一步便捷地赢得好的结果。同样的情景在
$$100$$ 个神经元的设定中也油然而生了：

$$1/\sqrt{n_{in}}$$

$$1/\sqrt{n_{in}}$$

$$1/\sqrt{n_{in}}$$

Practical Recommendations for Gradient-Based Training of Deep
Architectures, by Yoshua Bengio .

• 将标准化和纠正的权重开始化方法结合使用 L2
标准化有的时候候会自动给大家有个别像样于新的初叶化方法的事物。就算我们运用旧的开头化权重的点子。思考多个启发式的视角：若是$$\lambda$$
不太小，练习的首先回合将会差点被权重裁减统治。；要是 $$\eta\lambda \ll n$$，权重会依据因子 $$exp(-\eta\lambda/m)$$ 每一趟合下落；假如$$\lambda$$ 不太大，权重缩短会在权重降到 $$1/\sqrt{n}$$
的时候保持住，在那之中 $$n$$
是网络中权重的个数。用论述那么些原则都已经满意本节给出的例证。

network.py 的精耕细作版本。若是您没有留神看过
network.py，那您可能会须求重读前边境海关于这段代码的座谈。仅仅 $$74$$

network.py 一样，首要部分就是 Network

class Network: def __init__(self, sizes, cost=CrossEntropyCost): self.num_layers = len self.sizes = sizes self.default_weight_initializer() self.cost=cost


__init__ 方法的和 network.py

$$0$$ 而标准差为 $$1/\sqrt{n}$$，$$n$$

def default_weight_initializer: self.biases = [np.random.randn for y in self.sizes[1:]] self.weights = [np.random.randn/np.sqrt for x, y in zip(self.sizes[:-1], self.sizes[1:])]


Numpy。同样大家尚无对第一层的神经细胞的不是进行早先化。因为第一层其实是输入层，所以没有需求引进任何的谬误。我们在
network.py 中做了一心一样的事务。

large_weight_initializer

def large_weight_initializer: self.biases = [np.random.randn for y in self.sizes[1:]] self.weights = [np.random.randn for x, y in zip(self.sizes[:-1], self.sizes[1:])]


class CrossEntropyCost: @staticmethod def fn: return np.sum(np.nan_to_num(-y*np.log*np.log @staticmethod def delta: return


Python 的类并非 Python

$$a$$ 和对象输出 $$y$$ 差别优劣的胸怀。这一个剧中人物通过
CrossEntropyCost.fn 方法来饰演。（注意，np.nan_to_num 调用确定保障了
Numpy 正确管理临近 $$0$$

class QuadraticCost: @staticmethod def fn: return 0.5*np.linalg.norm**2 @staticmethod def delta: return  * sigmoid_prime


"""network2.py~~~~~~~~~~~~~~An improved version of network.py, implementing the stochasticgradient descent learning algorithm for a feedforward neural network.Improvements include the addition of the cross-entropy cost function,regularization, and better initialization of network weights. Notethat I have focused on making the code simple, easily readable, andeasily modifiable. It is not optimized, and omits many desirablefeatures."""#### Libraries# Standard libraryimport jsonimport randomimport sys# Third-party librariesimport numpy as np#### Define the quadratic and cross-entropy cost functionsclass QuadraticCost: @staticmethod def fn: """Return the cost associated with an output a and desired output y. """ return 0.5*np.linalg.norm**2 @staticmethod def delta: """Return the error delta from the output layer.""" return  * sigmoid_primeclass CrossEntropyCost: @staticmethod def fn: """Return the cost associated with an output a and desired output y. Note that np.nan_to_num is used to ensure numerical stability. In particular, if both a and y have a 1.0 in the same slot, then the expression *np.log returns nan. The np.nan_to_num ensures that that is converted to the correct value . """ return np.sum(np.nan_to_num(-y*np.log*np.log @staticmethod def delta: """Return the error delta from the output layer. Note that the parameter z is not used by the method. It is included in the method's parameters in order to make the interface consistent with the delta method for other cost classes. """ return #### Main Network classclass Network: def __init__(self, sizes, cost=CrossEntropyCost): """The list sizes contains the number of neurons in the respective layers of the network. For example, if the list was [2, 3, 1] then it would be a three-layer network, with the first layer containing 2 neurons, the second layer 3 neurons, and the third layer 1 neuron. The biases and weights for the network are initialized randomly, using self.default_weight_initializer (see docstring for that method). """ self.num_layers = len self.sizes = sizes self.default_weight_initializer() self.cost=cost def default_weight_initializer: """Initialize each weight using a Gaussian distribution with mean 0 and standard deviation 1 over the square root of the number of weights connecting to the same neuron. Initialize the biases using a Gaussian distribution with mean 0 and standard deviation 1. Note that the first layer is assumed to be an input layer, and by convention we won't set any biases for those neurons, since biases are only ever used in computing the outputs from later layers. """ self.biases = [np.random.randn for y in self.sizes[1:]] self.weights = [np.random.randn/np.sqrt for x, y in zip(self.sizes[:-1], self.sizes[1:])] def large_weight_initializer: """Initialize the weights using a Gaussian distribution with mean 0 and standard deviation 1. Initialize the biases using a Gaussian distribution with mean 0 and standard deviation 1. Note that the first layer is assumed to be an input layer, and by convention we won't set any biases for those neurons, since biases are only ever used in computing the outputs from later layers. This weight and bias initializer uses the same approach as in Chapter 1, and is included for purposes of comparison. It will usually be better to use the default weight initializer instead. """ self.biases = [np.random.randn for y in self.sizes[1:]] self.weights = [np.random.randn for x, y in zip(self.sizes[:-1], self.sizes[1:])] def feedforward: """Return the output of the network if a is input.""" for b, w in zip(self.biases, self.weights): a = sigmoid(np.dot return a def SGD(self, training_data, epochs, mini_batch_size, eta, lmbda = 0.0, evaluation_data=None, monitor_evaluation_cost=False, monitor_evaluation_accuracy=False, monitor_training_cost=False, monitor_training_accuracy=False): """Train the neural network using mini-batch stochastic gradient descent. The training_data is a list of tuples  representing the training inputs and the desired outputs. The other non-optional parameters are self-explanatory, as is the regularization parameter lmbda. The method also accepts evaluation_data, usually either the validation or test data. We can monitor the cost and accuracy on either the evaluation data or the training data, by setting the appropriate flags. The method returns a tuple containing four lists: the (per-epoch) costs on the evaluation data, the accuracies on the evaluation data, the costs on the training data, and the accuracies on the training data. All values are evaluated at the end of each training epoch. So, for example, if we train for 30 epochs, then the first element of the tuple will be a 30-element list containing the cost on the evaluation data at the end of each epoch. Note that the lists are empty if the corresponding flag is not set. """ if evaluation_data: n_data = len(evaluation_data) n = len(training_data) evaluation_cost, evaluation_accuracy = [], [] training_cost, training_accuracy = [], [] for j in xrange: random.shuffle(training_data) mini_batches = [ training_data[k:k+mini_batch_size] for k in xrange(0, n, mini_batch_size)] for mini_batch in mini_batches: self.update_mini_batch( mini_batch, eta, lmbda, len(training_data)) print "Epoch %s training complete" % j if monitor_training_cost: cost = self.total_cost(training_data, lmbda) training_cost.append print "Cost on training data: {}".format if monitor_training_accuracy: accuracy = self.accuracy(training_data, convert=True) training_accuracy.append print "Accuracy on training data: {} / {}".format( accuracy, n) if monitor_evaluation_cost: cost = self.total_cost(evaluation_data, lmbda, convert=True) evaluation_cost.append print "Cost on evaluation data: {}".format if monitor_evaluation_accuracy: accuracy = self.accuracy(evaluation_data) evaluation_accuracy.append print "Accuracy on evaluation data: {} / {}".format( self.accuracy(evaluation_data), n_data) print return evaluation_cost, evaluation_accuracy, \ training_cost, training_accuracy def update_mini_batch(self, mini_batch, eta, lmbda, n): """Update the network's weights and biases by applying gradient descent using backpropagation to a single mini batch. The mini_batch is a list of tuples , eta is the learning rate, lmbda is the regularization parameter, and n is the total size of the training data set. """ nabla_b = [np.zeros for b in self.biases] nabla_w = [np.zeros for w in self.weights] for x, y in mini_batch: delta_nabla_b, delta_nabla_w = self.backprop nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)] nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)] self.weights = [(1-eta**w-(eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)] self.biases = [b-(eta/len(mini_batch))*nb for b, nb in zip(self.biases, nabla_b)] def backprop(self, x, y): """Return a tuple (nabla_b, nabla_w) representing the gradient for the cost function C_x. nabla_b and nabla_w are layer-by-layer lists of numpy arrays, similar to self.biases and self.weights.""" nabla_b = [np.zeros for b in self.biases] nabla_w = [np.zeros for w in self.weights] # feedforward activation = x activations = [x] # list to store all the activations, layer by layer zs = [] # list to store all the z vectors, layer by layer for b, w in zip(self.biases, self.weights): z = np.dot(w, activation)+b zs.append activation = sigmoid activations.append(activation) # backward pass delta = (self.cost).delta(zs[-1], activations[-1], y) nabla_b[-1] = delta nabla_w[-1] = np.dot(delta, activations[-2].transpose # Note that the variable l in the loop below is used a little # differently to the notation in Chapter 2 of the book. Here, # l = 1 means the last layer of neurons, l = 2 is the # second-last layer, and so on. It's a renumbering of the # scheme in the book, used here to take advantage of the fact # that Python can use negative indices in lists. for l in xrange(2, self.num_layers): z = zs[-l] sp = sigmoid_prime delta = np.dot(self.weights[-l+1].transpose * sp nabla_b[-l] = delta nabla_w[-l] = np.dot(delta, activations[-l-1].transpose return (nabla_b, nabla_w) def accuracy(self, data, convert=False): """Return the number of inputs in data for which the neural network outputs the correct result. The neural network's output is assumed to be the index of whichever neuron in the final layer has the highest activation. The flag convert should be set to False if the data set is validation or test data (the usual case), and to True if the data set is the training data. The need for this flag arises due to differences in the way the results y are represented in the different data sets. In particular, it flags whether we need to convert between the different representations. It may seem strange to use different representations for the different data sets. Why not use the same representation for all three data sets? It's done for efficiency reasons -- the program usually evaluates the cost on the training data and the accuracy on other data sets. These are different types of computations, and using different representations speeds things up. More details on the representations can be found in mnist_loader.load_data_wrapper. """ if convert: results = [(np.argmax(self.feedforward, np.argmax for  in data] else: results = [(np.argmax(self.feedforward for  in data] return sum(int for  in results) def total_cost(self, data, lmbda, convert=False): """Return the total cost for the data set data. The flag convert should be set to False if the data set is the training data (the usual case), and to True if the data set is the validation or test data. See comments on the similar (but reversed) convention for the accuracy method, above. """ cost = 0.0 for x, y in data: a = self.feedforward if convert: y = vectorized_result cost += self.cost.fn/len cost += 0.5*(lmbda/len*sum( np.linalg.norm**2 for w in self.weights) return cost def save(self, filename): """Save the neural network to the file filename.""" data = {"sizes": self.sizes, "weights": [w.tolist() for w in self.weights], "biases": [b.tolist() for b in self.biases], "cost": str(self.cost.__name__)} f = open(filename, "w") json.dump f.close()#### Loading a Networkdef load: """Load a neural network from the file filename. Returns an instance of Network. """ f = open(filename, "r") data = json.load f.close() cost = getattr(sys.modules[__name__], data["cost"]) net = Network(data["sizes"], cost=cost) net.weights = [np.array for w in data["weights"]] net.biases = [np.array for b in data["biases"]] return net#### Miscellaneous functionsdef vectorized_result: """Return a 10-dimensional unit vector with a 1.0 in the j'th position and zeroes elsewhere. This is used to convert a digit  into a corresponding desired output from the neural network. """ e = np.zeros e[j] = 1.0 return edef sigmoid: """The sigmoid function.""" return 1.0/(1.0+np.expdef sigmoid_prime: """Derivative of the sigmoid function.""" return sigmoid*(1-sigmoid


lmbda 到不一致的方法中，首假设 Network.SGD

False 的，但是在大家例子中，已经被置为 True 来监控 Network

>>> evaluation_cost, evaluation_accuracy, ... training_cost, training_accuracy = net.SGD(training_data, 30, 10, 0.5,... lmbda = 5.0,... evaluation_data=validation_data,... monitor_evaluation_accuracy=True,... monitor_evaluation_cost=True,... monitor_training_accuracy=True,... monitor_training_cost=True)


• 更换上边的代码来贯彻 L1 规范化，使用 L1 标准化使用 $$30$$
个遮盖元的神经互连网对 MNIST
数字实行分拣。你能够找到四个标准化参数使得比无标准化效果越来越好么？
• 看看 network.py 中的 Network.cost_derivative
方法。那几个点子是为叁次代价函数写的。如何修改能够用来交叉熵代价函数上？你能否体会领会恐怕在交叉熵函数上蒙受的主题素材？在
network2.py 中，大家曾经去掉了 Network.cost_derivative
方法，将其集成进了 CrossEntropyCost.delta
方法中。请问，那样是什么缓慢解决你早已开掘的主题材料的？