# 快乐十分20选8技巧: （精）1 the error surface of the 2-2-1 xor network stationary points with infinite weights

1 The Error Surface of the 2-2-1 XOR Network: Stationary Points with infi nite Weights? Ida G. Sprinkhuizen-Kuyper Egbert J.W. Boers Abstract In this paper it is proved that the error surface of the two-layer XOR network with two hidden units has a number of regions with local minima. These regions of local minima occur for combinations of the weights from the inputs to the hidden nodes such that one or both hidden nodes are saturated (give output 0 or 1) for at least two patterns. However, boundary points of these regions of local minima are saddle points. From these results it can be concluded that from each fi nite point in weight space a strictly decreasing path exists to a point with error zero. Furthermore we give proofs that points with error zero exist, and that points with the output unit saturated are either saddle points or (local) maxima. In [10] it is proved that stationary points with fi nite weights are either saddle points or abso- lute minima. 1 Introduction To investigate the error surfaces of XOR networks thoroughly is important, since Prechelt [5] found in his investigation of articles on learning algorithms in neural networks that 20 articles (18%) employed the “grandfather” of all neural network problems, the XOR or n-bit parity. So there are many experi- mental results that can possibly better be explained with more knowledge of the error surface of these networks. More insight in the error surfaces of a number of concrete problems can give a better insight in the specifi c behaviour of the learning algorithms under investigation. In literature we found a number of results concerning the error surface of these networks, but we didn’t fi nd a complete investigation of the error ?.Technical Report 96-10, Dept. of Computer Science, Leiden University. Avail- able as ftp://ftp.wi.leidenuniv.nl/pub/CS/TechnicalReports/1996/tr96-10.ps.gz 2 surfaces of the XOR networks. In [7, 8] we described our results for the error surface of the XOR network with one hidden node and connections directly from the inputs to the output node (see fi gure 1a). In this paper together with [10] we give a complete investigation of the error surface of the two-layer XOR network with two hidden nodes (see fi gure 1b). The transfer function used is the usual sigmoid. We consider the quadratic error function while in literature also the “cross-entropy” is used. The difference is that the terms which occur in the partial derivatives of the quadratic error simplify to for the cross-entropy. So the analysis for the quadratic error is more complicated than that for the cross entropy. Especially more stationary points (points where all partial derivatives with respect to the weights are zero, so the gradient of the error is zero in these points) occur for than for . We mention the stationary points for which the output node is saturated ( is equal to plus or minus infi nity) for at least one of the patterns. In subsection 4.2 we will show that these points are saddle points or local maxima. Also some more stationary points occur for fi nite weights (see [10]). Since all stationary points of the error surface with the cross-entropy form a subset of the stationary points for the quadratic error, it is easily checked that all results obtained here also hold for the cross entropy. So especially the regions of local minima found for the quadratic error, which Figure 1. The simplest XOR network (a) and one with two hidden nodes (b). (a)(b) f x( )11e x– +()?= E 1 2 -- -Oαtα–() 2 α ∑ = LOα() tα 1Oα–() 1tα– ln α ∑ –= RαOαtα–() f′ Iα()= E Rα′Oαtα–()= LE L E L Iα L E L E 3 are summarized in tables 2 until 11, are also regions of local minima for the cross-entropy. In this paper it is proved that the error surface of the 2-2-1 XOR network has a number of regions consisting of local minima. These regions of local minima occur for combinations of the weights from the inputs to the hidden nodes such that one or both hidden nodes are saturated (give output 0 or 1, since the input is either + or) for at least two patterns. However, boundary points of these regions of local minima are saddle points. From these results it can be concluded that from each fi nite point in weight space a strictly decreasing path exists to a point with error zero. Furthermore we give proofs that points with error zero exist, that stationary points with fi nite weights are either saddle points or absolute minima, and that points with the output unit saturated are either saddle points or (local) maxima. Relation to previous work Blum [1] investigated the 2-2-1 XOR network with the cross-entropy as error function. He restricted the weights to be symmetrical. For the 5 remaining independent weights he proved that exact solutions exist for the XOR problem. In section 3 we give our proof that the XOR problem can be repre- sented exactly by the 2-2-1 network. Blum’s proof is more complicated since the network considered has less degrees of freedom. In the same paper [1] Blum identifi ed a linear manifold of stationary points to be local minima. In this paper it is shown that for the network with 9 independent weights this line does not contain local minima. In [9] we proved that also with symmetric constraints on the weights no local minima exist on the manifold given by Blum. Hamey [2] also found that Blum’s proof was incorrect. Lisboa and Perantonis [4] characterize the stationary points of all two- layer XOR networks with and without connections directly from the inputs to the output unit, considering the cross-entropy as error function. They also give some local minima for the 2-2-1 network and tell that they checked that these points are indeed local minima by considering the second order partial derivatives. However, they do not give details of their proofs. Four of their local minima are found to be numerical equivalent to local minima resulting from our research (see section 5.7). The fi fth point is not a local minimum (see [10]). Hamey [2, 3] also investigated the 2-2-1 XOR network with the cross- entropy as error function. In [2] he shows that for all points with fi nite weights a fi nite non-ascending trajectory exists to a point with error zero. He concludes that the 2-2-1 XOR network has no (regional) local minima. He defi nes a regional local minimum as a local minimum that has to be reached L ∞∞– 4 by a non-ascending path from points in the neighbourhood. In [3] Hamey proves that all fi nite stationary points are saddle points. In this paper we prove that the 2-2-1 XOR network has local minima for infi nite values of the weights from the inputs to the hidden nodes. However, this result does not contradict Hamey’s results, since our defi nition of a local minimum in a point (fi nite or infi nite) is that a local minimum is attained in a point if for all points in a neighbourhood of the inequality holds and we accept that points with infi nite weights exist. Such an infi nite local minimum can trap a learning algorithm, since a decreasing path to a point with error zero will fi rst get closer to the infi nite point and will not neces- sarily reach a neighbourhood where the learning algorithm can escape from the region with the local minimum value. Contents of the paper In section 2 the network and its parameters are introduced and also the error function is given. In section 3 it is proved that the network can represent the XOR function, as specifi ed in section 2, exactly. In [7] we introduced the notions of stable stationary points, i.e. points which are stationary points for the error of each individual pattern, and instable stationary points. The latter points are stationary points for the total error, but not for each individual pattern. In section 4 it is proved that stable stationary points are either abso- lute minima with error zero or saddle points or (local) maxima, but not local minima. Both fi nite and infi nite weights are investigated in this section. Logically the next section would contain the proofs that instable stationary points with fi nite weights can not be local minima. We decided to publish this part separately in [10]. In section 5 the instable points with infi nite weights are treated. Here it is found that local minima only can exist if at most two patterns are learned. The resulting local minima with two patterns learned are summarized in subsection 5.3, while subsection 5.5 contains the local minima with one pattern learned, and section 5.6 contains the local minima with all patterns giving a wrong output. In subsection 5.7 some examples of local minima found in literature are shown to belong to one of the classes found earlier. The paper ends with section 6 containing some conclusions. w w′wf w()f w′()≤ 5 2 The network In this paper we investigate the error surface of the network with two hidden units and without direct connections from the inputs to the output (see fi gure 2). For the XOR function we assume that the patterns given in table 1 should be learned/represented by the network. The input of the output unit is for the four patterns: (2.1) So the four patterns result in output values equal to,, and, respectively. The mean square error is equal to: X1X2 bias u bias w01 v1 Figure 2. The XOR network with 2 hidden units v2 w21 bias w02 w22 w12 w11 Table 1: Patterns for the XOR problem PatternX1X2desired output P00000.1 P01010.9 P10100.9 P11110.1 A00uv1f w01()v2f w02()++= A01uv1f w01w21+()v2f w02w22+()++= A10uv1f w01w11+()v2f w02w12+()++= A11uv1f w01w11w21++()v2f w02w12w22++()++= f A00()f A01() f A10()f A11() 6 (2.2) The weight space has a number of symmetries for this problem, which we will exploit in order to reduce the number of different cases that have to be investigated. Especially, we will consider the following four transformations of the weight space: Transformation 2.1: (interchanging the inputs using the symmetry of the training patterns with respect to the inputs) ,,,, other weights equal. Transformation 2.2: (interchanging and, and and) ,,, ,,, other weights equal. Transformation 2.3: (using that, and interchanging patterns with desired output 0.1 and those with desired output 0.9) ,,, ,, other weights equal. Transformation 2.4: (mirroring the network) ,,,,, other weights equal. 3 Representation In this section we prove that the XOR function can be represented exactly by the network with two hidden units given in fi gure 2. The XOR function is exactly represented by the network if the weights, ,,,,,, and are such that the following equa- tions hold: (3.1) Application of the inverse ofon both sides of (3.1), using (2.1), leads to: E 1 2 -- - f A00()0.1–() 2 1 2 -- - f A01()0.9–() 2 ++= 1 2 -- - f A10()0.9–() 2 1 2 -- - f A11()0.1–() 2 + w11′w21=w21′w11=w12′w22=w22′w12= P00P11P01P10 w01′w01w11w21++=w02′w02w12w22++=w11′w11–= w21′w21–=w12′w12–=w22′w22–= f x( )1fx–()–= u′u–v1–v2–=w01′w01–w21–=w02′w02–w22–= w11′w11–=w12′w12–= v1′v2=v2′v1=wi1′wi2=wi2′wi1=i0 1 2, ,{}∈ u v1v2w01w11w21w02w12w22 f A00()0.1= f A01()0.9= f A10()0.9= f A11()0.1= f 7 (3.2) From these equations it follows that: (3.3) leading to the following three equations for the weights with exception of: (3.4) The set of equations (3.4) is a set of three linear equations in the two varia- bles and. Let us consider points with and. For these points the second equation of (3.4) is identically zero, and for almost all values of the weights the fi rst and third equation will be linearly inde- pendent, so and are determined by these equations and follows from (3.3). Thus at least one region in weight space exists where the patterns, , and are represented exactly. The dimension of this region is at least 4 and probably 5, since the condition that the fi rst two equations of (3.4) have to be linearly dependent results in one restriction on the 6 weights. 4 Stable stationary points Let us introduce (4.1) uv1f w01()v2f w02()++f 1– 0.1()= uv1f w01w21+()v2f w02w22+()++f 1– 0.9()= uv1f w01w11+()v2f w02w12+()++f 1– 0.9()= uv1f w01w11w21++()v2f w02w12w22++()++f 1– 0.1()= u–v1f w01()v2f w02()f 1– 0.1()–+= v1f w01w21+()v2f w02w22+()f 1– 0.9()–+= v1f w01w11+()v2f w02w12+()f 1– 0.9()–+= v1f w01w11w21++()v2f w02w12w22++()f 1– 0.1()–+= u v1f w01()f w01w11w21++()–() + v+ 2 f w02()f w02w12w22++()–()0= v1f w01w21+()f w01w11+()–() + v+ 2 f w02w22+()f w02w12+()–()0= v1f w01()f w01w21+()–() + v+ 2 f w02()f w02w22+()–()2f 1– 0.9()–= v1v2 w11w210≠=w12w220≠= wij v1v2u P00 P01P10P11 wij R00f A00()0.1–() f′ A00()= R01f A01()0.9–() f′ A01()= R10f A10()0.9–() f′ A10()= R11f A11()0.1–() f′ A11()= 8 Stable stationary points are obtained when the gradient of the error is zero for each of the four patterns separately, thus if (4.2) The cases with all weights fi nite and one or more weights infi nite are consid- ered here. 4.1 Finite weights If all weights are fi nite the only points with all’s equal to zero are the points satisfying equations (3.1) and thus all patterns are learned exactly and the error is zero. 4.2 Output 0 or 1 We have to investigate those points where one or more of the terms are infi nite and the other terms result in the desired output. Let us consider points in weight space in the neighbourhood of such a stable stationary point. We will show that it is not possible that an infi nite value ofcorresponds to a local minimum. The other cases, and/or tending to plus or minus infi nity are treated by transformations of the weight space. First let us try to keep, and constant. By (2.1) the effect of small variations of,,and on,, and is: (4.3) Solving, and from the equations for = = = 0, results in: Thus if and then it is possible to vary the weights,, and such that becomes closer to the desired value, while, R00R01R10R110==== Rij Aij A00A01A10 A11 A01A10A11 v1w01w11w21A00A01A10A11 ?A00f w01() ?v1v1f′ w01() ?w01+= ?A01f w01w21+() ?v1v1f′ w01w21+()?w01?w21+()+= ?A10f w01w11+() ?v1v1f′ w01w11+()?w01?w11+()+= ?A11f w01w11w21++() ?v1+= v1f′ w01w11w21++()?w01?w11?w21++()+ ?w01?w10?w11?A01?A10 ?A11 ?A00= f′ w01() ?v1 f w01() f′ w01() ------------------ - ? ? f w01w21+() f′ w01w21+() --------------------------------- - f w01w11+() f′ w01w11+() --------------------------------- -–+– f w01w11w21++() f′ w01w11w21++() ------------------------------------------------ -+ ? ? = f′ w01() ?v1e w01 1e w11 – ?? ?? 1e w21 – ?? ?? – w110≠w210≠v1w01 w11w21A00A01 9 and remain constant. The effect is that the error decreases when the weights are altered in a direction away from the stationary point. So if and then will never result in a local minimum. Similarly it is proved that for and it is possible to vary the weights,,and such that becomes closer to the desired value, while,and remain constant, and also in this case will not yield a local minimum. The cases that have to be investigated further are the cases where both or, andor. These cases lead to the four cases: ?and, ?and, ?and, ?and. In the fi rst case (and) equation (2.1) becomes: (4.4) Stable stationary points can only be found in this case i