## I Introduction

Over recent years, deep learning has become the mainstream approach for machine learning. Since AlextNet [1], increasingly more advanced neural networks [2-6] are being proposed, such as GoogleNet, ResNet, DenseNet, GAN and variants, to enable practical performance comparable to or beyond what the human delivers in computer vision [7], speech recognition [8], language processing [9] game playing [10], medical imaging [11-13], and so on. A heuristic understanding of why these deep learning models are so successful is that these models represent knowledge in hierarchy and facilitate high-dimensional non-linear functional fitting. It seems that deeper structures are correlated with greater capacities to approximate more complicated functions.

The generic representation power of neural networks has been rigorously studied since the eighties. The first result is that a network with a single hidden layer can approximate a continuous function at any accuracy given an infinite number of neurons [14]. That is, the network can be extremely wide. With the emergence of deep neural networks, studies have been performed on theoretical benefits of these deep models over shallow ones [15-22]. One way [15, 19] is to construct a special kind of functions that are easy to be approximated by deep networks but hard by shallow networks. It has been reported in [16] that a fully-connected network with ReLU activation can approximate any Lebegue integrable function in the

-norm sense, provided a sufficient depth and at most neurons in each layer, where is the number of inputs. Through a similar analysis, it is reported in [22] that ResNet with one single neuron per layer is a universal approximator. Moreover, it is demonstrated in [15] that a special class of functions is hard to be approximated by a conventional network with a single hidden layer unless an exponential number of neurons are used.In our previous studies [23-25], we proposed quadratic/second-order neurons and deep quadratic neural networks. In a quadratic neuron, the inner product of a vector of data and the corresponding weights in a conventional neuron is replaced with a quadratic function. The resultant quadratic neuron enjoys an enhanced expressive capability over the conventional neuron. For example, a single quadratic neuron can implement the famous XOR logic [23]. Furthermore, each quadratic neuron can be viewed as a generalized fuzzy logic gate, and a deep quadratic network is nothing but a deep fuzzy logic system [25]. Note that high-order neurons [26-27] were taken into account in the early stage of artificial intelligence, but they are not connected into deep networks and suffer from combinatorial explosion of parameters due to high order terms. In contrast, our quadratic neurons of a limited number of parameters (tripled that of a conventional neuron) performs the utility of high-order neurons in a deep network architecture. For this type of novel quadratic deep networks, we developed a general backpropagagtion algorithm [24] to make them trainable, paving the way for quadratic neurons to be applied for deep learning.

However, how quadratic neurons improve the expressing capability of a deep quadratic network has not been studied up to now, preferably in relation to that of a conventional neural network. In this paper, we ask three basic questions regarding the expressive capability of a quadratic network:
(1) for the one-hidden-layer network structure, is there any function that a quadratic network can approximate much more efficiently than a conventional network?
(2) for the same multi-layer network structure, is there any function that can be expressed by a quadratic network but cannot be expressed with conventional neurons in the same structure?
(3) Does a quadratic network give a new insight into universal approximation?
If the answers to these questions are favorable, quadratic networks should be significantly more powerful in many machine learning tasks.

In this paper, we present three theorems addressing the above three questions respectively and positively, thereby establishing the intrinsic advantages of quadratic networks over conventional networks. More specifically, these theorems characterize the merits of a quadratic network in terms of expressive efficiency, unique capability, and compact architecture respectively. We answer the first question with the first theorem, given the network with only one hidden layer, there exists a function that a quadratic network can approximate it with a polynomial number of neurons but a conventional network can only do the same level approximation with a exponential number of neurons.
Regarding the second question and the second theorem, any radial function can be approximated by a quadratic network in a structure of no more than four neurons in each layer but the function cannot be approximated by a conventional network in the same structure. Finally, with the third theorem we provide a new insight into the universal presentation from the perspective of the Algebraic Fundamental Theorem. Without introducing complex numbers, univariate polynomials of degree can be uniquely factorized as multiplication of quadratic terms. Since a quadratic network can represent any univariate polynomial in this way, by the Weierstrass theorem and the Kolmogorov theorem that multivariate functions can be represented through summation of univariate functions, we can approximate any multivariate function with a well-structured quadratic neural network, justifying the universal approximation power of the quadratic network.

To further elaborate, several additional comments are in order. For the first theorem, it is commonly known that a conventional neural network with one hidden layer is a universal approximator. Hence, the best thing we can do to justify the quadratic network is to find a class of functions that can be more efficiently approximated by a quadratic network than with a conventional network.
As related to the second question, the precious studies demonstrate that any function of variables that is not constant along any direction cannot be well represented by a fully-connected ReLU network with no more than neurons in each layer [16]. Breaking this network width lower bound, our second theorem states that when a radial function is not constant in any direction, the network width is sufficient for a quadratic network to approximate the function accurately.
Our third theorem is most interesting that a general polynomial function can be exactly expressed by a quadratic network in a novel way of data-driven network-based algebraic factorization.
Implied by the Weierstrass theorem, the quadratic network is a universal approximator. Different from the analyses on conventional neural networks with ReLU activation as universal approximators [14,16] that are of either infinitely wide or infinitely deep, for a given polynomial model the depth and width of our quadratic network are both finite for a perfect representation of a given order multivariate polynomial, which we call the size-bounded universal approximator. Notably, by factoring a generic polynomial globally a quadratic network can match the functional structure effectively and then be efficiently trained, avoiding brute-force piece-wise linear fitting into a target function.

There are prior papers related to but different from our contributions [41-45]. Motivated by a need for more powerful activation, Livni et al. [41] proposed to use the quadratic activation: in the neuron. Despite somewhat misleading in its name, networks with quadratic activation and our proposed networks that consist of quadratic neurons have fundamental differences. At the cellular level, a neuron with quadratic activation is still characterized with a linear decision boundary, while our quadratic neuron allows a quadratic decision boundary. In [41], the authors demonstrated that networks with quadratic activation are as expressive as networks with threshold activation, and constant-depth networks with quadratic activation can learn in polynomial time. In contrast, our work goes further showing that the expressibility of the quadratic network is superior to that of the conventional networks; for example, a single quadratic neuron can implement the XOR gate, and a quadratic network of finite width and depth can represent a finite-order polynomial up to any desirable accuracy. In [42], Du et al. showed that over-parametrization and weight decay are instrumental to the optimization aided by quadratic activation. [43] reported how a neural network can provably learn a low-degree polynomial with gradient descent search from scratch, with an emphasis on the effectiveness of the gradient descent method. [44] presents that layers of binary units and
ReLU units can approximate with closeness of . In contrast, our Theorem 3 is based on the Algebraic Fundamental Theorem to provide an exact representation of any finite-order polynomial. [45] is on factorization machine (FM) dedicated to combine high order features, clearly different from the polynomial factorization we propose to perform using a quadratic network.

In the next section, we introduce some preliminaries. In the third section, we describe our three theorems, and include the corresponding lemmas and proofs. Numerical examples are also used for illustration. Finally, in the last section we discuss relevant issues and conclude the paper with some conjectures and future work.

## Ii Preliminaries

Quadratic/Second-order Neuron: The -input function of a quadratic/second-order neuron before being nonlinearly processed is expressed as:

(1) | ||||

where denotes the input vector, and the other variables are defined in [23].
Our definition of the quadratic function only utilizes parameters, which is more compact than the general second-order representation requiring parameters. While our quadratic neuron design is unique, other papers on quadratic neurons are also in the later literature; for example,
[28] proposed a type of neurons with paraboloid decision boundaries. It is underlined that the emphasis of our work is not only on quadratic neurons individually but also deep quadratic networks in general.

One-hidden-layer Networks: The function represented by a one-hidden-layer conventional network is as follows:

(2) |

In contrast, the functions represented by a one-hidden-layer quadratic networks is:

(3) |

In our Theorem 1 below, we will compare the representation capability of a quadratic network and that of a conventional network assuming that both networks have the same one-hidden-layer structure.

-lipschitz Function: A -Lipschitz function from to R is defined by the following property:

Radial Function: A radial function only depends on the norm of its input vector, generically denoted as

. The functions mentioned in Theorems 1 and 2 are all radial functions. By its nature, the quadratic neuron is well suited for modeling of a radial function. On the other hand, a general function can be regarded as the mixtures of radial functions, such as radial basis function networks can be used for universal approximation.

Fourier Transform: For a function

, its Fourier transform is:

Euclidean Unit-volume Ball: In a -dimensional space, Let be the radius of a Euclidean ball such that has the unit volume.

The First Kind Bessel Function: The first kind Bessel function of an order is denoted as:

Lesbegue-integrable Function: A non-negative measurable function is called Lesbegue-integrable if its Lesbegue integral is finite. For an arbitrary measurable function, it is integrable if its positive part and negative part are both Lebesgue integrable.

Bernstein Polynomial: . The -th Bernstein polynomial of in is defined as

The Bernstein Polynomials are used before to prove the Weierstrass theorem.

## Iii Three Theorems

First, we present three theorems, and then give their proofs.

Theorem 1:

For an activation function

with , , and constants , and for some universal constants, there exist a probability measure

and a radial function : , where , that is bounded on [-2,2] and supported on satisfying:1. can be approximated by a single-hidden-layer quadratic network with neurons, which is denoted as .

2. For every function expressed by a single-hidden-layer conventional network with at most neurons, we have:

for some positive constant .

Theorem 2: For any Lesbegue-integrable radial function : , and any , there exists a fully-connected ReLU quadratic network with no more than four neurons in each layer such that the corresponding function F(x) expressed by this network satisfies:

Theorem 3: For any multivariate polynomial , which degrees of input components in the -th term are respectively. there is a quadratic network of width and depth that computes exactly.

### Iii-a Proof of Theorem 1

Key Idea for Proving Theorem 1: The proof combines the observation from [15] and the utility of quadratic neurons to approximate a radial function. For convenience and consistency, we use some definitions and notations in [15]. The form of functions represented by a single-hidden-layer conventional network is . It is observed that the distribution of the Fourier transform of is supported on a finite collections of lines. The support covered by the finite lines are sparse in the Fourier space, especially for a high dimensionality and high frequency regions, unless an exponential number of lines are involved. Thus, a possible target function to be constructed should have major components at high-frequencies. A suitable candidate has been found in [15]:

where , is a polynomial function of , are radial indicator functions over disconnected intervals. Although the constructed is hard to approximate by a conventional network, it is easy to approximate by a quadratic network, because Eq. (1) contains square terms that can be rewritten as to compute the norm and then the radial function. Consequently, it is feasible for a single-hidden-layer quadratic network to approximate the radial function with a polynomial number of neurons. Note that is discontinuous, and cannot be perfectly expressed by a neural network with continuous activation functions. Here we use a probability measure . With , can be approximated by represented by a network in the sense of .

Proposition 1 in [15] has demonstrated that cannot be well approximated by a single-hidden-layer conventional network with a polynomial number of neurons. We put his proposition here as Lemma 1 for readability and coherence of our paper.

Lemma 1: For a fixed dimension d, suppose that , and , and k an integer satisfying with universal constants , there exists a function , where such that for any function of the form: with for some , we have

where is a universal constant.

To illustrate is approximable with a quadratic network, we know from Lemma 12 of [15] that a continuous Lipschitz function can approximate , what is remained for us to do is to use a quadratic network with a polynomial number of neurons to approximate .

Lemma 2: Given a proper activation function , there is a constant 1 (depending on and other parameters) such that for any L-Lipschitz function f: R R, which is constant outside a bounded interval [r, R], and any , there exist scalars a, { }, , with which we have

satisfies

Proof: Without loss of generality, we assume that the nonlinear activation function is ReLU: . However, our proof is also applicable to other nonlinear activation functions.

If , then can be trivially constructed by setting it to be a 0 function. Otherwise, we have , We assume that there is an integer m satisfying , dividing into intervals . We set , , and

Then, by such construction, the lemma holds. Here, the number of the used neurons is , which is no more than , where is the floor function.

As we know, Eq.(1) represents a quadratic network with a single hidden layer. Lemma 2 confirms that a Lipschitz radial function can be well expressed by such a single-hidden layer network.

Lemma 3: There are a universal constant and , for and any choice of , there exists a function expressed by a single-hidden-layer quadratic network of a width of at most and with the range [-2,+2] such that

Proof: In Lemma 2, we make the following substitutions: , , , . Thus, is expressible by the a single-hidden-layer quadratic network with at most neurons. Coupled with Lemma 12 of [15], Lemma 3 is immediately obtained.

Proof of Theorem 1: By the combination of Lemmas 1 and 3, the proof for Theorem 1 is straightforward. In Lemma 1, by choosing , , we have

Let , to approximate we need the number of quadratic neurons being at most

such that

Therefore, we have . The proof is completed.

Classification Example: To demonstrate the exponential difference between the conventional and quadratic networks claimed by Theorem 1, we constructed an example for separation of two concentric rings. In this example, there are 60 instances in each of the two rings representing two classes. With only one quadratic neuron in a single hidden layer, the rings were totally separated, while at least six conventional neurons are required to complete the same task, as shown in Fig. 1.

### Iii-B Theorem 2

Key Idea for Proving Theorem 2: It was proved in [16] that an -input function that is not constant along any direction cannot be well approximated by a
conventional network with no more than neurons in each layer. However, when such a function is radially defined, it becomes feasible to approximate the function by a quadratic network with width=4, which breaks the lower width bound claimed in [16].

To compute a radial function, we need to find the norm and then evaluate the function at the norm. With a quadratic neuron, the norm is naturally found. With respect to the norm, the radial function is intrinsically univariate. Therefore, heuristically speaking, a quadratic network with no more than neurons in each layer could approximate a radial function very well, even if the function is not constant along any direction.

The trick of approximation by a deep conventional network is adopted here to study a deep quadratic network, in the same spirit to approach a function via composition layer by layer. Specifically, we use one quadratic neuron for the squared norm of an input vector, and a neuron to form a truncated-parabola function as a building block of the radial function. In every interval, a truncated-parabola function can approximate a piecewise constant function in the sense. Also, two neurons are needed to store the truncated-parabola function by the pair of neurons . We encapsulate these four neurons in total three layer as a module. By connecting these modules properly, we can express a piecewise trapezoid-like function and approximate any univariate function accurately.

Assume that the the input variable is x, an interval is , the network utilizes the ReLU function as the activation function, and , we use to represent the output of the -th neuron in the -th layer (we ignore the neuron that computes the square of norm.), then we have

where , signifies the ReLU function, is the expected output for the interval , and As shown in Fig. 2, the output is a truncated-parabola piecewise function. Also,

(4) |

where we have:

In this way, each module will correspond to a truncated-parabola function over a unique interval. By decreasing , our truncated-parabola function would approximate the target piecewise constant function with increasing accuracy. Since piecewise constant functions can approximate any Lesbegue-integrable function in the sense, a deep quadratic network with four neurons in each layer can express any Lesbegue-integrable radial function. Our final network structure is shown in Fig. 3. A blue rectangle is a neuron, and a red rectangle forms a layer. Depending on the interval width, our network can be very deep.

In the following, we will prove Theorem 2 formally. In the proof, the closeness between two functions is measured in the sense, and for convenience we partially adopted the notations used in [16].

Notations: For a Lesbegue-integrable radial function , suppose that is supported on , we define:

Because is Lesbegue integrable, and are measurable. Then, there exits a series of cylindrical tubes having the property:

(5) |

where is the Lesbegue measure. For cylindrical tubes , we assume , where , and is the corresponding height of a cylindrical tube. Then, we define the corresponding indicator function as , and we have the following lemma.

Lemma 4: For any and , the weighted sum of the indicator functions, which represent the cylindrical tubes, satisfies:

Proof: Lemma 4 is evident by Eq. (5).

By stacking the modules gradually, we reconstruct the target function over more and more intervals until the approximation is complete. As we discussed, the function generated by our composite three layer modules is positive. If the corresponding cylindrical tube is positive, then it can be used to approximate directly. If the corresponding cylindrical tube is negative, we need to subtract the output of the module when its output is transmitted into the next layer. The network will eventually produce a function with many trapezoid-like pieces. We denote such a function is , and are the positive and negative parts of , which are actually and . Then, we have

Because is Lebegue-integrable, its cover is Lebegue-integrable as well. Therefore,
. Let us denote and , we can approximate by our quadratic network up to a given accuracy .

(6) | ||||

Lemma 5:

Proof: Because of Eq. (6), we apply the triangle inequality:

Proof of Theorem 2: For any radial function whose support is bounded and any given closeness , there is a network with no more than four neurons in each layer of the quadratic network as shown in Fig. 3. Applying the triangle inequality again, with Lemmas 4 and 5, the function represented by the quadratic network satisfies:

Analytic Example: The function to approximate is , where x is supported on , as shown in Fig. 4(left). We approximate by stacking three modules with 9 layers in total, as shown in Fig. 4(right). The resultant function is:

which divides into three pieces , and . If the support of is divided into more intervals, the closeness of and will be further improved.

### Iii-C Theorem 3

Key Idea for Proving Theorem 3: For universal approximation by neural networks, the current mainstream strategies are all based on piecewise approximation in terms of , , or other distances. For such piecewise approximation, the functional space is divided into numerous hypercubes, which are intervals in the one-dimensional case, according to a specified accuracy to approximate a target function in every hypercube. With quadratic neurons, we can instead use a global approximating method, which can be much more efficient. At the sam time, the quadratic network structure is neither too wide nor too deep, which can be regarded as a size-bounded universal approximator, in contrast to [14][16]. In [14], a single-layer network may have an infinite width. On the other hand, in [16] the network width is restricted to be no more that but network depth goes infinity. What’s more, aided by Algebraic Fundamental theorem, our novel proof reveals the uniqueness and facility of our proposed quadratic networks that cannot be elegantly made by networks with quadratic activation.

First, we show any univariate polynomial of degree N can be exactly expressed by a quadratic network with a complexity of in depth. Next we refer the result [37] regarding Hilbert’s thirteen problem that multivariate functions can be represented with a group of separable functions, and then finalize the proof.

Lemma 6: Any univariate polynomial of degree can be perfectly computed by a quadratic network with depth of and width of no more than .

Proof: According to Algebraic Fundamental Theorem [38], a general univariate polynomial of degree can be expressed as , where . Similarly, we set as the output of the -th unit in the -th layer. The network we construct is shown in Fig. 7. Every neuron in the first layer computes , or , then the second layer merely use as half the number of neurons as that of the first layer to combine the outputs of the first layer. By repeating such a process, with the depth of the quadratic network can exactly represent .

The following lemma shows that any univariate function continuous in can be approximated with a Bernstein polynomial up to any accuracy, which is the Bernstein version of the proof for the Weierstrass theorem.

Lemma 7: Let is a continuous function over , we have

Proof: It is well known; please refer to [30].

Corollary 1: Any continuous univariate function supported on can be approximated by a quadratic network up to any accuracy.

Lemma 8: Every continuous -variable function on can be represented in the form:

Proof: This is a classical theorem made by Kolmogorov and his student Arnold. Please refer to [39].

Corollary 2: The quadratic network is a universal approximator.

Proof: Referring to Lemmas 6, 7 and 8, it can be shown that the quadratic network is a universal approximator. Specifically, assuming that is a polynomial of degree , and is a polynomial of degree

, the size of a quadratic network can be estimated to approximate

well. The representation of requires a network with a width of and a depth of . Then, the representation of demands an additional configuration with a width of and a depth of . Therefore, the structure used for is of width and depth . Integrating all terms in parallel, the final quadratic network architecture will be of at most width and depth .Also, aided by the concept of partially separable functions, the complexity of the quadratic network can be further reduced, such as in the case of computing an separable function. By the separable function, we mean that is separable defined as follows:

In practice, almost all continuous functions can be represented as separable functions, which are of low ranks at the same time.

Let us look at the following numerical example to illustrate our novel quadratic network based factorization method. First, 100 instances were sampled of a function, in [-1,0]. Instead of taking opposite weights and biases to create linearity as used in proof for clarity, here we have incorporated shortcuts to make our factorization method trainable in terms of adaptive offsets. Using the ReLU activation function, we trained a four-layer network 4-3-1-1 with shortcuts to factorize this polynomial, as shown in Fig. 6. The parameters in connections marked by green symbols are fixed to perform multiplication. The shortcut connections and vanilla connection are denoted by green and red lines respectively. The neurons in the first layer will learn the shifted factors , with unknown constant offsets . The whole network will be combined in pairs to form in the next layer. Then, the neurons in the third layer will multiply and to obtain . Finally, the neuron in the output layer is a linear one that will be trained to undo the effect of the constant offsets aided by the shortcuts. We trained the network to learn the factorization by initializing the parameters multiple times, with the number of iterations 600 and the learning rate 2.0e-3. The final average error is less than 0.0051. In this way, the function was learned to be , which agrees well with the target function .

Mathematically, we can handle the general factorization representation problem as follows. Let us denote , which are generic factors for , the question becomes if can always be represented as the linear combination of and a constant bias? If the answer is positive, then our above-illustrated factorization method can be extended for factorization of any univariate polynomial. Here we offer a proof by mathematical induction.

For , we have . Assume that we can represent with a linear combination of and a constant, denoted as . Then,

Comments

There are no comments yet.