We study the improper learning of multi-layer neural networks. Suppose that the neural network to be learned has k hidden layers and that the l1-norm of the incoming weights of any neuron is bounded by L. We present a kernel-based method, such that with probability at least 1-δ, it learns a predictor whose generalization error is at most e worse than that of the neural network. The sample complexity and the time complexity of the presented method are polynomial in the input dimension and in (1/ϵ, log(l/δ), F(k, L)), where F(k, L) is a function depending on (k, L) and on the activation function, independent of the number of neurons. The algorithm applies to both sigmoid-like activation functions and ReLU-like activation functions. It implies that any sufficiently sparse neural network is learnable in polynomial time.