TY - JOUR
T1 - Learning curves for stochastic gradient descent in linear feedforward networks
AU - Werfel, Justin
AU - Xie, Xiaohui
AU - Seung, H. Sebastian
N1 - Funding Information:
We thank Ila Fiete and Gert Cauwenberghs for useful discussions and comments. This work was supported in part by a Packard Foundation Fellowship (to H.S.S.) and NIH grants (GM07484 to MIT and MH60651 to H.S.S.).
PY - 2005/12
Y1 - 2005/12
N2 - Gradient-following learning methods can encounter problems of implementation in many applications, and stochastic variants are sometimes used to overcome these difficulties. We analyze three online training methods used with a linear perceptron: direct gradient descent, node perturbation, and weight perturbation. Learning speed is defined as the rate of exponential decay in the learning curves. When the scalar parameter that controls the size of weight updates is chosen to maximize learning speed, node perturbation is slower than direct gradient descent by a factor equal to the number of output units; weight perturbation is slower still by an additional factor equal to the number of input units. Parallel perturbation allows faster learning than sequential perturbation, by a factor that does not depend on network size. We also characterize how uncertainty in quantities used in the stochastic updates affects the learning curves. This study suggests that in practice, weight perturbation may be slow for large networks, and node perturbation can have performance comparable to that of direct gradient descent when there are few output units. However, these statements depend on the specifics of the learning problem, such as the input distribution and the target function, and are not universally applicable.
AB - Gradient-following learning methods can encounter problems of implementation in many applications, and stochastic variants are sometimes used to overcome these difficulties. We analyze three online training methods used with a linear perceptron: direct gradient descent, node perturbation, and weight perturbation. Learning speed is defined as the rate of exponential decay in the learning curves. When the scalar parameter that controls the size of weight updates is chosen to maximize learning speed, node perturbation is slower than direct gradient descent by a factor equal to the number of output units; weight perturbation is slower still by an additional factor equal to the number of input units. Parallel perturbation allows faster learning than sequential perturbation, by a factor that does not depend on network size. We also characterize how uncertainty in quantities used in the stochastic updates affects the learning curves. This study suggests that in practice, weight perturbation may be slow for large networks, and node perturbation can have performance comparable to that of direct gradient descent when there are few output units. However, these statements depend on the specifics of the learning problem, such as the input distribution and the target function, and are not universally applicable.
UR - http://www.scopus.com/inward/record.url?scp=27144462270&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=27144462270&partnerID=8YFLogxK
U2 - 10.1162/089976605774320539
DO - 10.1162/089976605774320539
M3 - Article
C2 - 16212768
AN - SCOPUS:27144462270
SN - 0899-7667
VL - 17
SP - 2699
EP - 2718
JO - Neural computation
JF - Neural computation
IS - 12
ER -