Abstract
Network performance problems are notoriously tricky to diagnose, and this is magnified when applications are often split into multiple tiers of application components spread across thousands of servers in a data center. Problems often arise in the communication between the tiers, where either the application or the network (or both!) could be to blame. In this paper, we present SNAP, a scalable network-application profiler that guides developers in identifying and fixing performance problems. SNAP passively collects TCP statistics and socket-call logs with low computation and storage overhead, and correlates across shared resources (e.g., host, link, switch) and connections to pinpoint the location of the problem (e.g., send buffer mismanagement, TCP/application conflicts, application-generated microbursts, or network congestion). Our one-week deployment of SNAP in a production data center (with over 8,000 servers and over 700 application components) has already helped developers uncover 15 major performance problems in application software, the network stack on the server, and the underlying network.
Original language | English (US) |
---|---|
Pages | 57-70 |
Number of pages | 14 |
State | Published - Jan 1 2011 |
Event | 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2011 - Boston, United States Duration: Mar 30 2011 → Apr 1 2011 |
Conference
Conference | 8th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2011 |
---|---|
Country/Territory | United States |
City | Boston |
Period | 3/30/11 → 4/1/11 |
All Science Journal Classification (ASJC) codes
- Computer Networks and Communications
- Control and Systems Engineering