TY - GEN
T1 - Scaling up classifers to Cloud Computers
AU - Moretti, Christopher
AU - Steinhaeuser, Karsten
AU - Thain, Douglas
AU - Chawla, Nitesh V.
PY - 2008
Y1 - 2008
N2 - As the size of available datasets has grown from Megabytes to Gigabytes and now into Terabytes, machine learning algorithms and computing infrastructures have continuously evolved in an effort to keep pace. But at large scales, mining for useful patterns still presents challenges in terms of data management as well as computation. These issues can be addressed by dividing both data and omputation to build ensembles of classi-ers in a distributed fashion, but trade-offs in cost, performance, and accuracy must be considered when designing or selecting an appropriate architecture. In this paper, we present an abstraction for scalable data mining that allows us to explore these tradeoffs. Data and computation are distributed to a computing cloud with minimal effort from the user, and multiple models for data management are available depending on the workload and system con-guration. We demonstrate the performance and scalability characteristics of our ensembles using a wide variety of datasets and algorithms on a Condor-based pool with Chirp to handle the storage.
AB - As the size of available datasets has grown from Megabytes to Gigabytes and now into Terabytes, machine learning algorithms and computing infrastructures have continuously evolved in an effort to keep pace. But at large scales, mining for useful patterns still presents challenges in terms of data management as well as computation. These issues can be addressed by dividing both data and omputation to build ensembles of classi-ers in a distributed fashion, but trade-offs in cost, performance, and accuracy must be considered when designing or selecting an appropriate architecture. In this paper, we present an abstraction for scalable data mining that allows us to explore these tradeoffs. Data and computation are distributed to a computing cloud with minimal effort from the user, and multiple models for data management are available depending on the workload and system con-guration. We demonstrate the performance and scalability characteristics of our ensembles using a wide variety of datasets and algorithms on a Condor-based pool with Chirp to handle the storage.
UR - http://www.scopus.com/inward/record.url?scp=67149116890&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=67149116890&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2008.99
DO - 10.1109/ICDM.2008.99
M3 - Conference contribution
AN - SCOPUS:67149116890
SN - 9780769535029
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 472
EP - 481
BT - Proceedings - 8th IEEE International Conference on Data Mining, ICDM 2008
T2 - 8th IEEE International Conference on Data Mining, ICDM 2008
Y2 - 15 December 2008 through 19 December 2008
ER -