Nanometer fabrication technologies have made it feasible to integrate multiple processors on a single chip. Heterogeneous multi-processor systems-on-chip (MPSoCs), in which different processors are customized for specific tasks, can provide high levels of efficiency in performance and power consumption, while maintaining programma-bility. However, in order to best exploit processor heterogeneity, designers are still required to manually customize each processor, while mapping the application tasks to them, so that the overall performance and/or power requirements are satisfied. In this paper, we propose a methodology to automatically synthesize a custom (heterogeneous) architecture, consisting of multiple extensible processors, to best speed up a given application. Our methodology simultaneously customizes the instruction set of, and assigns application tasks to, each processor in the multiprocessor system, while scheduling their execution. We motivate the need for such an integrated approach by demonstrating that custom instruction selection has complex interdependencies with task assignment and scheduling, and performing these steps independently often results in significant degra-dation in the quality of the synthesized multiprocessor architecture. Our methodology uses an iterative improvement algorithm to assign and schedule tasks on processors and select custom instructions along the critical path in an interleaved manner. It utilizes the concept of "expected execution time" to better integrate these two steps. It not only considers the currently selected custom instructions for the current task assignment and schedule, but also the possibility of better custom instructions being selected in future iterations. We also enhance our methodology to integrate task-level software pipelining to further increase the parallelism and provide opportunities for multiprocessing. We have implemented the proposed heterogeneous multiprocessor synthesis methodology in the context of a commercial extensible processor design flow, using the Xtensa™platform from Tensilica Inc. We have evaluated our tool by automatically generating custom multiprocessor architectures for several complex embedded software benchmarks. The results show that architectures synthesized by the proposed methodology demonstrate an average speedup of 2.0 × (up to 3.1 ×) compared to symmetric multiprocessor architectures in which the processors have not been augmented with custom instructions. To the best of our knowledge, this is the first tool for the synthesis of custom MPSoCs using extensible processors.