Recently, the microprocessor industry has moved toward chip multiprocessor (CMP) designs as a means of utilizing the increasing transistor counts in the face of physical and micro-architectural limitations. Despite this move, CMPs do not directly improve the performance of single-threaded codes, a characteristic of most applications. In order to support parallelization of general-purpose applications, computer architects have proposed CMPs with light-weight scalar communication mechanisms [21, 23, 29]. Despite such support, most existing compiler multi-threading techniques have generally demonstrated little effectiveness in extracting parallelism from non-scientific applications [14, 15, 22]. The main reason for this is that such techniques are mostly restricted to extracting parallelism within straight-line regions of code. In this paper, we first propose a framework that enables global multi-threaded instruction scheduling in general. We then describe GREMIO, a scheduler built using this framework. GREMIO operates at a global scope, at the procedure level, and uses control dependence analysis to extract non-speculative thread-level parallelism from sequential codes. Using a fully automatic compiler implementation of GREMIO and a validated processor model, this paper demonstrates gains for a dual-core CMP model running a variety of codes. Our experiments demonstrate the advantage of exploiting global scheduling for multi-threaded architectures, and present gains in a detailed comparison with the Decoupled Software Pipelining (DSWP) multi-threading technique . Furthermore, our experiments show that adding GREMIO to a compiler with DSWP improves the average speedup from 16.5% to 32.8% for important benchmark functions when utilizing two cores, indicating the importance of this technique in making compilers extract threads effectively.