A simple way to avoid idle processors in implementing the multigrid method based on a sequence of nested grids, \Omega_j for j = 1, 2,..., l with \Omega_1 being the coarsest grid, on a parallel computer is to select a proper grid \Omega_j with 1 < j < l as the new coarsest grid. The aim of this paper is to show that such an approach is an efficient way to implement the multigrid method on a large scale, multiprocessor computer. In particular, the variant of the V-cycle generated by this approach, which is called the U-cycle, is studied in this paper. It is shown that the convergence rate \rho(j) of the U-cycle with grid \Omega_j being the coarsest grid is a decreasing function of j, and the coarsest grid equations of the U-cycle can be solved approximately without increasing the total number of U-cycle iterations. Then, a parallel U-cycle is defined by using domain partitioning techniques, which can be implemented on a MIMD multiprocessor computer without any idle processors. An analysis of the time complexity of the parallel U-cycle shows that the parallel U-cycle is fully scalable, and can have super-linear speedup in comparison to the original V-cycle. Further, the performance of the parallel U-cycle in the memory-constrained case is discussed. Numerical results are presented showing that the U-cycle can have a better performance than the V-cycle on a sequential computer while the parallel U-cycle has a high efficiency on a large scale, MIMD multiprocessor computer. Experiments are presented for both the Intel Paragon and the IBM SP2.