













An Architecture towards Sharing FPU across Cores





Abstract—The multithreading and multicore techniques are
widely adopted in the design of the modern high-performance
CPUs. Multithreading technique allows multiple threads to share
the functional units (FUs) within a core for the better utilization
of the FUs. Thus there will be confliction on the use of some
FUs, the floating-point unit (FPU) for instance. In such a case,
some floating-point instructions will be suspended until the FPU
is available for use. Multicore technique implements a small-scale
multiprocessor on a chip. A thread that runs on one core cannot
use the FUs of other cores. This results in poor utilization of the
FPU in some cores if the threads running on those cores do not
contain floating-point instructions at all, although in other cores,
the threads are straggling to complete for the FPU. Different
from the traditional multiprocessors that are implemented with
multiple CPU chips, because the multicore CPUs implement
multiprocessors on the same chip, it becomes possible to let the
threads in a core group share all the FPUs in the group. When a
conflict on the use of FPU occurs, some floating-point operations
can be redirected to the cores of the same group in which the
FPUs are in idle state, so that the overall performance of the
multicore CPU will be improved. This paper investigates such a
group architecture and gives the performance improvement of
the proposed architecture to that of the traditional multicore
architecture. Our experimental results show that, on average
for the floating-point benchmarks, 4.25%, 7.34%, and 7.45%
performance improvements can be achieved by redirecting the
floating-point operations to other cores within the group with
the group sizes of two, four, and eight, respectively, under the























































































































































































































































































































に参照されていく. 図 3は 8コアによって構成される CPU








































































装するには多くの修正が必要となる. 一方で Multi2Sim は











































について, 特に FPU においてその実装する数, Issue
レイテンシの長さ,Operation レイテンシの長さを表 1






FU name Number IssueLat OpLat
FP COMPARE 1 2 3
FP ADD 1 2 5
FP MUL 1 4 8
FP DIV 1 20 40
FP COMPLEX 1 50 100
VII. シミュレーション結果
実験に使った 8コア 8スレッド,スーパースカラーの命令























































































































































































































表 VIIは, グループサイズごとの各 FUの待ち時間の割合







1 2 4 8
FP ADD 23.97% 19.33% 4.38% 0.25%
FP COMP 8.64% 8.04% 0.60% 0.01%
FP MUL 54.38% 45.70% 8.01% 0.66%
FP DIV 5.67% 4.53% 1.07% 0.08%
































FP ADD 2.49% 6.94%
FP COMP 0.06% 0.78%
FP MUL 7.22% 24.14%
FP DIV 0.29% 2.38%
FP COMPLEX 0.15% 2.13%
ある. コア間で命令の委託実行を行わないものをベースラ
インアーキテクチャとする. これに対しグループサイズを 2




















































オーバーヘッド 0 4.25% 7.34% 7.45%
オーバーヘッド 1 3.38% 6.28% 6.36%
オーバーヘッド 2 2.49% 3.88% 4.33%
オーバーヘッド 3 2.14% 2.80% 4.02%

















[1] D. M. Tullsen, S. J. Eggers, and H. M. Levy, “Simultaneous multithread-
ing: Maximizing on-chip parallelism,” in ACM SIGARCH Computer
Architecture News, vol. 23, no. 2. ACM, 1995, pp. 392–403.
[2] L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer,
B. Sano, S. Smith, R. Stets, and B. Verghese, Piranha: a scalable
architecture based on single-chip multiprocessing. ACM, 2000, vol. 28,
no. 2.
[3] D. Processor, “Cmp implementation in systems based on the intel R
coreTM,” Intel R Centrino R Duo Mobile Technology, vol. 10, no. 2,
pp. 99–108, 2006.
[4] P. Kongetira, K. Aingaran, and K. Olukotun, “Niagara: A 32-way
multithreaded sparc processor,” Micro, IEEE, vol. 25, no. 2, pp. 21–
29, 2005.
[5] M. Butler, “Amd” bulldozer” core-a new approach to multithreaded
compute performance for maximum efficiency and throughput,” in IEEE
HotChips Symposium on High-Performance Chips (HotChips 2010),
2010.
[6] M. R. Kakoee, I. Loi, and L. Benini, “A shared-fpu architecture for
ultra-low power mpsocs,” in Proceedings of the ACM International
Conference on Computing Frontiers. ACM, 2013, p. 3.
[7] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, “Multi2sim: A
simulation framework for cpu-gpu computing,” in Proc. of the 21st
International Conference on Parallel Architectures and Compilation
Techniques, Sep. 2012.
[8] T. Austin, E. Larson, and D. Ernst, “Simplescalar: An infrastructure for
computer system modeling,” Computer, vol. 35, no. 2, pp. 59–67, 2002.
[9] S. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The splash-2
programs: Characterization and methodological considerations,” in Proc.
of the 22nd International Symposium on Computer Architecture, June
1995.
[10] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, “Mediabench: A
tool for evaluating and synthesizing multimedia and communications
systems,” in Proc. of the 30th Int’l Symposium on Microarchitecture,
Dec. 1997.
[11] K. Takaki, T. Kurihara, and Y. Li, “On the performance improvement
of an architecture towards sharing fpus across cores for the design of
multithreading multicore cpus,” in The 3rd International Workshop on
Computer Systems and Architectures. IEEE, Dec. 2015, pp. 408–411.
[12] A. Fog, “Instruction tables: Lists of instruction latencies, throughputs
and micro-operation breakdowns for intel, amd and via cpus,” Technical
University of Denmark, 2014.
