YOKOKAWA Mitsuo | ![]() |
Organization for Advanced and Integrated Research | |
Professor | |
Engineering / Other Field |
Jun. 2014 The Institute of Electronics, Information and Communication Engineers, Achievement Award 2013, Research and Development of the K Computer
分散メモリ型並列スーパコンピュータ「京」は,科学技術計算を高速に実行する技術HPC-ACE,Tofuと呼ぶ直接結合ネットワーク,電力性能の良い実装方式及び冷却方式など,新しい技術を駆使したCPU数88,128個(705,024コア)からなる大規模計算機システムである.平成18年度から平成24年度にかけて独立行政法人理化学研究所と富士通株式会社によって開発,製作が進められ,平成23年10月には目標性能LINPACK10.51ペタフロップスを達成し,TOP500において世界一と認められた.またHPCチャレンジでは,HPL9.796TFLOPS,GlobalRandomAccess 472GUPS,EP Stream3,857TByte/s,Global FFT 205.9TFLOPSを達成し,それぞれ世界一の性能を達成した.当初の予定どおり平成24年6月Japan society
Feb. 2011 The Japan Society of Fluid Mechanics, Award for Outstanding Paper in Fluid Mechanics, Energy dissipation rate and energy spectrum in high resolution direct numerical simulations of turbulence in a periodic box
著者らは地球シミュレータを駆使して一様等方性乱流の大規模DNS(40963)を実行し,実験的にも達成しにくい高レイノルズ数の乱流状態を数値的に実現することで,十分な慣性小領域を含む乱流データベースを構築した.そして,エネルギー散逸率が高レイノルズ数極限で粘性率に依存しないこと,慣性小領域におけるエネルギースペクトルがコルモゴロフの-5/3乗則から有意にずれることなどの基本的知見を明確に示している.また,著者らの数値計算は,地球シミュレータ上でスペクトル法によって16.4 TFlopsの高速計算を実現したことが評価されて,2002年ゴードンベルの特別賞を授与されている.さらに,著者らのグループは得られた乱流データベースを詳細かつ慎重に解析することで,乱流統計量に関する数多くの知見をもたらし,2009年のAnnu. Rev. Fluid Mech. にそJapan society
Nov. 2011 Association for Computing Machinery, ACM Gordon Bell Prize, First principles calculation of electronic states of a silicon nanowire with 100,000 atoms on the K computer
Nov. 2005 SC|05 International Conference for High Performance Computing Networking and StorageACM/IEEE, Best Technical Paper Award, Full Electron Calculation Beyond 20,000 Atoms: Ground Electronic State of Photosynthetic Proteins,
Nov. 2002 Association for Computing Machinery, ACM Gordon Bell Prize (Special), 16.4-Tflops Direct Numerical Simulation of Turbulence by a Fourier Spectral Method on the Earth Simulator
May 2004 Information Processing Society of Japan, IPSJ Industrial Achievement Award, 大規模並列ベクトル計算機・地球シミュレータの開発
Nov. 2002 Satoru Shingu, Yoshinori Tsuda, Wataru Ohfuchi, Kiyoshi Otsuka, Earth Simulator Center, Japan Marine Science and Technology Center; Hiroshi Takahara, Takashi Hagiwara, Shin-ichi Habata, NEC Corporation; Hiromitsu Fuchigami, Masayuki Yamada, Yuji Sasaki, Kazuo Kobayashi, NEC Informatec Systems; Mitsuo Yokokawa, National Institute of Advanced Industrial Science and Technology; Hiroyuki Itoh, ACM Gordon Bell Prize (Special award for language), 14.9 Tflop/s Three-dimensional Fluid Simulation for Fusion Science with HPF on the Earth Simulator
Nov. 2002 Association for Computing Machinery, ACM Gordon Bell Prize (Peak performance), A 26.58 Tflops Global Atmospheric Simulation with the Spectral Transform Method on the Earth Simulator
In book
[Refereed]
Scientific journal
[Refereed]
International conference proceedings
To evaluate the aerodynamic instability for buildings considering their high-order oscillation mode and torsional oscillation mode on which it is difficult to perform the usual wind tunnel experiments, the fluid-structure interaction (FSI) analysis code combined with a multi-degree-of-freedom structure model and large-eddy simulation was developed. The calculation results obtained in this study and the results of wind tunnel tests conducted in a previous study using a building model (average breadth 0.14 m x length 0.28 m x height 1 m; scale: 1/600) with a multi-degree-of-freedom structure were compared. The comparison showed that the FSI analysis results were corresponded well with those of the wind tunnel tests; for each spectrum of response displacement (wind, along-wind, and torsional direction) at the top mass node, the frequency distribution and the peak frequency of the first and second oscillation modes were in good agreement with the calculation results and the wind tunnel test results. As for the amplitude of the top displacement of the building, the results for both the along- and across-wind directions showed good correspondence. For the torsional direction, the calculation results reproduced a torsional flutter oscillation with a slightly low wind level in comparison with that observed in the wind tunnel experiments.
Elsevier {BV}, Feb. 2020, Journal of Wind Engineering and Industrial Aerodynamics, 197, 104052 - 104052, English[Refereed]
Scientific journal
[Refereed]
地震の多い我が国では建築物に対する耐震性の要求が高い.建築物の耐震性を調べるために,建物・地盤の地震動応答を求める数値シミュレーションが行われている.本研究で扱うシミュレーションコードは建物と地盤を3 次元有限要素法で離散化し,各節点について立てられた運動方程式に平均加速度法を適用したものである.そうして得られた連立一次方程式を前処理付き共役勾配法の一種であるPSCCG 法で解いている.本研究では,本シミュレーションの実行時間の大部分を占めるPSCCG 法のプロセス並列化を行った.スレッド数を1 に固定しプロセス数を増やしながら,プロセス並列の実装評価を行った.その結果2 プロセスでの実行時間を基準として,8 プロセスでは最大で3.2 倍の速度向上が確認できた.
Information Processing Society of Japan, Dec. 2018, 情報処理学会第167回ハイパフォーマンスコンピューティング研究会, 2018-HPC-167 (28), 1 - 5, JapaneseSymposium
数値シミュレーションにおける多くの問題は,偏微分方程式を離散化して得られる連立一次方程式を解く問題に帰着される.そして,多くの場合,連立一次方程式を解く時間は全体のシミュレーション時間の大部分を占める.よって,連立一次方程式を高速に解くことは非常に重要である.本研究では,正定値対称行列に適用できるコレスキー分解を扱う.疎行列に対して,コレスキー分解を行う手法はいくつかあるが,本稿では,緩和型スーパーノードマルチフロンタル法を用いた.同手法では,2 つのスーパーノードを融合する際に非零と見なす零要素数の上限である緩和パラメータが性能に大きな影響を与える。そこで,このパラメータの最適値を求めることを目的として,Intel Xeon (Ivy Bridge-EX) とIntel Xeon Phi(Knights Landing, KNL) のそれぞれ1 コ
Information Processing Society of Japan, Dec. 2018, 情報処理学会第167回ハイパフォーマンスコンピューティング研究会, 2018-HPC-167 (28), 1 - 8, JapaneseSymposium
計算生命科学は,生命の理解に向けて,近年急速に進展している計算科学と医農工理学分野が融合した学際的研究領域である.様々な研究分野や産業界等への研究の拡がりが期待されており,包括的な基礎知識を習得する機会が求められている.神戸大学計算科学教育センターは,関係諸機関と協力して,遠隔インタラクティブ講義「計算生命科学の基礎」シリーズを2014年から全国に配信を開始し,昨年度は600名の受講登録を受け付けた.本稿では,2017年度に実施した「計算生命科学の基礎IV」と,最近注目されているAIやディープラーニングに焦点を当て特別編として実施したディープラーニングチュートリアルの開催結果について報告する.年々受講者が増え続けており,アンケートでも高評価を得ている..
Nov. 2018, 大学ICT推進協議会2018年度年次大会論文集, 1 - 4, JapaneseSymposium
フーリエ・スペクトル法による一様等方性乱流の直接数値シミュレーション(DNS) では,その計算時間の大部分が,3 次元離散フーリエ変換(3D-FFT)で占められている,並列化による高性能計算が期待されるが,並列化3D-FFT の全対全通信は,近年のスーパーコンピュータのネットワークトポロジには適しておらず,計算時間の多くが通信時間となってしまい,乱流の普遍的統計法則を解明するために必要な高レイノルズ数での大規模乱流DNS が事実上困難となっている.本研究では,DNS で良く用いられる時間積分スキーム(4 次ルンゲ・クッタ法)の代わりとなりうる種々の時間積分スキームを用いることで,3D-FFT の適用回数自体を減らし,乱流DNS の計算時間を短縮することを目的とする.本論文では,それらのスキームによって得られる乱流場の統計量を評価することで,スキームの
Information Processing Society of Japan, Sep. 2018, 情報処理学会第166回ハイパフォーマンスコンピューティング研究会, 2018-HPC-166 (8), 1 - 7, JapaneseSymposium
3次元高速フーリエ変換(FFT)の並列処理に対し,通信コストのモデルから,1 軸分割,2 軸分割および通信方法の異なる2 種類の3 軸分割について通信コストを求め,どの分割方法が最適となるかを求めた.また,分割方法の異なる3 次元FFT の京コンピュータとOakforest-PACS を用いてその性能を評価した.
Information Processing Society of Japan, Mar. 2018, 情報処理学会第162回ハイパフォーマンスコンピューティング研究会, 2017-HPC-163 (29), 1 - 7, JapaneseSymposium
In numerical libraries for sparse matrix operations, there are many tuning parameters related to implementation selection. Selection of different tuning parameters could result in totally different performance. Moreover, optimal implementation depends on the sparse matrices to be operated. It is difficult to find optimal implementation without executing each implementation and thereby examining its performance on a given sparse matrix. In this study, we propose an implementation selection method for sparse iterative algorithms and preconditioners in a numerical library using deep learning. The proposed method uses full color images to represent the features of a sparse matrix. We present an image generation method for partitioning a given matrix (to generate its feature image) so that the value of each matrix element is considered in the implementation selection. We then evaluate the effectiveness of the proposed method by conducting a numerical experiment. In this experiment, the accuracy of implementation selection is evaluated. The training data comprise a pair of sparse matrix and its optimal implementation. The optimal implementation of each sparse matrix in the training data is obtained in advance by executing every implementation and getting the best one. The experimental results obtained using the proposed method show that the accuracy of selecting the optimal implementation of each sparse matrix is 79.5%.
IEEE, 2018, 2018 SIXTH INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING WORKSHOPS (CANDARW 2018), 2018, 257 - 262, English[Refereed]
International conference proceedings
A new SX-Aurora TSUBASA vector supercomputer has been released, and it features a new system architecture and a new execution model to achieve high sustained performance, especially for memory-intensive applications. In SX-Aurora TSUBASA, the vector host (VH) of a standard x86 Linux node is attached to the vector engine (VE) of the newly developed vector processor. An application is executed on the VE, and only system calls are offloaded to the VH. This new execution model can avoid redundant data transfers between the VH and VE that can easily become a bottleneck in the conventional execution model. This paper examines the potential of SX-Aurora TSUBASA. First, the basic performance is clarified by evaluating benchmark programs. Then, the effectiveness of the new execution model is examined by using a microbenchmark. Finally, the potential of SX-Aurora TSUBASA is clarified through evaluations of practical applications.
ASSOC COMPUTING MACHINERY, 2018, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE, AND ANALYSIS (SC'18), 2018, 1 - 12, English[Refereed]
International conference proceedings
In order to evaluate the high-order oscillation mode and the torsional oscillation mode of buildings on which it is difficultto perform normal wind tunnel experiments, the authors have developed an FSI analysis code combined with a ultidegree-of-freedom model and LES and have compared the FSI analysis results and those of wind tunnel tests conducted in a previous study using a
日本流体力学会, Dec. 2017, 第31回数値流体力学シンポジウム講演論文集, 2017, 1 - 5, JapaneseSymposium
時間依存シュレディンガー方程式のモデル問題として2次元ポアソン方程式を取り上げ,one-way dissection オーダリングによる正定値対称疎行列を係数行列にもつ連立一次方程式の並列直接解法に対して,いくつかの疎行列格納方式を用いた場合の性能評価結果について述べる.また,新しいスカイライン格納方式を提案し,その有効性を確認した.
Information Processing Society of Japan, Dec. 2017, 情報処理学会第162回ハイパフォーマンスコンピューティング研究会, 2017-HPC-162 (19), 1 - 10, JapaneseSymposium
近年注目されている,Intel Xeon Phi の第 2 世代 Knights Landing(KNL) 上での 3 次元FFT 応用を視野に入れて,STREAM ベンチマークや FFTE の C 言語移植版,FFTW などの基本カーネルコードを使用し KNL のメモリ周りの性能測定をまず調査する.更に,KNL 上での 3D-FFT 処理の並列性能評価(特にスケーラビリティ)に関する性能分析を行った.
Information Processing Society of Japan, Sep. 2017, 情報処理学会第161回ハイパフォーマンスコンピューティング研究会, 2017-HPC-161 (16), 1 - 7, JapaneseSymposium
rereverse Cuthill-McKee 法(RCM 法)に焦点をあて,並列化のオーバーヘッドを低減させる改良案として,一番最初の色付けを複数のノードから開始する複数初期点 RCM 法(MIP-RCM 法)を提案した.対称ガウス・ザイデル前処理付共役残差法に対して,MIP-RCM 法と従来の RCM 法による計算時間の比較を行った結果,8 つの問題において MIP-RCM 法の計算時間が短かった.特に,thermal2 の問題の thread 並列計算における反復計算時間の比較では,RCM 法の最も短い 2threads の計算時間に対し, MIP-RCM 法は 12threads で約 2.93 倍の時間短縮が達成され,提案手法の有効性が確認できた.
Information Processing Society of Japan, Apr. 2017, 情報処理学会第159回ハイパフォーマンスコンピューティング研究会, 2017-HPC-159 (3), 1 - 6, JapaneseSymposium
近年,HPC システムの大規模化にともない,シミュレーション結果も膨大な量となっている.この膨大な計算結果を効率よく分析するための手段として,可視化等が用いられることが多く,可視化専用のハードウェアを搭載したシステムを利用することがよくある.この場合,シミュレーションを行ったシステムとのデータ連携が必要となる.これらのサーバが同一のサイトに設置されている場合は,ストレージ共有で対応できるが,異なるサイトに設置されているシステムを利用する場合には,ネットワーク経由でデータの転送を行うことになり,高速なデータ転送が求められる.今回,スーパーコンピュータ「京」と隣接する神戸大学計算科学研究センターに設置された可視化用計算サーバ「π-VizStudio」を直接ネットワークで接続し,データ転送性能評価を行ったので報告する
Information Processing Society of Japan, Mar. 2017, 情報処理学会第158回ハイパフォーマンスコンピューティング研究会, 2017-HPC-158 (14), 1 - 6, JapaneseSymposium
Symposium
CPU のメニーコア化などによってスーパーコンピュータの並列度が増加し,空間方向の領域分割だけでは並列化要素が不十分になりつつある.そのため,時間並列計算法が近年注目されている.本研究では,その有用な解法の一つである Parareal 法を放物型偏微分方程式の一例として拡散問題に適用し,並列加速率の挙動の調査と考察を行った.京コンピュータでの評価では,時間方向の並列数が 100 の場合, Speedup モデルから推定される最大 Speedup 性能 14 倍に対して 8.5 倍の実行性能を得られた.また,領域分割法による空間並列と Parareal 法の時間並列の組み合わせによる大規模並列性能評価を行った.64空間並列を 64 ノードで並列した場合,逐次計算に対して 13.5 倍の速度向上が得られ,さらに Parareal 法による 100 並列を
Information Processing Society of Japan, Dec. 2016, The 157th IPSJ SIGHPC Meeting, 2016-HPC-152 (19), 1 - 7, JapaneseSymposium
We developed an Eclipse plug-in tool named STView for visualizing the program structures of Fortran source codes to help improve the performance of the program on a supercomputer. To create a tree that represents program structures, STView uses an abstract syntax tree (AST) generated by Photran and filters the tree because the AST has many nontrivial nodes for tokens. While ord
ACM, Nov. 2016, The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), 1 - 2, Japanese[Refereed]
International conference proceedings
Symposium
A tuning technique which makes the sustained performance of programs on a single processor higher is very important in high performance computing on modern supercomputers. Therefore, prediction of the marginal performance of programs is a big concern to know how extent we can tune programs. The roofline model can estimate the marginal performance of programs well, if the performance is limited by effective memory bandwidth. The model, however, could not be applied to the performance prediction in the case of increasing of L2 cache accesses. In this paper, we proposed a new prediction model of the marginal performance which can be applied in the case of increasing of accesses to L2 cache. It is found that the new model works well for the marginal performance prediction by applying it to actual programs on the K computer and other systems.
Information Processing Society of Japan, 14 Jul. 2016, 情報処理学会論文誌コンピューティングシステム(ACS), 9 (2), 1 - 14, Japanese[Refereed]
For evaluating the prediction accuracy of wind pressure on a high-rise building in an urban area using large-eddy simulation (LES), the results of LES calculation and wind tunnel experiment for a high-rise building with a complex surface shape comprising inner balcony and corner cut were compared. Approximately 140 million calculation meshes on the K computer, the fourth fastes
Northern University, Jun. 2016, Proceedings of the 8th International Colloquium on Bluff Body Aerodynamics and Applications (BBAA VIII), 1 - 10, English[Refereed]
Symposium
Ensemble computing, which is an instance of capacity computing, is an effective computing scenario for exascale parallel supercomputers. In ensemble computing, there are multiple linear systems associated with a common coefficient matrix. We improve the performance of iterative solvers for multiple vectors by solving them at the same time, that is, by solving for the product of the matrices. We implemented several iterative methods and compared their performance. The maximum performance on Sparc VIIIfx was 7.6 times higher than that of a naive implementation. Finally, to deal with the different convergence processes of linear systems, we introduced a control method to eliminate the calculation of already converged vectors.
TAYLOR & FRANCIS LTD, 2016, International Journal of Computational Fluid Dynamics, 30 (6), 395 - 401, English[Refereed]
Scientific journal
A study is made about the energy spectrum E(k) of turbulence on the basis of highresolution direct numerical simulations (DNSs) of forced incompressible turbulence in a periodic box using a Fourier spectral method with the number of grid points and the Taylor scale Reynolds number R-lambda up to 12 288(3) and approximately 2300, respectively. The DNS data show that there is a wave-number range (approximately 5 x 10(-3) < k eta< 2 x 10(-2)) in which E(k) fits approximately well to Kolmogorov's k(-5/3) scaling, where eta is the Kolmogorov length scale. However, a close inspection shows that the exponent is a little smaller than -5/3, and E(k) in the range fits to E(k)/[
[Refereed]
Scientific journal
We report our operational experience of improving the energy efficiency of the power supply and cooling facilities of the K computer. By optimizing the number of active air handlers, the blowout temperature and the number of fans, the power consumption of the air handlers was reduced by approximately 40 %. We next considered improvements to the energy efficiency of the gas turbine power generators. After analyzing the long-term power generation profile, we found that the efficiency of the gas turbine power generators could be improved by more than 50 %. To increase the energy efficiency of the cooling towers, we considered a range of ways of improving the ventilation around the cooling towers and finally decided to remove a section of wall panels from the chiller building. Preliminary results suggest that the modification has had a positive effect on efficiency.
SPRINGER HEIDELBERG, 2016, Computer Science-Research and Development, 31 (4), 235 - 243, English[Refereed]
Scientific journal
本報告では,種々の反復解法の実問題への適用,および,その性能を評価した.地盤とその上の大規模構造物の地震動解析において現れる条件数の異なる大規模対称疎行列に対して,数値実験を行った結果,行列に応じて,有効な反復解法が異なることがわかった.
Information Processing Society of Japan, Dec. 2015, 152回ハイパフォーマンスコンピューティング研究会, 152, JapaneseSymposium
High vorticity regions obtained by the large-scale direct numerical simulations with up to 122883 grid points, of forced incompressible turbulence in a periodic box are visualized using the method developed for handling such large-scale data. The visualization shows that in high Reynolds number turbulence, strong micro-scale vortices are dense in clusters of various sizes up to
日本流体力学会, Dec. 2015, 第29回数値流体力学シンポジウム講演論文集, 2015, 1 - 5, JapaneseSymposium
Capacity computing is a promising scenario for improving performance on upcoming exascale supercomputers. Ensemble computing is an instance and has multiple linear systems associated with a common coefficient matrix. We implement to reduce load cost of coefficient matrix by solving them at the same time and improve performance of several iterative solvers. The maximum performance o
日本流体力学会, Dec. 2015, 第29回数値流体力学シンポジウム講演論文集, 2015, 1 - 5, JapaneseSymposium
To evaluate the prediction accuracy of wind pressure on the high-rise building in urban area using Large-Eddy Simulation (LES), the LES calculation and the wind tunnel experiments were compared for the high-rise building with complex surface shape consisted by inner balcony and corner cut. As the result of the complex surface shape resolution, the complex flow feature inside th
日本流体力学会, Dec. 2015, 第29回数値流体力学シンポジウム講演論文集, 2015, 1 - 5, JapaneseSymposium
プログラムの実行性能限界を見積もるために,プロセッサのピーク性能,メモリバンド幅, Operational Intensity (Flop/Byte)をパラメータとしたルーフラインモデルが提案されている.ルーフラインモデルは,メモリネックのプログラムの場合に見積り性能と実測性能が良く一致するが,キャッシュアクセスが増えてくると,見積り性能と実測性能が乖離してくる.本報告では,キャッシュアクセスが増大するカーネルプログラムに対し,コーディングに基づく実行性能の見積もり方法を提案する.また,いくつかのカーネルループに対し,スーパーコンピュータ「京」上の実行性能の評価を行った結果,本方法が実効性能見積もりに適用できることを明らかにした.
Information Processing Society of Japan, Aug. 2015, Forum on Information Technology 2015, 2015, 13 - 19, Japanese[Refereed]
Research society
プログラムのチューニングや並列化作業を行う前段階として,プログラム構造を把握する必要がある.特に,プログラム中のループ,分岐,プロシージャ呼び出しの構造を理解出来れば,プログラム全体の構造を把握することが容易となる.本研究では,FORTRAN 77 及び Fortran 90 で書かれたプログラムについて,プログラム構造把握の作業効率向上を目的とし,ループ,分岐,プロシージャ呼び出しをプログラムから抽出し,ツリー状に可視化する支援ツール (STView) をフリーの統合開発環境 Eclipse 上に構築した
Information Processing Society of Japan, 12 May 2015, ハイパフォーマンスコンピューティングと計算科学シンポジウム2015 論文集, 2015 (2015), 94 - 94, Japanese[Refereed]
We propose a new tridiagonalization method for real symmetric matrices (D-W method), which is derived from Dongarra's blocked method and Wilkinson's techniques. The Dongarra's blocked method can speedup tridiagonalization by replacing half of operations by matrix-matrix multiplications, but the rest half part (matrix-vector multiplications) remains, and that restricts the perfo
Information Processing Society of Japan, May 2015, ハイパフォーマンスコンピューティングと計算科学シンポジウム2015 論文集, 2015, 19 - 28, Japanese[Refereed]
Research society
Statistics on the motion of small heavy (inertial) particles in turbulent flows with a high Reynolds number are physically fundamental to understanding realistic turbulent diffusion phenomena. An accurate parallel algorithm for tracking particles in large-scale direct numerical simulations (DNSs) of turbulence in a periodic box has been developed to extract accurate statistics on the motion of inertial particles. The tracking accuracy of the particle motion is known to primarily depend on the spatial resolution of the DNS for the turbulence and the accuracy of the interpolation scheme used to calculate the fluid velocity at the particle position. In this study, a DNS code based on the Fourier spectral method and two-dimensional domain decomposition method was developed and optimised for the K computer. An interpolation scheme based on cubic splines is implemented by solving tridiagonal matrix problems in parallel.
SPRINGER-VERLAG BERLIN, 2015, Parallel Computing Technologies (Pact 2015), 9251, 522 - 527, English[Refereed]
International conference proceedings
The Roofline models have been proposed in order to estimate the marginal performance of programs based on some features of computer systems such as peak performance, memory bandwidth, and operational intensity. The estimated performance by the model is in good agreement with the measured performance in the case that programs access memory devices directly. However, a difference between the estimated performance and the measured performance appears in the case that cache accesses of the program increase. In this paper, we extended the roofline model to a new one which can apply to a performance estimation of programs in which many cache accesses occur. It is shown that the new model can estimate the sustained performance of various kernel loops on the K computer by comparing with measured performance.
Information Processing Society of Japan (IPSJ), 02 Dec. 2014, IPSJ SIG Notes, 2014 (30), 1 - 9, Japanese現在日本最速のスーパーコンピュータである「京」を用いて,一様等方性乱流の超大規模直接数値シミュレーション(DNS)を実現するために,地球シミュレータ向けに開発されたフーリエ・スペクトル法に基づく一様等方性乱流のDNSコードの「京」への移植,及び最適化を行った.移植の際には,従来の1次元分割によるデータ分散手法から,より効率的なAll-to-all通信が可能であると考えられる2次元分割による手法へと変更を行った.その結果,「京」の192×128ノードを用いて最大格子点数12288^3の超大規模DNSの実現に成功した.
IPSJ, 02 Dec. 2014, IPSJ SIG Technical Report, 2014 (17), 1 - 5, JapaneseThis study investigated the influence of actual urban block area for the wind pressure prediction of target high-rise building using large-eddy simulations (LES), and introduced the utilization of High-Performance Computer (HPC) for LES. Firstly, four LES cases were carried out using 3072 parallel calculation and compared with four wind tunnel experiments, respectively; (1) No-
Dec. 2014, 第28回数値流体力学シンポジウム講演論文集, 1 - 8, JapaneseSymposium
Silicon nanowires are potentially useful in next-generation field-effect transistors, and it is important to clarify the electron states of silicon nanowires to know the behavior of new devices. Computer simulations are promising tools for calculating electron states. Real-space density functional theory (RSDFT) code performs first-principles electronic structure calculations. To obtain higher performance, we applied various optimization techniques to the code: multi-level parallelization, load balance management, sub-mesh/torus allocation, and a message-passing interface library tuned for the K computer. We measured and evaluated the performance of the modified RSDFT code on the K computer. A 5.48 petaflops (PFLOPS) sustained performance was measured for an iteration of a self-consistent field calculation for a 107,292-atom Si nanowire simulation using 82,944 compute nodes, which is 51.67% of the K computer's peak performance of 10.62 PFLOPS. This scale of simulation enables analysis of the behavior of a silicon nanowire with a diameter of 10-20 nm.
SAGE PUBLICATIONS LTD, 2014, International Journal of High Performance Computing Applications, 28 (3), 335 - 355, English[Refereed]
Scientific journal
The K computer, released on September 29, 2012, is a large-scale parallel supercomputer system consisting of 82,944 compute nodes. We have been able to resolve a significant number of operation issues since its release. Some system software components have been fixed and improved to obtain higher stability and utilization. We achieved 94% service availability because of a low hardware failure rate and approximately 80% node utilization by careful adjustment of operation parameters. We found that the K computer is an extremely stable and high utilization system.
ELSEVIER SCIENCE BV, 2014, 2014 International Conference on Computational Science, 29, 576 - 585, English[Refereed]
International conference proceedings
スーパーコンピュータ「京」の構成と評価について述べる.「京」はスパコンの広範な分野での利活用を目指した10PFLOPS級のスパコンである.我々は,デザインコンセプトとして,①汎用的なCPUアーキテクチャの採用と高いCPU単体性能の実現,②高いスケーラビリティのインターコネクトの専用開発,③並列度の爆発に抗する技術の導入,⑤高い信頼性,柔軟な運用性,省電力性の実現を掲げ,2011年にそのシステムを完成させた.HPC向けCPU,SPARC64 VIIIfxと,スケーラビリティの高いTofuインターコネクトを専用に開発し,並列度の爆発に抗する技術としてVISIMPACTを実装した.冷却やジョブマネージャ等により,高い信頼性,柔軟な運用性,省電力性を実現した.「京」は2011年6月と11月にTOP500で世界一となった.また,複数のアプリケーションで高い実行
The Institute of Electronics, Information and Communication Engineers, Oct. 2013, The IEICE transactions on information and systems (Japanese edition), 96 (10), 2118 - 2129, Japanese[Refereed]
Lattice QCD is first principle calculation to solve the dynamics between quarks and gluons based on strong interaction. The calculation is performed on four dimensional space-time which is discretized to lattice, and requires a huge amount of inversion of the sparse matrix derived from Wilson-Dirac equation. In this study, Lattice QCD code, LDDHMC uses domain decomposition HMC
Information Processing Society of Japan, 25 Sep. 2013, 情報処理学会論文誌コンピューティングシステム(ACS), 6 (3), 43 - 57, Japanese[Refereed]
This paper proposes the design of ultra scalable MPI collective communication for the K computer, which consists of 82,944 computing nodes and is the world's first system over 10 PFLOPS. The nodes are connected by a Tofu interconnect that introduces six dimensional mesh/torus topology. Existing MPI libraries, however, perform poorly on such a direct network system since they assume typical cluster environments. Thus, we design collective algorithms optimized for the K computer. On the design of the algorithms, we place importance on collision-freeness for long messages and low latency for short messages. The long-message algorithms use multiple RDMA network interfaces and consist of neighbor communication in order to gain high bandwidth and avoid message collisions. On the other hand, the short-message algorithms are designed to reduce software overhead, which comes from the number of relaying nodes. The evaluation results on up to 55,296 nodes of the K computer show the new implementation outperforms the existing one for long messages by a factor of 4 to 11 times. It also shows the short-message algorithms complement the long-message ones. © 2012 Springer-Verlag.
Springer-Verlag, May 2013, Computer Science - Research and Development, 28 (2-3), 147 - 155, English[Refereed]
Scientific journal
超並列計算機の安定的・友好的な利用のためにはシステム自体の耐故障性に加え,ユーザが開発するアプリケーションの耐故障性も重要である.我々は,スーパーコンピュータ「京」における耐故障機能を具備したマスタ・ワーカ型プログラミングモデルを検討中である.本報告では,次の3つの検討状況について,報告する.1)京のジョブ管理機構とMPIライブラリの拡張による耐故障性の実現,2)MPIの動的プロセス生成機能とRemote Procedure Callに基づくマスタ・ワーカ型プログラミングモデルの実現,3)本モデルに基づくフラグメント分子軌道法コードの実装および評価.
Information Processing Society of Japan, Feb. 2013, ISJ SIG Technical Report, 2013-HPC-138 (26), 1 - 6, JapaneseSymposium
Lattice QCD is first principle calculation to solve the dynamics between quarks and gluons based on strong interaction. The calculation is performed on four dimensional space-time which is discretized to lattice, and requires a huge amount of inversion of the sparse matrix derived from Wilson-Dirac equation. In this study, Lattice QCD code, LDDHMC uses domain decomposition HMC
Information Processing Society of Japan, 08 Jan. 2013, HPCS2013, 2013 (2013), 34 - 43, JapaneseSingle-particle coherent X-ray diffraction imaging using an X-ray free-electron laser has the potential to reveal the three-dimensional structure of a biological supra-molecule at sub-nanometer resolution. In order to realise this method, it is necessary to analyze as many as 1 x 10(6) noisy X-ray diffraction patterns, each for an unknown random target orientation. To cope with the severe quantum noise, patterns need to be classified according to their similarities and average similar patterns to improve the signal-to-noise ratio. A high-speed scalable scheme has been developed to carry out classification on the K computer, a 10PFLOPS supercomputer at RIKEN Advanced Institute for Computational Science. It is designed to work on the real-time basis with the experimental diffraction pattern collection at the X-ray free-electron laser facility SACLA so that the result of classification can be feedback for optimizing experimental parameters during the experiment. The present status of our effort developing the system and also a result of application to a set of simulated diffraction patterns is reported. About 1 x 10(6) diffraction patterns were successfully classificatied by running 255 separate 1 h jobs in 385-node mode.
WILEY-BLACKWELL, 2013, Journal of Synchrotron Radiation, 20 (6), 899 - 904, English[Refereed]
Scientific journal
This paper describes the design of high performance MPI communication which enables high performance communication with minimized memory usage on the 82,944 node K computer. The Tofu interconnect of K computer uses six dimension torus/mesh direct topology for realizing higher performance and availability on hundreds thousand node system. However, in such a ultra scale system, c
Information Processing Society of Japan, 09 May 2012, SACSIS2012, 2012 (2012), 237 - 244, JapaneseThis paper reports a method of speeding up MPI collective communication on the K computer, which consists of 82,944 computing nodes connected by a 6D direct network, named Tofu interconnect. Existing MPI libraries, however, do not have topology-aware algorithms which perform well on such a direct network. Thus, an Allreduce collective algorithm, named Trinaryx3, is designed and implemented in the MPI library for the K computer. The algorithm is optimized for a torus network and enables utilizing multiple RDMA engines, one of the strengths of the K computer. The evaluation results show the new implementation achieves five times higher bandwidth than existing one.
Information Processing Society of Japan, 09 May 2012, 情報処理学会論文誌コンピューティングシステム(ACS), 5 (2012), 245 - 253, JapaneseGiven that scientific computer programs are becoming larger and more complicated, high performance application developers routinely examine the program structure of their source code to improve their performance. We have developed K-scope, a source code analysis tool that can be used to improve code performance. K-scope has graphical user interface that visualizes program structures of Fortran 90 and FORTRAN 77 source code and enables static data-flow analysis. To develop the tool, we adopted the filtered abstract syntax tree (filtered-AST) model with Java to visualize the program structure efficiently. Filtered-AST, which extends the AST in the structured programming model by abstract block structuring, is suitable for visualization program structures. Based on this model, K-scope has been developed as an experimental implementation. It constructs filtered-AST objects from both source and intermediate code generated by the front-end of the XcalableMP compiler. We provide illustrations of the graphical user interface and give detailed examples of the tool applied to an actual application code. © 2012 IEEE.
2012, Proceedings of the International Conference on Parallel Processing Workshops, 434 - 443, English[Refereed]
International conference proceedings
RIKEN and Fujitsu have been working together to develop the K computer, with the aim of beginning shared use by the fall of 2012, as a part of the High-Performance Computing Infrastructure (HPCI) initiative led by Japan's Ministry of Education, Culture, Sports, Science and Technology (MEXT). Since the K computer involves over 80 000 compute nodes, building it with lower power consumption and high reliability was important from the availability point of view. This paper describes the K computer system and the measures taken for reducing power consumption and achieving high reliability and high availability. It also presents the results of implementing those measures.
FUJITSU LTD, 2012, Fujitsu Scientific & Technical Journal, 48 (3), 255 - 265, English[Refereed]
Scientific journal
In this paper, we propose an implementation of a parallel one-dimensional fast Fourier transform (FFT) on the K computer. The proposed algorithm is based on the six-step FFT algorithm, which can be altered into the recursive six-step FFT algorithm to reduce the number of cache misses. The recursive six-step FFT algorithm improves performance by utilizing the cache memory effectively. We use the recursive six-step FFT algorithm to implement the parallel one-dimensional FFT algorithm. The performance results of one-dimensional FFTs on the K computer are reported. We successfully achieved a performance of over 18 TFlops on 8192 nodes of the K computer (82944 nodes, 128 GFlops/node, 10.6 PFlops peak performance) for a 2(41)-point FFT.
IEEE COMPUTER SOC, 2012, 2012 Ieee 14th International Conference on High Performance Computing and Communications & 2012 Ieee 9th International Conference on Embedded Software and Systems (Hpcc-Icess), 344 - 350, English[Refereed]
Scientific journal
The K computer is a distributed memory supercomputer system with 82,944 compute nodes and 5,184 I/O nodes that was jointly developed by RIKEN and Fujitsu as a Japanese national project. Its development began in 2006 and was completed in June, 2012. By achieving the LINPACK performance of 10.51 peta-FLOPS, the K computer ranked first for two consecutive TOP500 list in June and November 2011. During its adjustment, part of the K computer was served, gradually increasing computing resources, to experts in computational science as a trial and was used for performance optimization of users' application codes.
IEEE, 2012, 2012 Third International Conference on Networking and Computing (Icnc 2012), 21 - 22, English[Refereed]
Scientific journal
[Refereed]
Scientific journal
Real space DFT (RSDFT) is a simulation technique most suitable for massively-parallel architectures to perform first-principles electronic-structure calculations based on density functional theory. We here report unprecedented simulations on the electron states of silicon nanowires with up to 107,292 atoms carried out during the initial performance evaluation phase of the K computer being developed at RIKEN. The RSDFT code has been parallelized and optimized so as to make effective use of the various capabilities of the K computer. Simulation results for the self-consistent electron states of a silicon nanowire with 10,000 atoms were obtained in a run lasting about 24 hours and using 6,144 cores of the K computer. A 3.08 peta-flops sustained performance was measured for one iteration of the SCF calculation in a 107,292-atom Si nanowire calculation using 442,368 cores, which is 43.63% of the peak performance of 7.07 peta-flops. Copyright 2011 ACM.
2011, Proceedings of 2011 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, English[Refereed]
International conference proceedings
Advanced Institute for Computational Science (AICS) was created in July 2010 at RIKEN under the supervision of Japanese Ministry of Education, Culture, Sports, Science, and Technology (MEXT) in order to establish the national center of excellence (COE) for high-performance computing and to operate the 10 petaflops class supercomputer called "K", manufactured by Fujitsu, Ltd. This paper presents AICS in the context of the national high-performance computing infrastructure of Japan. Furthermore, some notable simulation research using the K computer will be briefly discussed. Copyright 2011 ACM.
2011, State of the Practice Reports, SC'11, English[Refereed]
International conference proceedings
The K computer is a distributed memory supercomputer system consisting of more than 80,000 compute nodes which is being developed by RIKEN as a Japanese national project. Its performance is aimed at achieving 10 peta-flops sustained in the LINPACK benchmark. The system is under installation and adjustment. The whole system will be operational in 2012. © 2011 IEEE.
2011, Proceedings of the International Symposium on Low Power Electronics and Design, 371 - 372, English[Refereed]
International conference proceedings
All electron calculations were performed on the photosynthetic reaction center of Blastochloris viridis, using the fragment molecular orbital (FMO) method. The protein complex of 20,581 atoms and 77,754 electrons was divided into 1398 fragments, and the two-body expansion of FMO/6-31G* was applied to calculate the ground state. The excited electronic states of the embedded electron transfer system were separately calculated by the configuration interaction singles approach with the multilayer FMO method. Despite the structural symmetry of the system, asymmetric excitation energies were observed, especially on the bacteriopheophytin molecules. The asymmetry was attributed to electrostatic interaction with the Surrounding proteins, in which the cytoplasmic side plays a major role. (C) 2009 Wiley periodicals, Inc. J Comput Chem 31: 447-454, 2010
WILEY, Jan. 2010, JOURNAL OF COMPUTATIONAL CHEMISTRY, 31 (2), 447 - 454, English[Refereed]
Scientific journal
[Refereed]
Scientific journal
A GridFMO application was developed by recoining the fragment molecular orbital (FMO) method of GAMESS with Grid technology. With the GridFMO, quantum calculations of macro molecules become possible by using large amount of computational resources collected from many moderate-sized cluster computers. A new middleware suite was developed based on Ninf-G, whose fault tolerance and flexible resource management were found to be indispensable for long-term calculations. The GridFMO was used to draw ab initio potential energy curves of a protein motor system with 16,664 atoms. For the calculations, 10 cluster computers over the Pacific rim were used, sharing the resources with other users via batch queue systems on each machine. A series of 14 GridFMO calculations were conducted for 70 days, coping with more than 100 problems cropping up. The FMO curves were compared against the molecular mechanics (MM), and it was confirmed that (1) the FMO method is capable of drawing smooth curves despite several cut-off approximations, and that (2) the MM method is reliable enough for molecular modeling.
IEEE, 2007, 2007 8th Ieee/acm International Conference on Grid Computing, 50 - +, English[Refereed]
Scientific journal
One-point statistics of velocity gradients and Eulerian and Lagrangian accelerations are studied by analysing the data from high-resolution direct numerical simulations (DNS) of turbulence in a periodic box, with up to 4096 3 grid points. The DNS consist of two series of runs; one is with k(max)eta similar to 1 (Series 1) and the other is with k(max)eta similar to 2 (Series 2), where k(max) is the maximum wavenumber and eta the Kolmogorov length scale. The maximum Taylor-microscale Reynolds number R-lambda in Series 1 is about 1130, and it is about 675 in Series 2. Particular attention is paid to the possible Reynolds number (Re) dependence of the statistics. The visualization of the intense vorticity regions shows that the turbulence field at high Re consists of clusters of small intense vorticity regions, and their structure is to be distinguished from those of small eddies. The possible dependence on Re of the probability distribution functions of velocity gradients is analysed through the dependence on R-lambda of the skewness and flatness factors (S and F). The DNS data suggest that the R-lambda dependence of S and F of the longitudinal velocity gradients fit well with a simple power law: S similar to -0.32R(lambda)(0.11) and F similar to 1.14(lambda)(0.34), in fairly good agreement with previous experimental data. They also suggest that all the fourth-order moments of velocity gradients scale with R-lambda similarly to each other at R-lambda > 100, in contrast to R-lambda < 100. Regarding the statistics of time derivatives, the sccond-order time derivatives of turbulent velocities are more intermittent than the first-order ones for both the Eulerian and Lagrangian velocities, and the Lagrangian time derivatives of turbulent velocities are more intermittent than the Eulerian time derivatives, as would be expected. The flatness factor of the Lagrangian acceleration is as large as 90 at R-lambda approximate to 430. The flatness factors of the Eulerian and Lagrangian accelerations increase with R-lambda approximately proportional to R-lambda(alpha E) and R-lambda(alpha L), respectively, where alpha(E) approximate to 0.5 and alpha(L) approximate to 1.0, while those of the second-order time derivatives of the Eulerian and Lagrangian velocities increases approximately proportional to R-lambda(beta E) and R-lambda(beta L), respectively, where beta(E) approximate to 1.5 and beta(L) approximate to 3.0.
CAMBRIDGE UNIV PRESS, 2007, Journal of Fluid Mechanics, 592, 335 - 366, English[Refereed]
Scientific journal
A full electron calculation for the photosynthetic reaction center of Rhadopseudomanas viridis was performed by using the fragment molecular orbital (FMO) method on a massive cluster computer. The target system contains 20,581 atoms and 77,754 electrons, which was divided into 1,398 fragments. According to the FMO prescription, the calculations of the fragments and pairs of the fragments were conducted to obtain the electronic state of the system. The calculation at RHF/6-31G* level of theory took 72.5 hours with 600 CPUs. The CPUs were grouped into several workers, to which the calculations of the fragments were dispatched. An uneven CPU grouping, where two types of workers are generated, was shown to be efficient. © 2005 IEEE.
2005, Proceedings of the ACM/IEEE 2005 Supercomputing Conference, SC'05, 2005, English[Refereed]
International conference proceedings
This chapter discusses direct numerical simulation (DNS) study of canonical turbulence. Incompressible turbulence obeying the Navier-Stokes (N-S) equation under periodic boundary conditions (BC) is widely regarded as one of the most canonical types of turbulences. It keeps the essence of tile turbulence dynamics: (1) the nonlinear convection effect associated with the fluid motion, (2) dissipativity, and (3) mass conservation, which is equivalent to the incompressibility or the so-called solenoidal condition in incompressible fluid. Underlying the study of turbulence in such a simple geometry is the idea of the Kolmogorov hypotheses, according to which the small scale statistics in fully developed turbulence at sufficiently high Reynolds number Re is universal and insensitive to the details of large scale conditions. The DNS of incompressible homogeneous turbulence was performed under periodic boundary conditions with the number of grid points up to 10243 on the VPP5000 system at the Information Technology Center, Nagoya University, and DNS up to 40963 grid points on the earth simulator (ES). The DNS is based on a spectral method free from alias error. Sustained performance of 16.4 Tflops was achieved in the DNS with 20483 grid points and double precision arithmetic on the ES. © 1996 Elsevier B.V. All rights reserved.
Elsevier B.V., 2005, Parallel Computational Fluid Dynamics: Multidisciplinary Applications, 23 - 32, English[Refereed]
Scientific journal
The Earth Simulator (ES), developed under the Japanese government's initiative "Earth Simulator project", is a highly parallel vector supercomputer system. In this paper, an overview of ES, its architectural features, hardware technology and the result of performance evaluation are described. In May 2002, the ES was acknowledged to be the most powerful computer in the world: 35.86 teraflop/s for the LINPACK HPC benchmark and 26.58 teraflop/s for an atmospheric general circulation code (AFES). Such a remarkable performance may be attributed to the following three architectural features vector processor, shared-memory and high-bandwidth non-blocking interconnection crossbar network. The ES consists of 640 processor nodes (PN) and an interconnection network (IN), which are housed in 320 PN cabinets and 65 IN cabinets. The ES is installed in a specially designed building, 65m long, 50m wide and 17m high. In order to accomplish this advanced system, many kinds of hardware technologies have been developed, such as a high-density and high-frequency LSI, a high-frequency signal transmission, a high-density packaging, and a high-efficiency cooling and power supply system with low noise so as to reduce whole volume of the ES and total power consumption. For highly parallel processing, a special synchronization means connecting all nodes, Global Barrier Counter (GBC), has been introduced. © 2004 Elsevier B.V. All rights reserved.
2004, Parallel Computing, 30 (12), 1287 - 1313, English[Refereed]
Scientific journal
[Refereed]
Scientific journal
[Refereed]
Scientific journal
The Earth Simulator (ES) is an ultra high-speed supercomputer. The research and development of the ES was initiated in 1997 as one of the goals in the Earth Simulator project aiming at promotion of research for understanding and prediction of global environmental changes. The ES is a parallel computer system of the distributed-memory type, and consists of 640 processor nodes connected by 640 × 640 single-stage crossbar switches. Each processor node is a shared memory system composed of eight vector processors. The total peak performance and main memory capacity are 40 Tflop/s and 10TB, respectively. The LSI technology of 0.15 #m CMOS has been adopted to its one-chip vector processor. Its development has been successfully completed in February, 2002, by achieving a remarkable sustained performance of 35.86 Tfiop/s (a ratio of 88 % to the peak) in the Linpack benchmark program.
Elsevier Inc., 2003, Parallel Computational Fluid Dynamics: New Frontiers and Multi-Disciplinary Applications, Proceedings, 131 - 138, English[Refereed]
Scientific journal
An atmospheric general circulation model (AGCM) for climate studies was developed for the Earth Simulator (ES). The model is called AFES which is based on the CCSR/NIES AGCM and is a global three-dimensional hydrostatic model using the spectral transform method. AFES is optimized for the architecture of the ES. We achieved the high sustained performance by the execution of AFES with T1279L96 resolution on the ES. The performance of 26.58 Tflops was achieved the execution of the main time step loop using all 5120 processors (640 nodes) of the ES. This performance corresponds to 64.9% of the theoretical peak performance 40.96 Tflops. The T1279 resolution, equivalent to about 10 km grid intervals at the equator, is very close to the highest resolution in which the hydrostatic approximation is valid. To our best knowledge, no other model simulation of the global atmosphere has ever been performed with such super high resolution. Currently, such a simulation is possible only on the ES with AFES. In this paper we describe optimization method, computational performance and calculated result of the test runs.
Elsevier Inc., 2003, Parallel Computational Fluid Dynamics: New Frontiers and Multi-Disciplinary Applications, Proceedings, 79 - 86, English[Refereed]
Scientific journal
[Refereed]
Scientific journal
Parallel programming is essential for large-scale scientific simulations, and MPI is intended to be a de facto standard API for this kind of programming. Since MPI has several functions that exhibit similar behaviors, programmers often have difficulty in choosing the appropriate function for their programs. An MPI benchmark program library named MBL has been developed for gathering performance data for various MPI functions. It measures the performance of MPI-1 and MPI-2 functions under several communication patterns. MBL has been applied to the measurement of MPI performance in the Earth Simulator. It is confirmed that a maximum throughput of 11.7GB/s is obtained in inter-node communications in the Earth Simulator. © 2002 Springer Berlin Heidelberg.
2002, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2327, 219 - 230, English[Refereed]
International conference proceedings
The Earth Simulator is an ultra high-speed supercomputer. The research and development of the Earth Simulator was initiated in 1997 as one of the approaches in the Earth Simulator project which aims at promotion of research and development for understanding and prediction of global environmental changes. The Earth Simulator is a distributed memory parallel system which consists of 640 processor nodes connected by a single-stage full crossbar switch. Each processor node is a shared memory system composed of eight vector processors. The total peak performance and main memory capacity are 40Tflop/s and 10TB, respectively. In this paper, a concept of the Earth Simulator and an outline of the Earth Simulator system are described.
IEEE Computer Society, 2000, Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems, 2001-, 93 - 99, English[Refereed]
International conference proceedings
[Refereed]
Scientific journal
The Earth Simulator is an ultra high-speed supercomputer. The research and development of the Earth Simulator started in 1997 as one of the approaches in the Earth Simulator project which aims at the promotion of research and development for understanding and prediction for global environment change. Conceptual design and basic design of the Earth Simulator have been finished so far. According to the design, the Earth Simulator is a distributed memory parallel system which consists of 640 processor nodes connected by an internode crossbar switch. Each processor node is a shared memory system composed of eight vector processors. The total peak performance and main memory capacity are 40Tflop/s and 10TB, respectively. In this paper, the concepts of the Earth Simulator system and the outline of the basic design are presented.
Springer Verlag, 1999, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 1615, 269 - 280, English[Refereed]
International conference proceedings
[Refereed]
Scientific journal
In response to the preceding Comment by Garcia, Baras, and Mansour [Phys. Rev. E 51, 3784 (1995)], we evaluate the Rayleigh number by taking the temperature jump at the wall into consideration. It is shown that a good agreement between the direct simulation Monte Carlo results and the linear stability theory is obtained by using the diffuse boundary condition, while there is a slight discrepancy in the case of the semislip boundary condition. © 1995 The American Physical Society.
1995, Physical Review E, 51 (4), 3786 - 3787, English[Refereed]
Scientific journal
[Refereed]
Scientific journal
The transition between heat conduction and convection in the two-dimensional Rayleigh-Bénard system is simulated using the direct simulation Monte Carlo method. Long-range correlations of temperature fluctuations are found to grow in the transition. © 1995 The American Physical Society.
1995, Physical Review E, 52 (2), 1601 - 1605[Refereed]
Scientific journal
熱伝導から対流熱伝達への遷移が生じるRayleigh-Benard不安定性を、直接シミュレーションモンテカルロ法により調べた。基礎方程式と計算手法を詳しく記述し、分子運動のレベルの計算により得られる臨界レイリー数が、巨視的な流体方程式の線形不安定性理論から導かれる値と一致することを示した。さらに、臨界レイリー数近傍の条件における遷移過程で流れ場が熱伝導状態を示していても、温度変動の空間的な相関は既に対流状態への移行を示していること、変動の影響の及ぶ範囲を示す特性距離は、安定な熱伝導あるいは対流状態では小さく、遷移過程でのみ大きくなることが示された。
01 Oct. 1994, Therm. Sci. Eng., 2 (4), 17 - 24, JapaneseThe transition from conduction to convection in the two-dimensional Rayleigh-Bénard system has been simulated using the direct simulation Monte Carlo method, where the diffuse reflection boundary conditions are strictly applied at the top and bottom walls. It is shown that the determined critical Rayleigh number agrees well with that obtained by the macroscopic hydrodynamic equations. © 1994 The American Physical Society.
1994, Physical Review E, 49 (5), 4060 - 4064, English[Refereed]
Scientific journal
[Refereed]
Scientific journal
[Refereed]
Scientific journal
In high-Reynolds-number turbulence the spatial distribution of velocity fluctuation at small scales is strongly non-uniform. In accordance with the non-uniformity, the distributions of the inertial and viscous forces are also non-uniform. According to direct numerical simulation (DNS) of forced turbulence of an incompressible fluid obeying the Navier-Stokes equation in a periodic box at the Taylor microscale Reynolds number R-lambda approximate to 1100, the average < R-loc > over the space of the 'local Reynolds number' R-loc, which is defined as the ratio of inertial to viscous forces at each point in the flow, is much smaller than the conventional 'Reynolds number' given by Re = UL/v, where U and L are the characteristic velocity and length of the energy-containing eddies, and nu is the kinematic viscosity. While both conditional averages of the inertial and viscous forces for a given squared vorticity omega(2) increase with omega(2) at large omega(2), the conditional average of R-loc is almost independent of omega(2). A comparison of the DNS field with a random structureless velocity field suggests that the increase in the conditional average of R-loc with omega(2) at large omega(2) is suppressed by the Navier-Stokes dynamics. Something similar is also true for the conditional averages for a given local energy dissipation rate per unit mass. Certain features of intermittency effects such as that on the Re dependence of < R-loc > are explained by a multi-fractal model by Dubrulle (J. Fluid Mech., vol. 867, 2019, P1).
CAMBRIDGE UNIV PRESS, Oct. 2021, JOURNAL OF FLUID MECHANICS, 929, English[Refereed]
Scientific journal
This work introduces a new idea of batched 3D-FFT with a survey of data decomposition methods and a review of the states-of-arts high performance parallel FFT libraries. Besides, it is argued that the particular usage of multiple FFTs has been associated with the batched execution. The batched 3D-FFT kernel, which is performed on the K computer, shows 45.9% speedup when N and P are 20483 and 128, respectively. The batched FFT allows the developer to take advantage of a flexible internal data layout and scheduling to improve the total performance.
IOS PRESS, 2020, PARALLEL COMPUTING: TECHNOLOGY TRENDS, 36, 169 - 178, English[Refereed]
International conference proceedings
Introduction scientific journal
To achieve high-performance computation on recent supercomputer architecture with numerous cores, it is required to incorporate various parallelism into parallel simulation to break performance saturation employing only domain decomposition approach. Parallel-in-Time (PinT) is one of the most promising candidate for such issues. In this paper, the authors demonstrate that Parareal method is simple but still effective implementation of PinT and its Pipelined version shows higher performance. Performance evaluation was conducted on the K Computer with upto 8,000 nodes and found that Pipelined Parareal method with domain decomposition method achieved 222 times faster than the its sequential time-integration.
日本計算工学会, May 2017, 計算工学講演会論文集 Proceedings of the Conference on Computational Engineering and Science, 22, 4p, JapaneseIntroduction scientific journal
The X-ray free electron laser facility, SACLA, delivers ultra-short pulse, high-brilliant, and coherent X ray. X-ray coherent diffraction imaging, which is one of scientific applications of SACLA, aims at structural analysis of noncrystalline proteins. To reconstruct three-dimensional structure of proteins, several-million diffraction images are to be analyzed using supercomputers such as the K computer. To achieve cooperative experiments, we perform data-transfer test from SACLA to FOCUS and e-Science supercomputers.
The Institute of Electronics, Information and Communication Engineers, 05 Oct. 2012, IEICE technical report. Internet Architecture, 112 (236), 25 - 30, Japaneseスーパーコンピュータ 「京」 や地球シミュレータなどの大規模システムでは,計算ノードのファイル I/O 性能を確保するために 2 階層のファイルシステムを採用しており,ジョブ実行の一連の作業としてファイルシステム間でファイルを移動させるファイルステージング機構をジョブスケジューリングに組み込んでいる.本稿では,ファイルステージングがジョブスケジューリングに与える影響等についてソフトウェアジョブシミュレータを用いて評価したので報告する.
26 Sep. 2012, 研究報告ハイパフォーマンスコンピューティング(HPC), 2012 (22), 1 - 6, Japanese「世界最先端・最高性能の次世代スーパーコンピュータの開発・利用及び利用技術の開発・普及」という目標を掲げて,平成18年度からの7年計画で,次世代スーパーコンピュータ(愛称「京」)の開発がスタートした.本稿では,「京」の開発方針,開発経緯,システム構成の決定とその見直し,製作と性能実証等のシステム完成に至るまでのプロジェクト概要について紹介する.
Information Processing Society of Japan, 15 Jul. 2012, 情報処理, 53 (8), 754 - 758, JapaneseIntroduction scientific journal
Report scientific journal
This paper reports a method of speeding up MPI collective communication on the K computer, that consists of more than 80 thousand computing nodes connected by direct network. Almost all existing MPI libraries only implement algorithms optimized for indirect network. However, such algorithms perform poor on direct network because of collisions of the messages. Thus, in order to achieve high performance on direct network, it is necessary to implement collective algorithms optimized for the network topology. In this paper, Trinaryx3 Allreduce algorithm is designed and implemented in the MPI library for the K computer. The algorithm is optimized for torus network and enables utilizing multiple RDMA engines, one of the strengths of the K computer. The evaluation result shows that the new implementation achieves five times higher bandwidth than existing one, optimized for indirect network.
Information Processing Society of Japan (IPSJ), 21 Nov. 2011, IPSJ SIG Notes, 2011 (6), 1 - 10, JapaneseThe Advanced Institute for Computational Science (AICS) was established in Kobe on July 1, 2010. This new organization is responsible for operating the Next-Generation Supercomputer-named "K" after the character 京, which stands for 10 to the 16th power-and for carrying out R & D in computational science and technology. Our mission is to get the maximum potential use out of the "K computer" to propel Japan to a leading position in the world of computational science and technology. Petascale computing hardware is just around the corner. Petascale resources will enable us to enter a new era of modeling. The supercomputer is an essential tool for contemporary science and technology. The potential it offers for expanding basic research in the study of the universe, elementary particles, materials science and the life science is clear. But the supercomputer is equally essential to a wide range of advanced science and technology that is directly related to our daily lives. We are in the midst of a fierce global competition to develop and use the most advanced supercomputer. AICS will be working to further science and technology, buttress Japan's competitive edge in science, and respond to the needs of both the Japanese people and the global community. Our vision is for ACIS that will become the Mecca of computational science, a converging point of global knowledge attracting scientists from around the world. We hope to produce exciting results that will amaze the world and the Japanese people.
The Physical Society of Japan, 2011, Butsuri, 66 (7), 524 - 528, Japaneseスーパーコンピュータは,科学技術計算を高速に行う計算機であり,今後の科学技術の発展のためにはなくてはならない計算科学のための基盤ツールである。理化学研究所は,平成18年度から次世代スーパーコンピュータの開発プロジェクトにおいて,LINPACK性能10ペタフロップスを超える世界最速レベルの汎用型スーパーコンピュータ(愛称:京速コンピュータ「京」)の開発を進めている。本稿では,すでに製作が開始された京速コンピュータ「京」の概要について紹介する。
Atomic Energy Society of Japan, 01 Dec. 2010, Journal of the Atomic Energy Society of Japan, 52 (12), 782 - 786, JapaneseThe Grid technology was used to develop the GridFMO application to perform quantum chemical calculations in the distributed parallel environment. The Fragment Molecular Orbital (FMO) method was employed to obtain accurate electronic states of proteins. To support long-term calculations, a Grid middleware was developed with high fault-tolerancy and flexyble resource management, based on the lower middleware of Ninf-G. Uniting 10 cluster computers over the Pacific rim, 14 GridFMO calculations were conducted in a period of 70 days, while sharing them with other users via batch queue systems on each machine. The importance of the fault-tolerancy and the resource management was demonstrated through the experiment.
Information Processing Society of Japan (IPSJ), 15 May 2007, 情報処理学会論文誌コンピューティングシステム(ACS), 48 (8), 83 - 93, JapaneseA data repository system, that is called EDgrid Central, is designed for storing huge amount of experiment data by using a 3-D full-scale earthquake testing facility. The EDgrid Central prepares large storage capacity and implements a data modeling for the shake test in the backend. The frontend is a portal for users to retrieve the stored data by meta-data search and bulk download. This system uses the NEEScentral developed by the NEES project in the United States by enhancing search and download functionalities, according to the EDgrid users' requirements. The EDgrid Central allows facility sites to have a permanent repository of the shaking table experiment and it also enables civil engineering researchers to share their data and reports in their daily activities.
Information Processing Society of Japan (IPSJ), 27 Feb. 2006, IPSJ SIG Notes, 2006 (20), 115 - 120, JapaneseIn order to stop a large-scale cluster safely in the cases, such as a power failure, air-conditioning under shutdown also poses a problem. In this paper, a 2-dimensional distribution and trends of room temperature is surveyed, and the safety of way which shuts a system down without backing up an air-conditioning machine is confirmed.
Information Processing Society of Japan (IPSJ), 07 Mar. 2005, IPSJ SIG Notes, 162 (19), 127 - 132, JapaneseThe energy spectrum in the near dissipation range of turbulence is studied by analyzing the data of a series of high-resolution direct numerical simulations of incompressible homogeneous turbulence in a periodic box with the Taylor micro-scale Reynolds number R-lambda and resolution ranging up to about 675 and 4096(3), respectively. The spectra in the Reynolds number range fit well to the form C(k eta)(alpha) exp(-beta k eta) in the wavenumber range 0.5 less than or similar to k eta less than or similar to 1.5, where eta is the Kolmogorov dissipation length scale and C, alpha and beta are constants independent of k. The values of alpha and beta decrease monotonically with R-lambda, and they are consistent with the conjecture that they approach to constants as R-lambda -> infinity, but the approach, especially that of beta, is slow.
PHYSICAL SOC JAPAN, 2005, Journal of the Physical Society of Japan, 74 (5), 1464 - 1471, English[Refereed]
The statistics of energy transfer is studied by using the data of a series of high-resolution direct numerical simulations of incompressible homogeneous turbulence in a periodic box with the Taylor micro-scale Reynolds number R-lambda and grid points up to approximately 1130 and 4096(3), respectively. The data show that the energy transfer T across the wave number k is highly intermittent and the skewness S and flatness F of T increase with k approximately as S proportional to (kL)(alpha S), F proportional to (kL)(alpha F) in the inertial subrange, where alpha(S) similar to 2/3, alpha(F) similar to 1 and L the characteristic length scale of energy containing eddies. The comparison between the statistics of T, the energy dissipation rate epsilon and its average epsilon(r) over a domain of scale r shows that T is less intermittent than epsilon, while there is a certain similarity between the probability distribution functions of T and epsilon(r).
PHYSICAL SOC JAPAN, 2005, Journal of the Physical Society of Japan, 74 (12), 3202 - 3212, English[Refereed]
AIST supercluster installed at Grid Technology Research Center consists of three systems, P-32, M-64, and F-32. We performed Linpack benchmark on P-32 cluster system which is the largest among the three systems. The measured performance of 6.155 Tflop/s corresponds to 75% of the theoretical peak performance. This paper reports how an appropriate combination of parameters of HPL program was effectively found based upon HPL program analysis for computation and communication time behavior for large scale problems. Then effects of NUMA capability of Linux kernel to the Linpack performance which was revealed through the benchmark.
Information Processing Society of Japan (IPSJ), 30 Jul. 2004, IPSJ SIG Notes, 99 (81(HPC-99)), 163 - 168, JapaneseHigh-resolution direct numerical simulations (DNSs) of incompressible turbulence based on an alias-free spectral method were performed on the Earth Simulator. Statistics of turbulence are studied by a DNS on 1024(3) grid points with a special emphasis on the spectra of moments fourth order in velocity. A brief review is given on some results of the preliminary analysis of the data of DNSs with up to 2048(3) grid points.
SPRINGER, 2004, Iutam Symposium on Reynolds Number Scaling in Turbulent Flow, 74, 155 - 162, English[Refereed]
The Earth Simulator (ES) is an SMP cluster system. There are two types of parallel programming models available on the ES. One is a flat programming model, in which a parallel program is implemented by MPI interfaces only, both within an SMP node and among nodes. The other is a hybrid programming model, in which a parallel program is written by using thread programming within an SMP node and MPI programming among nodes simultaneously. It is generally known that it is difficult to obtain the same high level of performance using the hybrid programming model as can be achieved with the flat programming model. In this paper, we have evaluated scalability of the code for direct numerical simulation of the Navier-Stokes equations on the ES. The hybrid programming model achieves the sustained performance of 346.9Gflop/s, while the flat programming model achieves 296.4Gfiop/s with 16 PNs of the ES for a DNS problem size of 256 3. For small scale problems, however, the hybrid programming model is not as efficient because of microtasking overhead. It is shown that there is an advantage for the hybrid programming model on the ES for the larger size problems. (C) 2004 Elsevier B.V. All rights reserved.
ELSEVIER SCIENCE BV, 2004, Parallel Computing, 30 (12), 1329 - 1343, English[Refereed]
We have to access a computer center which has supercomputers or large-scale cluster systems to carry out large-scale scientific computations. However, these are not familiar for us to execute the job at the center and therefore a delay of research might be occurred. We can build a portal site for scientific researchers to provide friendly computational environments. In this study, we have constructed a portal site for molecular dynamics (MD) simulations with two application components by using the Grid PSE Builder, which is a developing framework for grid-enabled problem solving environment (Grid PSE). Two components are a MD simulation component by using the parallel MD Stencil and an image generator for snapshots by MD simulations. PSE users can choose one of the components, and then jump to job sibmission page, job status page, and results download page in this portal site. An image generator provides a monitoring feature of a simulation as an animation on the web browser. We have confirmed the effect of the portal by applying it to a simulation of intrinsic transformation of vacancy dislocation loop in copper crystal.
Information Processing Society of Japan (IPSJ), 16 Oct. 2003, IPSJ SIG Notes, 96, 55 - 60, JapaneseThe Earth Simulator is an ultra high-speed supercomputer which was developed for global environment change simulations. For achieving high performance computing on large scale distributed memory parallel computers such as the Earth Simulator, an optimization of communication processings in user applications is required, and the optimization needs an evaluation for performance of communication methods. On the Earth Simulator, Message Passing Interface (MPI) is supported as the communication method. We have evaluated performance of the MPI-1/MPI-2 functions on the Earth Simulator in detail using MBL which was developed for the measurements of MPI performance on various parallel computers. The results show that the maximum throughputs of ping-pong communication using MPI_Send are 14.8GB/s within a node and 11.8GB/s between two nodes. Latencies of MPI_Send and MPI_Put are 5.58 microseconds and 6.36 microseconds, respectively. On the condition that run one MPI-process on one node and use 512 nodes, latencies of MPI_Barrier and MPI_Win_fence are 3.25 microseconds and 223.75 microseconds, respectively. We found out that the MPI on the Earth Simulator has excellent performance.
Information Processing Society of Japan (IPSJ), 15 Jan. 2003, 情報処理学会論文誌. ハイパフォーマンスコンピューティングシステム, 44 (6), 24 - 34, JapaneseThe spectra of the squares of velocity quadratics including the energy dissipation rate epsilon per unit mass, the enstrophy omega(2) and the pressure p were measured using the data obtained from direct numerical simulations (DNSs) of incompressible turbulence in a periodic box with number of grid points up to 2048(3). These simulations were performed using the Earth Simulator computing system. The spectra for epsilon, omega(2) and p exhibited a wave number range in which the spectra scaled with the wave number k as proportional to k(-a). Exponent a for p was about 1.81, which is in good agreement with the value obtained by assuming the joint probability distribution of the velocity field to be Gaussian, while a values for epsilon and omega(2) were about 2/3, and very different from the Gaussian approximation values.
PHYSICAL SOC JAPAN, 2003, Journal of the Physical Society of Japan, 72 (5), 983 - 986, English[Refereed]
High-resolution direct numerical simulations (DNSs) of incompressible homogeneous turbulence in a periodic box with up to 4096(3) grid points were performed on the Earth Simulator computing system. DNS databases, including the present results, suggest that the normalized mean energy dissipation rate per unit mass tends to a constant, independent of the fluid kinematic viscosity nu as nu-->0. The DNS results also suggest that the energy spectrum in the inertial subrange almost follows the Kolmogorov k(-5/3) scaling law, where k is the wavenumber, but the exponent is steeper than -5/3 by about 0.1. (C) 2003 American Institute of Physics.
AMER INST PHYSICS, 2003, Physics of Fluids, 15 (2), L21 - L24, English[Refereed]
High-resolution direct numerical simulations (DNSs) of incompressible turbulence with numbers of grid points up to 2048(3) have been executed on the Earth Simulator (ES). The DNSs are based on the Fourier spectral method, so that the equation for mass conservation is accurately solved. In DNSs based on the spectral method, most of the computation time is consumed in calculating the three-dimensional (3D) Fast Fourier Transform (FFT). In this paper, we tuned the 3D-FFT algorithm for the Earth Simulator and have achieved DNS at 16ATFLOPS on 2048(3) grid points.
NEC CORP, 2003, Nec Research & Development, 44 (1), 115 - 120, English[Refereed]
There are two programming models on the shared-memory architecure. One, called the flat programming, is using MPI, and the other, called the hybrid programming, is using MPI and shared-memory models simultaneously. In general, it is difficult that the hybrid programming outperforms the flat programming. In this study, we evaluated a scalability of large-scale direct numerical simulations of the Navier-Stokes equations on the Earth Simulator. As a result, the hybrid programming could outpeform the flat programming on the Earth Simulator. Also, we discuss the tuning strategies to obtain higher performance on the Earth Simulator.
Information Processing Society of Japan (IPSJ), 21 Aug. 2002, IPSJ SIG Notes, 91, 55 - 60, JapaneseThe Earth Simulator has 640 processor nodes and its peak performance is 40 Tflop/s. Each node has 8 vector processors, each of which has 8 Gflop/s peak performance, and 16GByte shared main memory. There are two programming methods on a node of the Earth Simulator. One, called flat programming, is MPI on shared memory architecture, the other, called hybrid programming, is "microtask" processing by automatically parallelization by the compiler. In this study, we have evaluated three basic performances. The first one is a calculation time in a node with 8 vector processors which include microtask starting and closing overhead or MPI barrier. The second one is a data transfer time between two nodes with 1-by-1 MPI processes or 8-by-8 MPI processes. The last one is a time for application with the calculation and data transfer. Finally, we evaluated an application program which is large-scale direct numerical simulations of the Navier-Stokes equations. The most of calculation time of this application is 3 dimensional FFT. The total running time with 8 nodes (64 APs) are 4 and 30 seconds for 256^3 and 512^3 problem size, respectively. Since the difference time between two programming models is almost 1 second, the hybrid programming can be achieved the same performance as the flat programming.
Information Processing Society of Japan (IPSJ), 27 May 2002, IPSJ SIG Notes, 90, 19 - 24, JapaneseWe implemented a feature for irregular problems, called HALO, into the HPF/ES compiler on the Earth Simulator. HALO enhances irregular access and communication of an array, and makes it possible to write efficient parallel programs of irregular problems easily. This paper describes the usage and implementation of HALO and shows its evaluation results on the Earth Simulator. A Benchmark program of the finite element method parallelized with HALO achieved an over 10 times faster execution than the one parallelized without HALO on Earth Simulator.
Information Processing Society of Japan (IPSJ), 27 May 2002, IPSJ SIG Notes, 90, 61 - 66, Japanese[Refereed]
Summary national conference
Technical report
Technical report
Technical report
The Earth Simulator which is under development has 640 processor nodes and its peak performance is 40 Tflop/s. Each node has 8 vector processors, each of which has 8 Gflop/s peak performance, and 16GByte shared main memory. In this study, we have evaluated performance of solid molecular dynamics simulation on an SMP node of the Earth Simulator. In molecular dynamics simulation, each particle is influenced by all particles within a cut-off region and the representation of these pairs of particles is made by a matrix. Two matrix representations, compressed row form and jagged diagonal form, are considered for vectorization. The jagged diagonal form is better than the compressed row form in performance on a vector processor for the force calculation of every pairs, because the vector length of the former is longer than that of the latter. However, computational cost for converting, the normal matrix form to the jagged diagonal form is quite expensive and the total performance in using the jagged diagonal form is low. with the jagged diagonal form is obtained. Speedup by parallelization with the compressed row form is 2.4 to 2.7 with 8 vector processors.
Information Processing Society of Japan (IPSJ), 26 Oct. 2001, IPSJ SIG Notes, 88, 67 - 72, JapaneseMPI is one of major message communication interfaces for application programs. The MPI consists of an MPI-1 as a basic specification, and an MPI-2 as extensions. Some benchmark programs for MPI-1 have been proposed already. However benchmark programs for MPI-2 are a little and their measurements are limited. We have developed an MPI benchmark program library for MPI-2 (MBL2) which measures the detail performance of MPI-I/O and RMA functions of MPI-2. In this report, we describe the MBL2 and performance data of MPI-2 on VPP5000 and SX-5, which we measured using MBL2.
Information Processing Society of Japan (IPSJ), 25 Jul. 2001, IPSJ SIG Notes, 87, 67 - 72, JapaneseEarth simulator is a distributed memory parallel system which consists of 640 processor nodes connected by a full crossbar network. Each processor node is a shared memory system which is composed of eight vector processors. The total peak performance and main memory capacity are 40Tflops and 10TB, respectively. A performance prediction system GS3 for the Earth Simulator has been developed to estimate sustained performance of programs. To validate accuracy of vector performance prediction by the GS3, the processing times for three groups of kernel loops estimated by the GS3 are compared with the ones measured on SX-4. It is found that the absolute relative errors of the processing time are 0.89%, 1.42% and 6.81% in average for three groups. The sustained performance of three groups on a processor of the Earth Simulator have been estimated by the GS3 and those performance are 5.94Gflops, 3.76Gflops and 2.17Gflops in average.
JAPAN SOCIETY FOR COMPUTATIONAL ENGINEERING AND SCIENCE, 2001, Transactions of the Japan Society for Computational Engineering and Science, 2001 (0), 20010040 - 20010040, Japanese科学技術庁は,プロセス(基礎科学)研究,観測,計算機シミュレーションの三位一体で地球環境変動予測研究を推進するプロジェクトの一環として,大気大循環シミュレーションで超高速並列計算機「地球シミュレータ」を開発中である.この地球シミュレータのハードウェア,基本ソフトウェア,応用ソフトウェアについてその概要を解説する.
Information Processing Society of Japan (IPSJ), 15 Apr. 2000, IPSJ Magazine, 41 (4), 369 - 374, Japanese科学技術庁は,地球環境変動予測研究を推進するプロジェクトの一環として,大気大循環シミュレーションで超高速並列計算機「地球シミュレータ」を開発中である.この地球シミュレータ開発の必要性,応用のターゲット,そのために求められる計算機としての要件,開発スケジュール,さらには,世界の高性能計算機開発計画における位置付けなどについて解説する.
Information Processing Society of Japan (IPSJ), 15 Mar. 2000, IPSJ Magazine, 41 (3), 249 - 254, JapaneseHigh Perrormance Fortran (HPF) is considered one of the major parallel programming interfaces as well as Massage Passing Interface (MPI). HPF is a high-level data parallel language designed to provide a clear and easily understood programming interface. Users can parallelize their sequential programs by mainly inserting directives specifying data mapping on distributed memories. We plan to adopt HPF as a common parallel programming interface on the Earth Simulator, which is a distributed memory parallel system under development mainly targeting earth science. In this paper, we evaluated efficiency and describability of HPF by parallelizing two application programs originally developed for sequential execution. The evaluation results showed that users can obtain good scalability by HPF programming with relatively small effort.
Information Processing Society of Japan (IPSJ), 03 Dec. 1999, IPSJ SIG Notes, 79, 49 - 54, JapaneseThe Earth simulator is a distributed memory parallel system which consists of 640 processor nodes connected by a cross barnetwork. Each processor node is a shared memory system which is composed of eight vector processors. The total peak performance and main memory are 40Tflop/s and 10TB, respectively. A software simulator (GSSS) for the Earth Simulator and its similar computers has been developed to estimate the sustained performance of programs. To validate an accuracy of the software simulator, the processing times for some kernel DO loops estimated by the GSSS are compared with the ones measured on an SX-4. It is found that the absolute relative error of the processing time is about 1% in average. The sustained performance of the kernel loops on the Earth Simulator has been estimated by the GSSS and a performance of 4.18Gflop/s in average is obtained.
Information Processing Society of Japan (IPSJ), 04 Mar. 1999, IPSJ SIG Notes, 132 (21), 55 - 60, JapaneseIn recent years, since the development of fast processor element has reached the upper limit, parallelism is one of the solutions for massive numerical simulations, and efficient and portable parallel libraries are needed. With MPI or PVM which is not limited to a certain architecture, we are developing parallel subroutine libraries specific for use on parallel vector processors. We report here the development of parallel subroutine library for eigenvalue problem of real symmetric matrix based on the Householder transformation and the bisection method. The matrix is partitioned into columns by the column-wide cyclic decomposition scheme, and all elements of the symmetric matrix are stored in order to reduce data exchanges among processors. For the Householder transformation using eight processors on Paragon, the speedup ratio of 6.0 has been achieved for a matrix of 2000×2000 elements. In the case of the matrix of 4000×4000 elements, the ratio is found to be 4.2 on VPP300.
Information Processing Society of Japan (IPSJ), 28 Aug. 1996, IPSJ SIG Notes, 62, 129 - 134, JapaneseThe successive overrelaxation method, also called the SOR method, is one of the iterative methods for so1ving a linear system of equations, and has been used in many programs. With the advent of vector processors, the SOR method is executed efficiently in parallel with the red-black or hyperplane ordering on vector processors. In this paper, a parallel scheme termed the 4-color SOR method is revised and compared with the natural and red-black SOR methods on a multiprocessor system. The 4-color SOR method has the highest parallel performance of the three. The parallel-vector calculation of the 4-color method with 4 processors is about 10 times faster than the scalar one with one processor.
The Japan Society of Mechanical Engineers, 1990, TRANSACTIONS OF THE JAPAN SOCIETY OF MECHANICAL ENGINEERS Series B, 56 (524), 1062 - 1065, JapaneseMany benchmark problems for the numerical analysis of the fluid flow have been proposed. In this paper, a lid-driven cavity flow which is one of the most famous benchmark problems is examined. Four numerical schemes are compared with each other in terms of the grid dependency and accuracy of the solutions. From the viewpoint of the interaction between neighboring computational cells, we choose a 'CONDIF : Controlled Numerical Diffusion with Internal Feedback' approach developed by Runchal and we check this scheme for this cavity flow problem. Consequently, it is found that the CONDIF' approach is very stable for the grid Peclet number condition and the time step is not sensitive to the Courant condition.
The Japan Society of Mechanical Engineers, 1989, TRANSACTIONS OF THE JAPAN SOCIETY OF MECHANICAL ENGINEERS Series B, 55 (515), 1823 - 1828, JapaneseThree-dimensional laminar flows in various curved pipes # (i. e., Rc/a=2 and 9 for curved 180° pipes and Rc/a=2 for curved 90° pipes are numerically simulated through use of the full Navier-Stokes equations. The boundary-fitted coordinate system is used in order to treat complicated boundary configurations like a strongly curved pipe. The obtained flow patterns and the magnitude of the secondary flow in the case of Rc/a=9 for a curved 180° pipe are in good agreement with the experimental results. The elliptic nature of the basis equation is very important in simulating these recirculating flows, especially in the case of small curvature. Two separation regions, one occurring on the outside wall near the inlet and the other occurring on the inside wall at the outlet, are found in the case of Rc/a=2. Not only is the crossover point between the shear stress maxima on the inside and outside of the pipe independent of the curvature and the bending angle of the pipe, but it is also independent of the initial profile at the straight pipe.
The Japan Society of Mechanical Engineers, 1989, TRANSACTIONS OF THE JAPAN SOCIETY OF MECHANICAL ENGINEERS Series B, 55 (518), 3011 - 3018, JapaneseThree-dimensional laminar flows in various curved pipes(i.e., Rc/a=2, 4, 7 and 9 for curved 180 degree pipes and Rc/a=2 and 4 for curved 90 degree pipes) were numerically simulated by using time-dependent incompressible Navier-Stokes equations. The boundary-fitted coordinate system was used in order to treat the complicated boundary configurations of strongly curved pipes. The obtained flow patterns and the magnitude of the secondary flow in case of Rc/a=9 for curved 180 degree pipe were in good agreement with the experimental results. The ellipticity of the basic equation was very important to simulate these recirculating flows, especially when the curvature ratio was small. Velocity and pressure fields were visualized by a pseudo-color technique on a workstation.
THE VISUALIZATION SOCIETY OF JAPAN, 1988, JOURNAL OF THE FLOW VISUALIZATION SOCIETY OF JAPAN, 8 (30), 201 - 204, JapaneseThis paper describes the reduction of computation time of a large sparse linear equations obtained by discretization of a three-dimensional Poisson's equation using the finite difference method. The equation is induced from wind field calculations, which are needed for evaluation of environmental consequences due to radioactive effluents.
Various iterative methods, such aS ICCG, MICCG, ILUCR and MILUCR methods, are applied to solving linear equations and are compared with SOR method. The optimum value of the acceleration factor of SOR method can be obtained numerically according to atmospheric stability for each nuclear site, and the iterations are minimized by using this optimum value.
The computation time for MICCG or MILUCR method is a half of that for SOR method. The ILUCR method is better than SOR method, because it does not use acceleration factor and the computation time is shorter. The use of vector computer drastically reduces the computation time, and all the iterative methods are applicable. A scalar computer, however, favors the use of MILUCR or MICCG methods because of a half of the computation time for SOR method.
線形計画問題 混合整数計画問題等の線形の数理計画問題は 計算機の高速化とアルゴリズムの改良により 一部の大規模問題を除いて ほとんどの問題がわずかな計算時間で解けるようになっている.しかし 多くのユーザは 数理計画問題を簡単に記述できる言語がないため 計算機への入力データ作成に多大な時間を費やしている.このため 筆者らは 科学計算型の極めて簡単な数理計画問題記述言語PDL/MPを開発した.この言語は 数理計画問題を数式に近い形で記述できるため 言語の習得 問題の記述 修正が極めて短時間にできる.PDL/MPの処理システムは PDL/MPで記述された問題を解釈し 世界中で幅広く使われているMPS系ソフトウェアの入力データを自動作成する.本論文では PDL/MPの概要 処理システム 適用例について記述する.
Information Processing Society of Japan (IPSJ), 15 Sep. 1986, IPSJ Journal, 27 (9), 880 - 891, Japanese近年 計算機利用の多様化 増大にともない 一地域に複数の計算機を設置するユーザが増えてきた.本論文では このような一地域複合計算機システムの構成設計に焦点をあて 与えられた計算需要が処理可能なコスト最小の最適計算機構成を求める混合整数計画モデルについて記述する.本モデルは 既存のモデルに比べ以下の特長をもつ.(i)べクトル計算機を対象計算機として扱える.(ii)2レベルの目的関数を用いることにより 最適計算機構成と最適計算機構成のもとでの最適ジョブ負荷配分を同時に求めることができる.(iii)計算機システムの運用形態 運用時間帯などに複数の種別が設定でき さまざまな運用制約を記述できる.(iv)線形モデルであるため 幅広く使われている数理計画汎用ソフトウェアで容易に解くことができる.
Information Processing Society of Japan (IPSJ), 15 Sep. 1985, IPSJ Journal, 26 (5), 807 - 814, JapaneseScholarly book
Scholarly book
Oral presentation
Oral presentation
Oral presentation
Oral presentation
Oral presentation
Oral presentation
Nominated symposium
Oral presentation
Oral presentation
Oral presentation
Oral presentation
Oral presentation
Oral presentation
Oral presentation
Oral presentation
Oral presentation
Oral presentation
Oral presentation
Oral presentation
Oral presentation
Oral presentation
Oral presentation
Oral presentation
[Invited]
Invited oral presentation
Public discourse
Oral presentation
Keynote oral presentation
Invited oral presentation
Public discourse
[Invited]
Invited oral presentation
The Japan Society of Fluid Mechanics
The Japan Society for Industrial and Applied Mathematics
Information Processing Society of Japan
本研究では,建物・地盤連成地震動応答シミュレーションコードのマルチコアシステムにおける高速解法を開発,評価することが目的である. 建築物の耐震性を検討するために,地盤とその上に立つ建物の地震動応答を求める数値シミュレーションが行われている.このようなシミュレーションでは,複雑な領域を対象とするために,一般には3次元有限要素法による計算格子に対し,各格子点における運動方程式を立て,それらを離散化して得られる大規模連立一次方程式を解くことが行われている.有限要素法を適用して得られる全体行列は疎で不規則な形となるが,建物に対応する部分行列と地盤に対応する部分行列の性質をうまく利用したコレスキー法による部分的前処理,及びスケーリングによる部分的前処理を適用した共役勾配法(PSCCG法)を用いて,逐次計算によって解かれている.コレスキー法部分の前処理は直接法となるため並列化が困難であるが,他の部分については並列実行が可能である.本研究では,この解法による計算時間をマルチコアシステム上での並列化することにより短縮する. 2018年度は,本手法の実行可能性を検討するために,モデル問題として2次元計算領域における拡散方程式に対し,有限差分法による離散化を行った.離散化において,計算領域の部分ごとに拡散係数を変化させることにより,条件数が大きい部分行列を持つ全体行列の構成した.同時に,既存の建物・地盤連成地震動解析の時間発展シミュレーションコードで実装されているPSCCG法の計算部分に対し,プロセス並列実装を行い,スレッド数を1に固定し,かつプロセス数を増加させた時の並列性能を評価した.
Competitive research funding
(1) プラズマシミュレーションの電位計算部分で現れる連立1次方程式向けに,ブロック赤黒順序付け法に基づく高並列な不完全LU分解型前処理プログラムを開発し,GPU上で実装した。性能評価の結果,マルチコアプロセッサ上での実行に比べて10倍以上の加速が得られた。また,GPU向けのソルバであるMAGMAと比較しても,大規模問題の場合に3倍程度の速度向上が見られた。
(2) 超並列一般化固有値問題ライブラリに対する自作ミドルウェアEigenKernel(https://github.com/eigenkernel/)を発展させ,(i) Oakforest-PACSでのベンチマーク, (ii) マルコフ連鎖モンテカルロ型ベイズ推定による強スケーリング性能予測(外挿),の2点を実現した。また,超並列性に優れた選択的固有対計算アルゴリズムを提案し,電子状態計算の実問題へと適用し,手法の有用性を示し,コードを公開した(https://github.com/lee-djl/k-ep)。超並列型固有値問題ならではの応用研究として,有機デバイス材料(ペンタセン薄膜,10ナノメートルスケール系)の計算を行い,デバイス性能に直接関わる準局在波動関数を得た。
(3) 建物の地震動応答シミュレーションに現れる大規模連立1次方程式に対し,部分的不完全コレスキー分解前処理付き共役勾配法(CG method preconditioned with partially incomplete Cholesky decomposition)のプロセス並列実装を行い並列性能の評価を行った。また,正定値対称疎行列を係数行列に持つ連立1次方程式に対し,緩和型スーパーノードマルチフロンタル法によるコレスキー分解法を適用し,スーパーノードと見做すパラメータに対する計算性能について評価し,最適な緩和パラメーターの指針を得た。
This work has demonstrated some case studies of effectively using machine learning techniques for supporting High-Performance Computing (HPC) programming. Various problems in code optimization can be solved by converting the problems to the problems that have already been proven to be solved by machine learning. Moreover, this work clarified the importance of analyzing the target problems in advance of machine learning, because it is unlikely that a sufficient number of training data are available in code optimization problems. Moreover, as well as HPC programming, machine learning also needs knowledge and experiences of human experts. However, in machine learning, the problem is already parameterized, and hence can be solved if sufficiently-high performance is available.
In this project, we aimed at developing solvers for numerical linear algebra functions used in electronic structure calculations. The main achievements of this project are as follows. (1) We performed an error analysis of the CholeskyQR2 method, which is a promising communication-avoiding algorithm for the QR decomposition, and proved its numerical stability. (2) We developed a parallel linear equation solver based on the one-way dissection for quantum wave dynamics. (3) We developed a generalized eigenvalue solver EigenKernel, which is a hybrid solver that combines three eigenvalue solvers for dense matrices. These results will be useful for accelerating electronic structure calculations on many-core processors.
A parallel direct numerical simulation (DNS) code was developed on the K computer for solving the Navier-Stokes equations in a box with periodic boundary conditions for three orthogonal axes. The objective of the code is to simulate behaviors of homogeneous, isotropic incompressible turbulent flow, or canonical turbulent flow, which is the most standard flow among turbulent flows without boundary walls. Pesudo-spectral method is used for discretization in 3-dimensional space and forth-order Runge-Kutta method is used for temporal discretization. Hybrid parallelization with both OpenMP and MPI are adopted in the code. The DNS with grid points of 12288 cubed were carried out on the K computer with sustained performance of 2.2% on the 37,376 compute nodes. Simulation data for about Reynolds number 2300 were obtained.
Competitive research funding