work

Name: Chen Yang Wang
Cell: +886 932-230-966
Email: me@youn.gg
Links: youn.gg

Technical Skills

Languages: C++ (17/20), Rust, Python, Bash, CMake
HPC & GPU: HIP (ROCm), CUDA, Kokkos, MPI, OpenMP
Systems & Tools: Linux, Docker, Git, Slurm, Proxmox VE

Education

National Ilan University, Department of Computer Science and Information Engineering (2023/09 - Present)
National Tainan Industrial High School, Information Technology (2020/09 - 2023/06)

Work Experience

HPC Engineering Intern, National Center for High-performance Computing (NCHC)

Remote, during semester (2025/11 - Present)

Focusing on software deployment and performance validation within the AMD ROCm ecosystem.
Deployed a single-node multi-GPU (MI325X*8) inference environment using Docker Compose (with LMCache, vLLM), enhancing large language model inference efficiency through PD separation.

Projects

Libraries

hippp - Write GPU program with RAII

Developed a modern C++ header-only library leveraging RAII to manage GPU resources on the ROCm stack.
Simplified GPU programming workflow by reducing boilerplate code and preventing memory leaks.

Algorithms

ROCm-mini-nbody - A simple gravitational N-body simulation with ROCm optimizations

Ported mini-nbody, a simple gravitational N-body simulation, to HIP/ROCm using hipify-perl and CMake, enabling execution on AMD GPUs.

rocOdyssey - ROCm version of Odyssey: a public, GPU-based GRRT

Ported Odyssey, a General Relativistic Ray-Tracing code, from CUDA to HIP/ROCm to enable black hole simulations on AMD GPUs.
Tuned kernel launch parameters for AMD CDNA/RDNA architectures: optimized Wavefront size to 64 and Thread Block size to 256, achieving maximum compute unit occupancy.

Use Kokkos to Accelerate AE-QTS Algorithm

Accelerated the AE-QTS algorithm by migrating from single-threaded Python to the Kokkos performance portability framework.
Enabled cross-platform execution on both AMD and NVIDIA GPUs, achieving performance parity with the native CUDA implementation.

System Applications

Competitions

4th HiPAC High-Performance Application Competition - Honorable Mention

Accelerated LAMMPS molecular dynamics simulations in a multi-node environment (2 nodes × 8 NVIDIA V100 GPUs).
Devised a resource scheduling strategy: utilized off-peak hours for high-load testing and optimized execution scripts to minimize runtime, maximizing the team’s testing window.

National Center for High-Performance Computing International Student Cluster Competition (SCC)

Selected for National Team Training: Undergoing intensive HPC training to represent Taiwan in international supercomputing competitions.
Large-Scale Cluster Benchmarking: Achieved 11.48 PFLOPS on HPL using 256 NVIDIA H200 GPUs across 32 nodes.
Optimized problem size (N=1853440) and block size (NB=2048) with a 16x16 process grid, maximizing compute efficiency under constraints of standard Ethernet connectivity and topology-unaware Slurm scheduling.

View HPL Benchmark Log (11.48 PFLOPS)

================================================================================
T/V                N    NB     P     Q         Time          Gflops (   per GPU)
--------------------------------------------------------------------------------
WR0          1853440  2048    16    16       369.72       1.148e+07 ( 4.485e+04)

HPL_pdgesv() start time Sat Mar  7 02:41:45 2026
HPL_pdgesv() end time   Sat Mar  7 02:47:54 2026

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   0.000264582858 ...... PASSED
||Ax-b||_oo  . . . . . . . . . . . . . . . . . = 0.0000001085912040
||A||_oo . . . . . . . . . . . . . . . . . . . = 464354.0734328660182655
||x||_oo . . . . . . . . . . . . . . . . . . . = 4.2953119208323303
||b||_oo . . . . . . . . . . . . . . . . . . . = 0.9923124922204719
================================================================================

ISC 2026 Student Cluster Competition (Online) - Team Leader (In Progress)

Contributions

HPC Community Contributions

GitHub Linguist - Add .slurm and .sbatch as shell extensions: Fixed language detection for Slurm scripts (.slurm and .sbatch), improving code statistics accuracy for global HPC projects.

ROCm Ecosystem Contributions

ROCm rocHPL - Add gfx1201 and ROCm 7.0 support: Updated CMake build logic to support the latest gfx1201 (RDNA4) architecture, resolving hardware compatibility issues for benchmark testing.
GitHub Linguist - Add HIP language support: Added syntax highlighting support for HIP, improving code statistics and readability for global projects.
Github gitignore - Add HIP.gitignore: Standardized .gitignore templates for HIP projects to enhance developer experience.
zed-industries Zed - Recognize HIP files as C++: Let Zed editor treat HIP files as C++ for better syntax highlighting and code analysis, improving the development experience for HIP programmers.

Certifications

Community

TANET & NCS 2025 - Conference Staff
SITCON X - Speaker (Topic: Project Introduction & System Programming)
SCIST S3 Algorithm Course - Online Teaching Assistant
Jianbei Electrical Engineering Club - Club Instructor
Southern Nine Schools Information Club - Team Mentor (Joint Tea Party & Winter Training)
National Tainan Industrial High School Web Design Club - President & Instructor

resume

Technical Skills

Education

Work Experience

HPC Engineering Intern, National Center for High-performance Computing (NCHC)

Projects

Libraries

hippp - Write GPU program with RAII

Algorithms

ROCm-mini-nbody - A simple gravitational N-body simulation with ROCm optimizations

rocOdyssey - ROCm version of Odyssey: a public, GPU-based GRRT

Use Kokkos to Accelerate AE-QTS Algorithm

System Applications

Competitions

4th HiPAC High-Performance Application Competition - Honorable Mention

National Center for High-Performance Computing International Student Cluster Competition (SCC)

ISC 2026 Student Cluster Competition (Online) - Team Leader (In Progress)

Contributions

HPC Community Contributions

ROCm Ecosystem Contributions

Certifications

Community