Software - AI Software Engineer (Generative AI)

Seoul, South Korea (On-site)

About the job

FuriosaAI is seeking a Software Engineer to join our Platform Software Team.
This team is dedicated to conducting focused research and engineering efforts to develop a cutting-edge, end-to-end LLM serving solution and deliver a streamlined software development kit (SDK) for the FuriosaAI Tensor Contraction Processor (TCP) architecture.
We are looking for an AI Software Engineer who will contribute to a full-stack real-world AI product development, from analyzing, researching, implementing inference/serving methods for Generative AI models.

1. Design & Optimization of Generative AI Model Inference

Parallelism strategies: data/pipeline/tensor/sequence/context/expert parallelism, and new parallelism methods
Serving strategies: Selective Batching, Sarathi-serve, DynamicSplitFuse, Dynamic MoE exper load, etc.
Inference acceleration techniques: Speculative Decoding, KV-cache dropping, Sparse Attention, Hybrid Linear Attention (e.g., Minimax-01), etc.
LLM Reasoning Inference techniques : Search-Based Methods (MCTS, MCTSr, and Variants, Best-of-N, etc.) in Combination with Chain-of-Thought, Tree-of-Thought, and Forest-of-Thought.
Research of Generative AI models beyond LLMs (e.g., Diffusion Models).

2. Generative AI Model & System Co-Design

Co-design Generative AI models and systems while considering Furiosa's Tensor Contraction Processor (TCP) architecture and software stack (Compiler, Runtime, Serving Stack).
Conduct performance modeling of various Generative AI models and systems on GPU/NPU to optimize inference techniques tailored for RNGD.
Implement optimized Generative AI model inference methods in the FuriosaAI SDK.

3. Analysis & Research of Existing Inference Frameworks

Analyze the features and source code of existing Generative AI model inference frameworks such as vLLM, TensorRT-LLM, and DeepSpeed-MII.
Research and analyze the state-of-the-art Generative AI model inference & system architectures, focusing on optimizing them for Furiosa's TCP architecture
Use profiling tools like Nsight to analyze GPU execution and study CUDA/Triton kernel performance.

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent industry experience.
Programming languages: Python, C++, Rust, or CUDA programming.
Hands-on experience with deep learning frameworks such as PyTorch or TensorFlow.
Strong understanding of Computer Science concepts, particularly Networking, Multi-Processing, Multi-Threading system and/or distributed systems.
Effective communication skills to discuss project requirements and technical issues

Experience using LLM inference frameworks: vLLM, TensorRT-LLM, and DeepSpeed-MII.
Experience in developing or analyzing large-scale open-source models and projects.
Hands-on experience in developing and researching efficient LLM inference methods.
Deep understanding of Transformer-based model inference.
Strong intellectual curiosity about various deep learning algorithms and applications.
Strong proficiency in Rust programming language.