cv

Basics

Name Xingjian Zhang
Label Master's Student
Email xingjian.zhang [at] qq.c0m
Url https://thuzxj.github.io/

Publications

  • 2025
    Minder: Faulty Machine Detection for Large-scale Distributed Model Training
    NSDI 2025
    Large-scale distributed model training requires simultaneous training on up to thousands of machines. Faulty machine detection is critical when an unexpected fault occurs in a machine. From our experience, a training task can encounter two faults per day on average, possibly leading to a halt for hours. To address the drawbacks of the time-consuming and labor-intensive manual scrutiny, we propose Minder, an automatic faulty machine detector for distributed training tasks. The key idea of Minder is to automatically and efficiently detect faulty distinctive monitoring metric patterns, which could last for a period before the entire training task comes to a halt. Minder has been deployed in a production environment for over nine months, monitoring daily distributed training tasks where each involves up to thousands of machines. In our real-world fault detection scenarios, Minder can accurately and efficiently react to faults within 3.6 seconds on average, with a precision of 0.904 and F1-score of 0.893.
  • 2023
    fKPISelect: Fault-Injection Based Automated KPI Selection for Practical Multivariate Anomaly Detection
    ISSRE 2023
    1. We investigated the issue of KPI selection in multivariate Time Series Anomaly Detection (TSAD) and point out the necessity of it for practical anomaly detection. 2. We investigated the performance loss issue of multivariate TSAD models based on Autoencoders when they are applied to real-world cloud systems, in particular when there are a large number of KPIs. 3. We propose fKPISelect, a fault-injection-based automated KPI selection mechanism to solve the gap between existing models and practical anomaly detection tasks.
  • 2022
    Modeling Composition of Cloud Services with Complex Dependencies for Availability Assessment
    DSN 2022 fast abstract
    We propose a modeling technology to represent heterogeneous dependencies of cloud services through Bayesian Network to assess the availability of cloud services.

Work

  • 2023.07 - 2023.10
    Research Intern
    RDMA group in ByteDance
    Work for the reliability engineering system for RDMA systems.
    • Design, implement, and deploy fault component localization algorithms of RDMA system based on mining monitoring data.
    • Design and implement the system to collect, aggregate, and analyze the huge amount of fine-grained monitoring data of RDMA systems with large model training workloads.
    • Develop the next-generation monitoring platform of the RDMA system.
  • 2023.03 - 2023.07
    Research Intern
    Network group in China Telecom Tianyi Cloud
    Do research on anomaly detection of monitoring data.
  • 2021.09 - 2022.09
    Software Development Intern
    Charging service group in Nio
    Design and develop the backend and frontend web services for Nio's car charging service APP.
  • 2021.07 - 2021.09
    Software Development Intern
    Network group in Kwai/Kuaishou Technology
    Develop the monitoring system based on Pingmesh for network infrastructure.

Education

  • 2022.09 - 2025.07

    Beijing, China

    Master
    Institute for Network Sciences and Cyberspace, Tsinghua University, Beijing, China
    Cloud Computing and Network
  • 2018.09 - 2022.07

    Beijing, China

    Bachelor
    Department of Computer Science and Technology, Tsinghua University, Beijing, China
    Computer Science and Technology
    • GPA: 3.7/4.0 (Top 40%)

Awards

Projects

  • 2022.09 - 2022.12
    Automatic Creation and Operation of Experimental Cloud Systems
    An automatic experimental cloud system for collecting data and testing new designs
    • Using Vagrant and Puppet for creating server clusters
    • Running microservices and workloads with kubernetes and locust
    • Using ansible, netem and stress-ng for fault injection
  • 2022.07 - 2022.09
    Deterministic Builds by eBPF
    Achieve deterministic builds by intercepting syscalls by eBPF for safer binary distribution, a project in Inclavare Containers community.
    • Based on libbpf, provide command line tools and docker environments.
    • Passed the test of compiling the kernel.

Skills

Web Service Development
Vue
flask
Django
Django REST framework
Springboot
SQL
Docker
Cloud Computing Operation
OpenStack
Kubernetes
FaaS Platforms
Docker
Vagrant
Ansible
Grafana
eBPF
Machine Learning
Basic of Data Science
Basic of Deep Learning
Basic of NLP
Pytorch
Security
Some Reverse Engineering

Languages

Chinese
Native speaker
English
Fluent