cv
Basics
Name | Xingjian Zhang |
Label | Master's Student |
xingjian.zhang [at] qq.c0m | |
Url | https://thuzxj.github.io/ |
Publications
-
2025 Minder: Faulty Machine Detection for Large-scale Distributed Model Training
NSDI 2025
Large-scale distributed model training requires simultaneous training on up to thousands of machines. Faulty machine detection is critical when an unexpected fault occurs in a machine. From our experience, a training task can encounter two faults per day on average, possibly leading to a halt for hours. To address the drawbacks of the time-consuming and labor-intensive manual scrutiny, we propose Minder, an automatic faulty machine detector for distributed training tasks. The key idea of Minder is to automatically and efficiently detect faulty distinctive monitoring metric patterns, which could last for a period before the entire training task comes to a halt. Minder has been deployed in a production environment for over nine months, monitoring daily distributed training tasks where each involves up to thousands of machines. In our real-world fault detection scenarios, Minder can accurately and efficiently react to faults within 3.6 seconds on average, with a precision of 0.904 and F1-score of 0.893.
-
2023 fKPISelect: Fault-Injection Based Automated KPI Selection for Practical Multivariate Anomaly Detection
ISSRE 2023
1. We investigated the issue of KPI selection in multivariate Time Series Anomaly Detection (TSAD) and point out the necessity of it for practical anomaly detection. 2. We investigated the performance loss issue of multivariate TSAD models based on Autoencoders when they are applied to real-world cloud systems, in particular when there are a large number of KPIs. 3. We propose fKPISelect, a fault-injection-based automated KPI selection mechanism to solve the gap between existing models and practical anomaly detection tasks.
-
2022 Modeling Composition of Cloud Services with Complex Dependencies for Availability Assessment
DSN 2022 fast abstract
We propose a modeling technology to represent heterogeneous dependencies of cloud services through Bayesian Network to assess the availability of cloud services.
Work
- 2023.07 - 2023.10
Research Intern
RDMA group in ByteDance
Work for the reliability engineering system for RDMA systems.
- Design, implement, and deploy fault component localization algorithms of RDMA system based on mining monitoring data.
- Design and implement the system to collect, aggregate, and analyze the huge amount of fine-grained monitoring data of RDMA systems with large model training workloads.
- Develop the next-generation monitoring platform of the RDMA system.
- 2023.03 - 2023.07
Research Intern
Network group in China Telecom Tianyi Cloud
Do research on anomaly detection of monitoring data.
- 2021.09 - 2022.09
Software Development Intern
Charging service group in Nio
Design and develop the backend and frontend web services for Nio's car charging service APP.
- 2021.07 - 2021.09
Software Development Intern
Network group in Kwai/Kuaishou Technology
Develop the monitoring system based on Pingmesh for network infrastructure.
Education
-
2022.09 - 2025.07 Beijing, China
Master
Institute for Network Sciences and Cyberspace, Tsinghua University, Beijing, China
Cloud Computing and Network
-
2018.09 - 2022.07 Beijing, China
Bachelor
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Computer Science and Technology
- GPA: 3.7/4.0 (Top 40%)
Awards
- 2022
Outstanding graduates
Department of CST, Tsinghua University
Projects
- 2022.09 - 2022.12
Automatic Creation and Operation of Experimental Cloud Systems
An automatic experimental cloud system for collecting data and testing new designs
- Using Vagrant and Puppet for creating server clusters
- Running microservices and workloads with kubernetes and locust
- Using ansible, netem and stress-ng for fault injection
- 2022.07 - 2022.09
Deterministic Builds by eBPF
Achieve deterministic builds by intercepting syscalls by eBPF for safer binary distribution, a project in Inclavare Containers community.
- Based on libbpf, provide command line tools and docker environments.
- Passed the test of compiling the kernel.
Skills
Web Service Development | |
Vue | |
flask | |
Django | |
Django REST framework | |
Springboot | |
SQL | |
Docker |
Cloud Computing Operation | |
OpenStack | |
Kubernetes | |
FaaS Platforms | |
Docker | |
Vagrant | |
Ansible | |
Grafana | |
eBPF |
Machine Learning | |
Basic of Data Science | |
Basic of Deep Learning | |
Basic of NLP | |
Pytorch |
Security | |
Some Reverse Engineering |
Languages
Chinese | |
Native speaker |
English | |
Fluent |