Abstract: The detection of anomalous behavior of an engineered system or its components is an important task for enhancing reliability, safety, and efficiency across various engineering applications.
Abstract: Distributed deep learning (DL) training constitutes a significant portion of workloads in modern data centers that are equipped with high computational capacities, such as GPU servers.