(Translated by https://www.hiragana.jp/)
Cluster-Wide Task Slowdown Detection in Cloud System

Cluster-Wide Task Slowdown Detection in Cloud System

Feiyi Chen chenfeiyi@zju.edu.cn Zhejiang University, Alibaba GroupHangzhouChina Yingying Zhang congrong.zyy@alibaba-inc.com Alibaba GroupHangzhouChina Lunting Fan lunting.fan@taobao.com Alibaba GroupHangzhouChina Yuxuan Liang yuxliang@outlook.com The Hong Kong University of Science and Technology (Guangzhou)GuangzhouChina Guansong Pang gspang@smu.edu.sg Singapore Management UniversitySingaporeSingapore Qingsong Wen qingsongedu@gmail.com Squirrel AIBellevue, USA  and  Shuiguang Deng dengsg@zju.edu.cn Zhejiang UniversityHangzhouChina
(2024)
Abstract.

Slow task detection is a critical problem in cloud operation and maintenance since it is highly related to user experience and can bring substantial liquidated damages. Most anomaly detection methods detect it from a single-task aspect. However, considering millions of concurrent tasks in large-scale cloud computing clusters, it becomes impractical and inefficient. Moreover, single-task slowdowns are very common and do not necessarily indicate a malfunction of a cluster due to its violent fluctuation nature in a virtual environment. Thus, we shift our attention to cluster-wide task slowdowns by utilizing the duration time distribution of tasks across a cluster, so that the computation complexity is not relevant to the number of tasks. The task duration time distribution often exhibits compound periodicity and local exceptional fluctuations over time. Though transformer-based methods are one of the most powerful methods to capture these time series normal variation patterns, we empirically find and theoretically explain the flaw of the standard attention mechanism in reconstructing subperiods with low amplitude when dealing with compound periodicity. To tackle these challenges, we propose SORN (i.e., Skimming Off subperiods in descending amplitude order and Reconstructing Non-slowing fluctuation), which consists of a Skimming Attention mechanism to reconstruct the compound periodicity and a Neural Optimal Transport module to distinguish cluster-wide slowdowns from other exceptional fluctuations. Furthermore, since anomalies in the training set are inevitable in a practical scenario, we propose a picky loss function, which adaptively assigns higher weights to reliable time slots in the training set. Extensive experiments demonstrate that SORN outperforms state-of-the-art methods on multiple real-world industrial datasets.

Task slowdown detection, Time series, Unsupervised anomaly detection, AIOps
copyright: acmlicensedjournalyear: 2024copyright: acmlicensedconference: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 25–29, 2024; Barcelona, Spainbooktitle: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), August 25–29, 2024, Barcelona, Spaindoi: 10.1145/3637528.3671936isbn: 979-8-4007-0490-1/24/08ccs: Computing methodologies Anomaly detectionccs: Computer systems organization Cloud computing

1. Introduction

Slow task detection is a critical issue in cloud operations and maintenance, as it directly impacts user experience and can lead to significant penalties for service level agreement violations (Upadhyay and Sikka, 2020). Most existing anomaly detection methods focus on detecting task slowdowns at the individual task level (Ma et al., 2021; Yang et al., 2023; Zhang et al., 2019; Su et al., 2019). However, with millions of tasks running concurrently (Ma et al., 2021; Zhang et al., 2021) in large-scale cloud computing clusters, these approaches become impractical and inefficient. Moreover, single-task slowdowns are common and may not indicate a cluster malfunction, given the random and dramatic fluctuations in task duration time within a virtual environment. To address these challenges, we pivot towards detecting slowdowns on a cluster-wide scale, which are more indicative of cluster malfunctions and can be identified without examining each individual task. Furthermore, unlike the random fluctuations observed in single-task duration time, the duration time of cluster-wide tasks exhibits more regular patterns, making slowdown detection more feasible. Particularly, we detect cluster-wide task slowdowns using the duration time distribution of a cluster, as illustrated in Fig. 1(a), in which for each time slot we partition the range of task duration time into intervals and calculate the proportion of tasks falling into each interval. This strategic shift not only significantly reduces the computational complexity of our algorithm, making it independent of the number of tasks, but also enhances the accuracy of cluster malfunction detection.

Refer to caption
(a) The task duration time distribution
Refer to caption
(b) The compound periodicity
Refer to caption
(c) The reconstruction series of attention
Figure 1. (a) At each time slot, we use a stacked histogram bar to plot the frequency distribution of the duration time at that slot. We use a darker color to denote the interval requiring more duration time. The stacked histogram bar is ordered in time order. (b) The compound periodicity of task duration time. (c) The original series and series reconstructed by standard attention are plotted in one figure, where the subperiod with low amplitude can not be well reconstructed.

Nonetheless, the distribution of normal task duration time is not stable but varies over time. Hence, there arises a necessity to discern the patterns of distribution variation and differentiate routine slowdowns from anomalies. Among the various methods for extracting normal patterns, transformer-based methods stand out as one of the most powerful and effective unsupervised anomaly detection approaches, resulting in numerous distinguished methods (Li et al., 2023; Xu et al., 2022; Tuli et al., 2022a; Yang et al., 2023). Despite the abundance of powerful neural networks available for normal pattern extraction, several challenges persist:

  • Compound periodicity: The distribution of cluster-wide task duration time often exhibits compound periodic variation patterns. Since different tasks exhibit different periodicity, the periodicity of cluster-wide task duration time distribution is compound and complicated. For example, in Fig. 1(b), it shows periodicity on both a weekly and daily basis. As depicted in Fig.1(c), when integrating two periodicities with different amplitudes and frequencies into a unified representation, the attention mechanism shows subpar performance in reconstructing the subperiodicity with relatively low amplitude in the presence of compound periodicity.

  • Non-slowing exceptional fluctuations: The temporal evolution of task duration time within the cluster manifests periodic characteristics on a global scale, interspersing with localized non-periodic exceptional fluctuations. Within these exceptional fluctuations, only a small fraction corresponds to cluster-wide slowdowns, while others are not the focus of our work (e.g., we are not concerned about exceptional task speedups). However, traditional anomaly detection methods can not reconstruct all of the exceptional fluctuations well and detect all of them as anomalies. To distinguish cluster-wide task slowdowns, it is imperative to accurately reconstruct other exceptional fluctuations while excluding the cluster-wide slowdowns.

  • Anomalies in the training set: In consideration of the substantial costs linked to manually labeling anomalies, our methodology has been intentionally crafted to function in an unsupervised manner. Nevertheless, it is noteworthy that several unsupervised methods operate on the assumption that anomalies are infrequent within the training set, a premise that tends to be overly optimistic in practical scenarios.

Addressing these challenges is imperative for improving the detection accuracy of compound periodic time series and enhancing model robustness against anomaly contamination in the training set. Therefore, we propose SORN, which Skims Off the subperiodicity with different amplitudes layer by layer and selectively Reconstructs the Non-slowing fluctuations excluding the cluster-wide task slowdowns. It contains three innovative mechanisms to tackle the aforementioned three issues correspondingly: Skimming Attention, Neural Optimal Transport (OT), and Picky Loss.

Specifically, we first theoretically prove that the standard attention mechanism tends to allocate more attention to subperiods with higher amplitudes in compound periodic time series. This bias prevents it from effectively reconstructing subperiods with relatively low amplitudes. Building on this analysis, we introduce a skimming attention mechanism to capture the compound periodicity pattern, where we sequentially skim off subperiods from the original sequence in descending order of amplitudes and reconstruct iteratively from the remaining series. In this way, the subperiods with higher amplitudes are initially well reconstructed and skimmed off from the original time series. After that, the subperiods with low amplitudes in the original series become subperiods with relatively high amplitudes in the remaining series and can be better reconstructed.

Subsequently, we use a Neural OT module to adjust the reconstructed series of skimming attention, where we innovatively transform the traditional optimal transport problem into a neural network, and by intricately designing a transportation cost matrix, we can selectively reconstruct the non-slowing fluctuations.

Furthermore, to mitigate the negative effect of anomaly contamination in the training set, we design a novel picky loss function, which allocates different weights to time slots in the loss function according to their reliability.

Accordingly, this work presents several novel and distinctive contributions to the field of cluster-wide slow task detection:

  • We present the first attempt to formalize the cluster-wide slowdown problem with the identification of the problem specifications and relevant challenges.

  • We provide a theoretical explanation for the limitations of the standard attention mechanism in effectively reconstructing subperiods with low amplitude in compound periodicity. Moreover, we introduce a novel skimming attention mechanism designed to extract subperiodic components with varying amplitudes and aggregate them to ensure accurate reconstruction of both high and low-amplitude subperiods.

  • We introduce a novel Neural OT module tailored to reconstruct the normal non-periodic fluctuations observed in the duration time distribution, while effectively filtering out the cluster-wide slow-down anomalies.

  • We propose a picky loss function that assigns higher weights to reliable time slots within the loss function.

Besides, we conducted extensive experiments and demonstrated that our method outperforms the state-of-the-art (SOTA) methods in F1 score on real-world industrial datasets.

2. Preliminary

2.1. Optimal Transport (OT)

It is given a set of value intervals I={(s1,s2],(s2,s3],,(sn1,sn]}𝐼subscript𝑠1subscript𝑠2subscript𝑠2subscript𝑠3subscript𝑠𝑛1subscript𝑠𝑛I=\{(s_{1},s_{2}],(s_{2},s_{3}],\dots,(s_{n-1},s_{n}]\}italic_I = { ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] , ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ] , … , ( italic_s start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] } and two distributions 𝐚RN𝐚superscript𝑅𝑁\mathbf{a}\in R^{N}bold_a ∈ italic_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and 𝐛RN𝐛superscript𝑅𝑁\mathbf{b}\in R^{N}bold_b ∈ italic_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where 𝐚i=P(si<xsi+1),x𝐚formulae-sequencesubscript𝐚𝑖𝑃subscript𝑠𝑖𝑥subscript𝑠𝑖1similar-to𝑥𝐚\mathbf{a}_{i}=P(s_{i}<x\leq s_{i+1}),x\sim\mathbf{a}bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_P ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_x ≤ italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) , italic_x ∼ bold_a. Similarly, 𝐛i=P(si<xsi+1),x𝐛formulae-sequencesubscript𝐛𝑖𝑃subscript𝑠𝑖𝑥subscript𝑠𝑖1similar-to𝑥𝐛\mathbf{b}_{i}=P(s_{i}<x\leq s_{i+1}),x\sim\mathbf{b}bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_P ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_x ≤ italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) , italic_x ∼ bold_b. The Optimal Transport problem aims at transforming distribution 𝐚𝐚\mathbf{a}bold_a to 𝐛𝐛\mathbf{b}bold_b by moving a fraction of the amount in each interval of 𝐚𝐚\mathbf{a}bold_a to another interval. Moving a unit from jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT interval to ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT interval costs a price Ci,jsubscript𝐶𝑖𝑗C_{i,j}italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. The Optimal Transport problem gropes for an optimal transport strategy P𝑃Pitalic_P costing the lowest price, where Pi,jsubscript𝑃𝑖𝑗P_{i,j}italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT denotes the amount of unit moving from jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT interval to the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT, as shown in Eq.1, where <P,C><P,C>< italic_P , italic_C > denotes the Frobenius dot-product.

(1) minP<P,C>,s.t.P1=𝐛,PT1=𝐚.\begin{split}\min_{P}<P&,C>,\\ s.t.\ P\cdot\vec{1}=\mathbf{b},&\ P^{T}\cdot\vec{1}=\mathbf{a}.\end{split}start_ROW start_CELL roman_min start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT < italic_P end_CELL start_CELL , italic_C > , end_CELL end_ROW start_ROW start_CELL italic_s . italic_t . italic_P ⋅ over→ start_ARG 1 end_ARG = bold_b , end_CELL start_CELL italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ over→ start_ARG 1 end_ARG = bold_a . end_CELL end_ROW

2.2. Problem Setup

Definition 1. ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ftsuperscriptsubscript𝑓𝑡f_{t}^{*}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are used to denote the real-time distribution and expected distribution at time slot t𝑡titalic_t. ft(α)subscript𝑓𝑡𝛼f_{t}(\alpha)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_α ) and ft(α)superscriptsubscript𝑓𝑡𝛼f_{t}^{*}(\alpha)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) are used to denote the α𝛼\alphaitalic_α-quantile of distribution ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ftsuperscriptsubscript𝑓𝑡f_{t}^{*}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. 𝒯𝒯\mathcal{T}caligraphic_T is used to denote the threshold for tolerable fluctuation range of duration time distribution.
Definition 2. If there is a slowdown at time slot t𝑡titalic_t, then maxαft(α)ft(α)>𝒯,αsubscript𝛼subscript𝑓𝑡𝛼subscriptsuperscript𝑓𝑡𝛼𝒯for-all𝛼\max_{\alpha}f_{t}(\alpha)-f^{*}_{t}(\alpha)>\mathcal{T},\forall\alpharoman_max start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_α ) - italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_α ) > caligraphic_T , ∀ italic_α.
Definition 3. (Input data & output data) Given a set of intervals I={[s1,s2),[s2,s3),,[sD,sD+1)}𝐼subscript𝑠1subscript𝑠2subscript𝑠2subscript𝑠3subscript𝑠𝐷subscript𝑠𝐷1I=\{[s_{1},s_{2}),[s_{2},s_{3}),\dots,[s_{D},s_{D+1})\}italic_I = { [ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , [ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) , … , [ italic_s start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_D + 1 end_POSTSUBSCRIPT ) }, the input data is a Tlimit-from𝑇T-italic_T -length and D𝐷Ditalic_D dimensional multivariate time series xRT×D𝑥superscript𝑅𝑇𝐷x\in R^{T\times D}italic_x ∈ italic_R start_POSTSUPERSCRIPT italic_T × italic_D end_POSTSUPERSCRIPT, where x[t,d]𝑥𝑡𝑑x[t,d]italic_x [ italic_t , italic_d ] is the number of tasks whose duration time falls into the dthsuperscript𝑑𝑡d^{th}italic_d start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT interval [sd,sd+1)subscript𝑠𝑑subscript𝑠𝑑1[s_{d},s_{d+1})[ italic_s start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT ). The reconstruction series of SORN is denoted by x~˙˙~𝑥\dot{\tilde{x}}over˙ start_ARG over~ start_ARG italic_x end_ARG end_ARG.
Problem Formalization. We use a SORN to obtain a reconstruction series x~˙˙~𝑥\dot{\tilde{x}}over˙ start_ARG over~ start_ARG italic_x end_ARG end_ARG from the original input data x𝑥xitalic_x. Subsequently, we use an anomaly score function AnomalyScore(x~˙,x,I)AnomalyScore˙~𝑥𝑥𝐼\operatorname{AnomalyScore}(\dot{\tilde{x}},x,I)roman_AnomalyScore ( over˙ start_ARG over~ start_ARG italic_x end_ARG end_ARG , italic_x , italic_I ). We aim to maximize the anomaly score gap between the slow-down time slots and the others.

3. Methodology

The overview of SORN is depicted in Fig. 2(a). We sequentially mask each time slot in x𝑥xitalic_x and employ a multi-layer Skimming Attention mechanism to reconstruct the time slot by leveraging compound periodic information. Subsequently, we utilize Neural OT to fine-tune the reconstructed series obtained from Skimming Attention, capturing aperiodic but typical fluctuations in the time series. Finally, we apply the picky loss function to assign higher weights to normal time slots while assigning lower weights to occasional anomalous slots in the loss function.

Refer to caption
(a) The model architecture of SORN
Refer to caption
(b) A layer of Skimming Attention
Figure 2. The model architecture of the proposed SORN algorithm.
Refer to caption
(a) The amplitude of different periods
Refer to caption
(b) Attention weights along different subperiods
Refer to caption
(c) The series from different Skimming Attention layers
Refer to caption
(d) The reconstruction and loss weight of SORN
Figure 3. (a) The figure shows different amplitudes of different subperiods; (b) The figure shows attention weight along different subperiods in f(t)𝑓𝑡f(t)italic_f ( italic_t ). The width of the shadow is the value of the attention weight divided by 100 at the corresponding time slot. To distinguish the positive attention weight and negative attention weight we plot them in different colors and denote them by 𝒜+superscript𝒜\mathcal{A}^{+}caligraphic_A start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and 𝒜superscript𝒜\mathcal{A}^{-}caligraphic_A start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT respectively. (c) & (d) The visualization of SORN.

3.1. Skimming Attention

The duration time distribution usually exhibits compound periodic fluctuations, as shown in Fig. 1(b). In a compound periodic series, different subperiods usually have different amplitudes (i.e., variation range) (Wen et al., 2020), as shown in Fig. 3(a). When dealing with this kind of compound periodicity, the standard attention mechanism falls short in reconstructing the subperiod with low amplitude, as shown in Fig.1(c), where we fuse two periodicities with different amplitudes and frequency, the standard attention only reconstructs the one with higher amplitude well. We theoretically explain this phenomenon in Theorem 1 and Theorem 2, where we prove that a self-attention mechanism pays more attention to the subperiod with relatively higher amplitude in compound periodic series, which degrades the performance of reconstructing the subperiods with lower amplitudes in compound periodic series. Thus, we propose a skimming attention that masks each time slot alternatively and aims at reconstructing it by compound periodic information. There are two challenges to achieving this. On the one hand, we need to prevent it from reconstructing time slots only by leveraging the similarity of adjacent time slots in each layer but neglecting the periodic information. On the other hand, we need to reconstruct every subperiod well rather than just those with high amplitudes.

We deduce Theorems 1-2 using the same setting as the self-attention mechanism in a patching transformer (Nie et al., 2023), where a time series is split into a set of p𝑝pitalic_p-length patches, which constitute the query, key, and value vectors of a self-attention mechanism. We start with a simple case and generalize it to a general situation. In the derivation, we omit the final step of applying softmax to the attention weights, as softmax does not alter the relative order of the attention weights assigned to different time slots in the sequence and will not affect the conclusion.
Definition 4. Given a,bZ,abformulae-sequence𝑎𝑏𝑍𝑎𝑏a,b\in Z,a\neq bitalic_a , italic_b ∈ italic_Z , italic_a ≠ italic_b, we set the patch length p𝑝pitalic_p to lcm(a,b)lcm𝑎𝑏\text{lcm}(a,b)lcm ( italic_a , italic_b ), where lcm(a,b)lcm𝑎𝑏\text{lcm}(a,b)lcm ( italic_a , italic_b ) denotes the least common multiple of a𝑎aitalic_a and b𝑏bitalic_b. It is given a series with compound periodicity f(t)=c1cos(ω1t)+c2sin(ω2t)𝑓𝑡subscript𝑐1subscript𝜔1𝑡subscript𝑐2subscript𝜔2𝑡f(t)=c_{1}\cos(\omega_{1}t)+c_{2}\sin(\omega_{2}t)italic_f ( italic_t ) = italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_cos ( italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t ) + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_sin ( italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t ), where ω1=2aπpsubscript𝜔12𝑎𝜋𝑝\omega_{1}=\frac{2a\pi}{p}italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG 2 italic_a italic_π end_ARG start_ARG italic_p end_ARG, ω2=2bπpsubscript𝜔22𝑏𝜋𝑝\omega_{2}=\frac{2b\pi}{p}italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG 2 italic_b italic_π end_ARG start_ARG italic_p end_ARG and c1,c2R,c1>c2formulae-sequencesubscript𝑐1subscript𝑐2𝑅subscript𝑐1subscript𝑐2c_{1},c_{2}\in R,c_{1}>c_{2}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_R , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. There are two subperiod component in f(t)𝑓𝑡f(t)italic_f ( italic_t ): f1(t)=c1cos(ω1t)subscript𝑓1𝑡subscript𝑐1subscript𝜔1𝑡f_{1}(t)=c_{1}\cos(\omega_{1}t)italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) = italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_cos ( italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t ) and f2(t)=c2sin(ω2t)subscript𝑓2𝑡subscript𝑐2subscript𝜔2𝑡f_{2}(t)=c_{2}\sin(\omega_{2}t)italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) = italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_sin ( italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t ). We use T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to denote the period length of f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively.
Theorem 1. In f(t)𝑓𝑡f(t)italic_f ( italic_t ), when taking the patch starting from t1thsuperscriptsubscript𝑡1𝑡t_{1}^{th}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT time slot as the query, the attention weight of the patch starting from t2thsuperscriptsubscript𝑡2𝑡t_{2}^{th}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT is p2[c12cosω1Δt+c22cosω2Δt]𝑝2delimited-[]superscriptsubscript𝑐12subscript𝜔1Δ𝑡superscriptsubscript𝑐22subscript𝜔2Δ𝑡\frac{p}{2}[c_{1}^{2}\cos\omega_{1}\Delta t+c_{2}^{2}\cos\omega_{2}\Delta t]divide start_ARG italic_p end_ARG start_ARG 2 end_ARG [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_cos italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ italic_t + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_cos italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Δ italic_t ], where Δt=(t2t1)Δ𝑡subscript𝑡2subscript𝑡1\Delta t=(t_{2}-t_{1})roman_Δ italic_t = ( italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).
Proof. Please refer to Appendix A for more details.

Taking a further look at the attention weight p2[c12cosω1Δt+c22cosω2Δt]𝑝2delimited-[]superscriptsubscript𝑐12subscript𝜔1Δ𝑡superscriptsubscript𝑐22subscript𝜔2Δ𝑡\frac{p}{2}[c_{1}^{2}\cos\omega_{1}\Delta t+c_{2}^{2}\cos\omega_{2}\Delta t]divide start_ARG italic_p end_ARG start_ARG 2 end_ARG [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_cos italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ italic_t + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_cos italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Δ italic_t ], it is a linear combination of cosω1Δtsubscript𝜔1Δ𝑡\cos\omega_{1}\Delta troman_cos italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ italic_t and cosω2Δtsubscript𝜔2Δ𝑡\cos\omega_{2}\Delta troman_cos italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Δ italic_t. The first one distributes attention weight according to the periodicity of f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: it assigns the highest attention weight to the time slot that is nT1𝑛subscript𝑇1nT_{1}italic_n italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-slots apart from the query time slot, where nZ𝑛𝑍n\in Zitalic_n ∈ italic_Z (i.e. when Δt=nT1Δ𝑡𝑛subscript𝑇1\Delta t=nT_{1}roman_Δ italic_t = italic_n italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, cosω1Δtsubscript𝜔1Δ𝑡\cos\omega_{1}\Delta troman_cos italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ italic_t reaches its maximum value). Similarly, the second one distributes attention weight according to the periodicity of f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and assigns the highest attention weight to the time slot that is nT2𝑛subscript𝑇2nT_{2}italic_n italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT apart from the query time slot. Moreover, their impact on the attention weight is decided by the amplitudes of their corresponding subperiod. Since c1>c2subscript𝑐1subscript𝑐2c_{1}>c_{2}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, cosω1Δtsubscript𝜔1Δ𝑡\cos\omega_{1}\Delta troman_cos italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ italic_t contributes more to the attention weight. Thus, the periodic information of f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can obtain higher attention weight and f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT will be reconstructed better. As shown in Fig. 3(b), the highest attention weights show up at the time slot that nT1𝑛subscript𝑇1nT_{1}italic_n italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-slots apart from the query slot without concerning the subperiod with period length of T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

To generalize Theorem 1 to a general situation, given a time series f~(t)~𝑓𝑡\tilde{f}(t)over~ start_ARG italic_f end_ARG ( italic_t ) with compound periodicity, we use trigonometric series to decompose it as defined in Definition 5.
Definition 5. Given a compound periodic time series f~(t)~𝑓𝑡\tilde{f}(t)over~ start_ARG italic_f end_ARG ( italic_t ) with period length p𝑝pitalic_p, we set the patch length to p𝑝pitalic_p. We decompose f~(t)~𝑓𝑡\tilde{f}(t)over~ start_ARG italic_f end_ARG ( italic_t ) to a linear combination of trigonometric series as: f~(t)=a02+n=1(ancosωnt+bnsinωnt)~𝑓𝑡subscript𝑎02superscriptsubscript𝑛1subscript𝑎𝑛subscript𝜔𝑛𝑡subscript𝑏𝑛subscript𝜔𝑛𝑡\tilde{f}(t)=\frac{a_{0}}{2}+\sum_{n=1}^{\infty}(a_{n}\cos{\omega_{n}t}+b_{n}% \sin{\omega_{n}t})over~ start_ARG italic_f end_ARG ( italic_t ) = divide start_ARG italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_cos italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_t + italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_sin italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_t ), where ωn=2nπpsubscript𝜔𝑛2𝑛𝜋𝑝\omega_{n}=\frac{2n\pi}{p}italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG 2 italic_n italic_π end_ARG start_ARG italic_p end_ARG and ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and bnsubscript𝑏𝑛b_{n}italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are coefficients for triangonometric series.
Theorem 2. In f~(t)~𝑓𝑡\tilde{f}(t)over~ start_ARG italic_f end_ARG ( italic_t ), when taking the patch starting from t1thsuperscriptsubscript𝑡1𝑡t_{1}^{th}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT time slot as the query, the attention weight of the patch starting from t2thsuperscriptsubscript𝑡2𝑡t_{2}^{th}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT is a02p4+p2n=1(an2+bn2)cosωnΔtsuperscriptsubscript𝑎02𝑝4𝑝2superscriptsubscript𝑛1superscriptsubscript𝑎𝑛2superscriptsubscript𝑏𝑛2subscript𝜔𝑛Δ𝑡\frac{a_{0}^{2}p}{4}+\frac{p}{2}\sum_{n=1}^{\infty}(a_{n}^{2}+b_{n}^{2})\cos% \omega_{n}\Delta tdivide start_ARG italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p end_ARG start_ARG 4 end_ARG + divide start_ARG italic_p end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_cos italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ italic_t, where Δt=(t2t1)Δ𝑡subscript𝑡2subscript𝑡1\Delta t=(t_{2}-t_{1})roman_Δ italic_t = ( italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).
Proof. Please refer to Appendix B for more details.

Similar to the analysis of f(t)𝑓𝑡f(t)italic_f ( italic_t ), the attention weight of f~(t)~𝑓𝑡\tilde{f}(t)over~ start_ARG italic_f end_ARG ( italic_t ) is a linear combination of cosωnΔtsubscript𝜔𝑛Δ𝑡\cos\omega_{n}\Delta troman_cos italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ italic_t. The subperiods with the higher amplitudes are more decisive to the attention weight distribution and the periodic information of these subperiods can obtain higher attention weights. Thus, the subperiods with higher amplitudes are more likely to reconstruct better, while the subperiods with low amplitudes can be poorly reconstructed.

We show the architecture of each skimming attention layer in Fig.2(b), which aims at preventing the attention mechanism from directly reconstructing time slots by making use of the similarity of adjacent time slots. As shown in Fig.2(b), we first use a sliding window with padding to extend the input data xRT×D𝑥superscript𝑅𝑇𝐷x\in R^{T\times D}italic_x ∈ italic_R start_POSTSUPERSCRIPT italic_T × italic_D end_POSTSUPERSCRIPT to x¯RT×(p+1)×D¯𝑥superscript𝑅𝑇𝑝1𝐷\bar{x}\in R^{T\times(p+1)\times D}over¯ start_ARG italic_x end_ARG ∈ italic_R start_POSTSUPERSCRIPT italic_T × ( italic_p + 1 ) × italic_D end_POSTSUPERSCRIPT, where p+1𝑝1p+1italic_p + 1 denotes the window length of the sliding window. Subsequently, we take each dimension separately (taking the dthsuperscript𝑑𝑡d^{th}italic_d start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT dimension as an example) and use the first p𝑝pitalic_p-length series in each window as the keys and the final time slot in each window as the queries and values. This process is shown in Eq.2-Eq.5, where [:p+1][:p+1][ : italic_p + 1 ] denotes the time slices from beginning to the pthsuperscript𝑝𝑡p^{th}italic_p start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT one.

(2) x¯=SlidingWindow(x,p+1,1),¯𝑥SlidingWindow𝑥𝑝11\displaystyle\bar{x}=\operatorname{SlidingWindow}(x,p+1,1),over¯ start_ARG italic_x end_ARG = roman_SlidingWindow ( italic_x , italic_p + 1 , 1 ) ,
(3) qd=x¯[:,:p+1,d],\displaystyle q_{d}=\bar{x}[:,:p+1,d],italic_q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = over¯ start_ARG italic_x end_ARG [ : , : italic_p + 1 , italic_d ] ,
(4) kd=x¯[:,:p+1,d],\displaystyle k_{d}=\bar{x}[:,:p+1,d],italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = over¯ start_ARG italic_x end_ARG [ : , : italic_p + 1 , italic_d ] ,
(5) vd=x¯[:,p+1,d].subscript𝑣𝑑¯𝑥:𝑝1𝑑\displaystyle v_{d}=\bar{x}[:,p+1,d].italic_v start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = over¯ start_ARG italic_x end_ARG [ : , italic_p + 1 , italic_d ] .

Afterward, as shown in Eq. 6, we apply a standard attention mechanism to the queries, keys, and values and obtain a set of attention weight 𝒜RT×T𝒜superscript𝑅𝑇𝑇\mathcal{A}\in R^{T\times T}caligraphic_A ∈ italic_R start_POSTSUPERSCRIPT italic_T × italic_T end_POSTSUPERSCRIPT, where 𝒜i,jsubscript𝒜𝑖𝑗\mathcal{A}_{i,j}caligraphic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT denotes the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT attention weight for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT query. Then, in Eq. 7, we multiply a gate curve GRT×T𝐺superscript𝑅𝑇𝑇G\in R^{T\times T}italic_G ∈ italic_R start_POSTSUPERSCRIPT italic_T × italic_T end_POSTSUPERSCRIPT to the 𝒜𝒜\mathcal{A}caligraphic_A, where G[i,j]=1exp(ij)2σ2𝐺𝑖𝑗1superscriptsuperscript𝑖𝑗2superscript𝜎2G[i,j]=1-\exp^{-\frac{(i-j)^{2}}{\sigma^{2}}}italic_G [ italic_i , italic_j ] = 1 - roman_exp start_POSTSUPERSCRIPT - divide start_ARG ( italic_i - italic_j ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT, σ𝜎\sigmaitalic_σ is a learnable parameter and * denotes an element-wise multiplication. In this way, the attention weights of time slots that are closer to the query are harder to pass through the gate, while the further one can easily get passed. Consequently, we can force the attention mechanism to put more weight on the hopping time slots. Finally, we obtain the reconstruction series in this layer as in Eq. 8, where x~lsubscript~𝑥𝑙\tilde{x}_{l}over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT denotes the reconstruction series of the lthsuperscript𝑙𝑡l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT skimming attention layer:

(6) 𝒜=qdkdT,𝒜subscript𝑞𝑑superscriptsubscript𝑘𝑑𝑇\displaystyle\mathcal{A}=q_{d}k_{d}^{T},caligraphic_A = italic_q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ,
(7) 𝒜~=softmax(𝒜G),~𝒜softmax𝒜𝐺\displaystyle\tilde{\mathcal{A}}=\operatorname{softmax}(\mathcal{A}*G),over~ start_ARG caligraphic_A end_ARG = roman_softmax ( caligraphic_A ∗ italic_G ) ,
(8) x~l[:,d]=𝒜~vd.subscript~𝑥𝑙:𝑑~𝒜subscript𝑣𝑑\displaystyle\tilde{x}_{l}[:,d]=\tilde{\mathcal{A}}v_{d}.over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT [ : , italic_d ] = over~ start_ARG caligraphic_A end_ARG italic_v start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT .

We organize different layers of skimming attention as follows to deal with compound periodic information:

(9) x~l=SkimmingAttentionLayer(xl),xl+1=xlx~l,formulae-sequencesubscript~𝑥𝑙SkimmingAttentionLayersubscript𝑥𝑙subscript𝑥𝑙1subscript𝑥𝑙subscript~𝑥𝑙\begin{split}\tilde{x}_{l}=\operatorname{Skimming}&\operatorname{% AttentionLayer}(x_{l}),\\ x_{l+1}=&x_{l}-\tilde{x}_{l},\end{split}start_ROW start_CELL over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = roman_Skimming end_CELL start_CELL roman_AttentionLayer ( italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT = end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , end_CELL end_ROW

where xlsubscript𝑥𝑙x_{l}italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the lthsuperscript𝑙𝑡l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer input data and x0=xsubscript𝑥0𝑥x_{0}=xitalic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x. There are two benefits to organizing the skimming attention layers in this way. On the one hand, each skimming attention layer skims off the subperiod with the highest amplitude in xlsubscript𝑥𝑙x_{l}italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and the next layer can pay more attention to the subperiod with relatively low amplitude in the remaining series. Consequently, the subperiods with different amplitudes can be reconstructed well. We show the reconstruction series of different Skimming Attention layers in Fig. 3(c), where it reconstructs subperiods in descending amplitude order. On the other hand, it can also prevent the problem of vanishing gradient like ResNet does, since the input of every layer can be also reduced to xl=xk=0l1x~lsubscript𝑥𝑙𝑥superscriptsubscript𝑘0𝑙1subscript~𝑥𝑙x_{l}=x-\sum_{k=0}^{l-1}\tilde{x}_{l}italic_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_x - ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT.

3.2. Neural OT

Besides the periodic patterns, there are still aperiodic but normal fluctuations in task duration time distribution. Since we only pay attention to slow-down anomalies but not others (e.g., the execution speed of homework has significantly increased), we target modeling these non-periodic fluctuations but only hinder the reconstruction of slow-down anomalies. Inspired by the Optimal Transport (OT) algorithm, we transform a standard OT problem into a neural network and embed it into our model so that our model becomes end-to-end.

We first establish an OT problem and then transform it into a neural network. For each time slot t𝑡titalic_t, we take its reconstruction duration time distribution x~[t]R1×d~𝑥delimited-[]𝑡superscript𝑅1𝑑\tilde{x}[t]\in R^{1\times d}over~ start_ARG italic_x end_ARG [ italic_t ] ∈ italic_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT as a source distribution and take its original duration time distribution x[t]R1×d𝑥delimited-[]𝑡superscript𝑅1𝑑x[t]\in R^{1\times d}italic_x [ italic_t ] ∈ italic_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT as a target distribution. The OT problem gropes for a transport strategy P𝑃Pitalic_P to transform the source distribution to the target distribution with a minimum cost <Px~[t],C><P*\tilde{x}[t],C>< italic_P ∗ over~ start_ARG italic_x end_ARG [ italic_t ] , italic_C >, where P[d,s]𝑃𝑑𝑠P[d,s]italic_P [ italic_d , italic_s ] denotes the ratio of x~[t,s]~𝑥𝑡𝑠\tilde{x}[t,s]over~ start_ARG italic_x end_ARG [ italic_t , italic_s ] transporting to x~[t,d]~𝑥𝑡𝑑\tilde{x}[t,d]over~ start_ARG italic_x end_ARG [ italic_t , italic_d ], C[d,s]𝐶𝑑𝑠C[d,s]italic_C [ italic_d , italic_s ] denotes the cost of transporting a unit from x~[t,s]~𝑥𝑡𝑠\tilde{x}[t,s]over~ start_ARG italic_x end_ARG [ italic_t , italic_s ] to x~[t,d]~𝑥𝑡𝑑\tilde{x}[t,d]over~ start_ARG italic_x end_ARG [ italic_t , italic_d ] and * denotes element-wise multiplication. According to the definition of P𝑃Pitalic_P, Px~[t]𝑃~𝑥delimited-[]𝑡P\tilde{x}[t]italic_P over~ start_ARG italic_x end_ARG [ italic_t ] denotes the distribution after applying the transport strategy P𝑃Pitalic_P to the source distribution x~[t]~𝑥delimited-[]𝑡\tilde{x}[t]over~ start_ARG italic_x end_ARG [ italic_t ], which should approach the target distribution x[t]𝑥delimited-[]𝑡x[t]italic_x [ italic_t ], and the sum of each column of P𝑃Pitalic_P should be 1111. Thus, we formulate |Px~[t]x[t]|𝑃~𝑥delimited-[]𝑡𝑥delimited-[]𝑡|P\tilde{x}[t]-x[t]|| italic_P over~ start_ARG italic_x end_ARG [ italic_t ] - italic_x [ italic_t ] | as an optimization goal and the PT1=1superscript𝑃𝑇11P^{T}\vec{1}=\vec{1}italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over→ start_ARG 1 end_ARG = over→ start_ARG 1 end_ARG as a constraint in our OT problem. To reconstruct anomalies except the slow ones, we set C𝐶Citalic_C as follows:

(10) Ci,j={M[i]M[j],i>j0,else,C_{i,j}=\left\{\begin{matrix}M[i]-M[j],&i>j\\ 0,&else,\end{matrix}\right.italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = { start_ARG start_ROW start_CELL italic_M [ italic_i ] - italic_M [ italic_j ] , end_CELL start_CELL italic_i > italic_j end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL italic_e italic_l italic_s italic_e , end_CELL end_ROW end_ARG

where M[i]𝑀delimited-[]𝑖M[i]italic_M [ italic_i ] is the midpoint of ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT interval in I𝐼Iitalic_I (I𝐼Iitalic_I is defined in Definition 3). In this way, only the slow-down distribution shift is penalized by the transporting cost. Based on the setting above, we formulate an OT problem as:

(11) min.λ<Px~[t],C>+|Px~[t]x[t]|2,s.t.PT1=1,formulae-sequenceformulae-sequence𝜆𝑃~𝑥delimited-[]𝑡𝐶subscript𝑃~𝑥delimited-[]𝑡𝑥delimited-[]𝑡2𝑠𝑡superscript𝑃𝑇11\begin{split}\min.\lambda<P*\tilde{x}[t],C>&+\left|P\tilde{x}[t]-x[t]\right|_{% 2},\\ s.t.P^{T}\vec{1}&=\vec{1},\end{split}start_ROW start_CELL roman_min . italic_λ < italic_P ∗ over~ start_ARG italic_x end_ARG [ italic_t ] , italic_C > end_CELL start_CELL + | italic_P over~ start_ARG italic_x end_ARG [ italic_t ] - italic_x [ italic_t ] | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_s . italic_t . italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over→ start_ARG 1 end_ARG end_CELL start_CELL = over→ start_ARG 1 end_ARG , end_CELL end_ROW

where λ𝜆\lambdaitalic_λ is a hyperparameter belonging to [0,1]01[0,1][ 0 , 1 ].

Furthermore, we transform it into a neural network. We take P𝑃Pitalic_P as a trainable parameter. To meet its constraint in the OT problem, we manipulate P𝑃Pitalic_P as softmax(PT)T\operatorname{softmax}(P^{T})^{T}roman_softmax ( italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, and the neural layer is specified as:

(12) x~˙=softmax(PT)Tx~\dot{\tilde{x}}=\operatorname{softmax}(P^{T})^{T}\tilde{x}over˙ start_ARG over~ start_ARG italic_x end_ARG end_ARG = roman_softmax ( italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over~ start_ARG italic_x end_ARG

Besides, we also introduce the optimization objective of the OT problem to the loss function.

3.3. Picky Loss Function

The reconstruction-based methods assume that there are no anomalies in the training set. However, it is inevitable to have some anomalies in the training set in the scenario of unsupervised learning. Thus, we propose a picky loss function, which adaptively attributes a weight 𝒲RT𝒲superscript𝑅𝑇\mathcal{W}\in R^{T}caligraphic_W ∈ italic_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT according to trustworthiness to the loss of each time slot. The more trustful a time slot is, the higher its weight is. Inspiring by AnomalyTransformer (Xu et al., 2022), which points out that the normal points can establish wide-broad informative association along the whole series in attention mechanism whereas the anomalies can only concentrate on adjacent time slots, we utilize the attention weight 𝒜𝒜\mathcal{A}caligraphic_A in subsection. 3.1 to obtain the weight 𝒲𝒲\mathcal{W}caligraphic_W. We use a trainable gate curve G^RT×T^𝐺superscript𝑅𝑇𝑇\hat{G}\in R^{T\times T}over^ start_ARG italic_G end_ARG ∈ italic_R start_POSTSUPERSCRIPT italic_T × italic_T end_POSTSUPERSCRIPT to filter out the attention weight in the adjacent part, where G^[i,j]=1exp(ij)2σ^2^𝐺𝑖𝑗1superscriptsuperscript𝑖𝑗2superscript^𝜎2\hat{G}[i,j]=1-\exp^{-\frac{(i-j)^{2}}{\hat{\sigma}^{2}}}over^ start_ARG italic_G end_ARG [ italic_i , italic_j ] = 1 - roman_exp start_POSTSUPERSCRIPT - divide start_ARG ( italic_i - italic_j ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT and σ^^𝜎\hat{\sigma}over^ start_ARG italic_σ end_ARG is a trainable parameter and obtain 𝒲𝒲\mathcal{W}caligraphic_W via:

(13) 𝒲=softmax[(𝒜G^)1].𝒲softmax𝒜^𝐺1\mathcal{W}=\operatorname{softmax}[(\mathcal{A}*\hat{G})\vec{1}].caligraphic_W = roman_softmax [ ( caligraphic_A ∗ over^ start_ARG italic_G end_ARG ) over→ start_ARG 1 end_ARG ] .

We obtain the final loss function by attributing the weight 𝒲𝒲\mathcal{W}caligraphic_W to each time slot and fusing the optimizing objective in Section 3.2, resulting in the final loss function as follows:

(14) =t=0T𝒲[t](|x~˙[t]x[t]|2+λPx~[t],C).superscriptsubscript𝑡0𝑇𝒲delimited-[]𝑡subscript˙~𝑥delimited-[]𝑡𝑥delimited-[]𝑡2𝜆𝑃~𝑥delimited-[]𝑡𝐶\mathcal{L}=\sum_{t=0}^{T}\mathcal{W}[t](\left|\dot{\tilde{x}}[t]-x[t]\right|_% {2}+\lambda\left<P*\tilde{x}[t],C\right>).caligraphic_L = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_W [ italic_t ] ( | over˙ start_ARG over~ start_ARG italic_x end_ARG end_ARG [ italic_t ] - italic_x [ italic_t ] | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ⟨ italic_P ∗ over~ start_ARG italic_x end_ARG [ italic_t ] , italic_C ⟩ ) .

As shown in Fig. 3(d), the picky loss function renders lower weights to the anomaly time slots.

3.4. Anomaly Score

Since the duration time distribution of different tasks does not distribute uniformly, we split the distribution intervals I𝐼Iitalic_I according to the distribution density of task duration time. This leads to the heterogeneous importance of the reconstruction errors for different intervals. However, the trivial anomaly score, which adds the reconstruction errors for different intervals together, ignores this heterogeneity. Thus, we use the difference between the task duration time expectations of the original distribution and reconstruction one as the anomaly score:

(15) AnomalyScore[t]=𝔼(T¯(x[t]))𝔼(T¯(x~˙[t]))=d=0Dx[t,d]M[d]d=0Dx~˙[t,d]M[d],AnomalyScore𝑡𝔼¯𝑇𝑥delimited-[]𝑡𝔼¯𝑇˙~𝑥delimited-[]𝑡superscriptsubscript𝑑0𝐷𝑥𝑡𝑑𝑀delimited-[]𝑑superscriptsubscript𝑑0𝐷˙~𝑥𝑡𝑑𝑀delimited-[]𝑑\begin{split}\operatorname{AnomalyScore}[t]=\mathbb{E}(\bar{T}(x[t]))-\mathbb{% E}(\bar{T}(\dot{\tilde{x}}[t]))\\ =\sum_{d=0}^{D}x[t,d]*M[d]-\sum_{d=0}^{D}\dot{\tilde{x}}[t,d]*M[d],\end{split}start_ROW start_CELL roman_AnomalyScore [ italic_t ] = blackboard_E ( over¯ start_ARG italic_T end_ARG ( italic_x [ italic_t ] ) ) - blackboard_E ( over¯ start_ARG italic_T end_ARG ( over˙ start_ARG over~ start_ARG italic_x end_ARG end_ARG [ italic_t ] ) ) end_CELL end_ROW start_ROW start_CELL = ∑ start_POSTSUBSCRIPT italic_d = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_x [ italic_t , italic_d ] ∗ italic_M [ italic_d ] - ∑ start_POSTSUBSCRIPT italic_d = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT over˙ start_ARG over~ start_ARG italic_x end_ARG end_ARG [ italic_t , italic_d ] ∗ italic_M [ italic_d ] , end_CELL end_ROW

where AnomalyScore[t]AnomalyScore𝑡\operatorname{AnomalyScore}[t]roman_AnomalyScore [ italic_t ] denotes the anomaly score of tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT time slot, and T¯(x[t])¯𝑇𝑥delimited-[]𝑡\bar{T}(x[t])over¯ start_ARG italic_T end_ARG ( italic_x [ italic_t ] ) and T¯(x~˙[t])¯𝑇˙~𝑥delimited-[]𝑡\bar{T}(\dot{\tilde{x}}[t])over¯ start_ARG italic_T end_ARG ( over˙ start_ARG over~ start_ARG italic_x end_ARG end_ARG [ italic_t ] ) denote two variables: the task duration time from distributions x[t]𝑥delimited-[]𝑡x[t]italic_x [ italic_t ] and x~˙[t]˙~𝑥delimited-[]𝑡\dot{\tilde{x}}[t]over˙ start_ARG over~ start_ARG italic_x end_ARG end_ARG [ italic_t ] respectively.

4. Experiment

We have made extensive experiments on four datasets to verify the following conclusions:

  • SORN can achieve the best performances on the four datasets, compared with the SOTA methods.

  • SORN consumes tolerable time and memory overhead.

  • SORN is parameter insensitive.

  • SORN is resistant to noise and lax periodicity.

  • Each module in SORN has contributed to the performance.

4.1. Experiment Setup

Baseline Methods. We compare SORN with the SOTA anomaly detection methods: DCdetector (Yang et al., 2023), TranAD (Tuli et al., 2022a), AnomalyTransformer (Xu et al., 2022), VQRAE (Kieu et al., 2022), OmniAnomaly (Su et al., 2019), MSCRED (Zhang et al., 2019). Furthermore, we compare SORN with a method specifically designed for slow-down detection: IASO (Panda et al., 2019) and a method designed for distribution shift detection, feature-shift detection (Kulinski et al., 2020).

Datasets. We perform our experiments on four datasets. Two of them (Ali1, Ali2) are monitoring data of industrial cloud clusters from Alibaba. One of them (Mustang) is disclosed by Carnegie Mellon Parallel Data Laboratory, and we label the slow-down anomalies in it manually. To further verify the impact of different factors on the performance, such as noise, periodicity, slow tasks ratio, and the average task slow-down time, we also introduce a synthetic dataset (Sync) so that we can keep every factor under control. We summarize key statistics of different datasets in Table 1.

Besides, different datasets exhibit different periodicity strictness. To measure the periodicity strictness of each dataset, for each subset, if it exhibits periodicity we calculate its autocorrelation coefficient at intervals of its period length as its periodicity strictness level, otherwise, we set its periodicity strictness as 00. We show the periodicity strictness level distribution of subsets in each dataset in Fig. 4(a). Ali1 and Ali2 show relatively strict periodicity. Mustang shows lax periodicity. Some subsets of Sync show strict periodicity, while others are aperiodic.

For more data preprocessing details, please refer to Appendix. C.

Table 1. Statistics of different datasets.
Ali1 Ali2 Mustang Sync
Dimension 14 14 17 14
Anomaly ratio (%) 3.71 6.06 7.75 1
Subsets 25 25 1 10
Table 2. The hyperparameters.
Hyperparameter Value Hyperparameter Value
Batch Size 100 Learning Rate 0.001
Skimming Layer of Ali1 10 Patch Size of Ali1 2
Skimming Layer of Ali2 6 Patch Size of Ali2 2
Skimming Layer of Mut 6 Patch Size of Mut 2
Skimming Layer of Sync 6 Patch Size of Sync 10

Hyperparameters. We show some important hyperparameters in Table 2, where we use Mut to stand for Mustang.

Refer to caption
(a) Periodicity strictness of each dataset
Refer to caption
(b) Time and memory overhead
Refer to caption
(c) Parameter sensitivity
Figure 4. (a) We show the autocorrelation coefficient distribution at the interval of period length for subsets in every dataset. (b) The time and memory overhead of SORN and baselines on Sync dataset. We use the first two characters to stand for each method; (c) The hyperparameter sensitivity of the number of skimming layers and patch size on Sync dataset.
Refer to caption
(a) The impact of noise and slow task ratio on performance
Refer to caption
(b) The impact of periodicity and slow task ratio on performance
Refer to caption
(c) The impact of noise and average slow-down time on performance
Refer to caption
(d) The impact of periodicity and average slow-down time on performance
Figure 5. (a) We add noise to the original synthetic time series, whose standard deviation is the maximum amplitude of the original time series multiplied by the ”noise” shown in the legend. Then, we test the performance of SORN for different slow task ratios. (b) For each period in a periodic time series, we extend it by a scaler which is randomly sampled from (1,1+R]11𝑅(1,1+R]( 1 , 1 + italic_R ]. In this way, the original time series will have a lax periodicity. Then, we test the performance of SORN for different slow task ratios. (c) Using the same noise setting as (a), we test the performance of SORN for different average slow-down time. (d) Using the same period setting as (b), we test the performance of SORN for different average slow-down time.

Evaluation Metrics. We choose three of the most widely-used metrics to measure the performance of our method as many marvelous methods did (Li et al., 2023; Chen et al., 2021, 2022a): the precision, recall, and F1 score.

4.2. Prediction Accuracy

We take 70% of each subset as the training set and take the remaining 30% as the testing set. For each subset, we train a unique model. This training strategy is also adopted by other marvelous works, such as (Zhang et al., 2019; Su et al., 2019; Chen et al., 2022a). We show the performance of SORN and baselines in Tab. 3, where we use ”Pre” and ”Rec” to stand for precision and recall respectively, and highlight the best performance as the boldfaced. When SORN achieves the best performance, we underline the best performance among baselines. Otherwise, we underline the second-best performance among all methods. SORN achieves the best F1 scores on all datasets compared with the state-of-the-art methods. Comparing the performance of our method on four datasets, we observe that it performs best on the Ali1 and Ali2 datasets, followed by Mustang and Sync. It can be seen that the effectiveness of our method is positively correlated to the strictness of periodicity in the datasets. It achieves impressive performance on datasets with strict periodicity, while also demonstrating competitive results on datasets with relaxed periodicity or non-periodic characteristics. We will further discuss the impact of periodicity strictness in subsection. 4.5.

Table 3. Average performance of SORN and baselines on subsets of four datasets.
Ali1 Ali2 Mustang Sync
Pre Rec F1 Pre Rec F1 Pre Rec F1 Pre Rec F1
MSCRED 0.841 0.981 0.878 0.928 0.988 0.954 0.871 0.960 0.896 0.717 0.874 0.779
Omni 0.681 0.981 0.782 0.814 0.987 0.890 0.812 0.968 0.878 0.655 0.997 0.787
AnomalyTr 1.000 0.870 0.923 0.999 0.763 0.857 1.000 0.891 0.935 1.000 0.680 0.809
TranAD 0.784 0.989 0.864 0.827 0.968 0.877 0.865 0.918 0.867 0.247 0.568 0.313
DCdetector 0.984 0.728 0.806 0.994 0.723 0.818 0.968 0.718 0.799 0.936 0.406 0.567
VRGAE 0.811 0.981 0.853 0.966 0.992 0.978 0.871 0.959 0.905 0.624 0.794 0.648
IASO 0.492 0.943 0.618 0.611 0.907 0.708 0.420 0.899 0.524 0.389 0.910 0.533
feature-shift 0.533 1.000 0.647 0.744 0.953 0.790 0.511 1.000 0.629 0.594 0.081 0.142
SORN 0.891 0.989 0.897 0.955 0.997 0.968 0.895 0.996 0.916 0.963 0.893 0.919
SORN 0.944 0.969 0.939 0.960 0.967 0.955 0.912 0.971 0.919 0.939 0.832 0.874
SORN§ 0.878 1.000 0.891 0.950 0.997 0.965 0.925 0.996 0.938 0.935 0.763 0.826
SORN 1.000 0.966 0.979 0.980 1.000 0.989 0.952 0.974 0.958 0.956 0.926 0.932

4.3. Time and Memory Overhead

We evaluated both time and memory overhead on a server equipped with a configuration comprising 32 Intel(R) Xeon(R) CPU E5-2620 @ 2.10GHz CPUs and 2 K80 GPUs. We use the checkpoint sizes to stand for the neural network memory overhead and use the time of training model for one epoch to stand for the time overhead. As for the non-neural network methods, IASO and feature-shift detection, we use the maximum memory consumption during its inferring process as its memory overhead. We show the time and memory overhead of different methods in Fig. 4(b), where SORN only introduces marginal time and memory overhead compared with some light methods, such as OmniAnomaly, but can achieve better performance on all the datasets. Compared with some transformer-based methods, such as AnomalyTransformer and DCdetector, we use less memory overhead yet achieve better accuracy. In this way, SORN can better meet the real-time requirements of the cloud center.

4.4. Hyperparameter Sensitivity

We test the performance of SORN when setting the number of skimming layers and patch size as the Cartesian product of {1,3,5,7,9} for skimming layers and {1,3,5,7,9,11,13} for patch size. We exhibit the result in Fig. 4(c). Overall, the performance of SORN is parameter insensitive. As the number of skimming attention layers and patch size increase, the performance of SORN increases in fluctuations.

4.5. The Impact of Dataset Property

We investigate the impact of four factors on the performance of SORN on the Sync dataset: the noise, periodicity strictness, slow task ratio, and average slow-down time in slow-down anomalies. The noise introduced into the Sync data is a random variable with a mean of 00 and standard deviation of noise𝒜𝑛𝑜𝑖𝑠𝑒𝒜noise*\mathcal{A}italic_n italic_o italic_i italic_s italic_e ∗ caligraphic_A, where 𝒜𝒜\mathcal{A}caligraphic_A is the amplitude of the original time series. To manipulate the periodicity strictness, we distort each period of the original series by using a scalar randomly sampled from a distribution (1,1+R]11𝑅(1,1+R]( 1 , 1 + italic_R ] to extend it. When we test the impact of the noise, we make the time series strictly periodic before introducing noise and vice versa. The results are displayed in Fig. 5(a)-Fig. 5(d). Generally, when the time series is strictly periodic without any noise, SORN can achieve excellent performance on the Sync dataset. When the noise becomes more variable and the periodicity is more severely distorted, the performance degrades but SORN is still sensitive and accurate: SORN can achieve an F1 score over 0.9 as long as the slow task ratio overpasses 10% in all conditions of the noise and periodicity strictness explored in our experiment; SORN can achieve an F1 score over 0.9 as long as the average slow-down time overpasses 60 minutes in all conditions of the periodicity strictness and most of conditions of the noise. It is worth noting that 60 minutes is slightly over the maximum interval length in I𝐼Iitalic_I (50 minutes). Since the maximum interval length is 50 minutes, the slow task with slow-down time less than that may not bring change to x𝑥xitalic_x. Thus, our model can not distinguish them. If there is a need to improve the sensitivity of SORN to the average slow-down time, we can make it by just substituting the interval division I𝐼Iitalic_I with a fine-grained one.

4.6. Ablation Study

To evaluate the contribution of each module in SORN, we alternatively remove each submodule and test the performance of the remaining model. Specifically, we denote SORN removing skimming attention as SORN, denote SORN removing neural OT as SORN and denote SORN replacing picky loss with MSE as SORN§. When removing the skimming attention mechanism, we replace it with a standard attention. When removing the picky loss, we substitute it with MSE. As shown in Table 3, the completed SORN achieves the best performance. Thus, each submodule of SORN does contribute to the performance.

5. Related Work

To the best of our knowledge, we are the first to investigate the issue of cluster-wide task slowdowns. While numerous works delve into slow query detection (Ma et al., 2020; Zhou et al., 2021) and disk fail-slow detection (Lu et al., 2023, 2022), they primarily focus on detecting slowdowns at the level of individual SQL queries or disks rather than considering the overall aspect. However, detecting slow tasks at the individual level can be unreliable in cloud virtual environments, where task duration time fluctuates randomly and significantly. Single-task slowdowns are common and do not necessarily indicate a cluster malfunction.

Moreover, time series anomaly detection is another relevant area, as we need to capture the normal variation pattern and time dependencies of time series (Zhang et al., 2023; Jin et al., 2024). Time series anomaly detection methods can be broadly categorized into three classes: classical methods (Pang et al., 2015; Barz et al., 2018; Nakamura et al., 2020; Gao et al., 2020), signal-processing-based methods (Zhao et al., 2019; Alarcon-Aquino and Barria, 2001; Ma et al., 2021), and deep learning-based methods (Zhang et al., 2022; Hundman et al., 2018; Zong et al., 2018; Sasal et al., 2022; Chen et al., 2024; Sun et al., 2023; Xu et al., 2024, 2023). Classical methods typically rely on statistical approaches and have relatively low computational overhead. However, they often make specific assumptions that limit their robustness in detecting anomalies in cloud environments (Ma et al., 2021). Signal-processing-based methods leverage the sparsity inherent in the frequency domain to reduce computational overhead. However, they may overlook local subtle features (Alarcon-Aquino and Barria, 2001) or struggle to handle heavy traffic loads in real-time scenarios (Ma et al., 2021). Deep learning-based anomaly detection methods have reported promising performance and diversified into various approaches, including prediction-based (Hundman et al., 2018; Zong et al., 2018; Sasal et al., 2022; Chen et al., 2022c), reconstruction-based (Chen et al., 2022b; You et al., 2022; Jiang et al., 2022; Shen et al., 2021; Tian et al., 2019; Deng and Hooi, 2021; Ho and Armanfard, 2023; Chen et al., 2024), classification-based (Grathwohl et al., 2020; Ruff et al., 2018; Shen et al., 2020; Xu et al., 2024; Sun et al., 2023), and perturbation-based methods (Cai and Fan, 2022; Stadler et al., 2021). Among them, reconstruction-based methods have shown strong advantages over others (Kieu et al., 2022), in which the transformer-based methods have demonstrated good performance recently (Tuli et al., 2022b; Xu et al., 2022; Potter et al., 2022). However, as we mentioned earlier, the standard attention mechanism may struggle to reconstruct compound periodic time series effectively.

6. Conclusion

In this study, we introduce SORN as a method for detecting cluster-wide task slowdowns in cloud clusters, offering three distinctive features: 1) Skimming Attention, where we provide a theoretical explanation for the limitations of standard attention mechanisms in reconstructing compound periodicity and propose a method to separately reconstruct subperiodic components to ensure accurate reconstruction of both high and low amplitude subperiods; 2) Neural OT, which selectively reconstructs non-slowing exceptional fluctuations; 3) Picky Loss, which assigns weights to time slots in the loss function based on their reliability. Additionally, extensive experiments demonstrate that SORN outperforms state-of-the-art methods in real-world datasets. In the future, we will use large language models for further analysis of the causes of slow-down tasks based on this foundation and employ multi-agent systems for automatic recovery.

Acknowledgements.
This work was supported by the National Science Foundation of China under Grants 62125206 and U20A20173, and in part by Alibaba Group through Alibaba Research Intern Program.

References

  • (1)
  • Alarcon-Aquino and Barria (2001) Vicente Alarcon-Aquino and Javier A Barria. 2001. Anomaly detection in communication networks using wavelets. IEEE Proceedings-Communications 148, 6 (2001), 355–362.
  • Amvrosiadis et al. (2018) George Amvrosiadis, Jun Woo Park, Gregory R Ganger, Garth A Gibson, Elisabeth Baseman, and Nathan DeBardeleben. 2018. On the diversity of cluster workloads and its impact on research results. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 533–546.
  • Barz et al. (2018) Björn Barz, Erik Rodner, Yanira Guanche Garcia, and Joachim Denzler. 2018. Detecting regions of maximal divergence for spatio-temporal anomaly detection. IEEE transactions on pattern analysis and machine intelligence 41, 5 (2018), 1088–1101.
  • Cai and Fan (2022) Jinyu Cai and Jicong Fan. 2022. Perturbation learning based anomaly detection. Advances in Neural Information Processing Systems 35 (2022).
  • Chen et al. (2022c) Chengwei Chen, Yuan Xie, Shaohui Lin, Angela Yao, Guannan Jiang, Wei Zhang, Yanyun Qu, Ruizhi Qiao, Bo Ren, and Lizhuang Ma. 2022c. Comprehensive regularization in a bi-directional predictive network for video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 230–238.
  • Chen et al. (2024) Feiyi Chen, Zhen Qin, Mengchu Zhou, Yingying Zhang, Shuiguang Deng, Lunting Fan, Guansong Pang, and Qingsong Wen. 2024. LARA: A Light and Anti-overfitting Retraining Approach for Unsupervised Time Series Anomaly Detection. In Proceedings of the ACM on Web Conference 2024. 4138–4149.
  • Chen et al. (2022a) Wenchao Chen, Long Tian, Bo Chen, Liang Dai, Zhibin Duan, and Mingyuan Zhou. 2022a. Deep variational graph convolutional recurrent network for multivariate time series anomaly detection. In International Conference on Machine Learning. PMLR, 3621–3633.
  • Chen et al. (2022b) Wenchao Chen, Long Tian, Bo Chen, Liang Dai, Zhibin Duan, and Mingyuan Zhou. 2022b. Deep Variational Graph Convolutional Recurrent Network for Multivariate Time Series Anomaly Detection. In International Conference on Machine Learning, ICML 2022 (Proceedings of Machine Learning Research, Vol. 162). 3621–3633.
  • Chen et al. (2021) Xuanhao Chen, Liwei Deng, Feiteng Huang, Chengwei Zhang, Zongquan Zhang, Yan Zhao, and Kai Zheng. 2021. Daemon: Unsupervised anomaly detection and interpretation for multivariate time series. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2225–2230.
  • Deng and Hooi (2021) Ailin Deng and Bryan Hooi. 2021. Graph neural network-based anomaly detection in multivariate time series. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 4027–4035.
  • Gao et al. (2020) Jingkun Gao, Xiaomin Song, Qingsong Wen, Pichao Wang, Liang Sun, and Huan Xu. 2020. Robusttad: Robust time series anomaly detection via decomposition and convolutional neural networks. arXiv preprint arXiv:2002.09545 (2020).
  • Grathwohl et al. (2020) Will Grathwohl, Kuan-Chieh Wang, Jorn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. 2020. Your classifier is secretly an energy based model and you should treat it like one. In 8th International Conference on Learning Representations, ICLR 2020.
  • Ho and Armanfard (2023) Thi Kieu Khanh Ho and Narges Armanfard. 2023. Self-supervised learning for anomalous channel detection in EEG graphs: application to seizure analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 7866–7874.
  • Hundman et al. (2018) Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and Tom Soderstrom. 2018. Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 387–395.
  • Jiang et al. (2022) Xi Jiang, Jianlin Liu, Jinbao Wang, Qiang Nie, Kai Wu, Yong Liu, Chengjie Wang, and Feng Zheng. 2022. Softpatch: Unsupervised anomaly detection with noisy data. Advances in Neural Information Processing Systems 35 (2022), 15433–15445.
  • Jin et al. (2024) Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. 2024. Time-llm: Time series forecasting by reprogramming large language models. In International Conference on Learning Representations.
  • Kieu et al. (2022) Tung Kieu, Bin Yang, Chenjuan Guo, Razvan-Gabriel Cirstea, Yan Zhao, Yale Song, and Christian S Jensen. 2022. Anomaly detection in time series with robust variational quasi-recurrent autoencoders. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 1342–1354.
  • Kulinski et al. (2020) Sean Kulinski, Saurabh Bagchi, and David I Inouye. 2020. Feature shift detection: Localizing which features have shifted via conditional distribution tests. Advances in neural information processing systems 33 (2020), 19523–19533.
  • Li et al. (2023) Yuxin Li, Wenchao Chen, Bo Chen, Dongsheng Wang, Long Tian, and Mingyuan Zhou. 2023. Prototype-oriented unsupervised anomaly detection for multivariate time series. In International Conference on Machine Learning. PMLR, 19407–19424.
  • Lu et al. (2023) Ruiming Lu, Erci Xu, Yiming Zhang, Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Jiwu Shu, Minglu Li, et al. 2023. Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems. In 21st USENIX Conference on File and Storage Technologies (FAST 23). 49–64.
  • Lu et al. (2022) Ruiming Lu, Erci Xu, Yiming Zhang, Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Minglu Li, and Jiesheng Wu. 2022. NVMeSSD failures in the field: the Fail-Stop and the Fail-Slow. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 1005–1020.
  • Ma et al. (2020) Minghua Ma, Zheng Yin, Shenglin Zhang, Sheng Wang, Christopher Zheng, Xinhao Jiang, Hanwen Hu, Cheng Luo, Yilin Li, Nengjun Qiu, et al. 2020. Diagnosing root causes of intermittent slow queries in cloud databases. Proceedings of the VLDB Endowment 13, 8 (2020), 1176–1189.
  • Ma et al. (2021) Minghua Ma, Shenglin Zhang, Junjie Chen, Jim Xu, Haozhe Li, Yongliang Lin, Xiaohui Nie, Bo Zhou, Yong Wang, and Dan Pei. 2021. Jump-Starting multivariate time series anomaly detection for online service systems. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 413–426.
  • Nakamura et al. (2020) Takaaki Nakamura, Makoto Imamura, Ryan Mercer, and Eamonn Keogh. 2020. Merlin: Parameter-free discovery of arbitrary length anomalies in massive time series archives. In 2020 IEEE international conference on data mining (ICDM). IEEE, 1190–1195.
  • Nie et al. (2023) Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. 2023. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In The Eleventh International Conference on Learning Representations, ICLR 2023.
  • Panda et al. (2019) Biswaranjan Panda, Deepthi Srinivasan, Huan Ke, Karan Gupta, Vinayak Khot, and Haryadi S Gunawi. 2019. IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 47–62.
  • Pang et al. (2015) Guansong Pang, Kai Ming Ting, and David Albrecht. 2015. LeSiNN: Detecting anomalies by identifying least similar nearest neighbours. In 2015 IEEE international conference on data mining workshop (ICDMW). IEEE, 623–630.
  • Potter et al. (2022) İlkay Yıldız Potter, George Zerveas, Carsten Eickhoff, and Dominique Duncan. 2022. Unsupervised Multivariate Time-Series Transformers for Seizure Identification on EEG. In 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 1304–1311.
  • Ruff et al. (2018) Lukas Ruff, Nico Görnitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Robert A. Vandermeulen, Alexander Binder, Emmanuel Müller, and Marius Kloft. 2018. Deep One-Class Classification. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018 (Proceedings of Machine Learning Research, Vol. 80). 4390–4399.
  • Sasal et al. (2022) Lena Sasal, Tanujit Chakraborty, and Abdenour Hadid. 2022. W-Transformers: A Wavelet-based Transformer Framework for Univariate Time Series Forecasting. In 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 671–676.
  • Shen et al. (2020) Lifeng Shen, Zhuocong Li, and James Kwok. 2020. Timeseries anomaly detection using temporal hierarchical one-class network. Advances in Neural Information Processing Systems 33 (2020), 13016–13026.
  • Shen et al. (2021) Lifeng Shen, Zhongzhong Yu, Qianli Ma, and James T Kwok. 2021. Time series anomaly detection with multiresolution ensemble decoding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 9567–9575.
  • Stadler et al. (2021) Maximilian Stadler, Bertrand Charpentier, Simon Geisler, Daniel Zügner, and Stephan Günnemann. 2021. Graph posterior network: Bayesian predictive uncertainty for node classification. Advances in Neural Information Processing Systems 34 (2021), 18033–18048.
  • Su et al. (2019) Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. 2019. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2828–2837.
  • Sun et al. (2023) Yuting Sun, Guansong Pang, Guanhua Ye, Tong Chen, Xia Hu, and Hongzhi Yin. 2023. Unraveling the Anomaly in Time Series Anomaly Detection: A Self-supervised Tri-domain Solution. arXiv preprint arXiv:2311.11235 (2023).
  • Tian et al. (2019) Kai Tian, Shuigeng Zhou, Jianping Fan, and Jihong Guan. 2019. Learning competitive and discriminative reconstructions for anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5167–5174.
  • Tuli et al. (2022a) Shreshth Tuli, Giuliano Casale, and Nicholas R Jennings. 2022a. Tranad: Deep transformer networks for anomaly detection in multivariate time series data. arXiv preprint arXiv:2201.07284 (2022).
  • Tuli et al. (2022b) Shreshth Tuli, Giuliano Casale, and Nicholas R. Jennings. 2022b. TranAD: Deep Transformer Networks for Anomaly Detection in Multivariate Time Series Data. Proc. VLDB Endow. 15, 6 (2022), 1201–1214.
  • Upadhyay and Sikka (2020) Utsav Upadhyay and Geeta Sikka. 2020. STDADS: an efficient slow task detection algorithm for deadline schedulers. Big Data 8, 1 (2020), 62–69.
  • Wen et al. (2020) Qingsong Wen, Zhe Zhang, Yan Li, and Liang Sun. 2020. Fast RobustSTL: Efficient and robust seasonal-trend decomposition for time series with complex patterns. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2203–2213.
  • Xu et al. (2023) Hongzuo Xu, Guansong Pang, Yijie Wang, and Yongjun Wang. 2023. Deep isolation forest for anomaly detection. IEEE Transactions on Knowledge and Data Engineering (2023).
  • Xu et al. (2024) Hongzuo Xu, Yijie Wang, Songlei Jian, Qing Liao, Yongjun Wang, and Guansong Pang. 2024. Calibrated one-class classification for unsupervised time series anomaly detection. IEEE Transactions on Knowledge and Data Engineering (2024).
  • Xu et al. (2022) Jiehui Xu, Haixu Wu, Jianmin Wang, and Mingsheng Long. 2022. Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.
  • Yang et al. (2023) Yiyuan Yang, Chaoli Zhang, Tian Zhou, Qingsong Wen, and Liang Sun. 2023. DCdetector: Dual Attention Contrastive Representation Learning for Time Series Anomaly Detection. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2203–2213.
  • You et al. (2022) Zhiyuan You, Lei Cui, Yujun Shen, Kai Yang, Xin Lu, Yu Zheng, and Xinyi Le. 2022. A unified model for multi-class anomaly detection. Advances in Neural Information Processing Systems 35 (2022), 4571–4584.
  • Zhang et al. (2019) Chuxu Zhang, Dongjin Song, Yuncong Chen, Xinyang Feng, Cristian Lumezanu, Wei Cheng, Jingchao Ni, Bo Zong, Haifeng Chen, and Nitesh V Chawla. 2019. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 1409–1416.
  • Zhang et al. (2022) Chaoli Zhang, Tian Zhou, Qingsong Wen, and Liang Sun. 2022. TFAD: A decomposition time series anomaly detection architecture with time-frequency analysis. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2497–2507.
  • Zhang et al. (2023) Kexin Zhang, Qingsong Wen, Chaoli Zhang, Rongyao Cai, Ming Jin, Yong Liu, James Zhang, Yuxuan Liang, Guansong Pang, Dongjin Song, et al. 2023. Self-Supervised Learning for Time Series Analysis: Taxonomy, Progress, and Prospects. arXiv preprint arXiv:2306.10125 (2023).
  • Zhang et al. (2021) Yingying Zhang, Zhengxiong Guan, Huajie Qian, Leili Xu, Hengbo Liu, Qingsong Wen, Liang Sun, Junwei Jiang, Lunting Fan, and Min Ke. 2021. CloudRCA: A root cause analysis framework for cloud computing platforms. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4373–4382.
  • Zhao et al. (2019) Nengwen Zhao, Jing Zhu, Yao Wang, Minghua Ma, Wenchi Zhang, Dapeng Liu, Ming Zhang, and Dan Pei. 2019. Automatic and generic periodicity adaptation for kpi anomaly detection. IEEE Transactions on Network and Service Management 16, 3 (2019), 1170–1183.
  • Zhou et al. (2021) Xuanhe Zhou, Lianyuan Jin, Ji Sun, Xinyang Zhao, Xiang Yu, Jianhua Feng, Shifu Li, Tianqing Wang, Kun Li, and Luyang Liu. 2021. Dbmind: A self-driving platform in opengauss. Proceedings of the VLDB Endowment 14, 12 (2021), 2743–2746.
  • Zong et al. (2018) Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Dae-ki Cho, and Haifeng Chen. 2018. Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection. In 6th International Conference on Learning Representations, ICLR 2018.

Appendix A Proof of Theorem 1

In the following, we use AttentionWeight[t1,t2]AttentionWeightsubscriptt1subscriptt2\operatorname{AttentionWeight[t_{1},t_{2}]}roman_AttentionWeight [ roman_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] to denote the attention weight of the patch starting from t2thsuperscriptsubscript𝑡2𝑡t_{2}^{th}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT time slot, when using the patch starting from t1thsuperscriptsubscript𝑡1𝑡t_{1}^{th}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT time slot as the query. We use the orthogonality of trigonometric functions when deriving Eq. 16 to Eq.17. Since cosω1tcosω1(t+Δt)=12cos(ω1t+ω1(t+Δt))+cos(ω1tω1(t+Δt))subscript𝜔1𝑡subscript𝜔1𝑡Δ𝑡12subscript𝜔1𝑡subscript𝜔1𝑡Δ𝑡subscript𝜔1𝑡subscript𝜔1𝑡Δ𝑡\cos\omega_{1}t\cos\omega_{1}(t+\Delta t)=\frac{1}{2}\cos(\omega_{1}t+\omega_{% 1}(t+\Delta t))+\cos(\omega_{1}t-\omega_{1}(t+\Delta t))roman_cos italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t roman_cos italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_cos ( italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t + italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) ) + roman_cos ( italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t - italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) ), sinω2tsinω2(t+Δt)=12(cos(ω2t+ω2(t+Δt))cos(ω2tω2(t+Δt)))subscript𝜔2𝑡subscript𝜔2𝑡Δ𝑡12subscript𝜔2𝑡subscript𝜔2𝑡Δ𝑡subscript𝜔2𝑡subscript𝜔2𝑡Δ𝑡\sin\omega_{2}t\sin\omega_{2}(t+\Delta t)=-\frac{1}{2}(\cos(\omega_{2}t+\omega% _{2}(t+\Delta t))-\cos(\omega_{2}t-\omega_{2}(t+\Delta t)))roman_sin italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t roman_sin italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) = - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( roman_cos ( italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t + italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) ) - roman_cos ( italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t - italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) ) ), and t1t1+pcos(2ω1t+ω1Δt)𝑑t=0superscriptsubscriptsubscript𝑡1subscript𝑡1𝑝2subscript𝜔1𝑡subscript𝜔1Δ𝑡differential-d𝑡0\int_{t_{1}}^{t_{1}+p}\cos(2\omega_{1}t+\omega_{1}\Delta t)\ dt=0∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_p end_POSTSUPERSCRIPT roman_cos ( 2 italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t + italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ italic_t ) italic_d italic_t = 0 (because p𝑝pitalic_p is integer multiple of the period length of cos(2ω1t+ω1Δt)2subscript𝜔1𝑡subscript𝜔1Δ𝑡\cos(2\omega_{1}t+\omega_{1}\Delta t)roman_cos ( 2 italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t + italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ italic_t )), we derive Eq. 17 to Eq. 18. Since ΔtΔ𝑡\Delta troman_Δ italic_t is a constant without relevance to t𝑡titalic_t, we derive Eq. 18 to Eq. 19.

(16) AttentionWeight[t1,t2]=t1t1+p(c1cosω1t+c2sinω2t)[c1cosω1(t+Δt)+c2sinω2(t+Δt)]𝑑tAttentionWeightsubscriptt1subscriptt2superscriptsubscriptsubscript𝑡1subscript𝑡1𝑝subscript𝑐1subscript𝜔1𝑡subscript𝑐2subscript𝜔2𝑡delimited-[]subscript𝑐1subscript𝜔1𝑡Δ𝑡subscript𝑐2subscript𝜔2𝑡Δ𝑡differential-d𝑡\displaystyle\operatorname{AttentionWeight[t_{1},t_{2}]}=\int_{t_{1}}^{t_{1}+p% }(c_{1}\cos\omega_{1}t+c_{2}\sin\omega_{2}t)[c_{1}\cos\omega_{1}(t+\Delta t)+c% _{2}\sin\omega_{2}(t+\Delta t)]\ dtstart_OPFUNCTION roman_AttentionWeight [ roman_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] end_OPFUNCTION = ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_p end_POSTSUPERSCRIPT ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_cos italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_sin italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t ) [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_cos italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_sin italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) ] italic_d italic_t
(17) =t1t1+pc12cos(ω1t)cosω1(t+Δt)+c22sin(ω2t)sinω2(t+Δt)dtabsentsuperscriptsubscriptsubscript𝑡1subscript𝑡1𝑝superscriptsubscript𝑐12subscript𝜔1𝑡subscript𝜔1𝑡Δ𝑡superscriptsubscript𝑐22subscript𝜔2𝑡subscript𝜔2𝑡Δ𝑡𝑑𝑡\displaystyle\hskip 62.59596pt=\int_{t_{1}}^{t_{1}+p}c_{1}^{2}\cos(\omega_{1}t% )\cos\omega_{1}(t+\Delta t)+c_{2}^{2}\sin(\omega_{2}t)\sin\omega_{2}(t+\Delta t% )\ dt= ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_p end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_cos ( italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t ) roman_cos italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_sin ( italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_t ) roman_sin italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) italic_d italic_t
(18) =t1t1+p12c12cos(ω1Δt)+12c22cos(ω2Δt)dtabsentsuperscriptsubscriptsubscript𝑡1subscript𝑡1𝑝12superscriptsubscript𝑐12subscript𝜔1Δ𝑡12superscriptsubscript𝑐22subscript𝜔2Δ𝑡𝑑𝑡\displaystyle=\int_{t_{1}}^{t_{1}+p}\frac{1}{2}c_{1}^{2}\cos(\omega_{1}\Delta t% )+\frac{1}{2}c_{2}^{2}\cos(\omega_{2}\Delta t)\ dt= ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_p end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_cos ( italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ italic_t ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_cos ( italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Δ italic_t ) italic_d italic_t
(19) =p2(c12cosω1Δt+c22cosω2Δt)absent𝑝2superscriptsubscript𝑐12subscript𝜔1Δ𝑡superscriptsubscript𝑐22subscript𝜔2Δ𝑡\displaystyle=\frac{p}{2}(c_{1}^{2}\cos\omega_{1}\Delta t+c_{2}^{2}\cos\omega_% {2}\Delta t)= divide start_ARG italic_p end_ARG start_ARG 2 end_ARG ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_cos italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_Δ italic_t + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_cos italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT roman_Δ italic_t )

Appendix B Proof of Theorem 2

We prove Theorem 2 in a similar way as in Theorem 1.

(20) AttentionWeight[t1,t2]=t1t1+p(a02+n=0ancosωnt+bnsinωnt)[a02+n=0ancosωn(t+Δt)+bnsinωn(t+Δt)]𝑑t=a02p4+n=0t1t1+pan2cosωntcosωn(t+Δt)+bn2sinωntsinωn(t+Δt)dt=a02p4+p2n=0(an2+bn2)cosωnΔtAttentionWeightsubscript𝑡1subscript𝑡2superscriptsubscriptsubscript𝑡1subscript𝑡1𝑝subscript𝑎02superscriptsubscript𝑛0subscript𝑎𝑛subscript𝜔𝑛𝑡subscript𝑏𝑛subscript𝜔𝑛𝑡delimited-[]subscript𝑎02superscriptsubscript𝑛0subscript𝑎𝑛subscript𝜔𝑛𝑡Δ𝑡subscript𝑏𝑛subscript𝜔𝑛𝑡Δ𝑡differential-d𝑡superscriptsubscript𝑎02𝑝4superscriptsubscript𝑛0superscriptsubscriptsubscript𝑡1subscript𝑡1𝑝superscriptsubscript𝑎𝑛2subscript𝜔𝑛𝑡subscript𝜔𝑛𝑡Δ𝑡superscriptsubscript𝑏𝑛2subscript𝜔𝑛𝑡subscript𝜔𝑛𝑡Δ𝑡𝑑𝑡superscriptsubscript𝑎02𝑝4𝑝2superscriptsubscript𝑛0superscriptsubscript𝑎𝑛2superscriptsubscript𝑏𝑛2subscript𝜔𝑛Δ𝑡\begin{split}\operatorname{AttentionWeight}[t_{1},t_{2}]&=\int_{t_{1}}^{t_{1}+% p}(\frac{a_{0}}{2}+\sum_{n=0}^{\infty}a_{n}\cos\omega_{n}t+b_{n}\sin\omega_{n}% t)\cdot[\frac{a_{0}}{2}+\sum_{n=0}^{\infty}a_{n}\cos\omega_{n}(t+\Delta t)+b_{% n}\sin\omega_{n}(t+\Delta t)]\ dt\\ &=\frac{a_{0}^{2}p}{4}+\sum_{n=0}^{\infty}\int_{t_{1}}^{t_{1}+p}a_{n}^{2}\cos% \omega_{n}t\cos\omega_{n}(t+\Delta t)+b_{n}^{2}\sin\omega_{n}t\sin\omega_{n}(t% +\Delta t)\ dt\\ &=\frac{a_{0}^{2}p}{4}+\frac{p}{2}\sum_{n=0}^{\infty}(a_{n}^{2}+b_{n}^{2})\cos% \omega_{n}\Delta t\end{split}start_ROW start_CELL roman_AttentionWeight [ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] end_CELL start_CELL = ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_p end_POSTSUPERSCRIPT ( divide start_ARG italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_cos italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_t + italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_sin italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_t ) ⋅ [ divide start_ARG italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_cos italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) + italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_sin italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) ] italic_d italic_t end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p end_ARG start_ARG 4 end_ARG + ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_p end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_cos italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_t roman_cos italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) + italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_sin italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_t roman_sin italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t ) italic_d italic_t end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_p end_ARG start_ARG 4 end_ARG + divide start_ARG italic_p end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_cos italic_ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_Δ italic_t end_CELL end_ROW

Appendix C Data preprocessing

The code and some datasets are available at https://github.com/gyhswtxnc/SORN.

  • Ali1 & Ali2 (periodic): We collect these datasets by tracing 25 industrial cloud clusters from Alibaba for 15 days. Most of the labels in these two datasets are assigned manually according to the experience of our engineers. Some of the labels are assigned according to our customer’s feedback. These two datasets were collected on server clusters in different regions, and there is a significant difference in the anomaly proportion between them. Each subset in Ali1 and Ali2 stands for a cluster.

  • Mustang (lax periodic) (Amvrosiadis et al., 2018): Mustang is a dataset that records task duration time for 5 years. We preprocess the original dataset as shown in Appendix. C and label the slow-down anomalies manually. Then, we equally divide the five years of tracing data into 35 intervals and constitute 35 subsets.

  • Sync (mixture of periodic and aperiodic): We synthesize this dataset by combining cosine waves with different frequencies and amplitudes. Then, we manually insert noise, distorted period and slow-down anomalies.

For every dataset, we count a task duration time distribution I𝐼Iitalic_I at each time slot and divide the intervals in I𝐼Iitalic_I according to the distribution density of the execution time. We show the interval division for every dataset in Tab. 4.

Table 4. The interval division for each dataset.
Dataset Edges of I𝐼Iitalic_I
Ali1 {0, 10, 20, 30, 40, 70, 110, 150, 190, 230, 280, 330, 380, 430}
Ali2 {0, 10, 20, 30, 40, 70, 110, 150, 190, 230, 280, 330, 380, 430}
Mut {0, 5, 10, 20, 30, 40, 70, 110, 150, 190, 230, 280, 330, 380, 430, 900, 1200, 9000}
Sync {0, 10, 20, 30, 40, 70, 110, 150, 190, 230, 280, 330, 380, 430}

Appendix D Hyperparameter searching space

We use grid-search to figure out the optimal hyperparameter settings. We list the ranges for important hyperparameters in Tab.5.

Table 5. The searching ranges for important hyperparameters.
Hyperparameter Searching Range
Skimming layers {1,3,5,7}
Patch size {2,3,4,5,7,9,11,15}
Window length {10,20,30,40,50,80}
Learning rate {0.0001,0.001,0.01}

Appendix E Baselines introduction

  • DCdetector: DCdetector is one of the most SOTA anomaly detection methods, which assembles a novel dual attention asymmetric design and a pure contrastive loss.

  • TranAD: TranAD is an influential and novel anomaly detection method, which is assisted by meta-learning and shows the high accuracy of anomaly detection.

  • AnomalyTransformer: AnomalyTransformer is one of the founders who introduced the deep transformer into the area of anomaly detection, which is verified with strong performance.

  • VQRAE: VQRAE is a novel and sharp anomaly detection method, which also delves into the problem that there are anomalies in the training set. Thus, we also include this method in our baseline.

  • OmniAnomaly: OmniAnomaly is one of the most widely-recognized and widely-used anomaly detection methods with small time and memory overhead.

  • MSCRED: MSCRED is an anomaly detection method garnering widespread attention with strong efficacy, which not only considers the temporal correlation but also takes the interdependency between features into account.

  • IASO: IASO is a method specifically designed to detect the slow-down data retrieval of disks.

  • Feature-shift detection: The feature-shift detection method is designed to detect whether the distribution of features has shifted.