PoSeiDon: Platform for Explainable Distributed Infrastructure
DOE science workflows are increasingly being executed on federated services infrastructures that are complex and managed by different organizations, domains, and communities. Hence, the operators of these infrastructures and the scientists that use them have limited global visibility and consequently incomplete understanding of the behavior of the entire set of resources that science workflows span. This limited visibility makes it extremely difficult to predict performance, detect and diagnose anomalies (e.g., network congestion, I/O bottlenecks) in the infrastructure and to understand their impact on the scientists’ workflows. PosEiDon will provide an integrated platform consisting of algorithms, methods, tools, and services that help facility operators and scientists improve the overall end-to-end science workflow by (1) predicting the performance of complex workflows; (2) detecting and classifying infrastructure and workflow anomalies and “explaining” the sources of these anomalies; and (3) suggesting performance optimizations.
Funding Agency: DOE