Up to now workflow performance studies have focused on modeling and measuring the performance of individual tasks, primarily taking into account the behavior of computational tasks, often ignoring data management jobs. In turn much work in workflow scheduling and resource provisioning for workflow application has focused on the managing computations and to some degree ignoring data movement and storage. Today’s workflow monitoring tools again focus primarily on task monitoring, viewing data management tasks as black boxes. At the same time, network monitoring tools have focused on low-level network performance that cannot be easily correlated with the workflows utilizing the network.
The main questions this work aims to address are: 1) how to develop analytical models that can predict the behavior of complex, data-aware scientific workflows executing in extreme scale infrastructures? 2) what monitoring information and information analysis is needed for performance prediction and anomaly detection in scientific workflow execution?, and 3) how to adapt the workflow execution and the infrastructure to achieve the potential performance predicted by the models. These questions will be addressed within the context of two DOE applications: Accelerated Climate Modeling for Energy (ACME), which processes large amount of community data and Spallation Neutron Source (SNS), which produces rich experimental data used in a variety of complex analysis.
The program of research focuses on data-aware workflow performance modeling, monitoring, and analysis and integrates this diverse information into knowledge about workflow behavior that can inform the scientist and the infrastructure providers about the observed performance issues and their causes. This work will develop end-to-end workflow-level analytical models that capture the behavior of the workflow tasks performance on a variety of systems as well as workflow data movement and storage across different networks and devices. The analytical models will be coupled with simulation-based models to increase fidelity of the predictions in dynamic environments. The models will be validated through experiments on DOE infrastructures (such as the ESnet testbed and production infrastructure, the ORNL facilities), on distributed testbeds like ExoGENI, and through simulations.
The work will result in analytical models, workflow-level monitoring tools and monitoring recommendations for existing tools, which capture not only computational task behavior but also that of the data transfer and storage activities in the workflow. An analysis capability will correlate workflow monitoring information with resource performance measurements to provide a better understanding of which resources contributed to the observed behavior. The analytical models will be used to guide anomaly detection and diagnosis, resource management and adaptation, and infrastructure design and planning.
Evaluation: The project will develop workflows for the target applications and synthetic workflows to evaluate the accuracy and performance of the models and tools. The experiments will measure workflow performance at various level of detail and compare it to the model predictions. Failures and load will be introduced into the system and their effects on the accuracy of anomaly detection and diagnosis will be measured.