Workflow optimization in distributed computing environment for stream-based data processing model / Saima Gulzar Ahmad
With the advancement in science and technology numerous complex scientific applications can be executed in heterogeneous computing environment. However, the bottle neck is efficient scheduling algorithms. Such complex applications can be expressed in the form of workflows. Geographically distribu...
| Main Author: | |
|---|---|
| Format: | Thesis |
| Published: |
2017
|
| Subjects: | |
| Online Access: | http://studentsrepo.um.edu.my/7761/ http://studentsrepo.um.edu.my/7761/2/All.pdf http://studentsrepo.um.edu.my/7761/1/thesis.pdf |
| Summary: | With the advancement in science and technology numerous complex scientific applications
can be executed in heterogeneous computing environment. However, the bottle
neck is efficient scheduling algorithms. Such complex applications can be expressed in
the form of workflows. Geographically distributed heterogeneous resources can execute
such workflows in parallel. This enhances the workflow execution. In data-intensive
workflows, heavy data moves across the execution nodes. This causes high communication
overhead. To avoid such overheads many techniques have been used, however in this
thesis stream-based data processing model is used in which data is processed in the form
of continuous instances of data items. Data-intensive workflow optimization is an active
research area because numerous applications are producing huge amount of data that is
increasing exponentially day by day.
This thesis proposes data-intensive workflow optimization algorithms. The first algorithm
architecture consists of two phases a) workflow partitioning, and b) partitions mapping.
Partitions are made in such a way that minimum data should move across the partitions.
It enables heavy data processing locally on same execution node because each partition
is mapped to one execution node. It overcomes the high communication costs. In the
mapping phase, a partition is mapped on that execution node which offers minimum execution
time. Eventually, the workflow is executed. The second algorithm is a variation
in first algorithm in which data parallelism is introduced in each partition. Most compute
intensive task in each partition is identified and data parallelism is applied to that
task. It reduces the execution time of that compute intensive tasks. The simulation results
prove that proposed algorithms outperform from state of the art algorithms for variety
of workflows. The datasets used for performance evaluation are synthesized as well as
workflows derived from real world applications. The workflows derived from real world applications include Montage and Cybershake. Synthesized workflows were generated
with different sizes, shapes and densities to evaluate the proposed algorithms. The simulation
results shows 60% reduced latency with 47% improvement in the throughput.
Similarly, when data parallelism is introduced in the algorithm the performance of the
algorithm improved further by 12% in latency and 17% in throughput when compared to
PDWA algorithm. In the real time stream processing framework the experiments were
performed using STORM with a use-case data-intensive workflow (EURExpressII). Experiments
show that PDWA outperforms in terms of execution time of the workflow with
different input data size. |
|---|