Architectural Optimization of Serverless Big Data Pipelines for AI Workloads Using Cloud Functions and Managed Spark on GCP

Amandeep Singh Arora; Thulasiram Yachamaneni; Uttam Kotadiya

doi:10.63282/3050-9246.IJETCSIT-V5I1P107

Authors

Amandeep Singh Arora Senior Engineer I, USA. Author
Thulasiram Yachamaneni Senior Engineer II, USA. Author
Uttam Kotadiya Software Engineer II, USA. Author

DOI:

https://doi.org/10.63282/3050-9246.IJETCSIT-V5I1P107

Keywords:

Cloud Functions, Managed Spark, Dataproc, GCP, Serverless, Big Data, AI Workloads

Abstract

The influx of applications of Artificial Intelligence (AI) and Machine Learning (ML) in data-intensive environments introduces a need for scalable, efficient and cost-effective data processing architectures. The lingering monolithic systems are making way for distributed, cloud-native and serverless systems. The current paper gives a thorough architectural optimization of serverless big data pipelines to execute AI workloads in Google Cloud Platform (GCP) services, specifically, Google Cloud Functions and Managed Spark (Dataproc). This architecture is able to solve the main challenges of scalability, fault tolerance, data latency and cost optimization through utilizing a modular and event-driven approach. The pattern couples storage, compute and orchestration layers in a dynamically decoupled manner to achieve maximum efficiency of resources and flexibility in operations. Training and deployment of AI/ML data pipelines: In our proposed model, ingestion, transformation, model training, and deployment are performed. Elaborate performance analyses show how operation overhead, compute idle time, and latency in the processing have been drastically reduced while sustaining great accuracy in model results. In addition, the paper presents specific architectural patterns, deployment strategies, and optimization strategies for serverless and Spark-native conceptions. Comparisons with more traditional pipeline models indicate up to a 35 percent efficiency gain on execution efficiency and a 45 percent decrease in the cost. The insights can play a decisive role in data engineers and AI practitioners who create a next-generation data system

Downloads

Download data is not yet available.

References

[1] Baldini, I., Castro, P., Chang, K., Cheng, P., Fink, S., Ishakian, V., ... & Suter, P. (2017). Serverless computing: Current trends and open problems. Research advances in cloud computing, 1-20.

[2] Hendrickson, S., Sturdevant, S., Harter, T., Venkataramani, V., Arpaci-Dusseau, A. C., & Arpaci-Dusseau, R. H. (2016). Serverless computation with {OpenLambda}. In the 8th USENIX workshop on hot topics in cloud computing (HotCloud 16).

[3] Lynn, T., Rosati, P., Lejeune, A., & Emeakaroha, V. (2017, December). A preliminary review of enterprise serverless cloud computing (function-as-a-service) platforms. In 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom) (pp. 162-169). IEEE.

[4] Zhang, Q., Cheng, L., & Boutaba, R. (2010). Cloud computing: state-of-the-art and research challenges. Journal of internet services and applications, 1, 7-18.

[5] Hellerstein, J. M., Faleiro, J., Gonzalez, J. E., Schleier-Smith, J., Sreekanti, V., Tumanov, A., & Wu, C. (2018). Serverless computing: One step forward, two steps back. arXiv preprint arXiv:1812.03651.

[6] Jonas, E., Schleier-Smith, J., Sreekanti, V., Tsai, C. C., Khandelwal, A., Pu, Q., ... & Patterson, D. A. (2019). Cloud programming simplified: A Berkeley view on serverless computing. arXiv preprint arXiv:1902.03383.

[7] Carreira, J., Fonseca, P., Tumanov, A., Zhang, A., & Katz, R. (2019, November). Cirrus: A serverless framework for end-to-end ML workflows. In Proceedings of the ACM Symposium on Cloud Computing (pp. 13-24).

[8] Rahman, M. M., & Hasan, M. H. (2019, October). Serverless architecture for big data analytics. In 2019 Global Conference for Advancement in Technology (GCAT) (pp. 1-5). IEEE.

[9] Nookala, G. (2023). Serverless Data Architecture: Advantages, Drawbacks, and Best Practices. Journal of Computing and Information Technology, 3(1).

[10] Sukhdeve, D. S. R., & Sukhdeve, S. S. (2023). Introduction to GCP. In Google Cloud Platform for Data Science: A Crash Course on Big Data, Machine Learning, and Data Analytics Services (pp. 1-9). Berkeley, CA: Apress.

[11] Li, Y., Lin, Y., Wang, Y., Ye, K., & Xu, C. (2022). Serverless computing: state-of-the-art, challenges and opportunities. IEEE Transactions on Services Computing, 16(2), 1522-1539.

[12] Werner, S., Kuhlenkamp, J., Klems, M., Müller, J., & Tai, S. (2018, December). Serverless big data processing using matrix multiplication as an example. In 2018 IEEE International Conference on Big Data (Big Data) (pp. 358-365). IEEE.

[13] Manconi, A., Gnocchi, M., Milanesi, L., Marullo, O., & Armano, G. (2023). Framing Apache Spark in life sciences. Heliyon, 9(2).

[14] Leitner, P., Wittern, E., Spillner, J., & Hummer, W. (2019). A mixed-method empirical study of Function-as-a-Service software development in industrial practice. Journal of Systems and Software, 149, 340-359.

[15] Vergadia, P. (2022). Visualizing Google Cloud: 101 Illustrated References for Cloud Engineers and Architects. John Wiley & Sons.

[16] Paul, A., & Haldar, M. Serverless Web Applications with AWS Amplify.

[17] Erbad, A., Tayarani Najaran, M., & Krasic, C. (2010, February). Paceline: latency management through adaptive output. In Proceedings of the first annual ACM SIGMM conference on Multimedia systems (pp. 181-192).

[18] Raptis, T. P., Passarella, A., & Conti, M. (2018). Performance analysis of latency-aware data management in industrial IoT networks. Sensors, 18(8), 2611.

[19] Liu, F., & Niu, Y. (2023). Demystifying the cost of serverless computing: Towards a win-win deal. IEEE Transactions on Parallel and Distributed Systems, 35(1), 59-72.

[20] Yussupov, V., Soldani, J., Breitenbücher, U., Brogi, A., & Leymann, F. (2021). Fasten your decisions: A classification framework and technology review of function-as-a-service platforms. Journal of Systems and Software, 175, 110906.

Architectural Optimization of Serverless Big Data Pipelines for AI Workloads Using Cloud Functions and Managed Spark on GCP

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

How to Cite

Similar Articles

callforpaper

Submission

Menu

Latest publications

Information

Reach US

Ethics and Policies

Important Links

Downloads & Indexing

Similar Articles

Multi-Cloud Serverless Computing & FaaS Architectures for Resilient and Cost-Efficient Systems

Hybrid Cloud Security: A Multi-Layered Approach for Securing Cloud-Native Applications

Efficient Resource Management and Scheduling in Cloud Computing: A Survey of Methods and Emerging Challenges

Cloud-Based Data Hubs and SQL Pipelines for Real-Time Financial Analytics

Hybrid Cloud Approaches for Large-Scale Medicaid Data Engineering Using AWS and Hadoop

Multi-Layered Security Policy Enforcement for Confidential Data in Serverless Cloud Functions

The Serverless Revolution in Healthcare: What It Means and How to Get There

Architecting Data Pipelines for Scalable and Resilient Data Processing Workflows

Designing High-Throughput Data Pipelines: A Performance-Centric Architectural Framework for Low-Latency Analytics in Distributed Cloud Environments

Optimizing Distributed Computing Architectures for Scalable Big Data Analytics