Cost-Aware Autoscaling for Batch vs. Online Inference

Rohit Reddy Gaddam

doi:10.63282/3050-9246.IJETCSIT-V3I4P113

Authors

Rohit Reddy Gaddam Sr. Site Reliability Engineer, USA. Author

DOI:

https://doi.org/10.63282/3050-9246.IJETCSIT-V3I4P113

Keywords:

Autoscaling, Cloud Computing, Machine Learning, Batch Inference, Online Inference, Cost Optimization, Elastic Scaling, Resource Management, Latency-Aware Systems, Cloud Economics

Abstract

Autoscaling is basically a key feature of the modern machine learning environment that is most valued in the deployment of ML models, but it is still a challenge to apply in those delicate inference workloads that are simply a mixture of performance, reliability, and cost. Online inferences that are run in real-time are required to scale very fast to meet any sudden demands, whereas batch inferences, which are typically done by processing large volumes of data at scheduled intervals, usually have a more predictable scaling pattern. The trade-off between the two types of inferences is rather obvious: on the one hand, online inference requires low latency, which is something that cloud providers charge a lot for; on the other hand, batch inference is much more cost-efficient but is less responsive to real-time needs. The existing autoscaling methods generally rely on throughput as well as on latency metrics but they neglect cost-awareness as an issue of the first priority, especially in a situation where the workload has to be changed from batch to online mode without any interruption. This work is aimed at this specific gap and it does so by introducing a cost-aware autoscaling mechanism that is not only based on the demand but also considers the cost-performance trade-offs. It uses workload profiling, predictive scaling policies, and adaptive scheduling to manage and keep the right balance between efficiency and the capability of quick response. The Study of a production-scale machine learning system is an example of how this framework can lower the expenses involved in operations by managing it well when batch and online inference are the ones that require differentiation while at the same time meeting all the levels of service needed.

Downloads

Download data is not yet available.

References

[1] Wang, Zhaoxing, et al. "Jily: Cost-aware AutoScaling of heterogeneous GPU for DNN inference in public cloud." 2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC). IEEE, 2019.

[2] Zhang, Chengliang, et al. "Enabling cost-effective, slo-aware machine learning inference serving on public cloud." IEEE Transactions on Cloud Computing 10.3 (2020): 1765-1779.

[3] Gujarati, Arpan, et al. "Swayam: distributed autoscaling to meet slas of machine learning inference services with resource efficiency." Proceedings of the 18th ACM/IFIP/USENIX middleware conference. 2017.

[4] Hu, Yitao, Rajrup Ghosh, and Ramesh Govindan. "Scrooge: A cost-effective deep learning inference system." Proceedings of the ACM Symposium on Cloud Computing. 2021.

[5] Parakala, Adityamallikarjunkumar, and Aaron Bell. "How Citizen Developers Changed the Game." American International Journal of Computer Science and Technology 3.5 (2021): 14-24.

[6] 6 Tang, Xuehai, et al. "Nanily: A qos-aware scheduling for dnn inference workload in clouds." 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, 2019.

[7] Tian, Huangshi, Minchen Yu, and Wei Wang. "Continuum: A platform for cost-aware, low-latency continual learning." Proceedings of the ACM Symposium on Cloud Computing. 2018.

[8] Guntupalli, Bhavitha. "The Evolution of ETL: From Informatica to Modern Cloud Tools." International Journal of AI, BigData, Computational and Management Studies 2.2 (2021): 66-75.

[9] Mao, Ming, and Marty Humphrey. "Auto-scaling to minimize cost and meet application deadlines in cloud workflows." Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 2011.

[10] Goldstein, Orpaz. Real-time cost-aware machine learning at the edge. University of California, Los Angeles, 2021.

[11] Qu, Chenhao, Rodrigo N. Calheiros, and Rajkumar Buyya. "Auto-scaling web applications in clouds: A taxonomy and survey." ACM Computing Surveys (CSUR) 51.4 (2018): 1-33.

[12] Parakala, Adityamallikarjunkumar. "Integrating Salesforce and UiPath: Cross-System Intelligent Automation." International Journal of Emerging Trends in Computer Science and Information Technology 3.4 (2022): 88-99.

[13] Kumar, Pramod. "Inferall: coordinated optimization for machine learning inference serving in public cloud." (2021).

[14] Al-Dulaimy, Auday, et al. "Multiscaler: A multi-loop auto-scaling approach for cloud-based applications." IEEE Transactions on Cloud Computing 10.4 (2020): 2769-2786.

[15] Mahallat, Iran. "A Cost-AWARE Approach Based ON Learning Automata FOR Resource Auto-Scaling IN Cloud Computing Environment." International Journal of Hybrid Information Technology 8.7 (2015): 389-398.

[16] Guntupalli, Bhavitha. "My Approach to Data Validation and Quality Assurance in ETL Pipelines." International Journal of Artificial Intelligence, Data Science, and Machine Learning 2.3 (2021): 62-73.

[17] Moldovan, Daniel, Hong-Linh Truong, and Schahram Dustdar. "Cost-aware scalability of applications in public clouds." 2016 IEEE international conference on cloud engineering (IC2E). IEEE, 2016.

[18] Kriushanth, M., and L. Arockiam. "Cost Aware Dynamic Rule based Auto-scaling of Infrastructure as a Service in Cloud Environment." International Journal of Computer Applications 975.8887 (2014): 19458-6047.

[19] Lesch, Veronika. "Self-Aware Multidimensional Auto-Scaling." Würzburg Software Engineering Award sponsored by Bosch Rexroth, Master Thesis, University of Würzburg, Am Hubland, Informatikgebäude 97074 (2017).

Cost-Aware Autoscaling for Batch vs. Online Inference

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

How to Cite

Similar Articles

callforpaper

Submission

Menu

Latest publications

Information

Reach US

Ethics and Policies

Important Links

Downloads & Indexing

Similar Articles

Mitigating Algorithmic Complexity Attacks in Federated GraphQL Architectures: A Depth-Bounded Semantic Rate Limiting Approach for Open Banking

Decision-Centric Architectures for Intelligent and Networked Wireless Computing Environments Operating at Scale and Uncertainty

AI at the Edge: Transforming Real-Time Data Processing

Secure Cloud Operations: Balancing Compliance, Data Privacy, and Performance in Healthcare Systems

Leveraging Generative AI for Actionable Insights in Cloud Computing: Innovations and Applications

Enterprise and RAN-Aware Data and Analytics Platforms for Mission-Critical and Low-Latency Digital Services

Multi-Cloud FinOps: AI-Driven Cost Allocation and Optimization Strategies

Kubernetes and AWS Lambda for Serverless Computing: Optimizing Cost and Performance Using Kubernetes in a Hybrid Serverless Model

Edge Computing Architectures for Real-Time Distributed Processing

ML-Driven Anomaly Detection for EPICS PV Streams at the Edge: Implementation and Evaluation on Raspberry PI IOCs