Enhancing Data Quality and Consistency in Large-Scale Analytical Systems through AI-Driven Engineering Workflows

Authors

  • Dinesh Babu Govindarajulunaidu Sambath Narayanan Independent Researcher, USA. Author

DOI:

https://doi.org/10.63282/3050-9246.IJETCSIT-V6I3P114

Keywords:

Data quality, Data contracts, Active metadata, Anomaly detection, Schema evolution, Consistency enforcement

Abstract

Large-scale analytical systems integrate heterogeneous, fast-evolving data from operational databases, event streams, and third-party sources conditions that routinely introduce schema drift, missing values, semantic inconsistencies, and latency spikes. Introduce an AI-based workflow of engineering that advances the hygienic quality and consistency of data to a proactive and quantifiable field of discipline. The framework is associated with the declarative data contracts and active metadata coupled with learning-based observability to identify the freshness, volume, schema, and distributional anomalies along the batch and streaming routes. Policy-sensitive remediation module is a deduplication, imputation fixes and type harmonization controlled by the context of anomalies, lineage and downstream blast radius. Schema management and consistency enforcement re-check erroneous output with canonical definitions; versioned governance makes all changes auditable, reversible and scope-based in their impact. Training detectors, recalibration of thresholds, and re-training of contracts are done by a feedback and constant learning loop that is driven by incident outcomes and consumer feedback. On a reference lakehouse stack (CDC + streaming ingestion, Spark transformations, contract checks, lineage capture, and MLflow-managed models), it is shown that there are better results in error rates and recovery time, better timeliness and validity, and an increase in report alignment across analytical layers. Collectively, these results indicate that embedding AI within robust DataOps/MLOps practices can deliver durable reliability, faster incident resolution, and consistent semantics at scale, without sacrificing governance or cost control

Downloads

Download data is not yet available.

References

[1] Wang, J., Liu, Y., Li, P., Lin, Z., Sindakis, S., & Aggarwal, S. (2024). Overview of data quality: Examining the dimensions, antecedents, and impacts of data quality. Journal of the Knowledge Economy, 15(1), 1159-1178.

[2] Shahnawaz, M., & Kumar, M. (2025). A Comprehensive Survey on Big Data Analytics: Characteristics, Tools and Techniques. ACM Computing Surveys, 57(8), 1-33.

[3] Govindarajulunaidu Sambath Narayanan, D. B. (2024). Data Engineering for Responsible AI: Architecting Ethical and Transparent Analytical Pipelines. International Journal of Emerging Research in Engineering and Technology, 5(3), 97-105. https://doi.org/10.63282/3050-922X.IJERET-V5I3P110

[4] Bernardo, B. M. V., São Mamede, H., Barroso, J. M. P., & dos Santos, V. M. P. D. (2024). Data governance & quality management—Innovation and breakthroughs across different fields. Journal of Innovation & Knowledge, 9(4), 100598.

[5] Fu, Q., Nicholson, G. L., & Easton, J. M. (2024). Understanding data quality in a data-driven industry context: Insights from the fundamentals. Journal of Industrial Information Integration, 42, 100729.

[6] 3 Ways to Build ETL Process Pipelines with Examples, panoply, online. https://panoply.io/data-warehouse-guide/3-ways-to-build-an-etl-process/

[7] Sambath Narayanan, D. B. G. (2025). AI-Driven Data Engineering Workflows for Dynamic ETL Optimization in Cloud-Native Data Analytics Ecosystems. American International Journal of Computer Science and Technology, 7(3), 99-109. https://doi.org/10.63282/3117-5481/AIJCST-V7I3P108

[8] The Critical Role of Data Quality in AI Implementations, rapidinnovation, Online. https://www.rapidinnovation.io/post/the-critical-role-of-data-quality-in-ai-implementations

[9] Peddisetti, S. (2023). AI-driven data engineering: Streamlining data pipelines for seamless automation in modern analytics. International Journal of Computational Mathematical Ideas (IJCMI), 15(1), 1066-1075.

[10] Taleb, I., Serhani, M. A., Bouhaddioui, C., & Dssouli, R. (2021). Big data quality framework: a holistic approach to continuous quality management. Journal of Big Data, 8(1), 76.

[11] Optimizing Data Quality with AI: Advanced Strategies for Real-Time Data Enrichment and Automation, superagi, 2025. Online. https://superagi.com/optimizing-data-quality-with-ai-advanced-strategies-for-real-time-data-enrichment-and-automation/

[12] Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., & Stonebraker, M. (2009, June). A comparison of approaches to large-scale data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data (pp. 165-178).

[13] Addressing Data Quality Issues Before Implementing AI Solutions, online. https://orases.com/blog/addressing-data-quality-issues-before-implementing-ai-solutions/

[14] Kunungo, S., Ramabhotla, S., & Bhoyar, M. (2018). The Integration of Data Engineering and Cloud Computing in the Age of Machine Learning and Artificial Intelligence. Iconic Research And Engineering Journals, 1(12), 79-84.

[15] Data Quality Metrics & Measures, informatica, online. https://www.informatica.com/resources/articles/data-quality-metrics-and-measures.html

[16] Galarini, R., Buratti, R., Fioroni, L., Contiero, L., & Lega, F. (2011). Development, validation and data quality assurance of screening methods: a case study. Analytica chimica acta, 700(1-2), 2-10.

[17] Baker, M., Fard, A. Y., Althuwaini, H., & Shadmand, M. B. (2022). Real-time AI-based anomaly detection and classification in power electronics dominated grids. IEEE Journal of Emerging and Selected Topics in Industrial Electronics, 4(2), 549-559.

[18] Establishing a Data Quality Framework: A Comprehensive Guide, zendata, online. https://www.zendata.dev/post/data-quality-framework-a-comprehensive-guide

[19] Eick, C. F., & Werstein, P. (2002). Rule-based consistency enforcement for knowledge-based systems. IEEE transactions on knowledge and data engineering, 5(1), 52-64.

[20] Batini, C., Cappiello, C., Francalanci, C., & Maurino, A. (2009). Methodologies for data quality assessment and improvement. ACM computing surveys (CSUR), 41(3), 1-52.

[21] Lu, H., Veeraraghavan, K., Ajoux, P., Hunt, J., Song, Y. J., Tobagus, W., ... & Lloyd, W. (2015, October). Existential consistency: Measuring and understanding consistency at facebook. In Proceedings of the 25th Symposium on Operating Systems Principles (pp. 295-310)

[22] Soori, M., Arezoo, B., & Dastres, R. (2023). Artificial intelligence, machine learning and deep learning in advanced robotics, a review. Cognitive Robotics, 3, 54-70.

[23] Mohammed, S., Budach, L., Feuerpfeil, M., Ihde, N., Nathansen, A., Noack, N., ... & Harmouch, H. (2025). The effects of data quality on machine learning performance on tabular data. Information Systems, 132, 102549.

[24] Govindarajulunaidu Sambath Narayanan, D. B. (2025). Generative AI–Enabled Intelligent Query Optimization for Large-Scale Data Analytics Platforms. International Journal of Artificial Intelligence, Data Science, and Machine Learning, 6(2), 153-160. https://doi.org/10.63282/3050-9262.IJAIDSML-V6I2P117

Published

2025-09-30

Issue

Section

Articles

How to Cite

1.
Govindarajulunaidu Sambath Narayanan DB. Enhancing Data Quality and Consistency in Large-Scale Analytical Systems through AI-Driven Engineering Workflows. IJETCSIT [Internet]. 2025 Sep. 30 [cited 2025 Dec. 17];6(3):85-93. Available from: https://www.ijetcsit.org/index.php/ijetcsit/article/view/500

Similar Articles

1-10 of 351

You may also start an advanced similarity search for this article.