ORGANIZATION OF DATA PROTECTION DURING PROCESSING IN THE APACHE SPARK APPLICATION
DOI: 10.31673/2409-7292.2025.014076
DOI:
https://doi.org/10.31673/2409-7292.2025.014076Abstract
Issues related to data security in Apache Spark are considered. The work focuses on such key aspects as access control,
protection of confidential information, and prevention of attacks at the data processing level. It has been studied that one of the
main threats is data leakage due to incorrect configuration of cluster access or unauthorized execution of tasks. In addition, there
is a danger of attacks at the data serialization level, which can exploit vulnerabilities in data transfer mechanisms between nodes.
It is also important to consider possible threats associated with the use of third-party libraries that may contain malicious code
or have known vulnerabilities. Taking these issues into account will help Apache Spark users increase the security of their
computing environments and minimize the risks of data leakage. Secure authentication and authorization mechanisms, as well
as encryption of transmitted data, can significantly reduce the likelihood of unauthorized access. Additionally, the use of security
policies at the cluster configuration level and isolation of execution environments allows you to avoid the impact of potentially
malicious processes. It is also important to regularly monitor and audit activity in the system, which allows you to detect and
respond to suspicious actions in a timely manner. Based on the most common problems faced by companies and users of Apache
Spark, the main threats affecting data security were analyzed. The study examined such known vulnerabilities as CVE-2023-
22946, CVE-2022-31777, CVE-2022-33891, CVE-2021-38296 and CVE-2020-9480. Each of these vulnerabilities could lead
to data leakage, unauthorized code execution or other threats to the integrity and confidentiality of information. The analysis
showed that, as a rule, the key problems are associated with incorrect access management, insufficient verification of input data
and vulnerabilities in request processing mechanisms. Taking into account these threats, recommendations were developed to
eliminate them and minimize risks. Using up-to-date authentication and authorization mechanisms, regularly updating software,
and isolating work environments can significantly reduce the likelihood of exploiting known vulnerabilities. In addition,
monitoring system logs and analyzing query behavior helps detect suspicious actions and respond quickly to potential attacks.
Keywords: Hadoop, Apache Spark, HDFS, RDD, Spark cluster. AES, TLS/SSL, data security, logs, authentication,
access control.
References
1. Дейнека О.Р., Гарасимчук О. І. Виклики та стратегії зберігання великих обсягів даних у сучасному світі
// Захист інформації. – 2024. – Т. 25, № 4. – С. 197–207. DOI: https://doi.org/10.18372/2410-7840.25.18225.
2. Deineka, O., Harasymchuk, O., Partyka, A., Obshta, A., Korshun, N. Designing Data Classification and Secure
Store Policy According to SOC 2 Type II // CEUR Workshop Proceedings, 2024, 3654, pp. 398–409.
3. Apache Spark Unified engine for large-scale data analytics. URL: http://spark.apache.org.
4. C. S. Karthikeya Sahith, S. Muppidi and S. Merugula, "Apache Spark Big data Analysis, Performance Tuning,
and Spark Application Optimization," 2023 International Conference on Evolutionary Algorithms and Soft Computing
Techniques (EASCT), Bengaluru, India, 2023, pp. 1-8, doi: 10.1109/EASCT59475.2023.10393086.
5. Y. Tian, Q. Shen, Z. Zhu, Y. Yang and Z. Wu, "Non-Authentication Based Checkpoint Fault-tolerant
Vulnerability in Spark Streaming," 2018 IEEE Symposium on Computers and Communications (ISCC), Natal, Brazil,
2018, pp. 00783-00786, doi: 10.1109/ISCC.2018.8538745.
6. S. Shah, Y. Amannejad and D. Krishnamurthy, "Diaspore: Diagnosing Performance Interference in Apache
Spark," in IEEE Access, vol. 9, pp. 103230-103243, 2021, doi: 10.1109/ACCESS.2021.3098426.
7. Spark Security. URL: https://downloads.apache.org/spark/docs/2.4.4/security.html.
8. Introduction to Transparent Data Encryption. URL: https://docs.oracle.com/en/database/oracle/oracledatabase/19/asoag/introduction-to-transparent-data-encryption.html.
9. Apache Ranger. URL:https://ranger.apache.org/.
10. Amazon GuardDuty. URL: https://aws.amazon.com/guardduty/.
11. What is data loss prevention (DLP). URL: https://www.kingston.com/en/blog/data-security/data-lossprevention-dlp.
12. Spark security. URL: https://docs.cloudera.com/runtime/7.3.1/configuring-spark/topics/spark-security.html.
13. Spark custom data sources and sinks for cybersecurity use cases. URL: https://medium.
com/@alexott_en/spark-custom-data-sources-and-sinks-for-cybersecurity-use-cases-9623abb94574.
14. Apache Spark Ecosystem – Complete Spark Components Guide. URL: https://data-flair.training/blogs/apachespark-ecosystem-components/.
15. Park, G., Heo, Y.S., Lee, K. et al. A parallel and accurate method for large-scale image segmentation on a
cloud environment. J Supercomput 78, 4330–4357 (2022). https://doi.org/10.1007/s11227-021-04027-5.
16. How Do You Secure Apache Spark? URL: https://granulate.io/blog/spark-security-top-vulnerabilities-6-waysto-secure-your-spark/.
17. Oktay, T., Sayar, A. (2017). Analyzing Big Security Logs in Cluster with Apache Spark. In: Angelov, P.,
Manolopoulos, Y., Iliadis, L., Roy, A., Vellasco, M. (eds) Advances in Big Data. INNS 2016. Advances in Intelligent
Systems and Computing, vol 529. Springer, Cham. https://doi.org/10.1007/978-3-319-47898-2_14.