Spark Optimization: A Column Recommendation System for Data Partitioning and Z-Ordering on ETL Platforms
Master thesis
Permanent lenke
https://hdl.handle.net/11250/3089848Utgivelsesdato
2023Metadata
Vis full innførselSamlinger
- Studentoppgaver (TN-IDE) [823]
Sammendrag
In this thesis, we present a solution for the challenge of optimizing the retrieval of data in Spark. Our column recommendation system is based on Spark's event logs and finds influential columns for Z-ordering and partitioning. The column recommendation system consists of four methods, each looking for different query patterns and query characteristics. From the recommendation system experiment, we managed to improve the run time by 17% compared to the baseline. This improvement demonstrates our column recommendation system's potential for optimizing data retrieval in Spark. Our system was developed on an ETL platform and is a flexible solution for ETL platforms utilizing Spark.