May 4, 2023
Thanks for your question!
While catalyst does provide optimizations in joins via joins reordering and so on, it may not always get it right. It uses statistical techniques to do so, and in case of data skew, these statistics may not always be accurate.
Further, if you have complex queries or subqueries, the catalyst may not be able to accurately gauge the steps that need to be taken and may rely on heuristics.
I believe its always better to know the data and take some steps from our side to assist the catalyst in making better optimizations. :)