Thanks for your question!

While catalyst does provide optimizations in joins via joins reordering and so on, it may not always get it right. It uses statistical techniques to do so, and in case of data skew, these statistics may not always be accurate.

Further, if you have complex queries or subqueries, the catalyst may not be able to accurately gauge the steps that need to be taken and may rely on heuristics.

I believe its always better to know the data and take some steps from our side to assist the catalyst in making better optimizations. :)

Samyukta Hariharan
Samyukta Hariharan

Written by Samyukta Hariharan

Research Engineer in AI/Data. Learning and writing about all things Data.

No responses yet