Post by Prateek Jalgaonkar
Lead Analytics Engineer @ Cigna Evernorth | Building Scalable Healthcare Analytics Systems
✨ Learning PySpark Window Functions as a Beginner? Understanding a confusing concept- What’s the difference between using .sum() in a window vs using .rowsBetween() inside the window spec? Here’s the simple explanation 👇 💡 1. sum() as a Window Function When you write: sum("sales").over(window_spec) This creates a cumulative sum by default. It automatically aggregates from the start of the partition up to the current row. 💡 2. rowsBetween() inside Window Spec When you add: rowsBetween(Window.unboundedPreceding, Window.currentRow) You are telling Spark to explicitly define the frame. This gives total control over which rows should be included in the calculation. 🔥 Why this matters? As a data engineer / analyst, window functions allow you to calculate: Rolling metrics Running totals Moving averages Session/partitionizations Trend analysis Understanding rowsBetween gives you full control of your time-based or sequential calculations. 📘 If you're learning PySpark/ SQL, remember: 👉 SUM tells what to calculate. 👉 ROWS BETWEEN tells how much data to include. Happy learning 🔥 #PySpark #DataEngineering #BeginnerLearning #SparkWindowFunctions