Narrowing and widening Pandas dataframes

If you have a dataframe in wide record format, melting or stacking it can convert it to a narrower, longer table. And if you have a narrow dataframe, pivoting and unstacking it can convert it to a wider, shorter table. But when should you use stack versus melt and unstack versus pivot? - **Stack/unstack**: When you already have defined all proper (row) indexes and columns, you can move the lowest level of the columns to the lowest level row index via `stack(...)` (making tables longer & narrower) and vice versa via `unstack(...)` (making tables shorter & wider) while making sure that the values stay correctly attached to their index position. - **Pivot**: When you have loaded the raw dataframe, but still need to index and unstack the table to a wider, record-level format, use `df.pivot(columns=..., index=..., [values=...])` - **Melt**: You should rarely need this method. However, if you want to reshape the table before you set proper indices on your dataframe, `df.melt(id_vars=..., var_name='...', value_name='...')` can come in handy. Once you unstack or stack a dataframe to the point where you remove the last remaining level in the column or row index, the dataframe will "degrade" to a multi-index *series*. You can "recover" the dataframe nature by transforming the series back to a dataframe and giving the column that will represent the series values a name, by applying `.to_frame(name='...')` to the series. ## Example This is a narrow & fully indexed dataframe; Notice that is has two row index levels, *Item* and *Metal*: | | | EUR | | ----- | ---------- | ---- | | Item | Metal | | | **0** | **Bronze** | 0.31 | | | **Gold** | 2.20 | | **1** | **Gold** | 3.20 | | | **Silver** | 1.10 | And this would be an wide & indexed version of the same; It has two column index levels, *EUR* and *Metal* and introduces `NaN` values where rows were "missing" in the narrow table (e.g., a `pivot(...)` or `unstack()`-ing of the narrow table would introduce those `NaN` values): | | EUR | | | | ----- | ---------- | -------- | ---------- | | Metal | **Bronze** | **Gold** | **Silver** | | Item | | | | | **0** | 0.31 | 2.2 | NaN | | **1** | NaN | 3.2 | 1.1 | Notice how the *Metal* index level moves between the row and column index. If you pivot a dataframe that only has a single value column (like in the above example), and you are not planning on stacking the dataframe to a narrow table, then you should also explicitly set the `values` parameter in the `pivot(...)` call: If you do not, you get a top level column index with only the name of that value column (see the *EUR* top row in the example table above). That happens, because `pivot` by default prepends all value columns in the column index, just above the `columns` that you pick when calling `pivot(...)` (see the two-values example further below, too). For example, if you have one value column, pivoting without a `values` parameter gives you the above multi-indexed, wide table: ```python raw = { 'Item': [0, 0, 1, 1], 'Metal': ['Gold', 'Bronze', 'Gold', 'Silver'], 'EUR': [2.2, 0.31, 3.2, 1.1], } pd.DataFrame(raw).pivot(index='Item', columns='Metal') ``` (See the last table above for the shape of this dataframe.) To get rid of the *EUR* multi-index level (i.e., the top row in the above table), use `values` as follows: ```python pd.DataFrame(raw).pivot(index='Item', columns='Metal', values='EUR') ``` | Metal | **Bronze** | **Gold** | **Silver** | | ----- | ---------- | -------- | ---------- | | Item | | | | | **0** | 0.31 | 2.2 | NaN | | **1** | NaN | 3.2 | 1.1 | If you now `.stack()` this pivoted dataframe, you get the following series back, because there are no more column index levels left to move to the row index: ```plain Item Metal 0 Bronze 0.31 Gold 2.20 1 Gold 3.20 Silver 1.10 dtype: float64 ``` (You would end with a similar series if you were to `.unstack()` this dataframe, as there is only one row index level available to move.) To elevate such a series back to a dataframe again, you can use `.to_frame(name='...')`: ```python pd.DataFrame(raw).pivot( index='Item', columns='Metal', values='EUR' ).stack().to_frame(name='EUR') ``` | | | EUR | | ----- | ---------- | ---- | | Item | Metal | | | **0** | **Bronze** | 0.31 | | | **Gold** | 2.20 | | **1** | **Gold** | 3.20 | | | **Silver** | 1.10 | Notice that the top-level *EUR* column index has "snuck back in". If we instead look at an example where you have multiple value columns in the narrow shape, the `.pivot(..., values=...)` behavior becomes more obvious: ```python # "narrow" raw data with two value columns, EUR and USD: raw = { 'Item': [0, 0, 1, 1], 'Metal': ['Gold', 'Bronze', 'Gold', 'Silver'], 'EUR': [2.2, 0.31, 3.2, 1.1], 'USD': [2.0, 0.27, 2.7, 1.0], } pd.DataFrame(raw).pivot(index='Item', columns='Metal') ``` This gets you the following wide shape, by default tracking any remaining columns not listed in `index` or `columns` as **multi-indexed value columns** (here, *EUR* and *USD*): | | | | EUR | | | USD | | ----- | ---------- | -------- | ---------- | ---------- | -------- | ---------- | | Metal | **Bronze** | **Gold** | **Silver** | **Bronze** | **Gold** | **Silver** | | Item | | | | | | | | **0** | 0.31 | 2.2 | NaN | 0.27 | 2.0 | NaN | | **1** | NaN | 3.2 | 1.1 | NaN | 2.7 | 1.0 | And for completeness, here is what happens if you `stack()` that above table: ```python raw = { 'Item': [0, 0, 1, 1], 'Metal': ['Gold', 'Bronze', 'Gold', 'Silver'], 'EUR': [2.2, 0.31, 3.2, 1.1], 'USD': [2.0, 0.27, 2.7, 1.0], } pd.DataFrame(raw).pivot(index='Item', columns='Metal').stack() ``` | | | **EUR** | **USD** | | ----- | ---------- | ------- | ------- | | Item | Metal | | | | **0** | **Bronze** | 0.31 | 0.27 | | | **Gold** | 2.20 | 2.00 | | **1** | **Gold** | 3.20 | 2.70 | | | **Silver** | 1.10 | 1.00 | Which is a bit of a contrived example, as in this particular situation, you could directly define the relevant index to get the dataframe set up as above: ```python pd.DataFrame(raw).set_index(['Item', 'Metal']) ``` (This produces the exact same dataframe as shown in the table just above.) The key thing to take home is how `stack()` and `unstack()` can move the respective last index level (*Metal*, in this case) from the columns to the rows and vice versa: ```python pd.DataFrame(raw).set_index(['Item', 'Metal']).unstack() ``` | | | | EUR | | | USD | | ----- | ---------- | -------- | ---------- | ---------- | -------- | ---------- | | Metal | **Bronze** | **Gold** | **Silver** | **Bronze** | **Gold** | **Silver** | | Item | | | | | | | | **0** | 0.31 | 2.2 | NaN | 0.27 | 2.0 | NaN | | **1** | NaN | 3.2 | 1.1 | NaN | 2.7 | 1.0 | Therefore, after you either `set_index(...)` on your dataframes or, if necssary, `pivot(...)`-ed them to the right shape **and** index, you should only need `stack()` and `unstack()` to go between narrow and wide versions of your dataframe.