{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "convert_data_wide_to_long.ipynb", "provenance": [], "collapsed_sections": [], "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "KGmrniFLZ_rx" }, "source": [ "# Convert Data from Wide to Long Format" ] }, { "cell_type": "markdown", "metadata": { "id": "RIC-iofnaMyn" }, "source": [ "`xlogit` processes data only in long format. However, converting data from wide to long format is a relatively easy task and `xlogit` provides a convinient tool for this purpose. In this tutorial, we'll explain the reshaping using two examples. Let's start by importing the `pandas` and `numpy` python packages." ] }, { "cell_type": "code", "metadata": { "id": "fueNeIAbv8sW" }, "source": [ "import pandas as pd\n", "import numpy as np" ], "execution_count": 1, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "w8tpAUVvBajn" }, "source": [ "In case you haven't installed `xlogit`, you can install it using `pip` as follows:" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bc5G_n5yBhyL", "outputId": "3eb7d947-2d8a-4847-e1f8-ee28613b25f8" }, "source": [ "!pip install xlogit" ], "execution_count": 2, "outputs": [ { "output_type": "stream", "text": [ "Collecting xlogit\n", " Downloading https://files.pythonhosted.org/packages/60/5f/9bc576d180c366af77bc04e268536e9e34be23c52a520918aa0cb56b438e/xlogit-0.1.3-py3-none-any.whl\n", "Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from xlogit) (1.4.1)\n", "Requirement already satisfied: numpy>=1.13.1 in /usr/local/lib/python3.7/dist-packages (from xlogit) (1.19.5)\n", "Installing collected packages: xlogit\n", "Successfully installed xlogit-0.1.3\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "pHFbKUIkafhX" }, "source": [ "## Basic example" ] }, { "cell_type": "markdown", "metadata": { "id": "kpaNiP4VahUp" }, "source": [ "This example provides a quick intuition about the format conversion using `pandas`." ] }, { "cell_type": "markdown", "metadata": { "id": "bLhROdletgFE" }, "source": [ "![image.png]()" ] }, { "cell_type": "markdown", "metadata": { "id": "-Lbu6mGro4Xl" }, "source": [ "Let's create some artificial data in wide format for our example. Let's assume that our dataset contains choices among three transportation modes (i.e. car, train, and bus) for two individuals identified by `id`. We assume we collected the income of the individuals, and the time and cost of the transportation modes and saved the data in the `df_wide` dataframe as follows:" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "id": "cHVYGTFjZ4km", "outputId": "33d409ed-77e8-4816-9c4c-e47a2713ab4e" }, "source": [ "# Create example dataset in wide format\n", "df_wide = pd.DataFrame({'id': [1, 2],\n", " 'income': [4, 2],\n", " 'time_car': [2, 1],\n", " 'time_train': [2, 4],\n", " 'time_bus': [4, 5],\n", " 'cost_car': [3, 7],\n", " 'cost_train': [6, 8],\n", " 'cost_bus': [9, 2],\n", " 'choice': ['bus', 'car']\n", " })\n", "df_wide" ], "execution_count": 3, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idincometime_cartime_traintime_buscost_carcost_traincost_buschoice
014224369bus
122145782car
\n", "
" ], "text/plain": [ " id income time_car time_train ... cost_car cost_train cost_bus choice\n", "0 1 4 2 2 ... 3 6 9 bus\n", "1 2 2 1 4 ... 7 8 2 car\n", "\n", "[2 rows x 9 columns]" ] }, "metadata": { "tags": [] }, "execution_count": 3 } ] }, { "cell_type": "markdown", "metadata": { "id": "noXqdOh0crsq" }, "source": [ "The dataframe `df_wide` is in wide format because every row represent a sample in the dataset. The variables `time` and `cost` have one column for each choice alternative, as these as alternative-specific variables. After parsing these data to long format, we want the alternative-specific variables to be in rows and not in columns; therefore, in long format every row is associated with an alternative." ] }, { "cell_type": "markdown", "metadata": { "id": "F_JmscYExU6j" }, "source": [ "To convert `df_wide` to long format we use the `wide_to_long` function provided by `xlogit`." ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 237 }, "id": "Pyg0jFaBjw5p", "outputId": "e29909cd-b410-4167-e07c-440c57e116e4" }, "source": [ "from xlogit.utils import wide_to_long\n", "df_long = wide_to_long(df_wide, id_col='id', alt_list=['car', 'train', 'bus'],\n", " varying=['time', 'cost'], alt_name='alt', sep='_')\n", "\n", "df_long" ], "execution_count": 4, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idalttimecostincomechoice
01bus494bus
11car234bus
21train264bus
32bus522car
42car172car
52train482car
\n", "
" ], "text/plain": [ " id alt time cost income choice\n", "0 1 bus 4 9 4 bus\n", "1 1 car 2 3 4 bus\n", "2 1 train 2 6 4 bus\n", "3 2 bus 5 2 2 car\n", "4 2 car 1 7 2 car\n", "5 2 train 4 8 2 car" ] }, "metadata": { "tags": [] }, "execution_count": 4 } ] }, { "cell_type": "markdown", "metadata": { "id": "6WgA2bnxnTQQ" }, "source": [ "In the long format shown above, every three rows represent an individual and every row represents an alternative. An additional column called `alt` was created to indicate the alternative associated to a row, and now the variables `time` and `cost` have their values in different rows insted of columns. The rest of the variables (i.e. `id`, `choice`, and `income`) are simply repeated across rows as they are associated to the same individual. For instance, the individual with `id` equal to `1` has an `income` of `4`, which is repeated throughout the rows of this individual. This data in long format can be now processed by `xlogit`.\n", "\n", "The `wide_to_long` function takes as input the following parameters:\n", "- `dataframe`: The wide-format DataFrame.\n", "- `id_col` : Name of the column that uniquely identifies each sample.\n", "- `alt_list`: List of alternatives.\n", "- `alt_name`: Name of the alternatives column in returned dataset.\n", "- `varying`: List of column names that vary across alternatives.\n", "- `sep`: Separator of column names that vary across alternatives.\n", "- `alt_is_prefix`: True if alternative is prefix of the variable name or False if it is suffix.\n", "- `empty_val`: Value to fill when alternative not available for a certain variable.\n", "\n", "The current version of `xlogit` only supports wide to long conversion for `Pandas` dataframes. For additional details about this function, consult the documentation at: https://xlogit.readthedocs.io/en/latest/api/utils.html#xlogit.utils.wide_to_long" ] }, { "cell_type": "markdown", "metadata": { "id": "L_AvJKDSt7f3" }, "source": [ "## Admissions dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "DXCXSB73wmMw" }, "source": [ "This is a dataset used in a binary logit example in [UCLA IDRE's website](https://stats.idre.ucla.edu/stata/dae/logistic-regression/). The idea is to predict model the admission to a program based on the candidate's gre, gpa and school's rank. The outcome variable is `admit`, which is equal to 1 when the student was admited or zero otherwise. Note that we added an `id` column to uniquely identify each row.\n", "> Note that this dataset does not contain alternative-specific variables. All the explanatory variables are individual specific." ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "NE08CpzBvulM", "outputId": "8cea8abf-2c8a-4009-e5f2-fa37fea3bfab" }, "source": [ "import pandas as pd\n", "\n", "df_wide = pd.read_stata(\"https://stats.idre.ucla.edu/stat/stata/dae/binary.dta\")\n", "df_wide['id'] = np.arange(len(df_wide)) # Add unique identifier column\n", "df_wide" ], "execution_count": 5, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
admitgregparankid
00.0380.03.613.00
11.0660.03.673.01
21.0800.04.001.02
31.0640.03.194.03
40.0520.02.934.04
..................
3950.0620.04.002.0395
3960.0560.03.043.0396
3970.0460.02.632.0397
3980.0700.03.652.0398
3990.0600.03.893.0399
\n", "

400 rows × 5 columns

\n", "
" ], "text/plain": [ " admit gre gpa rank id\n", "0 0.0 380.0 3.61 3.0 0\n", "1 1.0 660.0 3.67 3.0 1\n", "2 1.0 800.0 4.00 1.0 2\n", "3 1.0 640.0 3.19 4.0 3\n", "4 0.0 520.0 2.93 4.0 4\n", ".. ... ... ... ... ...\n", "395 0.0 620.0 4.00 2.0 395\n", "396 0.0 560.0 3.04 3.0 396\n", "397 0.0 460.0 2.63 2.0 397\n", "398 0.0 700.0 3.65 2.0 398\n", "399 0.0 600.0 3.89 3.0 399\n", "\n", "[400 rows x 5 columns]" ] }, "metadata": { "tags": [] }, "execution_count": 5 } ] }, { "cell_type": "markdown", "metadata": { "id": "eWJ6qflTyoDL" }, "source": [ "Now we can proceed to convert the data to long format as follows:" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "XxOGIS8pyrg_", "outputId": "ad16bf39-059b-49d3-9d31-6b77258e2f58" }, "source": [ "df = wide_to_long(df_wide, id_col='id', alt_list=[0, 1], alt_name='alt')\n", "df" ], "execution_count": 6, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idaltadmitgregparank
0000.0380.03.613.0
1010.0380.03.613.0
2101.0660.03.673.0
3111.0660.03.673.0
4201.0800.04.001.0
.....................
79539710.0460.02.632.0
79639800.0700.03.652.0
79739810.0700.03.652.0
79839900.0600.03.893.0
79939910.0600.03.893.0
\n", "

800 rows × 6 columns

\n", "
" ], "text/plain": [ " id alt admit gre gpa rank\n", "0 0 0 0.0 380.0 3.61 3.0\n", "1 0 1 0.0 380.0 3.61 3.0\n", "2 1 0 1.0 660.0 3.67 3.0\n", "3 1 1 1.0 660.0 3.67 3.0\n", "4 2 0 1.0 800.0 4.00 1.0\n", ".. ... ... ... ... ... ...\n", "795 397 1 0.0 460.0 2.63 2.0\n", "796 398 0 0.0 700.0 3.65 2.0\n", "797 398 1 0.0 700.0 3.65 2.0\n", "798 399 0 0.0 600.0 3.89 3.0\n", "799 399 1 0.0 600.0 3.89 3.0\n", "\n", "[800 rows x 6 columns]" ] }, "metadata": { "tags": [] }, "execution_count": 6 } ] }, { "cell_type": "markdown", "metadata": { "id": "to9eEKhfFog1" }, "source": [ "Before fitting the model, let's create dummy representations of the `rank` variable to include them in the model." ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "lunx66smGy5d", "outputId": "f5358725-260d-465f-df7b-dca1cfd5de3d" }, "source": [ "df['rank'] = df['rank'].astype(int) # Convert to int first for better column naming\n", "df = df.join(pd.get_dummies(df['rank'], prefix='rank'))\n", "df" ], "execution_count": 7, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idaltadmitgregparankrank_1rank_2rank_3rank_4
0000.0380.03.6130010
1010.0380.03.6130010
2101.0660.03.6730010
3111.0660.03.6730010
4201.0800.04.0011000
.................................
79539710.0460.02.6320100
79639800.0700.03.6520100
79739810.0700.03.6520100
79839900.0600.03.8930010
79939910.0600.03.8930010
\n", "

800 rows × 10 columns

\n", "
" ], "text/plain": [ " id alt admit gre gpa rank rank_1 rank_2 rank_3 rank_4\n", "0 0 0 0.0 380.0 3.61 3 0 0 1 0\n", "1 0 1 0.0 380.0 3.61 3 0 0 1 0\n", "2 1 0 1.0 660.0 3.67 3 0 0 1 0\n", "3 1 1 1.0 660.0 3.67 3 0 0 1 0\n", "4 2 0 1.0 800.0 4.00 1 1 0 0 0\n", ".. ... ... ... ... ... ... ... ... ... ...\n", "795 397 1 0.0 460.0 2.63 2 0 1 0 0\n", "796 398 0 0.0 700.0 3.65 2 0 1 0 0\n", "797 398 1 0.0 700.0 3.65 2 0 1 0 0\n", "798 399 0 0.0 600.0 3.89 3 0 0 1 0\n", "799 399 1 0.0 600.0 3.89 3 0 0 1 0\n", "\n", "[800 rows x 10 columns]" ] }, "metadata": { "tags": [] }, "execution_count": 7 } ] }, { "cell_type": "markdown", "metadata": { "id": "UOYVPLTc20up" }, "source": [ "Now we can use these data in `xlogit`, as shown below. Note that the base category for `rank`, called `rank_1`, was not included in the model." ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "QqKxaMwgz7bq", "outputId": "25f39957-e9f9-47b6-9ea0-2d37cb619e2f" }, "source": [ "from xlogit import MultinomialLogit\n", "\n", "model = MultinomialLogit()\n", "varnames = ['gpa', 'gre', 'rank_2', 'rank_3', 'rank_4']\n", "model.fit(X=df[varnames],\n", " y=df['admit'],\n", " varnames=varnames,\n", " isvars=varnames,\n", " ids=df['id'],\n", " alts=df['alt'],\n", " fit_intercept=True)\n", "model.summary()" ], "execution_count": 8, "outputs": [ { "output_type": "stream", "text": [ "Estimation time= 0.0 seconds\n", "---------------------------------------------------------------------------\n", "Coefficient Estimate Std.Err. z-val P>|z|\n", "---------------------------------------------------------------------------\n", "_intercept.1 -3.9901156 1.0668058 -3.7402454 0.00081 ***\n", "gpa.1 0.8040565 0.2998838 2.6812267 0.0224 * \n", "gre.1 0.0022645 0.0009793 2.3124578 0.0556 . \n", "rank_2.1 -0.6754427 0.3038699 -2.2228022 0.068 . \n", "rank_3.1 -1.3402258 0.3329743 -4.0250129 0.000278 ***\n", "rank_4.1 -1.5514728 0.4055016 -3.8260580 0.000591 ***\n", "---------------------------------------------------------------------------\n", "Significance: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n", "\n", "Log-Likelihood= -229.259\n", "AIC= 470.517\n", "BIC= 494.466\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "oo903wY5JLbX" }, "source": [ "Note that the estimation results are identical to those displayed in the [original example in UCLA IDRE's website](https://stats.idre.ucla.edu/stata/dae/logistic-regression/).\n", "The `.1` suffix for the coefficient names represents the alternative that they are associated with. Given that in this case it is a binary logit, the coefficients are interpreted as being relative to the base alternative `0`." ] } ] }