Convert Data from Wide to Long Format¶

xlogit processes data only in long format. However, converting data from wide to long format is a relatively easy task and xlogit provides a convinient tool for this purpose. In this tutorial, we’ll explain the reshaping using two examples. Let’s start by importing the pandas and numpy python packages.

[1]:

import pandas as pd
import numpy as np

In case you haven’t installed xlogit, you can install it using pip as follows:

[2]:

!pip install xlogit

Collecting xlogit
  Downloading https://files.pythonhosted.org/packages/60/5f/9bc576d180c366af77bc04e268536e9e34be23c52a520918aa0cb56b438e/xlogit-0.1.3-py3-none-any.whl
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from xlogit) (1.4.1)
Requirement already satisfied: numpy>=1.13.1 in /usr/local/lib/python3.7/dist-packages (from xlogit) (1.19.5)
Installing collected packages: xlogit
Successfully installed xlogit-0.1.3

Basic example¶

This example provides a quick intuition about the format conversion using pandas.

Let’s create some artificial data in wide format for our example. Let’s assume that our dataset contains choices among three transportation modes (i.e. car, train, and bus) for two individuals identified by id. We assume we collected the income of the individuals, and the time and cost of the transportation modes and saved the data in the df_wide dataframe as follows:

[3]:

# Create example dataset in wide format
df_wide = pd.DataFrame({'id': [1, 2],
                        'income': [4, 2],
                        'time_car': [2, 1],
                        'time_train': [2, 4],
                        'time_bus': [4, 5],
                        'cost_car': [3, 7],
                        'cost_train': [6, 8],
                        'cost_bus': [9, 2],
                        'choice': ['bus', 'car']
                        })
df_wide

[3]:

	id	income	time_car	time_train	time_bus	cost_car	cost_train	cost_bus	choice
0	1	4	2	2	4	3	6	9	bus
1	2	2	1	4	5	7	8	2	car

The dataframe df_wide is in wide format because every row represent a sample in the dataset. The variables time and cost have one column for each choice alternative, as these as alternative-specific variables. After parsing these data to long format, we want the alternative-specific variables to be in rows and not in columns; therefore, in long format every row is associated with an alternative.

To convert df_wide to long format we use the wide_to_long function provided by xlogit.

[4]:

from xlogit.utils import wide_to_long
df_long = wide_to_long(df_wide, id_col='id', alt_list=['car', 'train', 'bus'],
                       varying=['time', 'cost'], alt_name='alt', sep='_')

df_long

[4]:

	id	alt	time	cost	income	choice
0	1	bus	4	9	4	bus
1	1	car	2	3	4	bus
2	1	train	2	6	4	bus
3	2	bus	5	2	2	car
4	2	car	1	7	2	car
5	2	train	4	8	2	car

In the long format shown above, every three rows represent an individual and every row represents an alternative. An additional column called alt was created to indicate the alternative associated to a row, and now the variables time and cost have their values in different rows insted of columns. The rest of the variables (i.e. id, choice, and income) are simply repeated across rows as they are associated to the same individual. For instance, the individual with id equal to 1 has an income of 4, which is repeated throughout the rows of this individual. This data in long format can be now processed by xlogit.

The wide_to_long function takes as input the following parameters: - dataframe: The wide-format DataFrame. - id_col : Name of the column that uniquely identifies each sample. - alt_list: List of alternatives. - alt_name: Name of the alternatives column in returned dataset. - varying: List of column names that vary across alternatives. - sep: Separator of column names that vary across alternatives. - alt_is_prefix: True if alternative is prefix of the variable name or False if it is suffix. - empty_val: Value to fill when alternative not available for a certain variable.

The current version of xlogit only supports wide to long conversion for Pandas dataframes. For additional details about this function, consult the documentation at: https://xlogit.readthedocs.io/en/latest/api/utils.html#xlogit.utils.wide_to_long

Admissions dataset¶

This is a dataset used in a binary logit example in UCLA IDRE’s website. The idea is to predict model the admission to a program based on the candidate’s gre, gpa and school’s rank. The outcome variable is admit, which is equal to 1 when the student was admited or zero otherwise. Note that we added an id column to uniquely identify each row. > Note that this dataset does not contain alternative-specific variables. All the explanatory variables are individual specific.

[5]:

import pandas as pd

df_wide = pd.read_stata("https://stats.idre.ucla.edu/stat/stata/dae/binary.dta")
df_wide['id'] = np.arange(len(df_wide))  # Add unique identifier column
df_wide

[5]:

	admit	gre	gpa	rank	id
0	0.0	380.0	3.61	3.0	0
1	1.0	660.0	3.67	3.0	1
2	1.0	800.0	4.00	1.0	2
3	1.0	640.0	3.19	4.0	3
4	0.0	520.0	2.93	4.0	4
...	...	...	...	...	...
395	0.0	620.0	4.00	2.0	395
396	0.0	560.0	3.04	3.0	396
397	0.0	460.0	2.63	2.0	397
398	0.0	700.0	3.65	2.0	398
399	0.0	600.0	3.89	3.0	399

400 rows × 5 columns

Now we can proceed to convert the data to long format as follows:

[6]:

df = wide_to_long(df_wide, id_col='id', alt_list=[0, 1], alt_name='alt')
df

[6]:

	id	alt	admit	gre	gpa	rank
0	0	0	0.0	380.0	3.61	3.0
1	0	1	0.0	380.0	3.61	3.0
2	1	0	1.0	660.0	3.67	3.0
3	1	1	1.0	660.0	3.67	3.0
4	2	0	1.0	800.0	4.00	1.0
...	...	...	...	...	...	...
795	397	1	0.0	460.0	2.63	2.0
796	398	0	0.0	700.0	3.65	2.0
797	398	1	0.0	700.0	3.65	2.0
798	399	0	0.0	600.0	3.89	3.0
799	399	1	0.0	600.0	3.89	3.0

800 rows × 6 columns

Before fitting the model, let’s create dummy representations of the rank variable to include them in the model.

[7]:

df['rank'] = df['rank'].astype(int)  # Convert to int first for better column naming
df = df.join(pd.get_dummies(df['rank'], prefix='rank'))
df

[7]:

	id	alt	admit	gre	gpa	rank	rank_1	rank_2	rank_3	rank_4
0	0	0	0.0	380.0	3.61	3	0	0	1	0
1	0	1	0.0	380.0	3.61	3	0	0	1	0
2	1	0	1.0	660.0	3.67	3	0	0	1	0
3	1	1	1.0	660.0	3.67	3	0	0	1	0
4	2	0	1.0	800.0	4.00	1	1	0	0	0
...	...	...	...	...	...	...	...	...	...	...
795	397	1	0.0	460.0	2.63	2	0	1	0	0
796	398	0	0.0	700.0	3.65	2	0	1	0	0
797	398	1	0.0	700.0	3.65	2	0	1	0	0
798	399	0	0.0	600.0	3.89	3	0	0	1	0
799	399	1	0.0	600.0	3.89	3	0	0	1	0

800 rows × 10 columns

Now we can use these data in xlogit, as shown below. Note that the base category for rank, called rank_1, was not included in the model.

[8]:

from xlogit import MultinomialLogit

model = MultinomialLogit()
varnames = ['gpa', 'gre', 'rank_2', 'rank_3', 'rank_4']
model.fit(X=df[varnames],
          y=df['admit'],
          varnames=varnames,
          isvars=varnames,
          ids=df['id'],
          alts=df['alt'],
          fit_intercept=True)
model.summary()

Estimation time= 0.0 seconds
---------------------------------------------------------------------------
Coefficient              Estimate      Std.Err.         z-val         P>|z|
---------------------------------------------------------------------------
_intercept.1           -3.9901156     1.0668058    -3.7402454       0.00081 ***
gpa.1                   0.8040565     0.2998838     2.6812267        0.0224 *
gre.1                   0.0022645     0.0009793     2.3124578        0.0556 .
rank_2.1               -0.6754427     0.3038699    -2.2228022         0.068 .
rank_3.1               -1.3402258     0.3329743    -4.0250129      0.000278 ***
rank_4.1               -1.5514728     0.4055016    -3.8260580      0.000591 ***
---------------------------------------------------------------------------
Significance:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood= -229.259
AIC= 470.517
BIC= 494.466

Note that the estimation results are identical to those displayed in the original example in UCLA IDRE’s website. The .1 suffix for the coefficient names represents the alternative that they are associated with. Given that in this case it is a binary logit, the coefficients are interpreted as being relative to the base alternative 0.