Convert Data from Wide to Long Format¶
xlogit
processes data only in long format. However, converting data from wide to long format is a relatively easy task and xlogit
provides a convinient tool for this purpose. In this tutorial, we’ll explain the reshaping using two examples. Let’s start by importing the pandas
and numpy
python packages.
[1]:
import pandas as pd
import numpy as np
In case you haven’t installed xlogit
, you can install it using pip
as follows:
[2]:
!pip install xlogit
Collecting xlogit
Downloading https://files.pythonhosted.org/packages/60/5f/9bc576d180c366af77bc04e268536e9e34be23c52a520918aa0cb56b438e/xlogit-0.1.3-py3-none-any.whl
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from xlogit) (1.4.1)
Requirement already satisfied: numpy>=1.13.1 in /usr/local/lib/python3.7/dist-packages (from xlogit) (1.19.5)
Installing collected packages: xlogit
Successfully installed xlogit-0.1.3
Basic example¶
This example provides a quick intuition about the format conversion using pandas
.
Let’s create some artificial data in wide format for our example. Let’s assume that our dataset contains choices among three transportation modes (i.e. car, train, and bus) for two individuals identified by id
. We assume we collected the income of the individuals, and the time and cost of the transportation modes and saved the data in the df_wide
dataframe as follows:
[3]:
# Create example dataset in wide format
df_wide = pd.DataFrame({'id': [1, 2],
'income': [4, 2],
'time_car': [2, 1],
'time_train': [2, 4],
'time_bus': [4, 5],
'cost_car': [3, 7],
'cost_train': [6, 8],
'cost_bus': [9, 2],
'choice': ['bus', 'car']
})
df_wide
[3]:
id | income | time_car | time_train | time_bus | cost_car | cost_train | cost_bus | choice | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 4 | 2 | 2 | 4 | 3 | 6 | 9 | bus |
1 | 2 | 2 | 1 | 4 | 5 | 7 | 8 | 2 | car |
The dataframe df_wide
is in wide format because every row represent a sample in the dataset. The variables time
and cost
have one column for each choice alternative, as these as alternative-specific variables. After parsing these data to long format, we want the alternative-specific variables to be in rows and not in columns; therefore, in long format every row is associated with an alternative.
To convert df_wide
to long format we use the wide_to_long
function provided by xlogit
.
[4]:
from xlogit.utils import wide_to_long
df_long = wide_to_long(df_wide, id_col='id', alt_list=['car', 'train', 'bus'],
varying=['time', 'cost'], alt_name='alt', sep='_')
df_long
[4]:
id | alt | time | cost | income | choice | |
---|---|---|---|---|---|---|
0 | 1 | bus | 4 | 9 | 4 | bus |
1 | 1 | car | 2 | 3 | 4 | bus |
2 | 1 | train | 2 | 6 | 4 | bus |
3 | 2 | bus | 5 | 2 | 2 | car |
4 | 2 | car | 1 | 7 | 2 | car |
5 | 2 | train | 4 | 8 | 2 | car |
In the long format shown above, every three rows represent an individual and every row represents an alternative. An additional column called alt
was created to indicate the alternative associated to a row, and now the variables time
and cost
have their values in different rows insted of columns. The rest of the variables (i.e. id
, choice
, and income
) are simply repeated across rows as they are associated to the same individual. For instance, the individual with id
equal to 1
has an income
of 4
, which is repeated throughout the rows of this individual. This data in long format can be now processed by xlogit
.
The wide_to_long
function takes as input the following parameters: - dataframe
: The wide-format DataFrame. - id_col
: Name of the column that uniquely identifies each sample. - alt_list
: List of alternatives. - alt_name
: Name of the alternatives column in returned dataset. - varying
: List of column names that vary across alternatives. - sep
: Separator of column names that vary across alternatives. - alt_is_prefix
: True if alternative is prefix of the variable name
or False if it is suffix. - empty_val
: Value to fill when alternative not available for a certain variable.
The current version of xlogit
only supports wide to long conversion for Pandas
dataframes. For additional details about this function, consult the documentation at: https://xlogit.readthedocs.io/en/latest/api/utils.html#xlogit.utils.wide_to_long
Admissions dataset¶
This is a dataset used in a binary logit example in UCLA IDRE’s website. The idea is to predict model the admission to a program based on the candidate’s gre, gpa and school’s rank. The outcome variable is admit
, which is equal to 1 when the student was admited or zero otherwise. Note that we added an id
column to uniquely identify each row. > Note that this dataset does not contain alternative-specific variables. All the
explanatory variables are individual specific.
[5]:
import pandas as pd
df_wide = pd.read_stata("https://stats.idre.ucla.edu/stat/stata/dae/binary.dta")
df_wide['id'] = np.arange(len(df_wide)) # Add unique identifier column
df_wide
[5]:
admit | gre | gpa | rank | id | |
---|---|---|---|---|---|
0 | 0.0 | 380.0 | 3.61 | 3.0 | 0 |
1 | 1.0 | 660.0 | 3.67 | 3.0 | 1 |
2 | 1.0 | 800.0 | 4.00 | 1.0 | 2 |
3 | 1.0 | 640.0 | 3.19 | 4.0 | 3 |
4 | 0.0 | 520.0 | 2.93 | 4.0 | 4 |
... | ... | ... | ... | ... | ... |
395 | 0.0 | 620.0 | 4.00 | 2.0 | 395 |
396 | 0.0 | 560.0 | 3.04 | 3.0 | 396 |
397 | 0.0 | 460.0 | 2.63 | 2.0 | 397 |
398 | 0.0 | 700.0 | 3.65 | 2.0 | 398 |
399 | 0.0 | 600.0 | 3.89 | 3.0 | 399 |
400 rows × 5 columns
Now we can proceed to convert the data to long format as follows:
[6]:
df = wide_to_long(df_wide, id_col='id', alt_list=[0, 1], alt_name='alt')
df
[6]:
id | alt | admit | gre | gpa | rank | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0.0 | 380.0 | 3.61 | 3.0 |
1 | 0 | 1 | 0.0 | 380.0 | 3.61 | 3.0 |
2 | 1 | 0 | 1.0 | 660.0 | 3.67 | 3.0 |
3 | 1 | 1 | 1.0 | 660.0 | 3.67 | 3.0 |
4 | 2 | 0 | 1.0 | 800.0 | 4.00 | 1.0 |
... | ... | ... | ... | ... | ... | ... |
795 | 397 | 1 | 0.0 | 460.0 | 2.63 | 2.0 |
796 | 398 | 0 | 0.0 | 700.0 | 3.65 | 2.0 |
797 | 398 | 1 | 0.0 | 700.0 | 3.65 | 2.0 |
798 | 399 | 0 | 0.0 | 600.0 | 3.89 | 3.0 |
799 | 399 | 1 | 0.0 | 600.0 | 3.89 | 3.0 |
800 rows × 6 columns
Before fitting the model, let’s create dummy representations of the rank
variable to include them in the model.
[7]:
df['rank'] = df['rank'].astype(int) # Convert to int first for better column naming
df = df.join(pd.get_dummies(df['rank'], prefix='rank'))
df
[7]:
id | alt | admit | gre | gpa | rank | rank_1 | rank_2 | rank_3 | rank_4 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0.0 | 380.0 | 3.61 | 3 | 0 | 0 | 1 | 0 |
1 | 0 | 1 | 0.0 | 380.0 | 3.61 | 3 | 0 | 0 | 1 | 0 |
2 | 1 | 0 | 1.0 | 660.0 | 3.67 | 3 | 0 | 0 | 1 | 0 |
3 | 1 | 1 | 1.0 | 660.0 | 3.67 | 3 | 0 | 0 | 1 | 0 |
4 | 2 | 0 | 1.0 | 800.0 | 4.00 | 1 | 1 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
795 | 397 | 1 | 0.0 | 460.0 | 2.63 | 2 | 0 | 1 | 0 | 0 |
796 | 398 | 0 | 0.0 | 700.0 | 3.65 | 2 | 0 | 1 | 0 | 0 |
797 | 398 | 1 | 0.0 | 700.0 | 3.65 | 2 | 0 | 1 | 0 | 0 |
798 | 399 | 0 | 0.0 | 600.0 | 3.89 | 3 | 0 | 0 | 1 | 0 |
799 | 399 | 1 | 0.0 | 600.0 | 3.89 | 3 | 0 | 0 | 1 | 0 |
800 rows × 10 columns
Now we can use these data in xlogit
, as shown below. Note that the base category for rank
, called rank_1
, was not included in the model.
[8]:
from xlogit import MultinomialLogit
model = MultinomialLogit()
varnames = ['gpa', 'gre', 'rank_2', 'rank_3', 'rank_4']
model.fit(X=df[varnames],
y=df['admit'],
varnames=varnames,
isvars=varnames,
ids=df['id'],
alts=df['alt'],
fit_intercept=True)
model.summary()
Estimation time= 0.0 seconds
---------------------------------------------------------------------------
Coefficient Estimate Std.Err. z-val P>|z|
---------------------------------------------------------------------------
_intercept.1 -3.9901156 1.0668058 -3.7402454 0.00081 ***
gpa.1 0.8040565 0.2998838 2.6812267 0.0224 *
gre.1 0.0022645 0.0009793 2.3124578 0.0556 .
rank_2.1 -0.6754427 0.3038699 -2.2228022 0.068 .
rank_3.1 -1.3402258 0.3329743 -4.0250129 0.000278 ***
rank_4.1 -1.5514728 0.4055016 -3.8260580 0.000591 ***
---------------------------------------------------------------------------
Significance: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood= -229.259
AIC= 470.517
BIC= 494.466
Note that the estimation results are identical to those displayed in the original example in UCLA IDRE’s website. The .1
suffix for the coefficient names represents the alternative that they are associated with. Given that in this case it is a binary logit, the coefficients are interpreted as being relative to the base alternative 0
.