Multinomial Logit¶
This is a step-by-step guide on how to estimate Multinomial Logit models using the xlogit
package. You can interactively execute the code in this guide by opening it Google Colab using the following link:
Install xlogit
package¶
Install xlogit
using pip
as shown below.
[ ]:
!pip install xlogit
Collecting xlogit
Downloading https://files.pythonhosted.org/packages/60/5f/9bc576d180c366af77bc04e268536e9e34be23c52a520918aa0cb56b438e/xlogit-0.1.3-py3-none-any.whl
Requirement already satisfied: numpy>=1.13.1 in /usr/local/lib/python3.7/dist-packages (from xlogit) (1.19.5)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from xlogit) (1.4.1)
Installing collected packages: xlogit
Successfully installed xlogit-0.1.3
Route Choice Dataset¶
This dataset contains choices of 151 commuters among three Home-to-work route alternatives. The three alternatives are arterial, rural, and freeway roads. This dataset was taken from Example 13.1 of the book “Statistical and econometric methods for transportation data analysis” (Washintong et. al., 2011).
Read data¶
We start by importing the data using pandas and renaming the columns of interest (choice, distance, male, and vehicle model). In addition, we create a column with the name of the alternatives and a column that uniquely identifies every observation in the dataset. Note that this dataset is long format.
[ ]:
import pandas as pd
import numpy as np
df = pd.read_csv("https://engineering.purdue.edu/~flm/StatEcon-Files/Ex13-1.txt",
sep="\t", header=None, prefix="x")
df.rename(columns={'x0': 'choice', 'x6': 'dist', 'x10': 'male', 'x14': 'vehmodel'},
inplace=True) # Rename columns of interest
df['alt'] = np.tile(['arterial', 'rural', 'freeway'], len(df)//3) # Add column with alternatives
df['ids'] = np.repeat(np.arange(len(df)//3), 3) # Add column with unique ids
df['vehage'] = 86 - df['vehmodel']
df
choice | x1 | x2 | x3 | x4 | x5 | dist | x7 | x8 | x9 | male | x11 | x12 | x13 | vehmodel | x15 | x16 | alt | ids | vehage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 0 | 460 | 14 | 48 | 0 | 0 | 2 | 0 | 1 | 0 | 1 | 86 | 0 | 28 | arterial | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 440 | 7 | 44 | 0 | 0 | 2 | 0 | 1 | 0 | 1 | 86 | 0 | 28 | rural | 0 | 0 |
2 | 0 | 0 | 0 | 1 | 130 | 7 | 61 | 0 | 0 | 2 | 0 | 1 | 0 | 1 | 86 | 0 | 28 | freeway | 0 | 0 |
3 | 1 | 1 | 0 | 0 | 595 | 13 | 59 | 1 | 0 | 2 | 1 | 1 | 0 | 2 | 85 | 0 | 27 | arterial | 1 | 1 |
4 | 0 | 0 | 1 | 0 | 515 | 13 | 70 | 1 | 0 | 2 | 1 | 1 | 0 | 2 | 85 | 0 | 27 | rural | 1 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
448 | 0 | 0 | 1 | 0 | 325 | 10 | 70 | 0 | 1 | 2 | 1 | 1 | 0 | 1 | 84 | 1 | 24 | rural | 149 | 2 |
449 | 1 | 0 | 0 | 1 | 200 | 5 | 74 | 0 | 1 | 2 | 1 | 1 | 0 | 1 | 84 | 1 | 24 | freeway | 149 | 2 |
450 | 0 | 1 | 0 | 0 | 900 | 14 | 51 | 1 | 0 | 2 | 1 | 1 | 0 | 3 | 83 | 1 | 18 | arterial | 150 | 3 |
451 | 0 | 0 | 1 | 0 | 550 | 7 | 47 | 1 | 0 | 2 | 1 | 1 | 0 | 3 | 83 | 1 | 18 | rural | 150 | 3 |
452 | 1 | 0 | 0 | 1 | 220 | 7 | 64 | 1 | 0 | 2 | 1 | 1 | 0 | 3 | 83 | 1 | 18 | freeway | 150 | 3 |
453 rows × 20 columns
Create model specification¶
The following code creates the model specification by including additional variables in the dataframe to accomodate the specification needs. The newly added variables are simply the product of existing variables with dummy variables for the different alternatives. Note that the specification in the code below corresponds to the following utility maximization formulation:
\begin{equation} \begin{array}{} V_{arterial} & = & \quad \beta_3DIST_{arterial} \\ V_{rural} & = \beta_1ASC_{rural} & + \beta_4DIST_{rural} & + \beta_6VEHAGE_{rural} \\ V_{freeway} & = \beta_2ASC_{freew} & + \beta_5DIST_{freeway} & + \beta_7{VEHAGE_{freeway}} & + \beta_8MALE_{freeway} \end{array} \end{equation}
[ ]:
# Alternative specific constants
df['asc_rural'] = np.ones(len(df)) * (df['alt'] == 'rural')
df['asc_freeway'] = np.ones(len(df)) * (df['alt'] == 'freeway')
# Distance
df['dist_arterial'] = df['dist'] * (df['alt'] == 'arterial')
df['dist_rural'] = df['dist'] * (df['alt'] == 'rural')
df['dist_freeway'] = df['dist'] * (df['alt'] == 'freeway')
# Vehicle age
df['vehage_rural'] = df['vehage'] * (df['alt'] == 'rural')
df['vehage_freeway'] = df['vehage'] * (df['alt'] == 'freeway')
# Male driver
df['male_freeway'] = df['male'] * (df['alt'] == 'freeway')
Estimate model¶
After creating the model specification, we can use xlogit
to estimate the model as follows:
[ ]:
from xlogit import MultinomialLogit
varnames=['asc_rural', 'asc_freeway', 'dist_arter', 'dist_rural',
'dist_freew', 'vehage_rural', 'vehage_freew', 'male_freew']
model = MultinomialLogit()
model.fit(X=df[varnames], y=df['choice'], varnames=varnames,
ids=df['ids'], alts=df['alt'])
model.summary()
Estimation time= 0.0 seconds
---------------------------------------------------------------------------
Coefficient Estimate Std.Err. z-val P>|z|
---------------------------------------------------------------------------
asc_rural 2.8131275 1.1504189 2.4453072 0.0416 *
asc_freeway -2.6868817 2.2817329 -1.1775619 0.398
dist_arterial -0.1229102 0.0240053 -5.1201358 4.14e-06 ***
dist_rural -0.1773579 0.0279818 -6.3383224 1.3e-08 ***
dist_freeway -0.0956391 0.0369681 -2.5870738 0.0295 *
vehage_rural 0.1236721 0.0535597 2.3090535 0.057 .
vehage_freeway 0.2268642 0.0755401 3.0032267 0.00969 **
male_freeway 0.5990000 0.6202114 0.9657996 0.499
---------------------------------------------------------------------------
Significance: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood= -94.440
AIC= 204.881
BIC= 229.019
Swissmetro Dataset¶
In this example, we will estimate a Multinomial Logit where each alternative is defined with a different utility specification. The swissmetro dataset is an SP/RP survey dataset popularly used in Biogeme and Pylogit examples. The dataset is available at http://transp-or.epfl.ch/data/swissmetro.dat and Bierlaire et. al., (2001) provides a detailed discussion of the data as wells as its context and collection process. . Note
that the dataset is available in wide format; therefore, we need to convert it to long format for xlogit
.
Read data¶
The dataset is imported to the Python environment using pandas
. Then, two types of samples, ones with a trip purpose different to commute or business and ones with an unknown choice, are filtered out. The original dataset contains 10,729 records, but after filtering, 6,768 records remain for following analysis. Finally, a new column that uniquely identifies each sample is added to the dataframe and the CHOICE
column, which originally contains a numerical coding of the choices, is mapped
to a description that is consistent with the alternatives in the column names.
[ ]:
import pandas as pd
import numpy as np
df_wide = pd.read_table("http://transp-or.epfl.ch/data/swissmetro.dat", sep='\t')
# Keep only observations for commute and business purposes that contain known choices
df_wide = df_wide[(df_wide['PURPOSE'].isin([1, 3]) & (df_wide['CHOICE'] != 0))]
df_wide['custom_id'] = np.arange(len(df_wide)) # Add unique identifier
df_wide['CHOICE'] = df_wide['CHOICE'].map({1: 'TRAIN', 2:'SM', 3: 'CAR'})
df_wide
GROUP | SURVEY | SP | ID | PURPOSE | FIRST | TICKET | WHO | LUGGAGE | AGE | MALE | INCOME | GA | ORIGIN | DEST | TRAIN_AV | CAR_AV | SM_AV | TRAIN_TT | TRAIN_CO | TRAIN_HE | SM_TT | SM_CO | SM_HE | SM_SEATS | CAR_TT | CAR_CO | CHOICE | custom_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 2 | 1 | 1 | 1 | 1 | 112 | 48 | 120 | 63 | 52 | 20 | 0 | 117 | 65 | SM | 0 |
1 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 2 | 1 | 1 | 1 | 1 | 103 | 48 | 30 | 60 | 49 | 10 | 0 | 117 | 84 | SM | 1 |
2 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 2 | 1 | 1 | 1 | 1 | 130 | 48 | 60 | 67 | 58 | 30 | 0 | 117 | 52 | SM | 2 |
3 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 2 | 1 | 1 | 1 | 1 | 103 | 40 | 30 | 63 | 52 | 20 | 0 | 72 | 52 | SM | 3 |
4 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 2 | 1 | 1 | 1 | 1 | 130 | 36 | 60 | 63 | 42 | 20 | 0 | 90 | 84 | SM | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8446 | 3 | 1 | 1 | 939 | 3 | 1 | 7 | 3 | 1 | 5 | 1 | 2 | 0 | 1 | 2 | 1 | 1 | 1 | 108 | 13 | 30 | 50 | 17 | 30 | 0 | 130 | 64 | TRAIN | 6763 |
8447 | 3 | 1 | 1 | 939 | 3 | 1 | 7 | 3 | 1 | 5 | 1 | 2 | 0 | 1 | 2 | 1 | 1 | 1 | 108 | 12 | 30 | 53 | 16 | 10 | 0 | 80 | 80 | TRAIN | 6764 |
8448 | 3 | 1 | 1 | 939 | 3 | 1 | 7 | 3 | 1 | 5 | 1 | 2 | 0 | 1 | 2 | 1 | 1 | 1 | 108 | 16 | 60 | 50 | 16 | 20 | 0 | 80 | 64 | TRAIN | 6765 |
8449 | 3 | 1 | 1 | 939 | 3 | 1 | 7 | 3 | 1 | 5 | 1 | 2 | 0 | 1 | 2 | 1 | 1 | 1 | 128 | 16 | 30 | 53 | 17 | 30 | 0 | 80 | 104 | TRAIN | 6766 |
8450 | 3 | 1 | 1 | 939 | 3 | 1 | 7 | 3 | 1 | 5 | 1 | 2 | 0 | 1 | 2 | 1 | 1 | 1 | 108 | 13 | 60 | 53 | 21 | 30 | 0 | 100 | 80 | TRAIN | 6767 |
6768 rows × 29 columns
Reshape data¶
Given that xlogit
requires the dataset to be provided in the long format, we reshape the dataset using the wide_to_long
utility provided by xlogit
. This function takes as input the column that uniquely identifies each sample (id_col
), the name of column to save the alternatives (alt_name
), the list of alternatives (alt_list
), the columns that vary across alternatives (varying
), and whether the alternative names are prefix in the column names (alt_is_prefix
).The
wide_to_long
method fills with NaN
the columns that do not have certain alternatives (e.g. SEATS
and HE
). Depending on your specification needs, you can ignore the NaN
or replace them with zeros. In this case we replaced them with zeros using the empty_val
parameter. Additional details about the wide_to_long
function can be found in the xlogit’s documentation.
[ ]:
from xlogit.utils import wide_to_long
df = wide_to_long(df_wide, id_col='custom_id', alt_name='alt', sep='_',
alt_list=['TRAIN', 'SM', 'CAR'], empty_val=0,
varying=['TT', 'CO', 'HE', 'AV', 'SEATS'], alt_is_prefix=True)
df
custom_id | alt | TT | CO | HE | AV | SEATS | GROUP | SURVEY | SP | ID | PURPOSE | FIRST | TICKET | WHO | LUGGAGE | AGE | MALE | INCOME | GA | ORIGIN | DEST | CHOICE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | CAR | 117 | 65 | 0 | 1 | 0 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 2 | 1 | SM |
1 | 0 | SM | 63 | 52 | 20 | 1 | 0 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 2 | 1 | SM |
2 | 0 | TRAIN | 112 | 48 | 120 | 1 | 0 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 2 | 1 | SM |
3 | 1 | CAR | 117 | 84 | 0 | 1 | 0 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 2 | 1 | SM |
4 | 1 | SM | 60 | 49 | 10 | 1 | 0 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 2 | 1 | SM |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20299 | 6766 | SM | 53 | 17 | 30 | 1 | 0 | 3 | 1 | 1 | 939 | 3 | 1 | 7 | 3 | 1 | 5 | 1 | 2 | 0 | 1 | 2 | TRAIN |
20300 | 6766 | TRAIN | 128 | 16 | 30 | 1 | 0 | 3 | 1 | 1 | 939 | 3 | 1 | 7 | 3 | 1 | 5 | 1 | 2 | 0 | 1 | 2 | TRAIN |
20301 | 6767 | CAR | 100 | 80 | 0 | 1 | 0 | 3 | 1 | 1 | 939 | 3 | 1 | 7 | 3 | 1 | 5 | 1 | 2 | 0 | 1 | 2 | TRAIN |
20302 | 6767 | SM | 53 | 21 | 30 | 1 | 0 | 3 | 1 | 1 | 939 | 3 | 1 | 7 | 3 | 1 | 5 | 1 | 2 | 0 | 1 | 2 | TRAIN |
20303 | 6767 | TRAIN | 108 | 13 | 60 | 1 | 0 | 3 | 1 | 1 | 939 | 3 | 1 | 7 | 3 | 1 | 5 | 1 | 2 | 0 | 1 | 2 | TRAIN |
20304 rows × 23 columns
We then scale some variables as per the examples in Biogeme and Pylogit. The time and headway variables are converted to hours, the price is scaled, and the new variables single_luggage
, free ticket
, multiple_luggage
, regular_class
and train_survey
are created to accommodate the model specification as shown below:
[ ]:
# Scale travel time and headway variables to hours
df['time'] = df['TT'] / 60.0
df['headway'] = df['HE'] / 60.0
df['cost'] = df['CO'] / 100.0
# We set the cost as zero for individuals with an annual pass paid by employer
annual_pass = (df['alt'].isin(['TRAIN', 'SM'])) & ((df["GA"] == 1) | (df["WHO"] == 2))
df["cost"] = df["cost"] * (~annual_pass)
#Travellers carrying only single luggage
df["single_luggage"] = (df["LUGGAGE"] == 1).astype(int)
#Travellers carrying more than one luggage
df["multiple_luggage"] = (df["LUGGAGE"] == 3).astype(int)
# Travellers travelling in classes other than First class
df["regular_class"] = 1 - df["FIRST"]
# Travellers who responded to the survey while on a train
df["train_survey"] = 1 - df["SURVEY"]
df
custom_id | alt | TT | CO | HE | AV | SEATS | GROUP | SURVEY | SP | ID | PURPOSE | FIRST | TICKET | WHO | LUGGAGE | AGE | MALE | INCOME | GA | ORIGIN | DEST | CHOICE | time | headway | cost | single_luggage | multiple_luggage | regular_class | train_survey | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | CAR | 117 | 65 | 0 | 1 | 0 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 2 | 1 | SM | 1.950000 | 0.000000 | 0.65 | 0 | 0 | 1 | 1 |
1 | 0 | SM | 63 | 52 | 20 | 1 | 0 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 2 | 1 | SM | 1.050000 | 0.333333 | 0.52 | 0 | 0 | 1 | 1 |
2 | 0 | TRAIN | 112 | 48 | 120 | 1 | 0 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 2 | 1 | SM | 1.866667 | 2.000000 | 0.48 | 0 | 0 | 1 | 1 |
3 | 1 | CAR | 117 | 84 | 0 | 1 | 0 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 2 | 1 | SM | 1.950000 | 0.000000 | 0.84 | 0 | 0 | 1 | 1 |
4 | 1 | SM | 60 | 49 | 10 | 1 | 0 | 2 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | 0 | 2 | 0 | 2 | 1 | SM | 1.000000 | 0.166667 | 0.49 | 0 | 0 | 1 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20299 | 6766 | SM | 53 | 17 | 30 | 1 | 0 | 3 | 1 | 1 | 939 | 3 | 1 | 7 | 3 | 1 | 5 | 1 | 2 | 0 | 1 | 2 | TRAIN | 0.883333 | 0.500000 | 0.17 | 1 | 0 | 0 | 0 |
20300 | 6766 | TRAIN | 128 | 16 | 30 | 1 | 0 | 3 | 1 | 1 | 939 | 3 | 1 | 7 | 3 | 1 | 5 | 1 | 2 | 0 | 1 | 2 | TRAIN | 2.133333 | 0.500000 | 0.16 | 1 | 0 | 0 | 0 |
20301 | 6767 | CAR | 100 | 80 | 0 | 1 | 0 | 3 | 1 | 1 | 939 | 3 | 1 | 7 | 3 | 1 | 5 | 1 | 2 | 0 | 1 | 2 | TRAIN | 1.666667 | 0.000000 | 0.80 | 1 | 0 | 0 | 0 |
20302 | 6767 | SM | 53 | 21 | 30 | 1 | 0 | 3 | 1 | 1 | 939 | 3 | 1 | 7 | 3 | 1 | 5 | 1 | 2 | 0 | 1 | 2 | TRAIN | 0.883333 | 0.500000 | 0.21 | 1 | 0 | 0 | 0 |
20303 | 6767 | TRAIN | 108 | 13 | 60 | 1 | 0 | 3 | 1 | 1 | 939 | 3 | 1 | 7 | 3 | 1 | 5 | 1 | 2 | 0 | 1 | 2 | TRAIN | 1.800000 | 1.000000 | 0.13 | 1 | 0 | 0 | 0 |
20304 rows × 30 columns
Create model specification¶
By operating the dataframe columns, highly-flexible utility specifications can be modeled in xlogit
. As shown below, alternative specific constants or coefficients can be included in the specification by strategically creating new columns and setting their values depending on the alternative. This flexibility allows even the specification of one or multiple coefficients per alternative or group of alternatives.
[ ]:
# Create model specification
# Alternative Specific Constants
df['asc_train'] = np.ones(len(df))*(df['alt'] == 'TRAIN')
df['asc_sm'] = np.ones(len(df))*(df['alt'] == 'SM')
# Travel cost (One coefficient per alternative)
df['cost_train'] = df['cost']*(df['alt'] == 'TRAIN')
df['cost_sm'] = df['cost']*(df['alt'] == 'SM')
df['cost_car'] = df['cost']*(df['alt'] == 'CAR')
# Travel time (One coefficient for train and sm and other for car)
df['time_train_sm'] = df['time']*((df['alt'] == 'TRAIN') | (df['alt'] == 'SM'))
df['time_car'] = df['time']*(df['alt'] == 'CAR')
# Headway (One coefficient per alternative, except for car)
df['headway_train'] = df['headway']*(df['alt'] == 'TRAIN')
df['headway_sm'] = df['headway']*(df['alt'] == 'SM')
# Seat config (Coefficient only for swissmetro)
df['seatconf_sm'] = df['SEATS']*(df['alt'] == 'SM')
# Train Survey (Coefficient only for swissmetro)
df['survey_train_sm'] = df['train_survey']* ((df['alt'] == 'TRAIN') | (df['alt'] == 'SM'))
# Regular class (Coefficient only for swissmetro)
df['regular_class_sm'] = df['regular_class']*(df['alt'] == 'TRAIN')
# Luggage (Coefficient only for car)
df['single_lug_car'] = df['single_luggage']*(df['alt'] == 'CAR')
df['multip_lug_car'] = df['multiple_luggage']*(df['alt'] == 'CAR')
Estimate model¶
The swissmetro dataset contains unbalanced choice situations across individuals (i.e., some individuals do not have observations for all alternatives). The avail
option enables estimation for such datasets. avail
takes the values that indicate the availability of each alternative across individuals. Once the model specification is complete, the model is estimated as follows:
[ ]:
from xlogit import MultinomialLogit
varnames=['asc_train', 'asc_sm', 'time_train_sm', 'time_car', 'cost_train',
'cost_sm', 'cost_car', 'headway_train', 'headway_sm', 'seatconf_sm',
'survey_train_sm', 'regular_class_sm', 'single_lug_car',
'multip_lug_car']
model = MultinomialLogit()
model.fit(X=df[varnames],
y=df['CHOICE'],
varnames=varnames,
alts=df['alt'],
ids=df['custom_id'],
avail=df['AV'])
model.summary()
Estimation time= 0.1 seconds
---------------------------------------------------------------------------
Coefficient Estimate Std.Err. z-val P>|z|
---------------------------------------------------------------------------
asc_train -1.2929512 0.1237556 -10.4476139 2.43e-24 ***
asc_sm -0.5026152 0.1032927 -4.8659312 5.87e-06 ***
time_train_sm -0.6990098 0.0396510 -17.6290608 8.13e-67 ***
time_car -0.7229887 0.0442625 -16.3340968 1.18e-57 ***
cost_train -0.5618773 0.0807075 -6.9618968 2.59e-11 ***
cost_sm -0.2816843 0.0417373 -6.7489902 1.1e-10 ***
cost_car -0.5139009 0.0970406 -5.2957304 6.66e-07 ***
headway_train -0.3143519 0.0505955 -6.2130370 3.49e-09 ***
headway_sm -0.3773753 0.1652542 -2.2836046 0.0589 .
seatconf_sm -0.7824379 0.0758912 -10.3100010 9.91e-24 ***
survey_train_sm 2.5424946 0.0921336 27.5957408 1.5e-157 ***
regular_class_sm 0.5650259 0.0652226 8.6630441 4.93e-17 ***
single_lug_car 0.4227658 0.0611684 6.9115077 3.66e-11 ***
multip_lug_car 1.4141058 0.2373032 5.9590672 1.62e-08 ***
---------------------------------------------------------------------------
Significance: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood= -5159.258
AIC= 10346.517
BIC= 10441.996
The estimates are identical to those provided by PyLogit (Brathwaite and Walker 2018) as shown in this link
Fishing Dataset¶
The following example illustrates the estimation of a Multinomial Logit model for choices of 1,182 individuals for sport fishing modes using xlogit
. The goal is to analyze the market shares of four alternatives (i.e., beach, pier, boat, and charter) based on their cost and fish catch rate. Cameron (2005) provides additional details about this dataset. The following code illustrates how to use xlogit
to estimate the model parameters.
Read data¶
The data to be analyzed can be imported to Python using any preferred method. In this example, the data in CSV format was imported using the popular pandas
Python package. However, it is worth highlighting that xlogit
does not depend on the pandas
package, as xlogit
can take any array-like structure as input. This represents an additional advantage because xlogit
can be used with any preferred dataframe library, and not only with pandas
.
[ ]:
import pandas as pd
df = pd.read_csv("https://raw.github.com/arteagac/xlogit/master/examples/data/fishing_long.csv")
df
id | alt | choice | income | price | catch | |
---|---|---|---|---|---|---|
0 | 1 | beach | 0 | 7083.33170 | 157.930 | 0.0678 |
1 | 1 | boat | 0 | 7083.33170 | 157.930 | 0.2601 |
2 | 1 | charter | 1 | 7083.33170 | 182.930 | 0.5391 |
3 | 1 | pier | 0 | 7083.33170 | 157.930 | 0.0503 |
4 | 2 | beach | 0 | 1249.99980 | 15.114 | 0.1049 |
... | ... | ... | ... | ... | ... | ... |
4723 | 1181 | pier | 0 | 416.66668 | 36.636 | 0.4522 |
4724 | 1182 | beach | 0 | 6250.00130 | 339.890 | 0.2537 |
4725 | 1182 | boat | 1 | 6250.00130 | 235.436 | 0.6817 |
4726 | 1182 | charter | 0 | 6250.00130 | 260.436 | 2.3014 |
4727 | 1182 | pier | 0 | 6250.00130 | 339.890 | 0.1498 |
4728 rows × 6 columns
Estimate the model¶
Once the data is in the Python
environment, xlogit
can be used to fit the model, as shown below. The MultinomialLogit
class is imported from xlogit
, and its constructor is used to initialize a new model. The fit
method estimates the model using the input data and estimation criteria provided as arguments to the method’s call. The arguments of the fit
methods are described in `xlogit
’s documentation <https://https://xlogit.readthedocs.io/en/latest/api/>`__.
[ ]:
from xlogit import MultinomialLogit
varnames = ['income','price', 'catch']
model = MultinomialLogit()
model.fit(X=df[varnames],
y=df['choice'],
varnames=varnames,
isvars=['income'],
ids=df['id'],
alts=df['alt'],
fit_intercept=True)
model.summary()
Estimation time= 0.0 seconds
---------------------------------------------------------------------------
Coefficient Estimate Std.Err. z-val P>|z|
---------------------------------------------------------------------------
_intercept.boat 0.5273413 0.2017396 2.6139700 0.0264 *
_intercept.charter 1.6943827 0.2096186 8.0831712 1.2e-14 ***
_intercept.pier 0.7779899 0.2051062 3.7931076 0.000622 ***
income.boat 0.0000894 0.0000473 1.8906643 0.134
income.charter -0.0000333 0.0000480 -0.6936462 0.627
income.pier -0.0001276 0.0000467 -2.7335425 0.0192 *
price -0.0251161 0.0015240 -16.4800784 5.89e-54 ***
catch 0.3578374 0.0985344 3.6315992 0.00113 **
---------------------------------------------------------------------------
Significance: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood= -1215.138
AIC= 2446.275
BIC= 2486.875
Heating Dataset¶
For this example, we use the Heating dataset from R’s mlogit package (Croissant 2020), which contains choice of heating systems in California house. The dataset contains 90 observations, with 8 explanatory variables and more information can be found in https://cran.r-project.org/web/packages/mlogit/vignettes/e1mlogit.html.
Read data¶
[ ]:
import pandas as pd
import numpy as np
df_wide = pd.read_csv("https://raw.github.com/arteagac/xlogit/master/examples/data/heating_wide.csv")
df_wide
idcase | depvar | ic.gc | ic.gr | ic.ec | ic.er | ic.hp | oc.gc | oc.gr | oc.ec | oc.er | oc.hp | income | agehed | rooms | region | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | gc | 866.00 | 962.64 | 859.90 | 995.76 | 1135.50 | 199.69 | 151.72 | 553.34 | 505.60 | 237.88 | 7 | 25 | 6 | ncostl |
1 | 2 | gc | 727.93 | 758.89 | 796.82 | 894.69 | 968.90 | 168.66 | 168.66 | 520.24 | 486.49 | 199.19 | 5 | 60 | 5 | scostl |
2 | 3 | gc | 599.48 | 783.05 | 719.86 | 900.11 | 1048.30 | 165.58 | 137.80 | 439.06 | 404.74 | 171.47 | 4 | 65 | 2 | ncostl |
3 | 4 | er | 835.17 | 793.06 | 761.25 | 831.04 | 1048.70 | 180.88 | 147.14 | 483.00 | 425.22 | 222.95 | 2 | 50 | 4 | scostl |
4 | 5 | er | 755.59 | 846.29 | 858.86 | 985.64 | 883.05 | 174.91 | 138.90 | 404.41 | 389.52 | 178.49 | 2 | 25 | 6 | valley |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
895 | 896 | gc | 766.39 | 877.71 | 751.59 | 869.78 | 942.70 | 142.61 | 136.21 | 474.48 | 420.65 | 203.00 | 6 | 20 | 4 | mountn |
896 | 897 | gc | 1128.50 | 1167.80 | 1047.60 | 1292.60 | 1297.10 | 207.40 | 213.77 | 705.36 | 551.61 | 243.76 | 7 | 45 | 7 | scostl |
897 | 898 | gc | 787.10 | 1055.20 | 842.79 | 1041.30 | 1064.80 | 175.05 | 141.63 | 478.86 | 448.61 | 254.51 | 5 | 60 | 7 | scostl |
898 | 899 | gc | 860.56 | 1081.30 | 799.76 | 1123.20 | 1218.20 | 211.04 | 151.31 | 495.20 | 401.56 | 246.48 | 5 | 50 | 6 | scostl |
899 | 900 | gc | 893.94 | 1119.90 | 967.88 | 1091.70 | 1387.50 | 175.80 | 180.11 | 518.68 | 458.53 | 245.13 | 2 | 65 | 4 | ncostl |
900 rows × 16 columns
Reshape data¶
The dataset is available in wide format. Since xlogit
requires the data in long format, we convert it as shown below:
[ ]:
from xlogit.utils import wide_to_long
df = wide_to_long(df_wide, id_col='idcase', alt_name='alt', varying=['ic', 'oc'],
alt_list=['ec', 'er', 'gc', 'gr', 'hp'], sep='.', empty_val=0)
df
idcase | alt | ic | oc | depvar | income | agehed | rooms | region | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | ec | 859.90 | 553.34 | gc | 7 | 25 | 6 | ncostl |
1 | 1 | er | 995.76 | 505.60 | gc | 7 | 25 | 6 | ncostl |
2 | 1 | gc | 866.00 | 199.69 | gc | 7 | 25 | 6 | ncostl |
3 | 1 | gr | 962.64 | 151.72 | gc | 7 | 25 | 6 | ncostl |
4 | 1 | hp | 1135.50 | 237.88 | gc | 7 | 25 | 6 | ncostl |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4495 | 900 | ec | 967.88 | 518.68 | gc | 2 | 65 | 4 | ncostl |
4496 | 900 | er | 1091.70 | 458.53 | gc | 2 | 65 | 4 | ncostl |
4497 | 900 | gc | 893.94 | 175.80 | gc | 2 | 65 | 4 | ncostl |
4498 | 900 | gr | 1119.90 | 180.11 | gc | 2 | 65 | 4 | ncostl |
4499 | 900 | hp | 1387.50 | 245.13 | gc | 2 | 65 | 4 | ncostl |
4500 rows × 9 columns
Estimate the model¶
We now import MultinomialLogit
from xlogit and estimate the model as shown below:
[ ]:
from xlogit import MultinomialLogit
varnames = ['ic', 'oc']
model = MultinomialLogit()
model.fit(X=df[varnames],
y=df['depvar'],
varnames=varnames,
alts=df['alt'],
ids=df['idcase'])
model.summary()
Estimation time= 0.0 seconds
---------------------------------------------------------------------------
Coefficient Estimate Std.Err. z-val P>|z|
---------------------------------------------------------------------------
ic -0.0062318 0.0003516 -17.7222802 2.16e-59 ***
oc -0.0045800 0.0003208 -14.2767999 9.14e-41 ***
---------------------------------------------------------------------------
Significance: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood= -1095.237
AIC= 2194.474
BIC= 2204.079
Note that these results are identical to the ones estimated by the R mlogit package: https://cran.r-project.org/web/packages/mlogit/vignettes/e1mlogit.html
References¶
Bierlaire, M. (2018). PandasBiogeme: a short introduction. EPFL (Transport and Mobility Laboratory, ENAC).
Brathwaite, T., & Walker, J. L. (2018). Asymmetric, closed-form, finite-parameter models of multinomial choice. Journal of Choice Modelling, 29, 78–112.
Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: methods and applications. Cambridge university press.
Croissant, Y. (2020). Estimation of Random Utility Models in R: The mlogit Package. Journal of Statistical Software, 95(1), 1-41.
Washington, S., Karlaftis, M., Mannering, F., & Anastasopoulos, P. (2020). Statistical and econometric methods for transportation data analysis. Chapman and Hall/CRC.