Multinomial Logit

This is a step-by-step guide on how to estimate Multinomial Logit models using the xlogit package. You can interactively execute the code in this guide by opening it Google Colab using the following link:

Open In Colab

Install xlogit package

Install xlogit using pip as shown below.

[ ]:
!pip install xlogit
Collecting xlogit
  Downloading https://files.pythonhosted.org/packages/60/5f/9bc576d180c366af77bc04e268536e9e34be23c52a520918aa0cb56b438e/xlogit-0.1.3-py3-none-any.whl
Requirement already satisfied: numpy>=1.13.1 in /usr/local/lib/python3.7/dist-packages (from xlogit) (1.19.5)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from xlogit) (1.4.1)
Installing collected packages: xlogit
Successfully installed xlogit-0.1.3

Route Choice Dataset

This dataset contains choices of 151 commuters among three Home-to-work route alternatives. The three alternatives are arterial, rural, and freeway roads. This dataset was taken from Example 13.1 of the book “Statistical and econometric methods for transportation data analysis” (Washintong et. al., 2011).

Read data

We start by importing the data using pandas and renaming the columns of interest (choice, distance, male, and vehicle model). In addition, we create a column with the name of the alternatives and a column that uniquely identifies every observation in the dataset. Note that this dataset is long format.

[ ]:
import pandas as pd
import numpy as np
df = pd.read_csv("https://engineering.purdue.edu/~flm/StatEcon-Files/Ex13-1.txt",
                 sep="\t", header=None, prefix="x")
df.rename(columns={'x0': 'choice', 'x6': 'dist', 'x10': 'male', 'x14': 'vehmodel'},
          inplace=True)  # Rename columns of interest
df['alt'] = np.tile(['arterial', 'rural', 'freeway'], len(df)//3)  # Add column with alternatives
df['ids'] = np.repeat(np.arange(len(df)//3), 3)  # Add column with unique ids
df['vehage'] = 86 - df['vehmodel']
df
choice x1 x2 x3 x4 x5 dist x7 x8 x9 male x11 x12 x13 vehmodel x15 x16 alt ids vehage
0 1 1 0 0 460 14 48 0 0 2 0 1 0 1 86 0 28 arterial 0 0
1 0 0 1 0 440 7 44 0 0 2 0 1 0 1 86 0 28 rural 0 0
2 0 0 0 1 130 7 61 0 0 2 0 1 0 1 86 0 28 freeway 0 0
3 1 1 0 0 595 13 59 1 0 2 1 1 0 2 85 0 27 arterial 1 1
4 0 0 1 0 515 13 70 1 0 2 1 1 0 2 85 0 27 rural 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
448 0 0 1 0 325 10 70 0 1 2 1 1 0 1 84 1 24 rural 149 2
449 1 0 0 1 200 5 74 0 1 2 1 1 0 1 84 1 24 freeway 149 2
450 0 1 0 0 900 14 51 1 0 2 1 1 0 3 83 1 18 arterial 150 3
451 0 0 1 0 550 7 47 1 0 2 1 1 0 3 83 1 18 rural 150 3
452 1 0 0 1 220 7 64 1 0 2 1 1 0 3 83 1 18 freeway 150 3

453 rows × 20 columns

Create model specification

The following code creates the model specification by including additional variables in the dataframe to accomodate the specification needs. The newly added variables are simply the product of existing variables with dummy variables for the different alternatives. Note that the specification in the code below corresponds to the following utility maximization formulation:

\begin{equation} \begin{array}{} V_{arterial} & = & \quad \beta_3DIST_{arterial} \\ V_{rural} & = \beta_1ASC_{rural} & + \beta_4DIST_{rural} & + \beta_6VEHAGE_{rural} \\ V_{freeway} & = \beta_2ASC_{freew} & + \beta_5DIST_{freeway} & + \beta_7{VEHAGE_{freeway}} & + \beta_8MALE_{freeway} \end{array} \end{equation}

[ ]:
# Alternative specific constants
df['asc_rural'] = np.ones(len(df)) * (df['alt'] == 'rural')
df['asc_freeway'] = np.ones(len(df)) * (df['alt'] == 'freeway')

# Distance
df['dist_arterial'] = df['dist'] * (df['alt'] == 'arterial')
df['dist_rural'] = df['dist'] * (df['alt'] == 'rural')
df['dist_freeway'] = df['dist'] * (df['alt'] == 'freeway')

# Vehicle age
df['vehage_rural'] = df['vehage'] * (df['alt'] == 'rural')
df['vehage_freeway'] = df['vehage'] * (df['alt'] == 'freeway')

# Male driver
df['male_freeway'] = df['male'] * (df['alt'] == 'freeway')

Estimate model

After creating the model specification, we can use xlogit to estimate the model as follows:

[ ]:
from xlogit import MultinomialLogit
varnames=['asc_rural', 'asc_freeway', 'dist_arter', 'dist_rural',
          'dist_freew', 'vehage_rural', 'vehage_freew', 'male_freew']
model = MultinomialLogit()
model.fit(X=df[varnames], y=df['choice'], varnames=varnames,
          ids=df['ids'], alts=df['alt'])
model.summary()
Estimation time= 0.0 seconds
---------------------------------------------------------------------------
Coefficient              Estimate      Std.Err.         z-val         P>|z|
---------------------------------------------------------------------------
asc_rural               2.8131275     1.1504189     2.4453072        0.0416 *
asc_freeway            -2.6868817     2.2817329    -1.1775619         0.398
dist_arterial          -0.1229102     0.0240053    -5.1201358      4.14e-06 ***
dist_rural             -0.1773579     0.0279818    -6.3383224       1.3e-08 ***
dist_freeway           -0.0956391     0.0369681    -2.5870738        0.0295 *
vehage_rural            0.1236721     0.0535597     2.3090535         0.057 .
vehage_freeway          0.2268642     0.0755401     3.0032267       0.00969 **
male_freeway            0.5990000     0.6202114     0.9657996         0.499
---------------------------------------------------------------------------
Significance:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood= -94.440
AIC= 204.881
BIC= 229.019

Swissmetro Dataset

In this example, we will estimate a Multinomial Logit where each alternative is defined with a different utility specification. The swissmetro dataset is an SP/RP survey dataset popularly used in Biogeme and Pylogit examples. The dataset is available at http://transp-or.epfl.ch/data/swissmetro.dat and Bierlaire et. al., (2001) provides a detailed discussion of the data as wells as its context and collection process. . Note that the dataset is available in wide format; therefore, we need to convert it to long format for xlogit.

Read data

The dataset is imported to the Python environment using pandas. Then, two types of samples, ones with a trip purpose different to commute or business and ones with an unknown choice, are filtered out. The original dataset contains 10,729 records, but after filtering, 6,768 records remain for following analysis. Finally, a new column that uniquely identifies each sample is added to the dataframe and the CHOICE column, which originally contains a numerical coding of the choices, is mapped to a description that is consistent with the alternatives in the column names.

[ ]:
import pandas as pd
import numpy as np

df_wide = pd.read_table("http://transp-or.epfl.ch/data/swissmetro.dat", sep='\t')

# Keep only observations for commute and business purposes that contain known choices
df_wide = df_wide[(df_wide['PURPOSE'].isin([1, 3]) & (df_wide['CHOICE'] != 0))]

df_wide['custom_id'] = np.arange(len(df_wide))  # Add unique identifier
df_wide['CHOICE'] = df_wide['CHOICE'].map({1: 'TRAIN', 2:'SM', 3: 'CAR'})
df_wide
GROUP SURVEY SP ID PURPOSE FIRST TICKET WHO LUGGAGE AGE MALE INCOME GA ORIGIN DEST TRAIN_AV CAR_AV SM_AV TRAIN_TT TRAIN_CO TRAIN_HE SM_TT SM_CO SM_HE SM_SEATS CAR_TT CAR_CO CHOICE custom_id
0 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 1 1 1 112 48 120 63 52 20 0 117 65 SM 0
1 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 1 1 1 103 48 30 60 49 10 0 117 84 SM 1
2 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 1 1 1 130 48 60 67 58 30 0 117 52 SM 2
3 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 1 1 1 103 40 30 63 52 20 0 72 52 SM 3
4 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 1 1 1 130 36 60 63 42 20 0 90 84 SM 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8446 3 1 1 939 3 1 7 3 1 5 1 2 0 1 2 1 1 1 108 13 30 50 17 30 0 130 64 TRAIN 6763
8447 3 1 1 939 3 1 7 3 1 5 1 2 0 1 2 1 1 1 108 12 30 53 16 10 0 80 80 TRAIN 6764
8448 3 1 1 939 3 1 7 3 1 5 1 2 0 1 2 1 1 1 108 16 60 50 16 20 0 80 64 TRAIN 6765
8449 3 1 1 939 3 1 7 3 1 5 1 2 0 1 2 1 1 1 128 16 30 53 17 30 0 80 104 TRAIN 6766
8450 3 1 1 939 3 1 7 3 1 5 1 2 0 1 2 1 1 1 108 13 60 53 21 30 0 100 80 TRAIN 6767

6768 rows × 29 columns

Reshape data

Given that xlogit requires the dataset to be provided in the long format, we reshape the dataset using the wide_to_long utility provided by xlogit. This function takes as input the column that uniquely identifies each sample (id_col), the name of column to save the alternatives (alt_name), the list of alternatives (alt_list), the columns that vary across alternatives (varying), and whether the alternative names are prefix in the column names (alt_is_prefix).The wide_to_long method fills with NaN the columns that do not have certain alternatives (e.g. SEATS and HE). Depending on your specification needs, you can ignore the NaN or replace them with zeros. In this case we replaced them with zeros using the empty_val parameter. Additional details about the wide_to_long function can be found in the xlogit’s documentation.

[ ]:
from xlogit.utils import wide_to_long

df = wide_to_long(df_wide, id_col='custom_id', alt_name='alt', sep='_',
                  alt_list=['TRAIN', 'SM', 'CAR'], empty_val=0,
                  varying=['TT', 'CO', 'HE', 'AV', 'SEATS'], alt_is_prefix=True)
df
custom_id alt TT CO HE AV SEATS GROUP SURVEY SP ID PURPOSE FIRST TICKET WHO LUGGAGE AGE MALE INCOME GA ORIGIN DEST CHOICE
0 0 CAR 117 65 0 1 0 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 SM
1 0 SM 63 52 20 1 0 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 SM
2 0 TRAIN 112 48 120 1 0 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 SM
3 1 CAR 117 84 0 1 0 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 SM
4 1 SM 60 49 10 1 0 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 SM
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20299 6766 SM 53 17 30 1 0 3 1 1 939 3 1 7 3 1 5 1 2 0 1 2 TRAIN
20300 6766 TRAIN 128 16 30 1 0 3 1 1 939 3 1 7 3 1 5 1 2 0 1 2 TRAIN
20301 6767 CAR 100 80 0 1 0 3 1 1 939 3 1 7 3 1 5 1 2 0 1 2 TRAIN
20302 6767 SM 53 21 30 1 0 3 1 1 939 3 1 7 3 1 5 1 2 0 1 2 TRAIN
20303 6767 TRAIN 108 13 60 1 0 3 1 1 939 3 1 7 3 1 5 1 2 0 1 2 TRAIN

20304 rows × 23 columns

We then scale some variables as per the examples in Biogeme and Pylogit. The time and headway variables are converted to hours, the price is scaled, and the new variables single_luggage, free ticket, multiple_luggage, regular_class and train_survey are created to accommodate the model specification as shown below:

[ ]:
# Scale travel time and headway variables to hours
df['time'] = df['TT'] / 60.0
df['headway'] = df['HE'] / 60.0
df['cost'] = df['CO'] / 100.0

# We set the cost as zero for individuals with an annual pass paid by employer
annual_pass = (df['alt'].isin(['TRAIN', 'SM'])) & ((df["GA"] == 1) | (df["WHO"] == 2))
df["cost"] = df["cost"] * (~annual_pass)

#Travellers carrying only single luggage
df["single_luggage"] = (df["LUGGAGE"] == 1).astype(int)

#Travellers carrying more than one luggage
df["multiple_luggage"] = (df["LUGGAGE"] == 3).astype(int)

# Travellers travelling in classes other than First class
df["regular_class"] = 1 - df["FIRST"]

# Travellers who responded to the survey while on a train
df["train_survey"] = 1 - df["SURVEY"]
df
custom_id alt TT CO HE AV SEATS GROUP SURVEY SP ID PURPOSE FIRST TICKET WHO LUGGAGE AGE MALE INCOME GA ORIGIN DEST CHOICE time headway cost single_luggage multiple_luggage regular_class train_survey
0 0 CAR 117 65 0 1 0 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 SM 1.950000 0.000000 0.65 0 0 1 1
1 0 SM 63 52 20 1 0 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 SM 1.050000 0.333333 0.52 0 0 1 1
2 0 TRAIN 112 48 120 1 0 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 SM 1.866667 2.000000 0.48 0 0 1 1
3 1 CAR 117 84 0 1 0 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 SM 1.950000 0.000000 0.84 0 0 1 1
4 1 SM 60 49 10 1 0 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 SM 1.000000 0.166667 0.49 0 0 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20299 6766 SM 53 17 30 1 0 3 1 1 939 3 1 7 3 1 5 1 2 0 1 2 TRAIN 0.883333 0.500000 0.17 1 0 0 0
20300 6766 TRAIN 128 16 30 1 0 3 1 1 939 3 1 7 3 1 5 1 2 0 1 2 TRAIN 2.133333 0.500000 0.16 1 0 0 0
20301 6767 CAR 100 80 0 1 0 3 1 1 939 3 1 7 3 1 5 1 2 0 1 2 TRAIN 1.666667 0.000000 0.80 1 0 0 0
20302 6767 SM 53 21 30 1 0 3 1 1 939 3 1 7 3 1 5 1 2 0 1 2 TRAIN 0.883333 0.500000 0.21 1 0 0 0
20303 6767 TRAIN 108 13 60 1 0 3 1 1 939 3 1 7 3 1 5 1 2 0 1 2 TRAIN 1.800000 1.000000 0.13 1 0 0 0

20304 rows × 30 columns

Create model specification

By operating the dataframe columns, highly-flexible utility specifications can be modeled in xlogit. As shown below, alternative specific constants or coefficients can be included in the specification by strategically creating new columns and setting their values depending on the alternative. This flexibility allows even the specification of one or multiple coefficients per alternative or group of alternatives.

[ ]:
# Create model specification
# Alternative Specific Constants
df['asc_train'] = np.ones(len(df))*(df['alt'] == 'TRAIN')
df['asc_sm'] = np.ones(len(df))*(df['alt'] == 'SM')

# Travel cost (One coefficient per alternative)
df['cost_train'] = df['cost']*(df['alt'] == 'TRAIN')
df['cost_sm'] = df['cost']*(df['alt'] == 'SM')
df['cost_car'] = df['cost']*(df['alt'] == 'CAR')

# Travel time (One coefficient for train and sm and other for car)
df['time_train_sm'] = df['time']*((df['alt'] == 'TRAIN') | (df['alt'] == 'SM'))
df['time_car'] = df['time']*(df['alt'] == 'CAR')

# Headway (One coefficient per alternative, except for car)
df['headway_train'] = df['headway']*(df['alt'] == 'TRAIN')
df['headway_sm'] = df['headway']*(df['alt'] == 'SM')

# Seat config (Coefficient only for swissmetro)
df['seatconf_sm'] = df['SEATS']*(df['alt'] == 'SM')

# Train Survey (Coefficient only for swissmetro)
df['survey_train_sm'] = df['train_survey']* ((df['alt'] == 'TRAIN') | (df['alt'] == 'SM'))

# Regular class (Coefficient only for swissmetro)
df['regular_class_sm'] = df['regular_class']*(df['alt'] == 'TRAIN')

# Luggage (Coefficient only for car)
df['single_lug_car'] = df['single_luggage']*(df['alt'] == 'CAR')
df['multip_lug_car'] = df['multiple_luggage']*(df['alt'] == 'CAR')

Estimate model

The swissmetro dataset contains unbalanced choice situations across individuals (i.e., some individuals do not have observations for all alternatives). The avail option enables estimation for such datasets. avail takes the values that indicate the availability of each alternative across individuals. Once the model specification is complete, the model is estimated as follows:

[ ]:
from xlogit import MultinomialLogit

varnames=['asc_train', 'asc_sm', 'time_train_sm', 'time_car', 'cost_train',
          'cost_sm', 'cost_car', 'headway_train', 'headway_sm', 'seatconf_sm',
          'survey_train_sm', 'regular_class_sm', 'single_lug_car',
          'multip_lug_car']
model = MultinomialLogit()
model.fit(X=df[varnames],
          y=df['CHOICE'],
          varnames=varnames,
          alts=df['alt'],
          ids=df['custom_id'],
          avail=df['AV'])
model.summary()
Estimation time= 0.1 seconds
---------------------------------------------------------------------------
Coefficient              Estimate      Std.Err.         z-val         P>|z|
---------------------------------------------------------------------------
asc_train              -1.2929512     0.1237556   -10.4476139      2.43e-24 ***
asc_sm                 -0.5026152     0.1032927    -4.8659312      5.87e-06 ***
time_train_sm          -0.6990098     0.0396510   -17.6290608      8.13e-67 ***
time_car               -0.7229887     0.0442625   -16.3340968      1.18e-57 ***
cost_train             -0.5618773     0.0807075    -6.9618968      2.59e-11 ***
cost_sm                -0.2816843     0.0417373    -6.7489902       1.1e-10 ***
cost_car               -0.5139009     0.0970406    -5.2957304      6.66e-07 ***
headway_train          -0.3143519     0.0505955    -6.2130370      3.49e-09 ***
headway_sm             -0.3773753     0.1652542    -2.2836046        0.0589 .
seatconf_sm            -0.7824379     0.0758912   -10.3100010      9.91e-24 ***
survey_train_sm         2.5424946     0.0921336    27.5957408      1.5e-157 ***
regular_class_sm        0.5650259     0.0652226     8.6630441      4.93e-17 ***
single_lug_car          0.4227658     0.0611684     6.9115077      3.66e-11 ***
multip_lug_car          1.4141058     0.2373032     5.9590672      1.62e-08 ***
---------------------------------------------------------------------------
Significance:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood= -5159.258
AIC= 10346.517
BIC= 10441.996

The estimates are identical to those provided by PyLogit (Brathwaite and Walker 2018) as shown in this link

Fishing Dataset

The following example illustrates the estimation of a Multinomial Logit model for choices of 1,182 individuals for sport fishing modes using xlogit. The goal is to analyze the market shares of four alternatives (i.e., beach, pier, boat, and charter) based on their cost and fish catch rate. Cameron (2005) provides additional details about this dataset. The following code illustrates how to use xlogit to estimate the model parameters.

Read data

The data to be analyzed can be imported to Python using any preferred method. In this example, the data in CSV format was imported using the popular pandas Python package. However, it is worth highlighting that xlogit does not depend on the pandas package, as xlogit can take any array-like structure as input. This represents an additional advantage because xlogit can be used with any preferred dataframe library, and not only with pandas.

[ ]:
import pandas as pd
df = pd.read_csv("https://raw.github.com/arteagac/xlogit/master/examples/data/fishing_long.csv")
df
id alt choice income price catch
0 1 beach 0 7083.33170 157.930 0.0678
1 1 boat 0 7083.33170 157.930 0.2601
2 1 charter 1 7083.33170 182.930 0.5391
3 1 pier 0 7083.33170 157.930 0.0503
4 2 beach 0 1249.99980 15.114 0.1049
... ... ... ... ... ... ...
4723 1181 pier 0 416.66668 36.636 0.4522
4724 1182 beach 0 6250.00130 339.890 0.2537
4725 1182 boat 1 6250.00130 235.436 0.6817
4726 1182 charter 0 6250.00130 260.436 2.3014
4727 1182 pier 0 6250.00130 339.890 0.1498

4728 rows × 6 columns

Estimate the model

Once the data is in the Python environment, xlogit can be used to fit the model, as shown below. The MultinomialLogit class is imported from xlogit, and its constructor is used to initialize a new model. The fit method estimates the model using the input data and estimation criteria provided as arguments to the method’s call. The arguments of the fit methods are described in `xlogit’s documentation <https://https://xlogit.readthedocs.io/en/latest/api/>`__.

[ ]:
from xlogit import MultinomialLogit

varnames = ['income','price', 'catch']
model = MultinomialLogit()
model.fit(X=df[varnames],
          y=df['choice'],
          varnames=varnames,
          isvars=['income'],
          ids=df['id'],
          alts=df['alt'],
          fit_intercept=True)
model.summary()
Estimation time= 0.0 seconds
---------------------------------------------------------------------------
Coefficient              Estimate      Std.Err.         z-val         P>|z|
---------------------------------------------------------------------------
_intercept.boat         0.5273413     0.2017396     2.6139700        0.0264 *
_intercept.charter      1.6943827     0.2096186     8.0831712       1.2e-14 ***
_intercept.pier         0.7779899     0.2051062     3.7931076      0.000622 ***
income.boat             0.0000894     0.0000473     1.8906643         0.134
income.charter         -0.0000333     0.0000480    -0.6936462         0.627
income.pier            -0.0001276     0.0000467    -2.7335425        0.0192 *
price                  -0.0251161     0.0015240   -16.4800784      5.89e-54 ***
catch                   0.3578374     0.0985344     3.6315992       0.00113 **
---------------------------------------------------------------------------
Significance:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood= -1215.138
AIC= 2446.275
BIC= 2486.875

Heating Dataset

For this example, we use the Heating dataset from R’s mlogit package (Croissant 2020), which contains choice of heating systems in California house. The dataset contains 90 observations, with 8 explanatory variables and more information can be found in https://cran.r-project.org/web/packages/mlogit/vignettes/e1mlogit.html.

Read data

[ ]:
import pandas as pd
import numpy as np

df_wide = pd.read_csv("https://raw.github.com/arteagac/xlogit/master/examples/data/heating_wide.csv")
df_wide
idcase depvar ic.gc ic.gr ic.ec ic.er ic.hp oc.gc oc.gr oc.ec oc.er oc.hp income agehed rooms region
0 1 gc 866.00 962.64 859.90 995.76 1135.50 199.69 151.72 553.34 505.60 237.88 7 25 6 ncostl
1 2 gc 727.93 758.89 796.82 894.69 968.90 168.66 168.66 520.24 486.49 199.19 5 60 5 scostl
2 3 gc 599.48 783.05 719.86 900.11 1048.30 165.58 137.80 439.06 404.74 171.47 4 65 2 ncostl
3 4 er 835.17 793.06 761.25 831.04 1048.70 180.88 147.14 483.00 425.22 222.95 2 50 4 scostl
4 5 er 755.59 846.29 858.86 985.64 883.05 174.91 138.90 404.41 389.52 178.49 2 25 6 valley
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
895 896 gc 766.39 877.71 751.59 869.78 942.70 142.61 136.21 474.48 420.65 203.00 6 20 4 mountn
896 897 gc 1128.50 1167.80 1047.60 1292.60 1297.10 207.40 213.77 705.36 551.61 243.76 7 45 7 scostl
897 898 gc 787.10 1055.20 842.79 1041.30 1064.80 175.05 141.63 478.86 448.61 254.51 5 60 7 scostl
898 899 gc 860.56 1081.30 799.76 1123.20 1218.20 211.04 151.31 495.20 401.56 246.48 5 50 6 scostl
899 900 gc 893.94 1119.90 967.88 1091.70 1387.50 175.80 180.11 518.68 458.53 245.13 2 65 4 ncostl

900 rows × 16 columns

Reshape data

The dataset is available in wide format. Since xlogit requires the data in long format, we convert it as shown below:

[ ]:
from xlogit.utils import wide_to_long
df = wide_to_long(df_wide, id_col='idcase', alt_name='alt', varying=['ic', 'oc'],
                  alt_list=['ec', 'er', 'gc', 'gr', 'hp'], sep='.', empty_val=0)
df
idcase alt ic oc depvar income agehed rooms region
0 1 ec 859.90 553.34 gc 7 25 6 ncostl
1 1 er 995.76 505.60 gc 7 25 6 ncostl
2 1 gc 866.00 199.69 gc 7 25 6 ncostl
3 1 gr 962.64 151.72 gc 7 25 6 ncostl
4 1 hp 1135.50 237.88 gc 7 25 6 ncostl
... ... ... ... ... ... ... ... ... ...
4495 900 ec 967.88 518.68 gc 2 65 4 ncostl
4496 900 er 1091.70 458.53 gc 2 65 4 ncostl
4497 900 gc 893.94 175.80 gc 2 65 4 ncostl
4498 900 gr 1119.90 180.11 gc 2 65 4 ncostl
4499 900 hp 1387.50 245.13 gc 2 65 4 ncostl

4500 rows × 9 columns

Estimate the model

We now import MultinomialLogit from xlogit and estimate the model as shown below:

[ ]:
from xlogit import MultinomialLogit

varnames = ['ic', 'oc']
model = MultinomialLogit()
model.fit(X=df[varnames],
          y=df['depvar'],
          varnames=varnames,
          alts=df['alt'],
          ids=df['idcase'])
model.summary()
Estimation time= 0.0 seconds
---------------------------------------------------------------------------
Coefficient              Estimate      Std.Err.         z-val         P>|z|
---------------------------------------------------------------------------
ic                     -0.0062318     0.0003516   -17.7222802      2.16e-59 ***
oc                     -0.0045800     0.0003208   -14.2767999      9.14e-41 ***
---------------------------------------------------------------------------
Significance:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood= -1095.237
AIC= 2194.474
BIC= 2204.079

Note that these results are identical to the ones estimated by the R mlogit package: https://cran.r-project.org/web/packages/mlogit/vignettes/e1mlogit.html

References

  • Bierlaire, M. (2018). PandasBiogeme: a short introduction. EPFL (Transport and Mobility Laboratory, ENAC).

  • Brathwaite, T., & Walker, J. L. (2018). Asymmetric, closed-form, finite-parameter models of multinomial choice. Journal of Choice Modelling, 29, 78–112.

  • Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: methods and applications. Cambridge university press.

  • Croissant, Y. (2020). Estimation of Random Utility Models in R: The mlogit Package. Journal of Statistical Software, 95(1), 1-41.

  • Washington, S., Karlaftis, M., Mannering, F., & Anastasopoulos, P. (2020). Statistical and econometric methods for transportation data analysis. Chapman and Hall/CRC.