Mixed Logit¶

The following examples provide step-by-step instructions to estimate mixed logit models using the xlogit package. You can interactively execute the code in this guide by opening it Google Colab using the following link:

Install and import `xlogit` package¶

Install xlogit using pip as shown below. In addition, import the package and check if GPU processing is available.

[1]:

!pip install xlogit
from xlogit import MixedLogit
MixedLogit.check_if_gpu_available()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting xlogit
  Downloading xlogit-0.2.0-py3-none-any.whl (35 kB)
Requirement already satisfied: numpy>=1.13.1 in /usr/local/lib/python3.7/dist-packages (from xlogit) (1.21.6)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from xlogit) (1.4.1)
Installing collected packages: xlogit
Successfully installed xlogit-0.2.0
1 GPU device(s) available. xlogit will use GPU processing

[1]:

True

Swissmetro Dataset¶

The swissmetro dataset contains stated-preferences for three alternative transportation modes that include car, train and a newly introduced mode: the swissmetro. This dataset is commonly used for estimation examples with the Biogeme and PyLogit packages. The dataset is available at http://transp-or.epfl.ch/data/swissmetro.dat and Bierlaire et. al., (2001) provides a detailed discussion of the data as wells as its context and collection process. The explanatory variables in this example include the travel time (TT) and cost CO for each of the three alternative modes.

Read data¶

The dataset is imported to the Python environment using pandas. Then, two types of samples, ones with a trip purpose different to commute or business and ones with an unknown choice, are filtered out. The original dataset contains 10,729 records, but after filtering, 6,768 records remain for following analysis. Finally, a new column that uniquely identifies each sample is added to the dataframe and the CHOICE column, which originally contains a numerical coding of the choices, is mapped to a description that is consistent with the alternatives in the column names.

[2]:

import pandas as pd
import numpy as np

df_wide = pd.read_table("http://transp-or.epfl.ch/data/swissmetro.dat", sep='\t')

# Keep only observations for commute and business purposes that contain known choices
df_wide = df_wide[(df_wide['PURPOSE'].isin([1, 3]) & (df_wide['CHOICE'] != 0))]

df_wide['custom_id'] = np.arange(len(df_wide))  # Add unique identifier
df_wide['CHOICE'] = df_wide['CHOICE'].map({1: 'TRAIN', 2:'SM', 3: 'CAR'})
df_wide

[2]:

	GROUP	SURVEY	SP	ID	PURPOSE	FIRST	TICKET	WHO	LUGGAGE	AGE	...	TRAIN_CO	TRAIN_HE	SM_TT	SM_CO	SM_HE	SM_SEATS	CAR_TT	CAR_CO	CHOICE	custom_id
0	2	0	1	1	1	0	1	1	0	3	...	48	120	63	52	20	0	117	65	SM	0
1	2	0	1	1	1	0	1	1	0	3	...	48	30	60	49	10	0	117	84	SM	1
2	2	0	1	1	1	0	1	1	0	3	...	48	60	67	58	30	0	117	52	SM	2
3	2	0	1	1	1	0	1	1	0	3	...	40	30	63	52	20	0	72	52	SM	3
4	2	0	1	1	1	0	1	1	0	3	...	36	60	63	42	20	0	90	84	SM	4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8446	3	1	1	939	3	1	7	3	1	5	...	13	30	50	17	30	0	130	64	TRAIN	6763
8447	3	1	1	939	3	1	7	3	1	5	...	12	30	53	16	10	0	80	80	TRAIN	6764
8448	3	1	1	939	3	1	7	3	1	5	...	16	60	50	16	20	0	80	64	TRAIN	6765
8449	3	1	1	939	3	1	7	3	1	5	...	16	30	53	17	30	0	80	104	TRAIN	6766
8450	3	1	1	939	3	1	7	3	1	5	...	13	60	53	21	30	0	100	80	TRAIN	6767

6768 rows × 29 columns

Reshape data¶

The imported dataframe is in wide format, and it needs to be reshaped to long format for processing by xlogit, which offers the convenient wide_to_long utility for this reshaping process. The user needs to specify the column that uniquely identifies each sample, the names of the alternatives, the columns that vary across alternatives, and whether the alternative names are a prefix or suffix of the column names. Additionally, the user can specify a value (empty_val) to be used by default when an alternative is not available for a certain variable. Additional usage examples for the wide_to_long function are available in xlogit’s documentation at https://xlogit.readthedocs.io/en/latest/notebooks/convert_data_wide_to_long.html. Also, details about the function parameters are available at the API reference.

[3]:

from xlogit.utils import wide_to_long

df = wide_to_long(df_wide, id_col='custom_id', alt_name='alt', sep='_',
                  alt_list=['TRAIN', 'SM', 'CAR'], empty_val=0,
                  varying=['TT', 'CO', 'HE', 'AV', 'SEATS'], alt_is_prefix=True)
df

[3]:

	custom_id	alt	TT	CO	HE	AV	SEATS	GROUP	SURVEY	SP	...	TICKET	WHO	LUGGAGE	AGE	MALE	INCOME	GA	ORIGIN	DEST	CHOICE
0	0	TRAIN	112	48	120	1	0	2	0	1	...	1	1	0	3	0	2	0	2	1	SM
1	0	SM	63	52	20	1	0	2	0	1	...	1	1	0	3	0	2	0	2	1	SM
2	0	CAR	117	65	0	1	0	2	0	1	...	1	1	0	3	0	2	0	2	1	SM
3	1	TRAIN	103	48	30	1	0	2	0	1	...	1	1	0	3	0	2	0	2	1	SM
4	1	SM	60	49	10	1	0	2	0	1	...	1	1	0	3	0	2	0	2	1	SM
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
20299	6766	SM	53	17	30	1	0	3	1	1	...	7	3	1	5	1	2	0	1	2	TRAIN
20300	6766	CAR	80	104	0	1	0	3	1	1	...	7	3	1	5	1	2	0	1	2	TRAIN
20301	6767	TRAIN	108	13	60	1	0	3	1	1	...	7	3	1	5	1	2	0	1	2	TRAIN
20302	6767	SM	53	21	30	1	0	3	1	1	...	7	3	1	5	1	2	0	1	2	TRAIN
20303	6767	CAR	100	80	0	1	0	3	1	1	...	7	3	1	5	1	2	0	1	2	TRAIN

20304 rows × 23 columns

Create model specification¶

Following the reshaping, users can create or update the dataset’s columns in order to accommodate their model specification needs, if necessary. The code below shows how the columns ASC_TRAIN and ASC_CAR were created to incorporate alternative-specific constants in the model. In addition, the example illustrates an effective way of establishing variable interactions (e.g., trip costs for commuters with an annual pass) by updating existing columns conditional on values of other columns. Although apparently simple, column operations provide users with an intuitive and highly-flexible mechanism to incorporate model specification aspects, such as variable transformations, interactions, and alternative specific coefficients and constants. By operating the dataframe columns, any utility specification can be accommodated in xlogit. As shown in this specification example, highly-flexible utility specifications can be modeled in xlogit by operating the dataframe columns.

[4]:

df['ASC_TRAIN'] = np.ones(len(df))*(df['alt'] == 'TRAIN')
df['ASC_CAR'] = np.ones(len(df))*(df['alt'] == 'CAR')
df['TT'], df['CO'] = df['TT']/100, df['CO']/100  # Scale variables
annual_pass = (df['GA'] == 1) & (df['alt'].isin(['TRAIN', 'SM']))
df.loc[annual_pass, 'CO'] = 0  # Cost zero for pass holders

Estimate model parameters¶

The fit method estimates the model by taking as input the data from the previous step along with additional specification criteria, such as the distribution of the random parameters (randvars), the number of random draws (n_draws), and the availability of alternatives for the choice situations (avail). We set the optimization method as L-BFGS-B as this is a robust routine that usually helps solve convergence issues. Once the estimation routine is completed, the summary method can be used to display the estimation results.

[5]:

from xlogit import MixedLogit
varnames=['ASC_CAR', 'ASC_TRAIN', 'CO', 'TT']
model = MixedLogit()
model.fit(X=df[varnames], y=df['CHOICE'], varnames=varnames,
          alts=df['alt'], ids=df['custom_id'], avail=df['AV'],
          panels=df["ID"], randvars={'TT': 'n'}, n_draws=1500,
          optim_method='L-BFGS-B')
model.summary()

GPU processing enabled.
Optimization terminated successfully.
    Message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
    Iterations: 14
    Function evaluations: 15
Estimation time= 17.0 seconds
---------------------------------------------------------------------------
Coefficient              Estimate      Std.Err.         z-val         P>|z|
---------------------------------------------------------------------------
ASC_CAR                 0.2831085     0.0560480     5.0511797      2.35e-06 ***
ASC_TRAIN              -0.5722790     0.0794780    -7.2004737      4.84e-12 ***
CO                     -1.6601703     0.0778870   -21.3151016      2.52e-96 ***
TT                     -3.2289850     0.1749807   -18.4533828      5.58e-73 ***
sd.TT                   3.6485337     0.1683459    21.6728359      1.88e-99 ***
---------------------------------------------------------------------------
Significance:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood= -4359.218
AIC= 8728.436
BIC= 8762.536

The negative signs for the cost and time coefficients suggest that decision makers experience a general disutility with alternatives that have higher waiting times and costs, which conforms to the underlying decision making theory. Note that these estimates are highly consistent with those returned by Biogeme (https://biogeme.epfl.ch/examples/swissmetro/05normalMixtureIntegral.html)

Electricity Dataset¶

The electricity dataset contains 4,308 choices among four electricity suppliers based on the attributes of the offered plans, which include prices(pf), contract lengths(cl), time of day rates (tod), seasonal rates(seas), as well as attributes of the suppliers, which include whether the supplier is local (loc) and well-known (wk). The data was collected through a survey where 12 different choice situations were presented to each participant. The multiple responses per participants were organized into panels. Given that some participants answered less than 12 of the choice situations, some panels are unbalanced, which xlogit is able to handle. Revelt and Train (1999) provide a detailed description of this dataset.

Read data¶

The dataset is already in long format so no reshaping is necessary, it can be used directly in xlogit.

[6]:

import pandas as pd

df = pd.read_csv("https://raw.github.com/arteagac/xlogit/master/examples/data/electricity_long.csv")
df

[6]:

	choice	id	alt	pf	cl	loc	wk	tod	seas	chid
0	0	1	1	7	5	0	1	0	0	1
1	0	1	2	9	1	1	0	0	0	1
2	0	1	3	0	0	0	0	0	1	1
3	1	1	4	0	5	0	1	1	0	1
4	0	1	1	7	0	0	1	0	0	2
...	...	...	...	...	...	...	...	...	...	...
17227	0	361	4	0	1	1	0	0	1	4307
17228	1	361	1	9	0	0	1	0	0	4308
17229	0	361	2	7	0	0	0	0	0	4308
17230	0	361	3	0	1	0	1	0	1	4308
17231	0	361	4	0	5	1	0	1	0	4308

17232 rows × 10 columns

Fit the model¶

Note that the parameter panels was included in the fit function in order to take into account panel structure of this dataset during estimation.

[7]:

from xlogit import MixedLogit

varnames = ['pf', 'cl', 'loc', 'wk', 'tod', 'seas']
model = MixedLogit()
model.fit(X=df[varnames],
          y=df['choice'],
          varnames=varnames,
          ids=df['chid'],
          panels=df['id'],
          alts=df['alt'],
          n_draws=600,
          randvars={'pf': 'n', 'cl': 'n', 'loc': 'n',
                    'wk': 'n', 'tod': 'n', 'seas': 'n'})
model.summary()

GPU processing enabled.
Optimization terminated successfully.
    Message: The gradients are close to zero
    Iterations: 25
    Function evaluations: 27
Estimation time= 3.5 seconds
---------------------------------------------------------------------------
Coefficient              Estimate      Std.Err.         z-val         P>|z|
---------------------------------------------------------------------------
pf                     -0.9972102     0.0361699   -27.5701452      7.2e-153 ***
cl                     -0.2196812     0.0143577   -15.3006166      2.44e-50 ***
loc                     2.2901807     0.0891457    25.6903002     3.37e-134 ***
wk                      1.6943247     0.0719012    23.5646304     2.87e-114 ***
tod                    -9.6752279     0.3117174   -31.0384572     1.15e-189 ***
seas                   -9.6961836     0.3111496   -31.1624480     4.94e-191 ***
sd.pf                   0.2207255     0.0126686    17.4230731      1.54e-64 ***
sd.cl                   0.4115547     0.0200678    20.5082053       5.5e-88 ***
sd.loc                  1.7840255     0.0974403    18.3089162      6.13e-71 ***
sd.wk                   1.2296227     0.0838150    14.6706716      1.93e-46 ***
sd.tod                  2.2757059     0.1257114    18.1026193      2.01e-69 ***
sd.seas                 1.4862206     0.1316147    11.2922079      4.06e-28 ***
---------------------------------------------------------------------------
Significance:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood= -3888.465
AIC= 7800.930
BIC= 7877.349

The xlogit estimates are similar to those estimated using R’s mlogit package (https://cran.r-project.org/web/packages/mlogit/vignettes/e3mxlogit.html). With GPU-enabled estimations, xlogit estimates the model in less than 10 seconds, significantly faster than open-source pacakges such as mlogit and pylogit. This feature can be beneficial while fitting models for large datasets with multiple explanatory variables to be estimated with random coefficients.

Fishing Dataset¶

This example illustrates the estimation of a Mixed Logit model for choices of 1,182 individuals for sport fishing modes using xlogit. The goal is to analyze the market shares of four alternatives (i.e., beach, pier, boat, and charter) based on their cost and fish catch rate. Cameron (2005) provides additional details about this dataset. The following code illustrates how to use xlogit to estimate the model parameters.

Read data¶

The data to be analyzed can be imported to Python using any preferred method. In this example, the data in CSV format was imported using the popular pandas Python package. However, it is worth highlighting that xlogit does not depend on the pandas package, as xlogit can take any array-like structure as input. This represents an additional advantage because xlogit can be used with any preferred dataframe library, and not only with pandas.

[8]:

import pandas as pd
df = pd.read_csv("https://raw.github.com/arteagac/xlogit/master/examples/data/fishing_long.csv")
df

[8]:

	id	alt	choice	income	price	catch
0	1	beach	0	7083.33170	157.930	0.0678
1	1	boat	0	7083.33170	157.930	0.2601
2	1	charter	1	7083.33170	182.930	0.5391
3	1	pier	0	7083.33170	157.930	0.0503
4	2	beach	0	1249.99980	15.114	0.1049
...	...	...	...	...	...	...
4723	1181	pier	0	416.66668	36.636	0.4522
4724	1182	beach	0	6250.00130	339.890	0.2537
4725	1182	boat	1	6250.00130	235.436	0.6817
4726	1182	charter	0	6250.00130	260.436	2.3014
4727	1182	pier	0	6250.00130	339.890	0.1498

4728 rows × 6 columns

Fit model¶

Once the data is in the Python environment, xlogit can be used to fit the model, as shown below. The MultinomialLogit class is imported from xlogit, and its constructor is used to initialize a new model. The fit method estimates the model using the input data and estimation criteria provided as arguments to the method’s call. The arguments of the fit methods are described in `xlogit’s documentation <https://https://xlogit.readthedocs.io/en/latest/api/>`__.

[9]:

from xlogit import MixedLogit
varnames = ['price', 'catch']
model = MixedLogit()
model.fit(X=df[varnames],
          y=df['choice'],
          varnames=varnames,
          alts=df['alt'],
          ids=df['id'],
          n_draws=1000,
          randvars={'price': 'n', 'catch': 'n'})
model.summary()

GPU processing enabled.
Optimization terminated successfully.
    Message: The gradients are close to zero
    Iterations: 19
    Function evaluations: 31
Estimation time= 1.0 seconds
---------------------------------------------------------------------------
Coefficient              Estimate      Std.Err.         z-val         P>|z|
---------------------------------------------------------------------------
price                  -0.0272460     0.0024982   -10.9062938      1.86e-25 ***
catch                   1.3271150     0.1724869     7.6940046      2.23e-13 ***
sd.price                0.0102130     0.0022025     4.6370078      1.87e-05 ***
sd.catch               -1.5706825     0.5732798    -2.7398182        0.0189 *
---------------------------------------------------------------------------
Significance:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood= -1300.511
AIC= 2609.023
BIC= 2629.323

Prediction¶

xlogit also provides a convenient set of post-estimation tools for prediction or forecasting. The predict function uses estimated parameters and a new or updated dataset to compute predicted choices. By including the return_proba and return_freq parameters in the function’s call, users can also obtain the predicted probabilities and frequency of the chosen alternatives. The following code illustrates the prediction functionality to forecast changes in market shares (choice frequency) for fishing modes caused by an increase in price for the “boat” mode. First, base market shares are computed by running predict on the original dataset. Then, an increase of 20% in the price for the “boat” alternative is applied to the dataset and the updated shares are predicted.

[10]:

choices, freq = model.predict(X=df[varnames], varnames=varnames, ids=df['id'],
                              alts=df['alt'], return_freq=True)
print(f"base: {freq}")

df.loc[df['alt']=='boat', 'price'] *= 1.2  # 20 percent price increase
choices, freq = model.predict(X=df[varnames], varnames=varnames, ids=df['id'],
                              alts=df['alt'], return_freq=True)
print(f"updated: {freq}")

GPU processing enabled.
base: {'beach': 0.223, 'boat': 0.461, 'charter': 0.228, 'pier': 0.089}
GPU processing enabled.
updated: {'beach': 0.238, 'boat': 0.379, 'charter': 0.278, 'pier': 0.105}

The output shows that the 20% price increase would result in a decrease of almost 10% in market share for the “boat” alternative.

Car Dataset¶

The fourth example uses a stated preference panel dataset for choice of car. Three alternatives are considered, with upto 6 choice situations per individual. This again is an unbalanced panel with responses of some individuals less than 6 situations. The dataset contains 8 explanaotry variables: price, operating cost, range, and binary indicators to indicate whether the car is electric, hybrid, and if performance is high or medium respectively. This dataset was taken from Kenneth Train’s MATLAB codes for estimation of Mixed Logit models as shown in this link: https://eml.berkeley.edu/Software/abstracts/train1006mxlmsl.html

Read data¶

[11]:

import pandas as pd
import numpy as np

df = pd.read_csv("https://raw.github.com/arteagac/xlogit/master/examples/data/car100_long.csv")

Since price and operating cost need to be estimated with negative coefficients, we reverse the variable signs in the dataframe.

[12]:

df['price'] = -df['price']/10000
df['opcost'] = -df['opcost']
df

[12]:

	person_id	choice_id	alt	choice	price	opcost	range	ev	gas	hybrid	hiperf	medhiperf
0	1	1	1	0	-4.6763	-47.43	0.0	0	0	1	0	0
1	1	1	2	1	-5.7209	-27.43	1.3	1	0	0	1	1
2	1	1	3	0	-8.7960	-32.41	1.2	1	0	0	0	1
3	1	2	1	1	-3.3768	-4.89	1.3	1	0	0	1	1
4	1	2	2	0	-9.0336	-30.19	0.0	0	0	1	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...
4447	100	1483	2	0	-2.8036	-14.45	1.6	1	0	0	0	0
4448	100	1483	3	0	-1.9360	-54.76	0.0	0	1	0	1	1
4449	100	1484	1	1	-2.4054	-50.57	0.0	0	1	0	0	0
4450	100	1484	2	0	-5.2795	-21.25	0.0	0	0	1	0	1
4451	100	1484	3	0	-6.0705	-25.41	1.4	1	0	0	0	0

4452 rows × 12 columns

Fit the model¶

[13]:

from xlogit import MixedLogit

varnames = ['hiperf', 'medhiperf', 'price', 'opcost', 'range', 'ev', 'hybrid']
model = MixedLogit()
model.fit(X=df[varnames],
          y=df['choice'],
          varnames=varnames,
          alts=df['alt'],
          ids=df['choice_id'],
          panels=df['person_id'],
          randvars = {'price': 'ln', 'opcost': 'n',
                      'range': 'ln', 'ev':'n', 'hybrid': 'n'},
          n_draws = 100)
model.summary()

GPU processing enabled.
Optimization terminated successfully.
    Message: The gradients are close to zero
    Iterations: 31
    Function evaluations: 34
Estimation time= 1.4 seconds
---------------------------------------------------------------------------
Coefficient              Estimate      Std.Err.         z-val         P>|z|
---------------------------------------------------------------------------
hiperf                  0.1058410     0.0971974     1.0889290         0.441
medhiperf               0.5604997     0.0977352     5.7348796      6.81e-08 ***
price                  -0.7871346     0.1048151    -7.5097475      7.49e-13 ***
opcost                  0.0110846     0.0041762     2.6542027        0.0237 *
range                  -0.6857957     0.4362769    -1.5719276         0.232
ev                     -1.5574339     0.3250817    -4.7908996      8.97e-06 ***
hybrid                  0.6883966     0.1467451     4.6911049      1.43e-05 ***
sd.price                0.8666443     0.0926631     9.3526350      2.72e-19 ***
sd.opcost               0.0401550     0.0047037     8.5369142      2.77e-16 ***
sd.range               -0.5527011     0.2453684    -2.2525353        0.0633 .
sd.ev                   0.9139887     0.2038702     4.4831898      3.66e-05 ***
sd.hybrid               0.7099238     0.1513756     4.6898169      1.44e-05 ***
---------------------------------------------------------------------------
Significance:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood= -1302.683
AIC= 2629.366
BIC= 2692.995

References¶

Bierlaire, M. (2018). PandasBiogeme: a short introduction. EPFL (Transport and Mobility Laboratory, ENAC).
Brathwaite, T., & Walker, J. L. (2018). Asymmetric, closed-form, finite-parameter models of multinomial choice. Journal of Choice Modelling, 29, 78–112.
Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: methods and applications. Cambridge university press.
Croissant, Y. (2020). Estimation of Random Utility Models in R: The mlogit Package. Journal of Statistical Software, 95(1), 1-41.
Revelt, D., & Train, K. (1999). Customer-specific taste parameters and mixed logit. University of California, Berkeley.

Mixed Logit¶

Install and import xlogit package¶

Swissmetro Dataset¶

Read data¶

Reshape data¶

Create model specification¶

Estimate model parameters¶

Electricity Dataset¶

Read data¶

Fit the model¶

Fishing Dataset¶

Read data¶

Fit model¶

Prediction¶

Car Dataset¶

Read data¶

Fit the model¶

References¶

Install and import `xlogit` package¶