{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "mixed_logit_model.ipynb", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "display_name": "Python 3 (Spyder)", "language": "python3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "TJHlxbR5kEe-" }, "source": [ "# Mixed Logit" ] }, { "cell_type": "markdown", "metadata": { "id": "qXO6ZtU_F2b4" }, "source": [ "The following examples provide step-by-step instructions to estimate mixed logit models using the xlogit package. You can interactively execute the code in this guide by opening it Google Colab using the following link:\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arteagac/xlogit/blob/master/examples/mixed_logit_model.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "id": "mra0NiIOFSie" }, "source": [ "## Install and import `xlogit` package" ] }, { "cell_type": "markdown", "metadata": { "id": "rvLSzJP1GfP1" }, "source": [ "Install `xlogit` using `pip` as shown below. In addition, import the package and check if GPU processing is available." ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "NQbZt7CVh8f_", "outputId": "b823e80f-fd47-4dd1-8656-3fd0d6a1e26a" }, "source": [ "!pip install xlogit\n", "from xlogit import MixedLogit\n", "MixedLogit.check_if_gpu_available()" ], "execution_count": 1, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", "Collecting xlogit\n", " Downloading xlogit-0.2.0-py3-none-any.whl (35 kB)\n", "Requirement already satisfied: numpy>=1.13.1 in /usr/local/lib/python3.7/dist-packages (from xlogit) (1.21.6)\n", "Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from xlogit) (1.4.1)\n", "Installing collected packages: xlogit\n", "Successfully installed xlogit-0.2.0\n", "1 GPU device(s) available. xlogit will use GPU processing\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "True" ] }, "metadata": {}, "execution_count": 1 } ] }, { "cell_type": "markdown", "metadata": { "id": "MP77ezqVfvRI" }, "source": [ "## Swissmetro Dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "BOWB3Lffg5Qc" }, "source": [ "\n", "The swissmetro dataset contains stated-preferences for three alternative transportation modes that include car, train and a newly introduced mode: the swissmetro. This dataset is commonly used for estimation examples with the `Biogeme` and `PyLogit` packages. The dataset is available at http://transp-or.epfl.ch/data/swissmetro.dat and [Bierlaire et. al., (2001)](https://transp-or.epfl.ch/documents/proceedings/BierAxhaAbay01.pdf) provides a detailed discussion of the data as wells as its context and collection process. The explanatory variables in this example include the travel time (`TT`) and cost `CO` for each of the three alternative modes." ] }, { "cell_type": "markdown", "metadata": { "id": "n4No84MAeFOM" }, "source": [ "### Read data" ] }, { "cell_type": "markdown", "metadata": { "id": "TEzmVzYDdLS8" }, "source": [ "The dataset is imported to the Python environment using `pandas`. Then, two types of samples, ones with a trip purpose different to commute or business and ones with an unknown choice, are filtered out. The original dataset contains 10,729 records, but after filtering, 6,768 records remain for following analysis. Finally, a new column that uniquely identifies each sample is added to the dataframe and the `CHOICE` column, which originally contains a numerical coding of the choices, is mapped to a description that is consistent with the alternatives in the column names. " ] }, { "cell_type": "code", "metadata": { "id": "4jqERhnWhGCc", "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "outputId": "6bbdca2a-1670-4836-c0d5-d16915ee9597" }, "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "df_wide = pd.read_table(\"http://transp-or.epfl.ch/data/swissmetro.dat\", sep='\\t')\n", "\n", "# Keep only observations for commute and business purposes that contain known choices\n", "df_wide = df_wide[(df_wide['PURPOSE'].isin([1, 3]) & (df_wide['CHOICE'] != 0))]\n", "\n", "df_wide['custom_id'] = np.arange(len(df_wide)) # Add unique identifier\n", "df_wide['CHOICE'] = df_wide['CHOICE'].map({1: 'TRAIN', 2:'SM', 3: 'CAR'})\n", "df_wide" ], "execution_count": 2, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " GROUP SURVEY SP ID PURPOSE FIRST TICKET WHO LUGGAGE AGE ... \\\n", "0 2 0 1 1 1 0 1 1 0 3 ... \n", "1 2 0 1 1 1 0 1 1 0 3 ... \n", "2 2 0 1 1 1 0 1 1 0 3 ... \n", "3 2 0 1 1 1 0 1 1 0 3 ... \n", "4 2 0 1 1 1 0 1 1 0 3 ... \n", "... ... ... .. ... ... ... ... ... ... ... ... \n", "8446 3 1 1 939 3 1 7 3 1 5 ... \n", "8447 3 1 1 939 3 1 7 3 1 5 ... \n", "8448 3 1 1 939 3 1 7 3 1 5 ... \n", "8449 3 1 1 939 3 1 7 3 1 5 ... \n", "8450 3 1 1 939 3 1 7 3 1 5 ... \n", "\n", " TRAIN_CO TRAIN_HE SM_TT SM_CO SM_HE SM_SEATS CAR_TT CAR_CO \\\n", "0 48 120 63 52 20 0 117 65 \n", "1 48 30 60 49 10 0 117 84 \n", "2 48 60 67 58 30 0 117 52 \n", "3 40 30 63 52 20 0 72 52 \n", "4 36 60 63 42 20 0 90 84 \n", "... ... ... ... ... ... ... ... ... \n", "8446 13 30 50 17 30 0 130 64 \n", "8447 12 30 53 16 10 0 80 80 \n", "8448 16 60 50 16 20 0 80 64 \n", "8449 16 30 53 17 30 0 80 104 \n", "8450 13 60 53 21 30 0 100 80 \n", "\n", " CHOICE custom_id \n", "0 SM 0 \n", "1 SM 1 \n", "2 SM 2 \n", "3 SM 3 \n", "4 SM 4 \n", "... ... ... \n", "8446 TRAIN 6763 \n", "8447 TRAIN 6764 \n", "8448 TRAIN 6765 \n", "8449 TRAIN 6766 \n", "8450 TRAIN 6767 \n", "\n", "[6768 rows x 29 columns]" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GROUPSURVEYSPIDPURPOSEFIRSTTICKETWHOLUGGAGEAGE...TRAIN_COTRAIN_HESM_TTSM_COSM_HESM_SEATSCAR_TTCAR_COCHOICEcustom_id
02011101103...48120635220011765SM0
12011101103...4830604910011784SM1
22011101103...4860675830011752SM2
32011101103...403063522007252SM3
42011101103...366063422009084SM4
..................................................................
8446311939317315...1330501730013064TRAIN6763
8447311939317315...123053161008080TRAIN6764
8448311939317315...166050162008064TRAIN6765
8449311939317315...1630531730080104TRAIN6766
8450311939317315...1360532130010080TRAIN6767
\n", "

6768 rows × 29 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 2 } ] }, { "cell_type": "markdown", "metadata": { "id": "-GRMhgM2eIPz" }, "source": [ "### Reshape data" ] }, { "cell_type": "markdown", "metadata": { "id": "r9OxW-yNhcal" }, "source": [ "The imported dataframe is in wide format, and it needs to be reshaped to long format for processing by `xlogit`, which offers the convenient `wide_to_long` utility for this reshaping process. The user needs to specify the column that uniquely identifies each sample, the names of the alternatives, the columns that vary across alternatives, and whether the alternative names are a prefix or suffix of the column names. Additionally, the user can specify a value (`empty_val`) to be used by default when an alternative is not available for a certain variable. Additional usage examples for the `wide_to_long` function are available in xlogit's documentation at https://xlogit.readthedocs.io/en/latest/notebooks/convert_data_wide_to_long.html. Also, details about the function parameters are available at the [API reference ](https://xlogit.readthedocs.io/en/latest/api/utils.html#xlogit.utils.wide_to_long)." ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "1KM-BvFvhWed", "outputId": "33a6bacf-9674-4fec-eeca-90ab763a3308" }, "source": [ "from xlogit.utils import wide_to_long\n", "\n", "df = wide_to_long(df_wide, id_col='custom_id', alt_name='alt', sep='_',\n", " alt_list=['TRAIN', 'SM', 'CAR'], empty_val=0,\n", " varying=['TT', 'CO', 'HE', 'AV', 'SEATS'], alt_is_prefix=True)\n", "df" ], "execution_count": 3, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " custom_id alt TT CO HE AV SEATS GROUP SURVEY SP ... \\\n", "0 0 TRAIN 112 48 120 1 0 2 0 1 ... \n", "1 0 SM 63 52 20 1 0 2 0 1 ... \n", "2 0 CAR 117 65 0 1 0 2 0 1 ... \n", "3 1 TRAIN 103 48 30 1 0 2 0 1 ... \n", "4 1 SM 60 49 10 1 0 2 0 1 ... \n", "... ... ... ... ... ... .. ... ... ... .. ... \n", "20299 6766 SM 53 17 30 1 0 3 1 1 ... \n", "20300 6766 CAR 80 104 0 1 0 3 1 1 ... \n", "20301 6767 TRAIN 108 13 60 1 0 3 1 1 ... \n", "20302 6767 SM 53 21 30 1 0 3 1 1 ... \n", "20303 6767 CAR 100 80 0 1 0 3 1 1 ... \n", "\n", " TICKET WHO LUGGAGE AGE MALE INCOME GA ORIGIN DEST CHOICE \n", "0 1 1 0 3 0 2 0 2 1 SM \n", "1 1 1 0 3 0 2 0 2 1 SM \n", "2 1 1 0 3 0 2 0 2 1 SM \n", "3 1 1 0 3 0 2 0 2 1 SM \n", "4 1 1 0 3 0 2 0 2 1 SM \n", "... ... ... ... ... ... ... .. ... ... ... \n", "20299 7 3 1 5 1 2 0 1 2 TRAIN \n", "20300 7 3 1 5 1 2 0 1 2 TRAIN \n", "20301 7 3 1 5 1 2 0 1 2 TRAIN \n", "20302 7 3 1 5 1 2 0 1 2 TRAIN \n", "20303 7 3 1 5 1 2 0 1 2 TRAIN \n", "\n", "[20304 rows x 23 columns]" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
custom_idaltTTCOHEAVSEATSGROUPSURVEYSP...TICKETWHOLUGGAGEAGEMALEINCOMEGAORIGINDESTCHOICE
00TRAIN1124812010201...110302021SM
10SM63522010201...110302021SM
20CAR11765010201...110302021SM
31TRAIN103483010201...110302021SM
41SM60491010201...110302021SM
..................................................................
202996766SM53173010311...731512012TRAIN
203006766CAR80104010311...731512012TRAIN
203016767TRAIN108136010311...731512012TRAIN
203026767SM53213010311...731512012TRAIN
203036767CAR10080010311...731512012TRAIN
\n", "

20304 rows × 23 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 3 } ] }, { "cell_type": "markdown", "metadata": { "id": "PhLjuzaSeVjE" }, "source": [ "### Create model specification" ] }, { "cell_type": "markdown", "metadata": { "id": "dgJP2WdQeXiY" }, "source": [ "Following the reshaping, users can create or update the dataset's columns in order to accommodate their model specification needs, if necessary. The code below shows how the columns `ASC_TRAIN` and `ASC_CAR` were created to incorporate alternative-specific constants in the model. In addition, the example illustrates an effective way of establishing variable interactions (e.g., trip costs for commuters with an annual pass) by updating existing columns conditional on values of other columns. Although apparently simple, column operations provide users with an intuitive and highly-flexible mechanism to incorporate model specification aspects, such as variable transformations, interactions, and alternative specific coefficients and constants. By operating the dataframe columns, any utility specification can be accommodated in `xlogit`. As shown in [this specification example](https://xlogit.readthedocs.io/en/latest/notebooks/multinomial_model.html#Create-model-specification), highly-flexible utility specifications can be modeled in `xlogit` by operating the dataframe columns." ] }, { "cell_type": "code", "metadata": { "id": "MsSu2jqKeoz-" }, "source": [ "df['ASC_TRAIN'] = np.ones(len(df))*(df['alt'] == 'TRAIN')\n", "df['ASC_CAR'] = np.ones(len(df))*(df['alt'] == 'CAR')\n", "df['TT'], df['CO'] = df['TT']/100, df['CO']/100 # Scale variables\n", "annual_pass = (df['GA'] == 1) & (df['alt'].isin(['TRAIN', 'SM']))\n", "df.loc[annual_pass, 'CO'] = 0 # Cost zero for pass holders" ], "execution_count": 4, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "JgjEi8QLexj6" }, "source": [ "### Estimate model parameters" ] }, { "cell_type": "markdown", "metadata": { "id": "gzuSO2UBe99t" }, "source": [ "The `fit` method estimates the model by taking as input the data from the previous step along with additional specification criteria, such as the distribution of the random parameters (`randvars`), the number of random draws (`n_draws`), and the availability of alternatives for the choice situations (`avail`). We set the optimization method as `L-BFGS-B` as this is a robust routine that usually helps solve convergence issues. Once the estimation routine is completed, the `summary` method can be used to display the estimation results." ] }, { "cell_type": "code", "metadata": { "id": "dctkrAPBez4T", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "eaacd08d-fec7-4b8d-b0dd-757e4f3eac15" }, "source": [ "from xlogit import MixedLogit\n", "varnames=['ASC_CAR', 'ASC_TRAIN', 'CO', 'TT']\n", "model = MixedLogit()\n", "model.fit(X=df[varnames], y=df['CHOICE'], varnames=varnames,\n", " alts=df['alt'], ids=df['custom_id'], avail=df['AV'],\n", " panels=df[\"ID\"], randvars={'TT': 'n'}, n_draws=1500,\n", " optim_method='L-BFGS-B')\n", "model.summary()" ], "execution_count": 5, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "GPU processing enabled.\n", "Optimization terminated successfully.\n", " Message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'\n", " Iterations: 14\n", " Function evaluations: 15\n", "Estimation time= 17.0 seconds\n", "---------------------------------------------------------------------------\n", "Coefficient Estimate Std.Err. z-val P>|z|\n", "---------------------------------------------------------------------------\n", "ASC_CAR 0.2831085 0.0560480 5.0511797 2.35e-06 ***\n", "ASC_TRAIN -0.5722790 0.0794780 -7.2004737 4.84e-12 ***\n", "CO -1.6601703 0.0778870 -21.3151016 2.52e-96 ***\n", "TT -3.2289850 0.1749807 -18.4533828 5.58e-73 ***\n", "sd.TT 3.6485337 0.1683459 21.6728359 1.88e-99 ***\n", "---------------------------------------------------------------------------\n", "Significance: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n", "\n", "Log-Likelihood= -4359.218\n", "AIC= 8728.436\n", "BIC= 8762.536\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "iwsZACbNgJwd" }, "source": [ "The negative signs for the cost and time coefficients suggest that decision makers experience a general disutility with alternatives that have higher waiting times and costs, which conforms to the underlying decision making theory. Note that these estimates are highly consistent with those returned by Biogeme (https://biogeme.epfl.ch/examples/swissmetro/05normalMixtureIntegral.html)" ] }, { "cell_type": "markdown", "metadata": { "id": "dt6rAYtH3Djj" }, "source": [ "## Electricity Dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "_k8_iTHPn7l9" }, "source": [ "The electricity dataset contains 4,308 choices among four electricity suppliers based on the attributes of the offered plans, which include prices(pf), contract lengths(cl), time of day rates (tod), seasonal rates(seas), as well as attributes of the suppliers, which include whether the supplier is local (loc) and well-known (wk). The data was collected through a survey where 12 different choice situations were presented to each participant. The multiple responses per participants were organized into panels. Given that some participants answered less than 12 of the choice situations, some panels are unbalanced, which `xlogit` is able to handle. [Revelt and Train (1999)](https://escholarship.org/content/qt1900p96t/qt1900p96t.pdf) provide a detailed description of this dataset. " ] }, { "cell_type": "markdown", "metadata": { "id": "wOjSrftv3Gtm" }, "source": [ "### Read data" ] }, { "cell_type": "markdown", "metadata": { "id": "ymoa7_h4oZo_" }, "source": [ "The dataset is already in long format so no reshaping is necessary, it can be used directly in xlogit." ] }, { "cell_type": "code", "metadata": { "id": "fLgHickp3IJw", "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "outputId": "d5873903-e4b9-4278-cc97-a7f96c016b2a" }, "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv(\"https://raw.github.com/arteagac/xlogit/master/examples/data/electricity_long.csv\")\n", "df" ], "execution_count": 6, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " choice id alt pf cl loc wk tod seas chid\n", "0 0 1 1 7 5 0 1 0 0 1\n", "1 0 1 2 9 1 1 0 0 0 1\n", "2 0 1 3 0 0 0 0 0 1 1\n", "3 1 1 4 0 5 0 1 1 0 1\n", "4 0 1 1 7 0 0 1 0 0 2\n", "... ... ... ... .. .. ... .. ... ... ...\n", "17227 0 361 4 0 1 1 0 0 1 4307\n", "17228 1 361 1 9 0 0 1 0 0 4308\n", "17229 0 361 2 7 0 0 0 0 0 4308\n", "17230 0 361 3 0 1 0 1 0 1 4308\n", "17231 0 361 4 0 5 1 0 1 0 4308\n", "\n", "[17232 rows x 10 columns]" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
choiceidaltpfcllocwktodseaschid
00117501001
10129110001
20130000011
31140501101
40117001002
.................................
17227036140110014307
17228136119001004308
17229036127000004308
17230036130101014308
17231036140510104308
\n", "

17232 rows × 10 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 6 } ] }, { "cell_type": "markdown", "metadata": { "id": "gFUpTIpU3-Oi" }, "source": [ "### Fit the model" ] }, { "cell_type": "markdown", "metadata": { "id": "0p7isMbqoIYz" }, "source": [ "Note that the parameter `panels` was included in the `fit` function in order to take into account panel structure of this dataset during estimation." ] }, { "cell_type": "code", "metadata": { "id": "It7kU3GE3XV0", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "5445c792-bea2-4449-c8b8-63ac63d9e8f1" }, "source": [ "from xlogit import MixedLogit\n", "\n", "varnames = ['pf', 'cl', 'loc', 'wk', 'tod', 'seas']\n", "model = MixedLogit()\n", "model.fit(X=df[varnames],\n", " y=df['choice'],\n", " varnames=varnames,\n", " ids=df['chid'],\n", " panels=df['id'],\n", " alts=df['alt'],\n", " n_draws=600,\n", " randvars={'pf': 'n', 'cl': 'n', 'loc': 'n',\n", " 'wk': 'n', 'tod': 'n', 'seas': 'n'})\n", "model.summary()" ], "execution_count": 7, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "GPU processing enabled.\n", "Optimization terminated successfully.\n", " Message: The gradients are close to zero\n", " Iterations: 25\n", " Function evaluations: 27\n", "Estimation time= 3.5 seconds\n", "---------------------------------------------------------------------------\n", "Coefficient Estimate Std.Err. z-val P>|z|\n", "---------------------------------------------------------------------------\n", "pf -0.9972102 0.0361699 -27.5701452 7.2e-153 ***\n", "cl -0.2196812 0.0143577 -15.3006166 2.44e-50 ***\n", "loc 2.2901807 0.0891457 25.6903002 3.37e-134 ***\n", "wk 1.6943247 0.0719012 23.5646304 2.87e-114 ***\n", "tod -9.6752279 0.3117174 -31.0384572 1.15e-189 ***\n", "seas -9.6961836 0.3111496 -31.1624480 4.94e-191 ***\n", "sd.pf 0.2207255 0.0126686 17.4230731 1.54e-64 ***\n", "sd.cl 0.4115547 0.0200678 20.5082053 5.5e-88 ***\n", "sd.loc 1.7840255 0.0974403 18.3089162 6.13e-71 ***\n", "sd.wk 1.2296227 0.0838150 14.6706716 1.93e-46 ***\n", "sd.tod 2.2757059 0.1257114 18.1026193 2.01e-69 ***\n", "sd.seas 1.4862206 0.1316147 11.2922079 4.06e-28 ***\n", "---------------------------------------------------------------------------\n", "Significance: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n", "\n", "Log-Likelihood= -3888.465\n", "AIC= 7800.930\n", "BIC= 7877.349\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "GWm-ydF3kxSr" }, "source": [ "The xlogit estimates are similar to those estimated using R's mlogit package (https://cran.r-project.org/web/packages/mlogit/vignettes/e3mxlogit.html). With GPU-enabled estimations, xlogit estimates the model in less than 10 seconds, significantly faster than open-source pacakges such as mlogit and pylogit. This feature can be beneficial while fitting models for large datasets with multiple explanatory variables to be estimated with random coefficients." ] }, { "cell_type": "markdown", "metadata": { "id": "r9gxNL0XePRc" }, "source": [ "## Fishing Dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "Innu2ypbmKui" }, "source": [ "This example illustrates the estimation of a Mixed Logit model for choices of 1,182 individuals for sport fishing modes using `xlogit`. The goal is to analyze the market shares of four alternatives (i.e., beach, pier, boat, and charter) based on their cost and fish catch rate. [Cameron (2005)](http://cameron.econ.ucdavis.edu/mmabook/mma.html) provides additional details about this dataset. The following code illustrates how to use `xlogit` to estimate the model parameters." ] }, { "cell_type": "markdown", "metadata": { "id": "cqBJWh8eOQDp" }, "source": [ "### Read data" ] }, { "cell_type": "markdown", "metadata": { "id": "GqCeQywgozbK" }, "source": [ "The data to be analyzed can be imported to Python using any preferred method. In this example, the data in CSV format was imported using the popular `pandas` Python package. However, it is worth highlighting that `xlogit` does not depend on the `pandas` package, as `xlogit` can take any array-like structure as input. This represents an additional advantage because `xlogit` can be used with any preferred dataframe library, and not only with `pandas`." ] }, { "cell_type": "code", "metadata": { "id": "9jDr3PIveaG8", "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "outputId": "490ab844-e202-4ba3-cdcf-cf64c54fc698" }, "source": [ "import pandas as pd\n", "df = pd.read_csv(\"https://raw.github.com/arteagac/xlogit/master/examples/data/fishing_long.csv\")\n", "df" ], "execution_count": 8, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " id alt choice income price catch\n", "0 1 beach 0 7083.33170 157.930 0.0678\n", "1 1 boat 0 7083.33170 157.930 0.2601\n", "2 1 charter 1 7083.33170 182.930 0.5391\n", "3 1 pier 0 7083.33170 157.930 0.0503\n", "4 2 beach 0 1249.99980 15.114 0.1049\n", "... ... ... ... ... ... ...\n", "4723 1181 pier 0 416.66668 36.636 0.4522\n", "4724 1182 beach 0 6250.00130 339.890 0.2537\n", "4725 1182 boat 1 6250.00130 235.436 0.6817\n", "4726 1182 charter 0 6250.00130 260.436 2.3014\n", "4727 1182 pier 0 6250.00130 339.890 0.1498\n", "\n", "[4728 rows x 6 columns]" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idaltchoiceincomepricecatch
01beach07083.33170157.9300.0678
11boat07083.33170157.9300.2601
21charter17083.33170182.9300.5391
31pier07083.33170157.9300.0503
42beach01249.9998015.1140.1049
.....................
47231181pier0416.6666836.6360.4522
47241182beach06250.00130339.8900.2537
47251182boat16250.00130235.4360.6817
47261182charter06250.00130260.4362.3014
47271182pier06250.00130339.8900.1498
\n", "

4728 rows × 6 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 8 } ] }, { "cell_type": "markdown", "metadata": { "id": "rffV7cx8ORpP" }, "source": [ "### Fit model" ] }, { "cell_type": "markdown", "metadata": { "id": "UFcmUz8Xo6UT" }, "source": [ "Once the data is in the `Python` environment, `xlogit` can be used to fit the model, as shown below. The `MultinomialLogit` class is imported from `xlogit`, and its constructor is used to initialize a new model. The `fit` method estimates the model using the input data and estimation criteria provided as arguments to the method's call. The arguments of the `fit` methods are described in [`xlogit`'s documentation](https://https://xlogit.readthedocs.io/en/latest/api/).\n" ] }, { "cell_type": "code", "metadata": { "id": "FIZwBe0zedfh", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "46561fa7-3339-4fb1-9c93-3c2a4a646c96" }, "source": [ "from xlogit import MixedLogit\n", "varnames = ['price', 'catch']\n", "model = MixedLogit()\n", "model.fit(X=df[varnames],\n", " y=df['choice'],\n", " varnames=varnames,\n", " alts=df['alt'],\n", " ids=df['id'],\n", " n_draws=1000,\n", " randvars={'price': 'n', 'catch': 'n'})\n", "model.summary()" ], "execution_count": 9, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "GPU processing enabled.\n", "Optimization terminated successfully.\n", " Message: The gradients are close to zero\n", " Iterations: 19\n", " Function evaluations: 31\n", "Estimation time= 1.0 seconds\n", "---------------------------------------------------------------------------\n", "Coefficient Estimate Std.Err. z-val P>|z|\n", "---------------------------------------------------------------------------\n", "price -0.0272460 0.0024982 -10.9062938 1.86e-25 ***\n", "catch 1.3271150 0.1724869 7.6940046 2.23e-13 ***\n", "sd.price 0.0102130 0.0022025 4.6370078 1.87e-05 ***\n", "sd.catch -1.5706825 0.5732798 -2.7398182 0.0189 * \n", "---------------------------------------------------------------------------\n", "Significance: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n", "\n", "Log-Likelihood= -1300.511\n", "AIC= 2609.023\n", "BIC= 2629.323\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "FnxV4Ekzq5nr" }, "source": [ "### Prediction" ] }, { "cell_type": "markdown", "metadata": { "id": "fR6GZoF2z3n-" }, "source": [ "`xlogit` also provides a convenient set of post-estimation tools for prediction or forecasting. The `predict` function uses estimated parameters and a new or updated dataset to compute predicted choices. By including the `return_proba` and `return_freq` parameters in the function's call, users can also obtain the predicted probabilities and frequency of the chosen alternatives. The following code illustrates the prediction functionality to forecast changes in market shares (choice frequency) for fishing modes caused by an increase in price for the \"boat\" mode. First, base market shares are computed by running `predict` on the original dataset. Then, an increase of 20% in the price for the \"boat\" alternative is applied to the dataset and the updated shares are predicted. " ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "KsXw9BK3q7BO", "outputId": "90ea510a-ee71-4c68-92ef-2b38aef9be88" }, "source": [ "choices, freq = model.predict(X=df[varnames], varnames=varnames, ids=df['id'],\n", " alts=df['alt'], return_freq=True)\n", "print(f\"base: {freq}\")\n", "\n", "df.loc[df['alt']=='boat', 'price'] *= 1.2 # 20 percent price increase\n", "choices, freq = model.predict(X=df[varnames], varnames=varnames, ids=df['id'],\n", " alts=df['alt'], return_freq=True)\n", "print(f\"updated: {freq}\")" ], "execution_count": 10, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "GPU processing enabled.\n", "base: {'beach': 0.223, 'boat': 0.461, 'charter': 0.228, 'pier': 0.089}\n", "GPU processing enabled.\n", "updated: {'beach': 0.238, 'boat': 0.379, 'charter': 0.278, 'pier': 0.105}\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "Nzxkhkwz0Q5d" }, "source": [ "The output shows that the 20% price increase would result in a decrease of almost 10% in market share for the \"boat\" alternative." ] }, { "cell_type": "markdown", "metadata": { "id": "mWU80LmcODPY" }, "source": [ "## Car Dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "q1zgiBGKouPr" }, "source": [ "The fourth example uses a stated preference panel dataset for choice of car. Three alternatives are considered, with upto 6 choice situations per individual. This again is an unbalanced panel with responses of some individuals less than 6 situations. The dataset contains 8 explanaotry variables: price, operating cost, range, and binary indicators to indicate whether the car is electric, hybrid, and if performance is high or medium respectively. This dataset was taken from Kenneth Train's MATLAB codes for estimation of Mixed Logit models as shown in this link: https://eml.berkeley.edu/Software/abstracts/train1006mxlmsl.html" ] }, { "cell_type": "markdown", "metadata": { "id": "SoSyQfjqkNU3" }, "source": [ "### Read data" ] }, { "cell_type": "code", "metadata": { "id": "v8AAMruCj8tt" }, "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "df = pd.read_csv(\"https://raw.github.com/arteagac/xlogit/master/examples/data/car100_long.csv\")" ], "execution_count": 11, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "0HY7mT__Lj5b" }, "source": [ "Since price and operating cost need to be estimated with negative coefficients, we reverse the variable signs in the dataframe. " ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "TQ33gsZZLkP5", "outputId": "7acb35c7-38b4-4017-b0ac-ae1dd2ef8b59" }, "source": [ "df['price'] = -df['price']/10000\n", "df['opcost'] = -df['opcost']\n", "df" ], "execution_count": 12, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " person_id choice_id alt choice price opcost range ev gas \\\n", "0 1 1 1 0 -4.6763 -47.43 0.0 0 0 \n", "1 1 1 2 1 -5.7209 -27.43 1.3 1 0 \n", "2 1 1 3 0 -8.7960 -32.41 1.2 1 0 \n", "3 1 2 1 1 -3.3768 -4.89 1.3 1 0 \n", "4 1 2 2 0 -9.0336 -30.19 0.0 0 0 \n", "... ... ... ... ... ... ... ... .. ... \n", "4447 100 1483 2 0 -2.8036 -14.45 1.6 1 0 \n", "4448 100 1483 3 0 -1.9360 -54.76 0.0 0 1 \n", "4449 100 1484 1 1 -2.4054 -50.57 0.0 0 1 \n", "4450 100 1484 2 0 -5.2795 -21.25 0.0 0 0 \n", "4451 100 1484 3 0 -6.0705 -25.41 1.4 1 0 \n", "\n", " hybrid hiperf medhiperf \n", "0 1 0 0 \n", "1 0 1 1 \n", "2 0 0 1 \n", "3 0 1 1 \n", "4 1 0 1 \n", "... ... ... ... \n", "4447 0 0 0 \n", "4448 0 1 1 \n", "4449 0 0 0 \n", "4450 1 0 1 \n", "4451 0 0 0 \n", "\n", "[4452 rows x 12 columns]" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
person_idchoice_idaltchoicepriceopcostrangeevgashybridhiperfmedhiperf
01110-4.6763-47.430.000100
11121-5.7209-27.431.310011
21130-8.7960-32.411.210001
31211-3.3768-4.891.310011
41220-9.0336-30.190.000101
.......................................
4447100148320-2.8036-14.451.610000
4448100148330-1.9360-54.760.001011
4449100148411-2.4054-50.570.001000
4450100148420-5.2795-21.250.000101
4451100148430-6.0705-25.411.410000
\n", "

4452 rows × 12 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 12 } ] }, { "cell_type": "markdown", "metadata": { "id": "_ZQf9DFKFE5j" }, "source": [ "### Fit the model" ] }, { "cell_type": "code", "metadata": { "id": "_MhfvmWgFCX6", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "28de1f2f-2e7c-46be-aa60-105df6b669bc" }, "source": [ "from xlogit import MixedLogit\n", "\n", "varnames = ['hiperf', 'medhiperf', 'price', 'opcost', 'range', 'ev', 'hybrid'] \n", "model = MixedLogit()\n", "model.fit(X=df[varnames],\n", " y=df['choice'],\n", " varnames=varnames,\n", " alts=df['alt'],\n", " ids=df['choice_id'],\n", " panels=df['person_id'],\n", " randvars = {'price': 'ln', 'opcost': 'n', \n", " 'range': 'ln', 'ev':'n', 'hybrid': 'n'}, \n", " n_draws = 100) \n", "model.summary()" ], "execution_count": 13, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "GPU processing enabled.\n", "Optimization terminated successfully.\n", " Message: The gradients are close to zero\n", " Iterations: 31\n", " Function evaluations: 34\n", "Estimation time= 1.4 seconds\n", "---------------------------------------------------------------------------\n", "Coefficient Estimate Std.Err. z-val P>|z|\n", "---------------------------------------------------------------------------\n", "hiperf 0.1058410 0.0971974 1.0889290 0.441 \n", "medhiperf 0.5604997 0.0977352 5.7348796 6.81e-08 ***\n", "price -0.7871346 0.1048151 -7.5097475 7.49e-13 ***\n", "opcost 0.0110846 0.0041762 2.6542027 0.0237 * \n", "range -0.6857957 0.4362769 -1.5719276 0.232 \n", "ev -1.5574339 0.3250817 -4.7908996 8.97e-06 ***\n", "hybrid 0.6883966 0.1467451 4.6911049 1.43e-05 ***\n", "sd.price 0.8666443 0.0926631 9.3526350 2.72e-19 ***\n", "sd.opcost 0.0401550 0.0047037 8.5369142 2.77e-16 ***\n", "sd.range -0.5527011 0.2453684 -2.2525353 0.0633 . \n", "sd.ev 0.9139887 0.2038702 4.4831898 3.66e-05 ***\n", "sd.hybrid 0.7099238 0.1513756 4.6898169 1.44e-05 ***\n", "---------------------------------------------------------------------------\n", "Significance: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n", "\n", "Log-Likelihood= -1302.683\n", "AIC= 2629.366\n", "BIC= 2692.995\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "pEiWWuCciEJ6" }, "source": [ "## References" ] }, { "cell_type": "markdown", "metadata": { "id": "UkfFmr7fiFxc" }, "source": [ "- Bierlaire, M. (2018). PandasBiogeme: a short introduction. EPFL (Transport and Mobility Laboratory, ENAC).\n", "\n", "- Brathwaite, T., & Walker, J. L. (2018). Asymmetric, closed-form, finite-parameter models of multinomial choice. Journal of Choice Modelling, 29, 78–112. \n", "\n", "- Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: methods and applications. Cambridge university press.\n", "\n", "- Croissant, Y. (2020). Estimation of Random Utility Models in R: The mlogit Package. Journal of Statistical Software, 95(1), 1-41.\n", "\n", "- Revelt, D., & Train, K. (1999). Customer-specific taste parameters and mixed logit. University of California, Berkeley.\n", "\n" ] } ] }