{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "gpuType": "T4" }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "source": [ "# WTP space models" ], "metadata": { "id": "NaIEHmukcv9b" } }, { "cell_type": "markdown", "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/arteagac/xlogit/blob/master/examples/wtp_space.ipynb)" ], "metadata": { "id": "vtaWwgdMen-3" } }, { "cell_type": "markdown", "source": [ "## Import xlogit" ], "metadata": { "id": "w-W6OfpsRak0" } }, { "cell_type": "markdown", "source": [ "The cell below installs and imports xlogit and check if GPU is available. A GPU is not is not strictly required, but it can speed up computations for models in WTP space with random parameters." ], "metadata": { "id": "YueGuKfoSCM8" } }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MjMa8cK_n9N3", "outputId": "343e0a99-2720-4511-9c9e-19f4b0d3d899" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Requirement already satisfied: xlogit in /usr/local/lib/python3.10/dist-packages (0.2.7)\n", "Requirement already satisfied: numpy>=1.13.1 in /usr/local/lib/python3.10/dist-packages (from xlogit) (1.23.5)\n", "Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from xlogit) (1.11.4)\n", "1 GPU device(s) available. xlogit will use GPU processing\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "True" ] }, "metadata": {}, "execution_count": 1 } ], "source": [ "!pip install xlogit\n", "from xlogit import MixedLogit\n", "MixedLogit.check_if_gpu_available()" ] }, { "cell_type": "markdown", "source": [ "## Yougurt Dataset" ], "metadata": { "id": "F8qGYPH7SfIP" } }, { "cell_type": "markdown", "source": [ "This dataset comprises revealed preferences data involving 2,412 choices among three yogurt brands. Following a panel structure, it includes multiple choice situations observed across 100 households. Due to variations in the number of choice situations experienced by each household, the panels are imbalanced, a characteristic that xlogit is capable of handling. Originally introduced by Jain et al. (1994), this dataset was ported from the [logitr package for R](https://github.com/jhelvy/logitr/)." ], "metadata": { "id": "88neKwKWShmD" } }, { "cell_type": "markdown", "source": [ "### Read data" ], "metadata": { "id": "b007lcxSWWL2" } }, { "cell_type": "code", "source": [ "import pandas as pd\n", "import numpy as np\n", "df = pd.read_csv(\"https://raw.githubusercontent.com/arteagac/xlogit/master/examples/data/yogurt_long.csv\")\n", "df" ], "metadata": { "id": "J-dXIf9mpA0B", "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "outputId": "9c8f1603-1dcc-4f53-885f-8d049ca3e665" }, "execution_count": 2, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " id choice feat price chid alt\n", "0 1 0 0 8.1 1 dannon\n", "1 1 0 0 6.1 1 hiland\n", "2 1 1 0 7.9 1 weight\n", "3 1 0 0 10.8 1 yoplait\n", "4 1 1 0 9.8 2 dannon\n", "... ... ... ... ... ... ...\n", "9643 100 0 0 12.2 2411 yoplait\n", "9644 100 0 0 8.6 2412 dannon\n", "9645 100 0 0 4.3 2412 hiland\n", "9646 100 1 0 7.9 2412 weight\n", "9647 100 0 0 10.8 2412 yoplait\n", "\n", "[9648 rows x 6 columns]" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idchoicefeatpricechidalt
01008.11dannon
11006.11hiland
21107.91weight
310010.81yoplait
41109.82dannon
.....................
96431000012.22411yoplait
9644100008.62412dannon
9645100004.32412hiland
9646100107.92412weight
96471000010.82412yoplait
\n", "

9648 rows × 6 columns

\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", " \n", " \n", " \n", "
\n", "\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 2 } ] }, { "cell_type": "markdown", "source": [ "## Convert column to dummy representation" ], "metadata": { "id": "Z2goyAp7W7Jr" } }, { "cell_type": "markdown", "source": [ "The `brand` column needs to be converted to a dummy representation in order to be included in the model." ], "metadata": { "id": "N2hL4QOeXEfk" } }, { "cell_type": "code", "source": [ "df[\"brand_yoplait\"] = 1*(df[\"alt\"] == \"yoplait\")\n", "df[\"brand_hiland\"] = 1*(df[\"alt\"] == \"hiland\")\n", "df[\"brand_weight\"] = 1*(df[\"alt\"] == \"weight\")" ], "metadata": { "id": "mG2aEmGLo7bR" }, "execution_count": 3, "outputs": [] }, { "cell_type": "markdown", "source": [ "### Estimate model in WTP space." ], "metadata": { "id": "ijm-XFgKYyLF" } }, { "cell_type": "markdown", "source": [ "Note that you need to provide a `scale_factor`, which corresponds to the `price` column. For models in WTP space, xlogit uses the negative of the `price` column." ], "metadata": { "id": "jltoIItVXP4u" } }, { "cell_type": "code", "source": [ "varnames = [\"feat\", \"brand_yoplait\", \"brand_hiland\", \"brand_weight\"]\n", "wtp = MixedLogit()\n", "wtp.fit(X=df[varnames],\n", " y=df[\"choice\"],\n", " varnames=varnames,\n", " ids=df[\"chid\"],\n", " alts=df[\"alt\"],\n", " panels=df['id'],\n", " randvars={\"feat\": \"n\", \"brand_yoplait\": \"n\", \"brand_hiland\": \"n\", \"brand_weight\": \"n\"},\n", " scale_factor=df[\"price\"],\n", " n_draws=1000\n", " )\n", "\n", "wtp.summary()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "mrLJEsANXmQP", "outputId": "1e6440e4-9f86-488b-ce3d-d4dfab812f13" }, "execution_count": 4, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "GPU processing enabled.\n", "Optimization terminated successfully.\n", " Message: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH\n", " Iterations: 84\n", " Function evaluations: 104\n", "Estimation time= 20.2 seconds\n", "---------------------------------------------------------------------------\n", "Coefficient Estimate Std.Err. z-val P>|z|\n", "---------------------------------------------------------------------------\n", "feat 2.3717675 0.6195132 3.8284377 0.000132 ***\n", "brand_yoplait 1.9544038 0.6661769 2.9337607 0.00338 ** \n", "brand_hiland -12.1382781 1.4991326 -8.0968678 8.85e-16 ***\n", "brand_weight -8.6411789 1.5387448 -5.6157323 2.18e-08 ***\n", "sd.feat 2.3618763 0.6012260 3.9284337 8.79e-05 ***\n", "sd.brand_yoplait 8.2849425 1.0365091 7.9931208 2.02e-15 ***\n", "sd.brand_hiland 6.1774711 1.1621855 5.3153917 1.16e-07 ***\n", "sd.brand_weight 8.2961161 1.1133869 7.4512428 1.28e-13 ***\n", "_scale_factor 0.4564795 0.0408925 11.1629027 3.01e-28 ***\n", "---------------------------------------------------------------------------\n", "Significance: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n", "\n", "Log-Likelihood= -1247.199\n", "AIC= 2512.398\n", "BIC= 2564.492\n" ] } ] }, { "cell_type": "markdown", "source": [ "### Provide alternative starting values" ], "metadata": { "id": "bfMEOMqgZBKt" } }, { "cell_type": "markdown", "source": [ "You can pass starting values to xlogit's `fit` method using the `init_coeff` argument. The most important aspect to consider when passing starting values is to follow the same order in which xlogit lists the parameters in the summary table. The order of the coefficients is `varnames` + `sd of varnames` + `scale_factor`. An easy way to figure out the order of the coefficients is to run a test estimation and follow the order of the coefficients in the summary table.\n", "\n", "The code below estimates first a model in preference space, and uses the estimated parameters as initial values for the model in WTP space." ], "metadata": { "id": "XaWXq23zZMty" } }, { "cell_type": "markdown", "source": [ "#### Estimimate model in preference space." ], "metadata": { "id": "O2h-zs9jaLFb" } }, { "cell_type": "code", "source": [ "# Step 3. Estimating a mixed logit model to obtain starting values for WTP Space Model\n", "varnames = [\"price\", \"feat\", \"brand_yoplait\", \"brand_hiland\", \"brand_weight\"]\n", "ml = MixedLogit()\n", "ml.fit(X=df[varnames],\n", " y=df[\"choice\"],\n", " varnames=varnames,\n", " ids=df[\"chid\"],\n", " alts=df[\"alt\"],\n", " panels=df['id'],\n", " randvars={\"feat\": \"n\", \"brand_yoplait\": \"n\", \"brand_hiland\": \"n\", \"brand_weight\": \"n\"},\n", " n_draws=1000)\n", "\n", "ml.summary()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "W-8qErKfpTjx", "outputId": "94662302-8771-41e8-dd01-36c93304e19a" }, "execution_count": 5, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "GPU processing enabled.\n", "Optimization terminated successfully.\n", " Message: The gradients are close to zero\n", " Iterations: 48\n", " Function evaluations: 54\n", "Estimation time= 6.1 seconds\n", "---------------------------------------------------------------------------\n", "Coefficient Estimate Std.Err. z-val P>|z|\n", "---------------------------------------------------------------------------\n", "price -0.4564785 0.0397931 -11.4712971 1.07e-29 ***\n", "feat 1.0826681 0.2101995 5.1506685 2.81e-07 ***\n", "brand_yoplait 0.8921554 0.1375538 6.4858654 1.07e-10 ***\n", "brand_hiland -5.5409249 0.4190916 -13.2212731 1.41e-38 ***\n", "brand_weight -3.9445356 0.2405401 -16.3986622 2.22e-57 ***\n", "sd.feat 1.0781550 0.2361639 4.5652820 5.24e-06 ***\n", "sd.brand_yoplait 3.7818750 0.1950585 19.3884147 6.1e-78 ***\n", "sd.brand_hiland 2.8199068 0.3515460 8.0214458 1.61e-15 ***\n", "sd.brand_weight 3.7870137 0.1869526 20.2565465 2.2e-84 ***\n", "---------------------------------------------------------------------------\n", "Significance: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n", "\n", "Log-Likelihood= -1247.199\n", "AIC= 2512.398\n", "BIC= 2564.492\n" ] } ] }, { "cell_type": "markdown", "source": [ "#### Obtain starting values by formatting estimates from the ML model" ], "metadata": { "id": "o-aRNTcUaQK8" } }, { "cell_type": "code", "source": [ "# Divide all estimates by the price coefficient\n", "coef = ml.coeff_\n", "coef = coef / -coef[0]\n", "\n", "# Add 1 as the starting value of the scale parameter\n", "coef = np.append(coef, 1)\n", "\n", "# Drop the price coefficient\n", "coef = coef[1: ]" ], "metadata": { "id": "lPH_b32Op2we" }, "execution_count": 6, "outputs": [] }, { "cell_type": "markdown", "source": [ "#### Estimating the WTP Space Model using the starting values" ], "metadata": { "id": "DYn4ZghDak9Z" } }, { "cell_type": "code", "source": [ "#use init_coeff to provide starting values\n", "#use scale_factor to specify price as the scale parameter\n", "varnames = [\"feat\", \"brand_yoplait\", \"brand_hiland\", \"brand_weight\"]\n", "wtp = MixedLogit()\n", "wtp.fit(X=df[varnames],\n", " y=df[\"choice\"],\n", " varnames=varnames,\n", " ids=df[\"chid\"],\n", " alts=df[\"alt\"],\n", " panels=df['id'],\n", " randvars={\"feat\": \"n\", \"brand_yoplait\": \"n\", \"brand_hiland\": \"n\", \"brand_weight\": \"n\"},\n", " init_coeff= coef,\n", " scale_factor= df[\"price\"],\n", " n_draws=4000)\n", "\n", "wtp.summary()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Yp-J0k0aqOcd", "outputId": "add87345-c2e6-468e-8e0f-ee277aa01861" }, "execution_count": 7, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "GPU processing enabled.\n", "Optimization terminated successfully.\n", " Message: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH\n", " Iterations: 73\n", " Function evaluations: 85\n", "Estimation time= 50.4 seconds\n", "---------------------------------------------------------------------------\n", "Coefficient Estimate Std.Err. z-val P>|z|\n", "---------------------------------------------------------------------------\n", "feat 2.2219166 0.6459974 3.4395134 0.000593 ***\n", "brand_yoplait 1.9045552 0.7059636 2.6978096 0.00703 ** \n", "brand_hiland -12.0965444 1.4722644 -8.2162851 3.38e-16 ***\n", "brand_weight -7.3143202 1.1454247 -6.3856841 2.04e-10 ***\n", "sd.feat 2.6308963 0.6602489 3.9847040 6.96e-05 ***\n", "sd.brand_yoplait 8.3661417 1.0923714 7.6586971 2.7e-14 ***\n", "sd.brand_hiland 5.6314080 1.0507816 5.3592564 9.15e-08 ***\n", "sd.brand_weight 7.3543321 0.8909297 8.2546716 2.48e-16 ***\n", "_scale_factor 0.4638383 0.0414612 11.1872810 2.32e-28 ***\n", "---------------------------------------------------------------------------\n", "Significance: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n", "\n", "Log-Likelihood= -1247.314\n", "AIC= 2512.627\n", "BIC= 2564.721\n" ] } ] }, { "cell_type": "markdown", "source": [ "#### Use large number of random draws" ], "metadata": { "id": "plTDNhZkbOg_" } }, { "cell_type": "markdown", "source": [ "xlogit enables estimations using very large number of random draws. If the number of draws is too big and the data does not fit on the GPU memory, use the the `batch_size` parameter to split the processing into multiple batches. For instance, when using `batch_size=1000`, xlogit will process 1,000 random draws at a time. This avoids overflowing the GPU memory as xlogit processes one batch at a time, computes the log likelihoods, and average them at the end, which does not affect on the final estimates or log likelihood. You can also increase the `batch_size` depending on your GPU memory size. The example below estimates a model with 10,000 draws using batches of 2,000 random draws." ], "metadata": { "id": "53YXXGSKbRmY" } }, { "cell_type": "code", "source": [ "varnames = [\"feat\", \"brand_yoplait\", \"brand_hiland\", \"brand_weight\"]\n", "wtp = MixedLogit()\n", "wtp.fit(X=df[varnames],\n", " y=df[\"choice\"],\n", " varnames=varnames,\n", " ids=df[\"chid\"],\n", " alts=df[\"alt\"],\n", " panels=df['id'],\n", " randvars={\"feat\": \"n\", \"brand_yoplait\": \"n\", \"brand_hiland\": \"n\", \"brand_weight\": \"n\"},\n", " n_draws=10000,\n", " batch_size=2000,\n", " init_coeff=coef,\n", " scale_factor= df[\"price\"],\n", " verbose=2)\n", "\n", "wtp.summary()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "QSVn0cs34HHF", "outputId": "6cbe1bd2-027a-4fe6-d0fc-397cfa5ffb1e" }, "execution_count": 8, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "GPU processing enabled.\n", "Optimization terminated successfully.\n", " Message: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH\n", " Iterations: 40\n", " Function evaluations: 58\n", "Estimation time= 139.6 seconds\n", "---------------------------------------------------------------------------\n", "Coefficient Estimate Std.Err. z-val P>|z|\n", "---------------------------------------------------------------------------\n", "feat 2.1505100 0.6509423 3.3036876 0.000968 ***\n", "brand_yoplait 1.3023788 0.7384744 1.7636072 0.0779 . \n", "brand_hiland -11.8644846 1.4151361 -8.3839882 8.58e-17 ***\n", "brand_weight -6.8533816 1.2996173 -5.2733846 1.46e-07 ***\n", "sd.feat 2.6531236 0.7187810 3.6911432 0.000228 ***\n", "sd.brand_yoplait 8.0006492 1.0914629 7.3302070 3.12e-13 ***\n", "sd.brand_hiland 5.6846516 1.1605500 4.8982392 1.03e-06 ***\n", "sd.brand_weight 8.9836146 1.2620149 7.1184695 1.43e-12 ***\n", "_scale_factor 0.4610077 0.0417999 11.0289098 1.25e-27 ***\n", "---------------------------------------------------------------------------\n", "Significance: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n", "\n", "Log-Likelihood= -1245.232\n", "AIC= 2508.464\n", "BIC= 2560.558\n" ] } ] }, { "cell_type": "markdown", "source": [ "## References" ], "metadata": { "id": "VaU0AR2DV-T-" } }, { "cell_type": "markdown", "source": [ "Jain, Dipak C, Naufel J Vilcassim, and Pradeep K Chintagunta. 1994. “A Random-Coefficients Logit Brand-Choice Model Applied to Panel Data.” Journal of Business & Economic Statistics 12 (3): 317–28." ], "metadata": { "id": "bvIWtJhQcUtr" } }, { "cell_type": "markdown", "source": [ "The Yogurt dataset example was kindly developed by [@chevrotin](https://github.com/chevrotin)." ], "metadata": { "id": "iS7Zhgpjc2te" } } ] }