{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "stupid-think",
   "metadata": {},
   "source": [
    "# pyg.base.dictable\n",
    "\n",
    "dictable is a table, a collection of iterable records. It is also a dict with each key's value being a column. \n",
    "Why not use a pandas.DataFrame? pd.DataFrame leads a dual life: "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "encouraging-driver",
   "metadata": {},
   "source": [
    "* by day an index-based optimized numpy array supporting e.g. timeseries analytics etc.\n",
    "* by night, a table with keys supporting filtering, aggregating, pivoting on keys as well as inner/outer joining on keys."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "architectural-cursor",
   "metadata": {},
   "source": [
    "As a result, the pandas interface is somewhat cumbersome. Further, the DataFrame isn't really designed for containing more complicated objects within it.\n",
    "<br> Conversely, dictable only tries to do the latter and is designed precisely for holding entire research process in one place. You can think of dictable as 'one level up' on a DataFrame: a dictable will handle thousands of data frames within it with ease.\n",
    "Indeed, dictable should be thought of as an 'organiser of research flow' rather than as an array of primitives.\n",
    "<br> In general, each row will contain some keys indexing the experiment, while some keys will contain complicated objects: a pd.DataFrame, a timeseries, yield_curves, machine-learning experiments etc.\n",
    "<br> The interface is succinct and extremely intuitive, allowing the user to concentrate on logic of the calculations rather than boilerplate.\n",
    "\n",
    "## Motivation: dictable as an organiser of research flow\n",
    "We start with a simple motivating example. Here is a typical workflow:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "roman-congo",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyg import *; import pandas as pd; import numpy as np\n",
    "import yfinance as yf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "middle-recorder",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[*********************100%***********************]  1 of 1 completed\n",
      "[*********************100%***********************]  1 of 1 completed\n",
      "[*********************100%***********************]  1 of 1 completed\n",
      "[*********************100%***********************]  1 of 1 completed\n",
      "[*********************100%***********************]  1 of 1 completed\n",
      "\n",
      "1 Failed download:\n",
      "- BAD_SYMBOL: No data found, symbol may be delisted\n",
      "[*********************100%***********************]  1 of 1 completed\n"
     ]
    }
   ],
   "source": [
    "symbols = ['MSFT', 'WMT', 'TSLA', 'AAPL', 'BAD_SYMBOL', 'C']\n",
    "history = [yf.download(symbol) for symbol in symbols]\n",
    "prices = [h['Adj Close'] for h in history]\n",
    "rtns = [p.diff() for p in prices]\n",
    "vols = [r.ewm(30).std() for r in rtns]\n",
    "zscores = [r/v for r,v in zip(rtns, vols)]\n",
    "zavgs = [z.mean() for z in zscores]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "speaking-matrix",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.06238896915574035,\n",
       " 0.045634555332148996,\n",
       " 0.0676156301672513,\n",
       " 0.053189575669227614,\n",
       " nan,\n",
       " 0.027297252361386543]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "zavgs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "universal-trace",
   "metadata": {},
   "source": [
    "At this point we ask ourselves: Why do we have a **nan**? Which ticker was it, and when did it go wrong? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "circular-outline",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['BAD_SYMBOL']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bad_symbols = [s for s, z in zip(symbols, zavgs) if np.isnan(z)]; bad_symbols"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "pending-trailer",
   "metadata": {},
   "source": [
    "Great, how do we remove bad symbols from all our other variables?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "thick-research",
   "metadata": {},
   "outputs": [],
   "source": [
    "vols = [v for s, v in zip(symbols, vols) if s not in bad_symbols]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "strange-fence",
   "metadata": {},
   "source": [
    "Now we can calculate some stuff with rtns and vols perhaps?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "great-colony",
   "metadata": {},
   "outputs": [],
   "source": [
    "ewmas = [r.ewm(n).mean()/v for r,v in zip(rtns, vols) for n in [10, 20, 30]]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "accepted-placement",
   "metadata": {},
   "source": [
    "Things went wrong and went wrong silently too: "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "greek-wildlife",
   "metadata": {},
   "source": [
    "* We forgot to remove bad data from rtns as well as from vols so our zip function is zipping the wrong stocks together\n",
    "* It is nearly impossible to discover what item in the list belong to what n and what stock"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "systematic-bruce",
   "metadata": {},
   "source": [
    "If you ever dealt with real data, the mess described above must be familiar."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "severe-projector",
   "metadata": {},
   "source": [
    "## Same code, in dictable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "breathing-colony",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[*********************100%***********************]  1 of 1 completed\n",
      "[*********************100%***********************]  1 of 1 completed\n",
      "[*********************100%***********************]  1 of 1 completed\n",
      "[*********************100%***********************]  1 of 1 completed\n",
      "[*********************100%***********************]  1 of 1 completed\n",
      "\n",
      "1 Failed download:\n",
      "- BAD_SYMBOL: No data found, symbol may be delisted\n",
      "[*********************100%***********************]  1 of 1 completed\n"
     ]
    }
   ],
   "source": [
    "from pyg import *\n",
    "import yfinance as yf\n",
    "s = dictable(symbol = ['MSFT', 'WMT', 'TSLA', 'AAPL', 'BAD_SYMBOL', 'C'])\n",
    "s = s(history = lambda symbol: yf.download(symbol))\n",
    "s = s(price = lambda history: history['Adj Close'])\n",
    "s = s(rtn = lambda price: price.diff())\n",
    "s = s(vol = lambda rtn: rtn.ewm(30).std())\n",
    "s = s(zscore = lambda rtn, vol: rtn/vol)\n",
    "s = s(zavg = lambda zscore: zscore.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "sexual-period",
   "metadata": {},
   "source": [
    "dictable __s__ contains all our data. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "interesting-hindu",
   "metadata": {},
   "source": [
    "* each row contains all the variables associated with a specific symbol\n",
    "* each column corresponds to a variable\n",
    "* adding a new variable is declarative and free of boiler-plate loop and zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "likely-anniversary",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[6 x 4]\n",
       "symbol    |history                                           |vol                                        |zavg                \n",
       "MSFT      |                  Open        High         Low    |Date                                       |0.06238896915574035 \n",
       "          |Date                                              |1986-03-13         NaN                     |                    \n",
       "          |1986-03-13    0.088542    0.101563    0.088542    |1986-03-14         NaN                     |                    \n",
       "          |1986-03-14    0.097222    0.102431    0.097222    |1986-03-17    0.000779                     |                    \n",
       "          |1986-03-17    0.100694    0.103299    0.100694    |1986-03-18    0.001997                     |                    \n",
       "WMT       |                  Open        High         Low    |Date                                       |0.045634555332148996\n",
       "          |Date                                              |1972-08-25         NaN                     |                    \n",
       "          |1972-08-25    0.063477    0.064697    0.063477    |1972-08-28         NaN                     |                    \n",
       "          |1972-08-28    0.064453    0.064941    0.064209    |1972-08-29    0.000198                     |                    \n",
       "          |1972-08-29    0.063965    0.063965    0.063477    |1972-08-30    0.000215                     |                    \n",
       "TSLA      |                  Open        High         Low    |Date                                       |0.0676156301672513  \n",
       "          |Date                                              |2010-06-29          NaN                    |                    \n",
       "          |2010-06-29    3.800000    5.000000    3.508000    |2010-06-30          NaN                    |                    \n",
       "          |2010-06-30    5.158000    6.084000    4.660000    |2010-07-01     0.255972                    |                    \n",
       "          |2010-07-01    5.000000    5.184000    4.054000    |2010-07-02     0.274164                    |                    \n",
       "AAPL      |                  Open        High         Low    |Date                                       |0.053189575669227614\n",
       "          |Date                                              |1980-12-12         NaN                     |                    \n",
       "          |1980-12-12    0.128348    0.128906    0.128348    |1980-12-15         NaN                     |                    \n",
       "          |1980-12-15    0.122210    0.122210    0.121652    |1980-12-16    0.001242                     |                    \n",
       "          |1980-12-16    0.113281    0.113281    0.112723    |1980-12-17    0.004938                     |                    \n",
       "BAD_SYMBOL|Empty DataFrame                                   |Series([], Name: Adj Close, dtype: float64)|nan                 \n",
       "          |Columns: [Open, High, Low, Close, Adj Close, Volum|                                           |                    \n",
       "          |Index: []                                         |                                           |                    \n",
       "C         |                 Open       High        Low      C|Date                                       |0.027297252361386543\n",
       "          |Date                                              |1977-01-03         NaN                     |                    \n",
       "          |1977-01-03  16.133125  16.236876  16.133125  16.23|1977-01-04         NaN                     |                    \n",
       "          |1977-01-04  16.236876  16.288750  16.184999  16.28|1977-01-05    0.053720                     |                    \n",
       "          |1977-01-05  16.288750  16.288750  16.133125  16.18|1977-01-06    0.043500                     |                    "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s[['symbol', 'history', 'vol', 'zavg']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "wicked-fairy",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.06238896915574035,\n",
       " 0.045634555332148996,\n",
       " 0.0676156301672513,\n",
       " 0.053189575669227614,\n",
       " nan,\n",
       " 0.027297252361386543]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.zavg"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "literary-eligibility",
   "metadata": {},
   "source": [
    "### Oh, no, we have a bad symbol, how do we remove it?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "dated-vintage",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.06238896915574035,\n",
       " 0.045634555332148996,\n",
       " 0.0676156301672513,\n",
       " 0.053189575669227614,\n",
       " 0.027297252361386543]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = s.exc(zavg = np.nan); s.zavg"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "sound-bowling",
   "metadata": {},
   "source": [
    "### Now if we want to calculate something per symbol and window...\n",
    "We want to create a new table, now keyed on two values: symbol and window n, so we create a bigger table using cross product:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "legislative-segment",
   "metadata": {},
   "outputs": [],
   "source": [
    "sn = s * dict(n = [10,20,30]) ## each row is now unique per symbol and window n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "injured-floor",
   "metadata": {},
   "outputs": [],
   "source": [
    "sn = sn(ewma = lambda rtn, n, vol: rtn.ewm(n).mean()/vol)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "mounted-convention",
   "metadata": {},
   "source": [
    "And here is Citibank's three ewma..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "single-plaza",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[3 x 2]\n",
       "n |ewma                  \n",
       "10|Date                  \n",
       "  |1977-01-03         NaN\n",
       "  |1977-01-04         NaN\n",
       "  |1977-01-05   -0.269415\n",
       "  |1977-01-06   -0.636750\n",
       "20|Date                  \n",
       "  |1977-01-03         NaN\n",
       "  |1977-01-04         NaN\n",
       "  |1977-01-05   -0.252990\n",
       "  |1977-01-06   -0.610388\n",
       "30|Date                  \n",
       "  |1977-01-03         NaN\n",
       "  |1977-01-04         NaN\n",
       "  |1977-01-05   -0.247336\n",
       "  |1977-01-06   -0.601208"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sn.inc(symbol = 'C')[['n', 'ewma']]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "racial-dining",
   "metadata": {},
   "source": [
    "Here is a pivot table of the average of each ewma per symbol and window... Note that again, we can access functions of variables and not just the existing keys in the dictable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "latest-minority",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[5 x 4]\n",
       "symbol|10                    |20                    |30                    \n",
       "AAPL  |[0.05186739363086048] |[0.048216725636102103]|[0.044663941035992756]\n",
       "C     |[0.027514106874753336]|[0.026447710541888325]|[0.02513228496004374] \n",
       "MSFT  |[0.061144190254947106]|[0.058806269926441355]|[0.056647557919386575]\n",
       "TSLA  |[0.05809192519536144] |[0.0513719449570986]  |[0.0461212842741293]  \n",
       "WMT   |[0.04534668336293025] |[0.04406462686679985] |[0.0426962176790935]  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sn.pivot('symbol', 'n', lambda ewma: ewma.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "first-engine",
   "metadata": {},
   "source": [
    "## dictable functionality\n",
    "### construction\n",
    "dictable is quite flexible on constuctions. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "efficient-atmosphere",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[4 x 2]\n",
       "a|b\n",
       "1|a\n",
       "2|b\n",
       "3|c\n",
       "4|d"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d = dictable(a = [1,2,3,4], b = ['a', 'b', 'c', 'd']); d"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "existing-cedar",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[4 x 4]\n",
       "symbol|exchange|a|b\n",
       "MSFT  |NYSE    |1|a\n",
       "AAPL  |NYSE    |2|b\n",
       "APA   |NYSE    |3|c\n",
       "MMM   |NYSE    |4|d"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d = dictable(dict(a = [1,2,3,4], b = ['a', 'b', 'c', 'd']), symbol = ['MSFT', 'AAPL', 'APA', 'MMM'], exchange = 'NYSE'); d"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "direct-security",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(d) # can instantiate a DataFrame from a dictable with no code and vice versa..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "civic-reality",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[4 x 4]\n",
       "symbol|exchange|a|b\n",
       "MSFT  |NYSE    |1|a\n",
       "AAPL  |NYSE    |2|b\n",
       "APA   |NYSE    |3|c\n",
       "MMM   |NYSE    |4|d"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d = dictable(df); d"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "hydraulic-tuition",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[3 x 2]\n",
       "a|b\n",
       "1|3\n",
       "2|4\n",
       "3|5"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d = dictable([(1,3), (2,4), (3,5)], ['a', 'b']); d # construction from records as tuples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "sophisticated-variation",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[3 x 4]\n",
       "b|c        |a|d         \n",
       "3|None     |1|None      \n",
       "4|None     |2|new column\n",
       "5|also here|3|None      "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d = dictable([dict(a = 1, b = 3), dict(a = 2, b = 4, d = 'new column'), dict(a = 3, b = 5, c = 'also here')]); d # construction from records as dicts, mismatching on keys is fine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "understanding-material",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[6 x 3]\n",
       "Symbol|Name               |Sector                \n",
       "MMM   |3M Company         |Industrials           \n",
       "AOS   |A.O. Smith Corp    |Industrials           \n",
       "ABT   |Abbott Laboratories|Health Care           \n",
       "ABBV  |AbbVie Inc.        |Health Care           \n",
       "ABMD  |ABIOMED Inc        |Health Care           \n",
       "ACN   |Accenture plc      |Information Technology"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d = dictable(read_csv('d:/dropbox/yoav/python/pyg/docs/constituents_csv.csv')); d = d[:6]; d"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "formed-pitch",
   "metadata": {},
   "source": [
    "### row access"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "aware-cincinnati",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Symbol': 'MMM', 'Name': '3M Company', 'Sector': 'Industrials'}"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d[0] #returns a record"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "recreational-checkout",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[2 x 3]\n",
       "Symbol|Name           |Sector     \n",
       "MMM   |3M Company     |Industrials\n",
       "AOS   |A.O. Smith Corp|Industrials"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d[:2] ## subset rows using slice"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "sharp-webcam",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'Symbol': 'MMM', 'Name': '3M Company', 'Sector': 'Industrials'}\n",
      "{'Symbol': 'AOS', 'Name': 'A.O. Smith Corp', 'Sector': 'Industrials'}\n",
      "{'Symbol': 'ABT', 'Name': 'Abbott Laboratories', 'Sector': 'Health Care'}\n",
      "{'Symbol': 'ABBV', 'Name': 'AbbVie Inc.', 'Sector': 'Health Care'}\n",
      "{'Symbol': 'ABMD', 'Name': 'ABIOMED Inc', 'Sector': 'Health Care'}\n",
      "{'Symbol': 'ACN', 'Name': 'Accenture plc', 'Sector': 'Information Technology'}\n"
     ]
    }
   ],
   "source": [
    "for row in d: # iteration is by row\n",
    "    print(row)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "scheduled-weekly",
   "metadata": {},
   "source": [
    "### column access"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "printable-moment",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['3M Company',\n",
       " 'A.O. Smith Corp',\n",
       " 'Abbott Laboratories',\n",
       " 'AbbVie Inc.',\n",
       " 'ABIOMED Inc',\n",
       " 'Accenture plc']"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d.Name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "corporate-study",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['3M Company',\n",
       " 'A.O. Smith Corp',\n",
       " 'Abbott Laboratories',\n",
       " 'AbbVie Inc.',\n",
       " 'ABIOMED Inc',\n",
       " 'Accenture plc']"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d['Name']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "subtle-bacon",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('3M Company', 'Industrials'),\n",
       " ('A.O. Smith Corp', 'Industrials'),\n",
       " ('Abbott Laboratories', 'Health Care'),\n",
       " ('AbbVie Inc.', 'Health Care'),\n",
       " ('ABIOMED Inc', 'Health Care'),\n",
       " ('Accenture plc', 'Information Technology')]"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d['Name', 'Sector']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "advised-price",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[6 x 2]\n",
       "Name               |Sector                \n",
       "3M Company         |Industrials           \n",
       "A.O. Smith Corp    |Industrials           \n",
       "Abbott Laboratories|Health Care           \n",
       "AbbVie Inc.        |Health Care           \n",
       "ABIOMED Inc        |Health Care           \n",
       "Accenture plc      |Information Technology"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d[['Name', 'Sector']]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "broke-foundation",
   "metadata": {},
   "source": [
    "### d is a dict so supports the usual keys(), values() and items():"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "olympic-uzbekistan",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Symbol : ['MMM', 'AOS', 'ABT', 'ABBV', 'ABMD', 'ACN']\n",
      "Name : ['3M Company', 'A.O. Smith Corp', 'Abbott Laboratories', 'AbbVie Inc.', 'ABIOMED Inc', 'Accenture plc']\n",
      "Sector : ['Industrials', 'Industrials', 'Health Care', 'Health Care', 'Health Care', 'Information Technology']\n"
     ]
    }
   ],
   "source": [
    "for key, column in d.items():\n",
    "    print(key, ':', column)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "freelance-bahrain",
   "metadata": {},
   "source": [
    "access via __function__ of variables is also supported"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "friendly-impression",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['MMM, Industrials',\n",
       " 'AOS, Industrials',\n",
       " 'ABT, Health Care',\n",
       " 'ABBV, Health Care',\n",
       " 'ABMD, Health Care',\n",
       " 'ACN, Information Technology']"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d[lambda Symbol, Sector: '%s, %s'%(Symbol, Sector)]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "romance-tragedy",
   "metadata": {},
   "source": [
    "### column and row access are commutative"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "false-gauge",
   "metadata": {},
   "outputs": [],
   "source": [
    "assert d[0].Name == d.Name[0] == '3M Company'\n",
    "assert d[0][lambda Symbol, Sector: '%s, %s'%(Symbol, Sector)] == d[lambda Symbol, Sector: '%s, %s'%(Symbol, Sector)][0] == 'MMM, Industrials'\n",
    "assert d[0]['Name'] == d['Name'][0]\n",
    "assert d[:2]['Name', 'Sector'] == d['Name', 'Sector'][:2]\n",
    "assert d[:2][['Name', 'Sector']] == d[['Name', 'Sector']][:2]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "outstanding-halloween",
   "metadata": {},
   "source": [
    "### adding records"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "bridal-desperate",
   "metadata": {},
   "outputs": [],
   "source": [
    "d = dictable(name = ['alan', 'barbara', 'chris'], surname = ['abramson', 'brown', 'cohen'], age = [1,2,3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "brilliant-subscription",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[4 x 3]\n",
       "age|name   |surname  \n",
       "1  |alan   |abramson \n",
       "2  |barbara|brown    \n",
       "3  |chris  |cohen    \n",
       "4  |david  |donaldson"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d + dict(name = 'david', surname = 'donaldson', age = 4) ## adding a single record"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "wired-findings",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[5 x 3]\n",
       "age|name   |surname  \n",
       "1  |alan   |abramson \n",
       "2  |barbara|brown    \n",
       "3  |chris  |cohen    \n",
       "4  |david  |donaldson\n",
       "5  |evan   |emmerson "
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d + [dict(name = 'david', surname = 'donaldson', age = 4), dict(name = 'evan', surname = 'emmerson', age = 5)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "arbitrary-affect",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[5 x 3]\n",
       "age|name   |surname  \n",
       "1  |alan   |abramson \n",
       "2  |barbara|brown    \n",
       "3  |chris  |cohen    \n",
       "4  |david  |donaldson\n",
       "5  |evan   |emmerson "
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d + dict(name = ['david', 'evan'], surname = ['donaldson', 'emmerson'], age = [4,5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "median-correlation",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[5 x 3]\n",
       "age|name   |surname  \n",
       "1  |alan   |abramson \n",
       "2  |barbara|brown    \n",
       "3  |chris  |cohen    \n",
       "4  |david  |donaldson\n",
       "5  |evan   |emmerson "
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d + pd.DataFrame(dict(name = ['david', 'evan'], surname = ['donaldson', 'emmerson'], age = [4,5]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "micro-breakdown",
   "metadata": {},
   "source": [
    "### adding/modifying columns\n",
    "You can add a column or a constant by simply calling the dictable with the values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "stupid-regulation",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[3 x 5]\n",
       "name   |surname |age|gender|school \n",
       "alan   |abramson|1  |m     |St Paul\n",
       "barbara|brown   |2  |f     |St Paul\n",
       "chris  |cohen   |3  |m     |St Paul"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d(gender = ['m', 'f', 'm'])(school = 'St Paul')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "careful-chambers",
   "metadata": {},
   "source": [
    "More interestingly, it can be a callable function using the other variables..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "emotional-description",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[3 x 4]\n",
       "name   |surname |age|initials\n",
       "alan   |abramson|1  |aa      \n",
       "barbara|brown   |2  |bb      \n",
       "chris  |cohen   |3  |cc      "
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d = d(initials = lambda name, surname: name[0] + surname[0]); d"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "primary-pride",
   "metadata": {},
   "source": [
    "Given d is a dict, a more traditional way of setting a new key is by simple assignment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "passing-delaware",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[3 x 4]\n",
       "name   |surname |age|initials\n",
       "alan   |abramson|1  |aa      \n",
       "barbara|brown   |2  |bb      \n",
       "chris  |cohen   |3  |cc      "
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d['initials'] = d[lambda name, surname: name[0] + surname[0]]; d"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "difficult-zealand",
   "metadata": {},
   "source": [
    "Or you can use the dict.update method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "liquid-bulgaria",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[3 x 5]\n",
       "name   |surname |age|initials|gender\n",
       "alan   |abramson|1  |aa      |m     \n",
       "barbara|brown   |2  |bb      |f     \n",
       "chris  |cohen   |3  |cc      |m     "
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d.update(dict(gender = ['m', 'f', 'm'])); d"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "excessive-village",
   "metadata": {},
   "source": [
    "### do\n",
    "Sometime we want to apply the same function(s) to a collection of columns. For this, 'do' will do nicely:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "explicit-muscle",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[3 x 5]\n",
       "name   |surname |age|initials|gender\n",
       "Alan   |Abramson|1  |AA      |M     \n",
       "Barbara|Brown   |2  |BB      |F     \n",
       "Chris  |Cohen   |3  |CC      |M     "
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d = d.do(upper, 'initials', 'gender').do(proper, 'name', 'surname'); d"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "proprietary-affair",
   "metadata": {},
   "source": [
    "### removing columns "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "indoor-electricity",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[3 x 4]\n",
       "name   |surname |age|gender\n",
       "Alan   |Abramson|1  |M     \n",
       "Barbara|Brown   |2  |F     \n",
       "Chris  |Cohen   |3  |M     "
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d = d - 'initials'; d"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "velvet-collect",
   "metadata": {},
   "source": [
    "### removing rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "significant-flavor",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[2 x 4]\n",
       "age|gender|name   |surname\n",
       "2  |F     |Barbara|Brown  \n",
       "3  |M     |Chris  |Cohen  "
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d.exc(name = 'Alan')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "clean-winner",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[2 x 4]\n",
       "age|gender|name |surname \n",
       "1  |M     |Alan |Abramson\n",
       "3  |M     |Chris|Cohen   "
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d.inc(name = ['Alan', 'Chris'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "expanded-phase",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[2 x 4]\n",
       "age|gender|name   |surname\n",
       "2  |F     |Barbara|Brown  \n",
       "3  |M     |Chris  |Cohen  "
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d.inc(lambda age: age>1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "disabled-scroll",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[1 x 4]\n",
       "name   |surname|age|gender\n",
       "Barbara|Brown  |2  |F     "
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d.exc(lambda gender: gender == 'M')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "likely-privacy",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[2 x 4]\n",
       "age|gender|name |surname \n",
       "1  |M     |Alan |Abramson\n",
       "3  |M     |Chris|Cohen   "
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d.exc(lambda name, surname: len(name)>len(surname))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "annual-seafood",
   "metadata": {},
   "source": [
    "### sort"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "pointed-fetish",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[3 x 4]\n",
       "name   |surname |age|gender\n",
       "Alan   |Abramson|1  |M     \n",
       "Barbara|Brown   |2  |F     \n",
       "Chris  |Cohen   |3  |M     "
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d.sort('name', 'surname')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "impaired-power",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[3 x 4]\n",
       "name   |surname |age|gender\n",
       "Barbara|Brown   |2  |F     \n",
       "Alan   |Abramson|1  |M     \n",
       "Chris  |Cohen   |3  |M     "
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d.sort(lambda name: name[::-1]) # can sort on functions of variables too"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "destroyed-prophet",
   "metadata": {},
   "source": [
    "### listby(keys)\n",
    "listby is like groupby except it returns a dictable with unique keys and the other columns are returned as a list. We find that MUCH more useful usually than groupby"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "binary-diversity",
   "metadata": {},
   "outputs": [],
   "source": [
    "grades = dictable(name = ['alan', 'barbara', 'chris'], grades = [30,90,80], subject = 'english', teacher = 'mr bennet') \\\n",
    "       + dictable(name = ['alan', 'david', 'esther'], grades = [40,50,70], subject = 'math', teacher = 'mrs ruler') \\\n",
    "       + dictable(name = ['barbara', 'chris', 'esther'], grades = [90,60,80], subject = 'french', teacher = 'dr francois')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "understanding-cooper",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[3 x 4]\n",
       "teacher    |grades      |name                          |subject                          \n",
       "dr francois|[90, 60, 80]|['barbara', 'chris', 'esther']|['french', 'french', 'french']   \n",
       "mr bennet  |[30, 90, 80]|['alan', 'barbara', 'chris']  |['english', 'english', 'english']\n",
       "mrs ruler  |[40, 50, 70]|['alan', 'david', 'esther']   |['math', 'math', 'math']         "
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grades.listby('teacher')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "blessed-brief",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[3 x 5]\n",
       "teacher    |grades      |name                          |subject                          |avg_grade         \n",
       "dr francois|[90, 60, 80]|['barbara', 'chris', 'esther']|['french', 'french', 'french']   |76.66666666666667 \n",
       "mr bennet  |[30, 90, 80]|['alan', 'barbara', 'chris']  |['english', 'english', 'english']|66.66666666666667 \n",
       "mrs ruler  |[40, 50, 70]|['alan', 'david', 'esther']   |['math', 'math', 'math']         |53.333333333333336"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grades.listby('teacher')(avg_grade = lambda grades: np.mean(grades))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "steady-rebate",
   "metadata": {},
   "source": [
    "### unlist\n",
    "unlist undoes listby() assuming it is possible..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "worldwide-cathedral",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[9 x 4]\n",
       "grades|name   |subject|teacher    \n",
       "90    |barbara|french |dr francois\n",
       "60    |chris  |french |dr francois\n",
       "80    |esther |french |dr francois\n",
       "...9 rows...\n",
       "40    |alan   |math   |mrs ruler  \n",
       "50    |david  |math   |mrs ruler  \n",
       "70    |esther |math   |mrs ruler  "
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grades.listby('teacher').unlist()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "useful-murray",
   "metadata": {},
   "source": [
    "### groupby(keys) and ungroup\n",
    "This is similar to DatFrame groupby except that instead of a new object, a dictable is returned: The name of the grouped column is given by 'grp'. ungroup allows us to get back to original."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "surprised-mexico",
   "metadata": {},
   "outputs": [],
   "source": [
    "classes = grades.groupby(['teacher', 'subject'], grp = 'class')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "compatible-pizza",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'teacher': 'dr francois',\n",
       " 'subject': 'french',\n",
       " 'class': dictable[3 x 2]\n",
       " grades|name   \n",
       " 90    |barbara\n",
       " 60    |chris  \n",
       " 80    |esther }"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classes[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "acknowledged-roommate",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[9 x 4]\n",
       "grades|name   |subject|teacher    \n",
       "90    |barbara|french |dr francois\n",
       "60    |chris  |french |dr francois\n",
       "80    |esther |french |dr francois\n",
       "...9 rows...\n",
       "40    |alan   |math   |mrs ruler  \n",
       "50    |david  |math   |mrs ruler  \n",
       "70    |esther |math   |mrs ruler  "
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classes.ungroup('class')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "mineral-psychology",
   "metadata": {},
   "source": [
    "### inner join\n",
    "The multiplication operation is overloaded for the join method. By default, if two dictables share keys, the join is an inner join on the keys "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "imported-enemy",
   "metadata": {},
   "outputs": [],
   "source": [
    "students = dictable(name = ['alan', 'barbara', 'chris', 'david', 'esthar', 'fabian'], surname = ['abramsom', 'brown', 'cohen', 'drummond', 'ecklestone', 'fox'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "australian-citation",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "shared keys: ['name']\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "dictable[7 x 5]\n",
       "name   |grades|subject|teacher    |surname \n",
       "alan   |30    |english|mr bennet  |abramsom\n",
       "alan   |40    |math   |mrs ruler  |abramsom\n",
       "barbara|90    |english|mr bennet  |brown   \n",
       "...7 rows...\n",
       "chris  |80    |english|mr bennet  |cohen   \n",
       "chris  |60    |french |dr francois|cohen   \n",
       "david  |50    |math   |mrs ruler  |drummond"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print('shared keys:', grades.keys() & students.keys())\n",
    "grades * students"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fitted-muslim",
   "metadata": {},
   "source": [
    "Are there students with no surname? We can do a xor or use division which is overloaded for xor: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "about-norway",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[2 x 4]\n",
       "grades|name  |subject|teacher    \n",
       "70    |esther|math   |mrs ruler  \n",
       "80    |esther|french |dr francois"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grades / students"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "intimate-lighting",
   "metadata": {},
   "source": [
    "Are there students with no grades?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "cordless-peeing",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[2 x 2]\n",
       "name  |surname   \n",
       "esthar|ecklestone\n",
       "fabian|fox       "
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "students / grades"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "acting-founder",
   "metadata": {},
   "outputs": [],
   "source": [
    "students = dictable(name = ['Alan', 'Barbara', 'Chris', 'David', 'Esther', 'Fabian'], surname = ['abramsom', 'brown', 'cohen', 'drummond', 'ecklestone', 'fox'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "classified-launch",
   "metadata": {},
   "source": [
    "We fixed Esther's spelling but introduced capitalization, that is OK, we are allowed to inner join on functions of keys too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "id": "curious-orlando",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[9 x 5]\n",
       "name   |grades|subject|teacher    |surname   \n",
       "alan   |30    |english|mr bennet  |abramsom  \n",
       "alan   |40    |math   |mrs ruler  |abramsom  \n",
       "barbara|90    |english|mr bennet  |brown     \n",
       "...9 rows...\n",
       "david  |50    |math   |mrs ruler  |drummond  \n",
       "esther |70    |math   |mrs ruler  |ecklestone\n",
       "esther |80    |french |dr francois|ecklestone"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grades.join(students, 'name', lambda name: name.lower())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "id": "progressive-phoenix",
   "metadata": {},
   "outputs": [],
   "source": [
    "students = dictable(first_name = ['alan', 'barbara', 'chris', 'david', 'esther', 'fabian'], surname = ['abramsom', 'brown', 'cohen', 'drummond', 'ecklestone', 'fox'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "naked-trial",
   "metadata": {},
   "source": [
    "You can inner join on different column names and both columns will be populated: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "defined-vocabulary",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[9 x 6]\n",
       "name   |grades|subject|teacher    |first_name|surname   \n",
       "alan   |30    |english|mr bennet  |alan      |abramsom  \n",
       "alan   |40    |math   |mrs ruler  |alan      |abramsom  \n",
       "barbara|90    |english|mr bennet  |barbara   |brown     \n",
       "...9 rows...\n",
       "david  |50    |math   |mrs ruler  |david     |drummond  \n",
       "esther |70    |math   |mrs ruler  |esther    |ecklestone\n",
       "esther |80    |french |dr francois|esther    |ecklestone"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grades.join(pd.DataFrame(students), 'name', 'first_name')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "pretty-cheese",
   "metadata": {},
   "source": [
    "### inner join (with other columns that match names)\n",
    "By default, if columns are shared but are not in the join, they will be returned with a tuple containing both values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "id": "exterior-receptor",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[5 x 4]\n",
       "key|y|z|x     \n",
       "a  |4|2|(1, 4)\n",
       "b  |5|8|(2, 1)\n",
       "b  |5|9|(2, 2)\n",
       "c  |6|1|(3, 3)\n",
       "c  |7|1|(4, 3)"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = dictable(key = ['a', 'b', 'c', 'c'], x = [1,2,3,4], y = [4,5,6,7])\n",
    "y = dictable(key = ['b', 'b', 'c', 'a'], x = [1,2,3,4], z = [8,9,1,2])\n",
    "x.join(y, 'key', 'key') ## ignore x column for joining"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "id": "transsexual-highway",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[5 x 4]\n",
       "key|y|z|x\n",
       "a  |4|2|1\n",
       "b  |5|8|2\n",
       "b  |5|9|2\n",
       "c  |6|1|3\n",
       "c  |7|1|4"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x.join(y, 'key', 'key', mode = 'left') ## grab left value"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "incoming-victoria",
   "metadata": {},
   "source": [
    "### cross join\n",
    "If no columns are shared, then a cross join is returned. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "present-roads",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[12 x 2]\n",
       "x|y\n",
       "1|1\n",
       "1|2\n",
       "1|3\n",
       "...12 rows...\n",
       "4|1\n",
       "4|2\n",
       "4|3"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = dictable(x = [1,2,3,4])\n",
    "y = dict(y = [1,2,3])\n",
    "x * y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "id": "apart-bullet",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[12 x 2]\n",
       "x|y\n",
       "1|1\n",
       "1|2\n",
       "1|3\n",
       "...12 rows...\n",
       "4|1\n",
       "4|2\n",
       "4|3"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x.join(y, [], []) ## you can force a full outer join"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "id": "favorite-master",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x / y == x"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "killing-pillow",
   "metadata": {},
   "source": [
    "### xor (versus left and right join)\n",
    "We find left/right join actually not very useful. There is usually a genuine reason for records for which there is a match and for records for which there isn't. And the treatment of these is distinct, which means a left-join operation that joins the two outcomes together is positively harmful.\n",
    "\n",
    "The xor operator is much more useful and you can use it to recreate left/right join if we really must. Here is an example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "id": "described-remedy",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[9 x 5]\n",
       "grades|name   |teacher    |surname |subject\n",
       "30    |alan   |mr bennet  |abramsom|english\n",
       "40    |alan   |mrs ruler  |abramsom|math   \n",
       "90    |barbara|mr bennet  |brown   |english\n",
       "...9 rows...\n",
       "50    |david  |mrs ruler  |None    |math   \n",
       "70    |esther |mrs ruler  |None    |math   \n",
       "80    |esther |dr francois|None    |french "
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "students = dictable(name = ['alan', 'barbara', 'chris'], surname = ['abramsom', 'brown', 'cohen',])\n",
    "new_students = dictable(name = ['david', 'esther', 'fabian'], surname = ['drummond', 'ecklestone', 'fox'])\n",
    "\n",
    "inner_join = grades * students ## grades with students\n",
    "left_xor = grades / students  ## grades without sudents\n",
    "\n",
    "# you can...\n",
    "left_join = grades * students + grades / students ## grades for which no surname is available will have None surname\n",
    "left_join"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "id": "earned-playback",
   "metadata": {},
   "outputs": [],
   "source": [
    "# but really you want to do:\n",
    "student_grades = grades * students\n",
    "unmapped_grades = grades / students ## we treat this one separately...\n",
    "new_student_grades =  unmapped_grades * new_students ## and grab surnames from the new students table..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "id": "norwegian-heating",
   "metadata": {},
   "outputs": [],
   "source": [
    "assert len(unmapped_grades / new_student_grades) == 0, 'students must exist either in the students table or in the new students table'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "id": "peripheral-reggae",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[9 x 5]\n",
       "grades|name   |subject|surname   |teacher    \n",
       "30    |alan   |english|abramsom  |mr bennet  \n",
       "40    |alan   |math   |abramsom  |mrs ruler  \n",
       "90    |barbara|english|brown     |mr bennet  \n",
       "...9 rows...\n",
       "50    |david  |math   |drummond  |mrs ruler  \n",
       "70    |esther |math   |ecklestone|mrs ruler  \n",
       "80    |esther |french |ecklestone|dr francois"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_grades = student_grades + new_student_grades; all_grades"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "serious-sierra",
   "metadata": {},
   "source": [
    "### pivot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "id": "dynamic-programmer",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[16 x 2]\n",
       "x|y\n",
       "1|1\n",
       "1|2\n",
       "1|3\n",
       "...16 rows...\n",
       "4|2\n",
       "4|3\n",
       "4|4"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = dictable(x = [1,2,3,4])\n",
    "y = dictable(y = [1,2,3,4])\n",
    "xy = (x * y)\n",
    "xy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "id": "actual-scholarship",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[4 x 5]\n",
       "x|1  |2  |3   |4   \n",
       "1|[1]|[2]|[3] |[4] \n",
       "2|[2]|[4]|[6] |[8] \n",
       "3|[3]|[6]|[9] |[12]\n",
       "4|[4]|[8]|[12]|[16]"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "xy.pivot('x', 'y', lambda x, y: x*y)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "plain-width",
   "metadata": {},
   "source": [
    "### a few observations:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "prime-novel",
   "metadata": {},
   "source": [
    "* as per usual, can provide a function for values in table (indeed columns y) and not just keys\n",
    "* the output in the cells come back as a list. This is because sometimes there are more than one row with given x and y, and sometimes there are none:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "id": "opponent-inclusion",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[4 x 5]\n",
       "x|1     |2     |3       |4       \n",
       "1|[1, 1]|[2, 2]|[3, 3]  |None    \n",
       "2|[2, 2]|[4, 4]|None    |[8, 8]  \n",
       "3|[3, 3]|None  |[9, 9]  |[12, 12]\n",
       "4|None  |[8, 8]|[12, 12]|[16, 16]"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(xy + xy).exc(lambda x,y: x+y == 5).pivot('x', 'y', lambda x, y: x*y)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "intensive-crash",
   "metadata": {},
   "source": [
    "You can apply a sequence of aggregate functions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "id": "expressed-wellington",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dictable[4 x 5]\n",
       "x|1   |2   |3   |4   \n",
       "1|2   |2   |2   |None\n",
       "2|2   |2   |None|2   \n",
       "3|2   |None|2   |2   \n",
       "4|None|2   |2   |2   "
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(xy + xy).exc(lambda x,y: x+y == 5).pivot('x', 'y', lambda x, y: x*y, lambda v: len(v))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}