{
"cells": [
{
"cell_type": "markdown",
"id": "stupid-think",
"metadata": {},
"source": [
"# pyg.base.dictable\n",
"\n",
"dictable is a table, a collection of iterable records. It is also a dict with each key's value being a column. \n",
"Why not use a pandas.DataFrame? pd.DataFrame leads a dual life: "
]
},
{
"cell_type": "markdown",
"id": "encouraging-driver",
"metadata": {},
"source": [
"* by day an index-based optimized numpy array supporting e.g. timeseries analytics etc.\n",
"* by night, a table with keys supporting filtering, aggregating, pivoting on keys as well as inner/outer joining on keys."
]
},
{
"cell_type": "markdown",
"id": "architectural-cursor",
"metadata": {},
"source": [
"As a result, the pandas interface is somewhat cumbersome. Further, the DataFrame isn't really designed for containing more complicated objects within it.\n",
"
Conversely, dictable only tries to do the latter and is designed precisely for holding entire research process in one place. You can think of dictable as 'one level up' on a DataFrame: a dictable will handle thousands of data frames within it with ease.\n",
"Indeed, dictable should be thought of as an 'organiser of research flow' rather than as an array of primitives.\n",
"
In general, each row will contain some keys indexing the experiment, while some keys will contain complicated objects: a pd.DataFrame, a timeseries, yield_curves, machine-learning experiments etc.\n",
"
The interface is succinct and extremely intuitive, allowing the user to concentrate on logic of the calculations rather than boilerplate.\n",
"\n",
"## Motivation: dictable as an organiser of research flow\n",
"We start with a simple motivating example. Here is a typical workflow:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "roman-congo",
"metadata": {},
"outputs": [],
"source": [
"from pyg import *; import pandas as pd; import numpy as np\n",
"import yfinance as yf"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "middle-recorder",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[*********************100%***********************] 1 of 1 completed\n",
"[*********************100%***********************] 1 of 1 completed\n",
"[*********************100%***********************] 1 of 1 completed\n",
"[*********************100%***********************] 1 of 1 completed\n",
"[*********************100%***********************] 1 of 1 completed\n",
"\n",
"1 Failed download:\n",
"- BAD_SYMBOL: No data found, symbol may be delisted\n",
"[*********************100%***********************] 1 of 1 completed\n"
]
}
],
"source": [
"symbols = ['MSFT', 'WMT', 'TSLA', 'AAPL', 'BAD_SYMBOL', 'C']\n",
"history = [yf.download(symbol) for symbol in symbols]\n",
"prices = [h['Adj Close'] for h in history]\n",
"rtns = [p.diff() for p in prices]\n",
"vols = [r.ewm(30).std() for r in rtns]\n",
"zscores = [r/v for r,v in zip(rtns, vols)]\n",
"zavgs = [z.mean() for z in zscores]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "speaking-matrix",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0.06238896915574035,\n",
" 0.045634555332148996,\n",
" 0.0676156301672513,\n",
" 0.053189575669227614,\n",
" nan,\n",
" 0.027297252361386543]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"zavgs"
]
},
{
"cell_type": "markdown",
"id": "universal-trace",
"metadata": {},
"source": [
"At this point we ask ourselves: Why do we have a **nan**? Which ticker was it, and when did it go wrong? "
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "circular-outline",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['BAD_SYMBOL']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bad_symbols = [s for s, z in zip(symbols, zavgs) if np.isnan(z)]; bad_symbols"
]
},
{
"cell_type": "markdown",
"id": "pending-trailer",
"metadata": {},
"source": [
"Great, how do we remove bad symbols from all our other variables?"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "thick-research",
"metadata": {},
"outputs": [],
"source": [
"vols = [v for s, v in zip(symbols, vols) if s not in bad_symbols]"
]
},
{
"cell_type": "markdown",
"id": "strange-fence",
"metadata": {},
"source": [
"Now we can calculate some stuff with rtns and vols perhaps?"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "great-colony",
"metadata": {},
"outputs": [],
"source": [
"ewmas = [r.ewm(n).mean()/v for r,v in zip(rtns, vols) for n in [10, 20, 30]]"
]
},
{
"cell_type": "markdown",
"id": "accepted-placement",
"metadata": {},
"source": [
"Things went wrong and went wrong silently too: "
]
},
{
"cell_type": "markdown",
"id": "greek-wildlife",
"metadata": {},
"source": [
"* We forgot to remove bad data from rtns as well as from vols so our zip function is zipping the wrong stocks together\n",
"* It is nearly impossible to discover what item in the list belong to what n and what stock"
]
},
{
"cell_type": "markdown",
"id": "systematic-bruce",
"metadata": {},
"source": [
"If you ever dealt with real data, the mess described above must be familiar."
]
},
{
"cell_type": "markdown",
"id": "severe-projector",
"metadata": {},
"source": [
"## Same code, in dictable"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "breathing-colony",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[*********************100%***********************] 1 of 1 completed\n",
"[*********************100%***********************] 1 of 1 completed\n",
"[*********************100%***********************] 1 of 1 completed\n",
"[*********************100%***********************] 1 of 1 completed\n",
"[*********************100%***********************] 1 of 1 completed\n",
"\n",
"1 Failed download:\n",
"- BAD_SYMBOL: No data found, symbol may be delisted\n",
"[*********************100%***********************] 1 of 1 completed\n"
]
}
],
"source": [
"from pyg import *\n",
"import yfinance as yf\n",
"s = dictable(symbol = ['MSFT', 'WMT', 'TSLA', 'AAPL', 'BAD_SYMBOL', 'C'])\n",
"s = s(history = lambda symbol: yf.download(symbol))\n",
"s = s(price = lambda history: history['Adj Close'])\n",
"s = s(rtn = lambda price: price.diff())\n",
"s = s(vol = lambda rtn: rtn.ewm(30).std())\n",
"s = s(zscore = lambda rtn, vol: rtn/vol)\n",
"s = s(zavg = lambda zscore: zscore.mean())"
]
},
{
"cell_type": "markdown",
"id": "sexual-period",
"metadata": {},
"source": [
"dictable __s__ contains all our data. \n"
]
},
{
"cell_type": "markdown",
"id": "interesting-hindu",
"metadata": {},
"source": [
"* each row contains all the variables associated with a specific symbol\n",
"* each column corresponds to a variable\n",
"* adding a new variable is declarative and free of boiler-plate loop and zip"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "likely-anniversary",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[6 x 4]\n",
"symbol |history |vol |zavg \n",
"MSFT | Open High Low |Date |0.06238896915574035 \n",
" |Date |1986-03-13 NaN | \n",
" |1986-03-13 0.088542 0.101563 0.088542 |1986-03-14 NaN | \n",
" |1986-03-14 0.097222 0.102431 0.097222 |1986-03-17 0.000779 | \n",
" |1986-03-17 0.100694 0.103299 0.100694 |1986-03-18 0.001997 | \n",
"WMT | Open High Low |Date |0.045634555332148996\n",
" |Date |1972-08-25 NaN | \n",
" |1972-08-25 0.063477 0.064697 0.063477 |1972-08-28 NaN | \n",
" |1972-08-28 0.064453 0.064941 0.064209 |1972-08-29 0.000198 | \n",
" |1972-08-29 0.063965 0.063965 0.063477 |1972-08-30 0.000215 | \n",
"TSLA | Open High Low |Date |0.0676156301672513 \n",
" |Date |2010-06-29 NaN | \n",
" |2010-06-29 3.800000 5.000000 3.508000 |2010-06-30 NaN | \n",
" |2010-06-30 5.158000 6.084000 4.660000 |2010-07-01 0.255972 | \n",
" |2010-07-01 5.000000 5.184000 4.054000 |2010-07-02 0.274164 | \n",
"AAPL | Open High Low |Date |0.053189575669227614\n",
" |Date |1980-12-12 NaN | \n",
" |1980-12-12 0.128348 0.128906 0.128348 |1980-12-15 NaN | \n",
" |1980-12-15 0.122210 0.122210 0.121652 |1980-12-16 0.001242 | \n",
" |1980-12-16 0.113281 0.113281 0.112723 |1980-12-17 0.004938 | \n",
"BAD_SYMBOL|Empty DataFrame |Series([], Name: Adj Close, dtype: float64)|nan \n",
" |Columns: [Open, High, Low, Close, Adj Close, Volum| | \n",
" |Index: [] | | \n",
"C | Open High Low C|Date |0.027297252361386543\n",
" |Date |1977-01-03 NaN | \n",
" |1977-01-03 16.133125 16.236876 16.133125 16.23|1977-01-04 NaN | \n",
" |1977-01-04 16.236876 16.288750 16.184999 16.28|1977-01-05 0.053720 | \n",
" |1977-01-05 16.288750 16.288750 16.133125 16.18|1977-01-06 0.043500 | "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s[['symbol', 'history', 'vol', 'zavg']]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "wicked-fairy",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0.06238896915574035,\n",
" 0.045634555332148996,\n",
" 0.0676156301672513,\n",
" 0.053189575669227614,\n",
" nan,\n",
" 0.027297252361386543]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s.zavg"
]
},
{
"cell_type": "markdown",
"id": "literary-eligibility",
"metadata": {},
"source": [
"### Oh, no, we have a bad symbol, how do we remove it?"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "dated-vintage",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0.06238896915574035,\n",
" 0.045634555332148996,\n",
" 0.0676156301672513,\n",
" 0.053189575669227614,\n",
" 0.027297252361386543]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s = s.exc(zavg = np.nan); s.zavg"
]
},
{
"cell_type": "markdown",
"id": "sound-bowling",
"metadata": {},
"source": [
"### Now if we want to calculate something per symbol and window...\n",
"We want to create a new table, now keyed on two values: symbol and window n, so we create a bigger table using cross product:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "legislative-segment",
"metadata": {},
"outputs": [],
"source": [
"sn = s * dict(n = [10,20,30]) ## each row is now unique per symbol and window n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "injured-floor",
"metadata": {},
"outputs": [],
"source": [
"sn = sn(ewma = lambda rtn, n, vol: rtn.ewm(n).mean()/vol)"
]
},
{
"cell_type": "markdown",
"id": "mounted-convention",
"metadata": {},
"source": [
"And here is Citibank's three ewma..."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "single-plaza",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[3 x 2]\n",
"n |ewma \n",
"10|Date \n",
" |1977-01-03 NaN\n",
" |1977-01-04 NaN\n",
" |1977-01-05 -0.269415\n",
" |1977-01-06 -0.636750\n",
"20|Date \n",
" |1977-01-03 NaN\n",
" |1977-01-04 NaN\n",
" |1977-01-05 -0.252990\n",
" |1977-01-06 -0.610388\n",
"30|Date \n",
" |1977-01-03 NaN\n",
" |1977-01-04 NaN\n",
" |1977-01-05 -0.247336\n",
" |1977-01-06 -0.601208"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sn.inc(symbol = 'C')[['n', 'ewma']]"
]
},
{
"cell_type": "markdown",
"id": "racial-dining",
"metadata": {},
"source": [
"Here is a pivot table of the average of each ewma per symbol and window... Note that again, we can access functions of variables and not just the existing keys in the dictable"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "latest-minority",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[5 x 4]\n",
"symbol|10 |20 |30 \n",
"AAPL |[0.05186739363086048] |[0.048216725636102103]|[0.044663941035992756]\n",
"C |[0.027514106874753336]|[0.026447710541888325]|[0.02513228496004374] \n",
"MSFT |[0.061144190254947106]|[0.058806269926441355]|[0.056647557919386575]\n",
"TSLA |[0.05809192519536144] |[0.0513719449570986] |[0.0461212842741293] \n",
"WMT |[0.04534668336293025] |[0.04406462686679985] |[0.0426962176790935] "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sn.pivot('symbol', 'n', lambda ewma: ewma.mean())"
]
},
{
"cell_type": "markdown",
"id": "first-engine",
"metadata": {},
"source": [
"## dictable functionality\n",
"### construction\n",
"dictable is quite flexible on constuctions. "
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "efficient-atmosphere",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[4 x 2]\n",
"a|b\n",
"1|a\n",
"2|b\n",
"3|c\n",
"4|d"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d = dictable(a = [1,2,3,4], b = ['a', 'b', 'c', 'd']); d"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "existing-cedar",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[4 x 4]\n",
"symbol|exchange|a|b\n",
"MSFT |NYSE |1|a\n",
"AAPL |NYSE |2|b\n",
"APA |NYSE |3|c\n",
"MMM |NYSE |4|d"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d = dictable(dict(a = [1,2,3,4], b = ['a', 'b', 'c', 'd']), symbol = ['MSFT', 'AAPL', 'APA', 'MMM'], exchange = 'NYSE'); d"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "direct-security",
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame(d) # can instantiate a DataFrame from a dictable with no code and vice versa..."
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "civic-reality",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[4 x 4]\n",
"symbol|exchange|a|b\n",
"MSFT |NYSE |1|a\n",
"AAPL |NYSE |2|b\n",
"APA |NYSE |3|c\n",
"MMM |NYSE |4|d"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d = dictable(df); d"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "hydraulic-tuition",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[3 x 2]\n",
"a|b\n",
"1|3\n",
"2|4\n",
"3|5"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d = dictable([(1,3), (2,4), (3,5)], ['a', 'b']); d # construction from records as tuples"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "sophisticated-variation",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[3 x 4]\n",
"b|c |a|d \n",
"3|None |1|None \n",
"4|None |2|new column\n",
"5|also here|3|None "
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d = dictable([dict(a = 1, b = 3), dict(a = 2, b = 4, d = 'new column'), dict(a = 3, b = 5, c = 'also here')]); d # construction from records as dicts, mismatching on keys is fine"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "understanding-material",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[6 x 3]\n",
"Symbol|Name |Sector \n",
"MMM |3M Company |Industrials \n",
"AOS |A.O. Smith Corp |Industrials \n",
"ABT |Abbott Laboratories|Health Care \n",
"ABBV |AbbVie Inc. |Health Care \n",
"ABMD |ABIOMED Inc |Health Care \n",
"ACN |Accenture plc |Information Technology"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d = dictable(read_csv('d:/dropbox/yoav/python/pyg/docs/constituents_csv.csv')); d = d[:6]; d"
]
},
{
"cell_type": "markdown",
"id": "formed-pitch",
"metadata": {},
"source": [
"### row access"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "aware-cincinnati",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Symbol': 'MMM', 'Name': '3M Company', 'Sector': 'Industrials'}"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d[0] #returns a record"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "recreational-checkout",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[2 x 3]\n",
"Symbol|Name |Sector \n",
"MMM |3M Company |Industrials\n",
"AOS |A.O. Smith Corp|Industrials"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d[:2] ## subset rows using slice"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "sharp-webcam",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'Symbol': 'MMM', 'Name': '3M Company', 'Sector': 'Industrials'}\n",
"{'Symbol': 'AOS', 'Name': 'A.O. Smith Corp', 'Sector': 'Industrials'}\n",
"{'Symbol': 'ABT', 'Name': 'Abbott Laboratories', 'Sector': 'Health Care'}\n",
"{'Symbol': 'ABBV', 'Name': 'AbbVie Inc.', 'Sector': 'Health Care'}\n",
"{'Symbol': 'ABMD', 'Name': 'ABIOMED Inc', 'Sector': 'Health Care'}\n",
"{'Symbol': 'ACN', 'Name': 'Accenture plc', 'Sector': 'Information Technology'}\n"
]
}
],
"source": [
"for row in d: # iteration is by row\n",
" print(row)"
]
},
{
"cell_type": "markdown",
"id": "scheduled-weekly",
"metadata": {},
"source": [
"### column access"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "printable-moment",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['3M Company',\n",
" 'A.O. Smith Corp',\n",
" 'Abbott Laboratories',\n",
" 'AbbVie Inc.',\n",
" 'ABIOMED Inc',\n",
" 'Accenture plc']"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.Name"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "corporate-study",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['3M Company',\n",
" 'A.O. Smith Corp',\n",
" 'Abbott Laboratories',\n",
" 'AbbVie Inc.',\n",
" 'ABIOMED Inc',\n",
" 'Accenture plc']"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d['Name']"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "subtle-bacon",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('3M Company', 'Industrials'),\n",
" ('A.O. Smith Corp', 'Industrials'),\n",
" ('Abbott Laboratories', 'Health Care'),\n",
" ('AbbVie Inc.', 'Health Care'),\n",
" ('ABIOMED Inc', 'Health Care'),\n",
" ('Accenture plc', 'Information Technology')]"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d['Name', 'Sector']"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "advised-price",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[6 x 2]\n",
"Name |Sector \n",
"3M Company |Industrials \n",
"A.O. Smith Corp |Industrials \n",
"Abbott Laboratories|Health Care \n",
"AbbVie Inc. |Health Care \n",
"ABIOMED Inc |Health Care \n",
"Accenture plc |Information Technology"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d[['Name', 'Sector']]"
]
},
{
"cell_type": "markdown",
"id": "broke-foundation",
"metadata": {},
"source": [
"### d is a dict so supports the usual keys(), values() and items():"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "olympic-uzbekistan",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Symbol : ['MMM', 'AOS', 'ABT', 'ABBV', 'ABMD', 'ACN']\n",
"Name : ['3M Company', 'A.O. Smith Corp', 'Abbott Laboratories', 'AbbVie Inc.', 'ABIOMED Inc', 'Accenture plc']\n",
"Sector : ['Industrials', 'Industrials', 'Health Care', 'Health Care', 'Health Care', 'Information Technology']\n"
]
}
],
"source": [
"for key, column in d.items():\n",
" print(key, ':', column)"
]
},
{
"cell_type": "markdown",
"id": "freelance-bahrain",
"metadata": {},
"source": [
"access via __function__ of variables is also supported"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "friendly-impression",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['MMM, Industrials',\n",
" 'AOS, Industrials',\n",
" 'ABT, Health Care',\n",
" 'ABBV, Health Care',\n",
" 'ABMD, Health Care',\n",
" 'ACN, Information Technology']"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d[lambda Symbol, Sector: '%s, %s'%(Symbol, Sector)]"
]
},
{
"cell_type": "markdown",
"id": "romance-tragedy",
"metadata": {},
"source": [
"### column and row access are commutative"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "false-gauge",
"metadata": {},
"outputs": [],
"source": [
"assert d[0].Name == d.Name[0] == '3M Company'\n",
"assert d[0][lambda Symbol, Sector: '%s, %s'%(Symbol, Sector)] == d[lambda Symbol, Sector: '%s, %s'%(Symbol, Sector)][0] == 'MMM, Industrials'\n",
"assert d[0]['Name'] == d['Name'][0]\n",
"assert d[:2]['Name', 'Sector'] == d['Name', 'Sector'][:2]\n",
"assert d[:2][['Name', 'Sector']] == d[['Name', 'Sector']][:2]\n"
]
},
{
"cell_type": "markdown",
"id": "outstanding-halloween",
"metadata": {},
"source": [
"### adding records"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "bridal-desperate",
"metadata": {},
"outputs": [],
"source": [
"d = dictable(name = ['alan', 'barbara', 'chris'], surname = ['abramson', 'brown', 'cohen'], age = [1,2,3])"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "brilliant-subscription",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[4 x 3]\n",
"age|name |surname \n",
"1 |alan |abramson \n",
"2 |barbara|brown \n",
"3 |chris |cohen \n",
"4 |david |donaldson"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d + dict(name = 'david', surname = 'donaldson', age = 4) ## adding a single record"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "wired-findings",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[5 x 3]\n",
"age|name |surname \n",
"1 |alan |abramson \n",
"2 |barbara|brown \n",
"3 |chris |cohen \n",
"4 |david |donaldson\n",
"5 |evan |emmerson "
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d + [dict(name = 'david', surname = 'donaldson', age = 4), dict(name = 'evan', surname = 'emmerson', age = 5)]"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "arbitrary-affect",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[5 x 3]\n",
"age|name |surname \n",
"1 |alan |abramson \n",
"2 |barbara|brown \n",
"3 |chris |cohen \n",
"4 |david |donaldson\n",
"5 |evan |emmerson "
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d + dict(name = ['david', 'evan'], surname = ['donaldson', 'emmerson'], age = [4,5])"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "median-correlation",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[5 x 3]\n",
"age|name |surname \n",
"1 |alan |abramson \n",
"2 |barbara|brown \n",
"3 |chris |cohen \n",
"4 |david |donaldson\n",
"5 |evan |emmerson "
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d + pd.DataFrame(dict(name = ['david', 'evan'], surname = ['donaldson', 'emmerson'], age = [4,5]))"
]
},
{
"cell_type": "markdown",
"id": "micro-breakdown",
"metadata": {},
"source": [
"### adding/modifying columns\n",
"You can add a column or a constant by simply calling the dictable with the values:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "stupid-regulation",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[3 x 5]\n",
"name |surname |age|gender|school \n",
"alan |abramson|1 |m |St Paul\n",
"barbara|brown |2 |f |St Paul\n",
"chris |cohen |3 |m |St Paul"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d(gender = ['m', 'f', 'm'])(school = 'St Paul')"
]
},
{
"cell_type": "markdown",
"id": "careful-chambers",
"metadata": {},
"source": [
"More interestingly, it can be a callable function using the other variables..."
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "emotional-description",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[3 x 4]\n",
"name |surname |age|initials\n",
"alan |abramson|1 |aa \n",
"barbara|brown |2 |bb \n",
"chris |cohen |3 |cc "
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d = d(initials = lambda name, surname: name[0] + surname[0]); d"
]
},
{
"cell_type": "markdown",
"id": "primary-pride",
"metadata": {},
"source": [
"Given d is a dict, a more traditional way of setting a new key is by simple assignment:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "passing-delaware",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[3 x 4]\n",
"name |surname |age|initials\n",
"alan |abramson|1 |aa \n",
"barbara|brown |2 |bb \n",
"chris |cohen |3 |cc "
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d['initials'] = d[lambda name, surname: name[0] + surname[0]]; d"
]
},
{
"cell_type": "markdown",
"id": "difficult-zealand",
"metadata": {},
"source": [
"Or you can use the dict.update method:"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "liquid-bulgaria",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[3 x 5]\n",
"name |surname |age|initials|gender\n",
"alan |abramson|1 |aa |m \n",
"barbara|brown |2 |bb |f \n",
"chris |cohen |3 |cc |m "
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.update(dict(gender = ['m', 'f', 'm'])); d"
]
},
{
"cell_type": "markdown",
"id": "excessive-village",
"metadata": {},
"source": [
"### do\n",
"Sometime we want to apply the same function(s) to a collection of columns. For this, 'do' will do nicely:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "explicit-muscle",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[3 x 5]\n",
"name |surname |age|initials|gender\n",
"Alan |Abramson|1 |AA |M \n",
"Barbara|Brown |2 |BB |F \n",
"Chris |Cohen |3 |CC |M "
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d = d.do(upper, 'initials', 'gender').do(proper, 'name', 'surname'); d"
]
},
{
"cell_type": "markdown",
"id": "proprietary-affair",
"metadata": {},
"source": [
"### removing columns "
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "indoor-electricity",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[3 x 4]\n",
"name |surname |age|gender\n",
"Alan |Abramson|1 |M \n",
"Barbara|Brown |2 |F \n",
"Chris |Cohen |3 |M "
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d = d - 'initials'; d"
]
},
{
"cell_type": "markdown",
"id": "velvet-collect",
"metadata": {},
"source": [
"### removing rows"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "significant-flavor",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[2 x 4]\n",
"age|gender|name |surname\n",
"2 |F |Barbara|Brown \n",
"3 |M |Chris |Cohen "
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.exc(name = 'Alan')"
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "clean-winner",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[2 x 4]\n",
"age|gender|name |surname \n",
"1 |M |Alan |Abramson\n",
"3 |M |Chris|Cohen "
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.inc(name = ['Alan', 'Chris'])"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "expanded-phase",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[2 x 4]\n",
"age|gender|name |surname\n",
"2 |F |Barbara|Brown \n",
"3 |M |Chris |Cohen "
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.inc(lambda age: age>1)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "disabled-scroll",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[1 x 4]\n",
"name |surname|age|gender\n",
"Barbara|Brown |2 |F "
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.exc(lambda gender: gender == 'M')"
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "likely-privacy",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[2 x 4]\n",
"age|gender|name |surname \n",
"1 |M |Alan |Abramson\n",
"3 |M |Chris|Cohen "
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.exc(lambda name, surname: len(name)>len(surname))"
]
},
{
"cell_type": "markdown",
"id": "annual-seafood",
"metadata": {},
"source": [
"### sort"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "pointed-fetish",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[3 x 4]\n",
"name |surname |age|gender\n",
"Alan |Abramson|1 |M \n",
"Barbara|Brown |2 |F \n",
"Chris |Cohen |3 |M "
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.sort('name', 'surname')"
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "impaired-power",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[3 x 4]\n",
"name |surname |age|gender\n",
"Barbara|Brown |2 |F \n",
"Alan |Abramson|1 |M \n",
"Chris |Cohen |3 |M "
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d.sort(lambda name: name[::-1]) # can sort on functions of variables too"
]
},
{
"cell_type": "markdown",
"id": "destroyed-prophet",
"metadata": {},
"source": [
"### listby(keys)\n",
"listby is like groupby except it returns a dictable with unique keys and the other columns are returned as a list. We find that MUCH more useful usually than groupby"
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "binary-diversity",
"metadata": {},
"outputs": [],
"source": [
"grades = dictable(name = ['alan', 'barbara', 'chris'], grades = [30,90,80], subject = 'english', teacher = 'mr bennet') \\\n",
" + dictable(name = ['alan', 'david', 'esther'], grades = [40,50,70], subject = 'math', teacher = 'mrs ruler') \\\n",
" + dictable(name = ['barbara', 'chris', 'esther'], grades = [90,60,80], subject = 'french', teacher = 'dr francois')"
]
},
{
"cell_type": "code",
"execution_count": 53,
"id": "understanding-cooper",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[3 x 4]\n",
"teacher |grades |name |subject \n",
"dr francois|[90, 60, 80]|['barbara', 'chris', 'esther']|['french', 'french', 'french'] \n",
"mr bennet |[30, 90, 80]|['alan', 'barbara', 'chris'] |['english', 'english', 'english']\n",
"mrs ruler |[40, 50, 70]|['alan', 'david', 'esther'] |['math', 'math', 'math'] "
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grades.listby('teacher')"
]
},
{
"cell_type": "code",
"execution_count": 54,
"id": "blessed-brief",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[3 x 5]\n",
"teacher |grades |name |subject |avg_grade \n",
"dr francois|[90, 60, 80]|['barbara', 'chris', 'esther']|['french', 'french', 'french'] |76.66666666666667 \n",
"mr bennet |[30, 90, 80]|['alan', 'barbara', 'chris'] |['english', 'english', 'english']|66.66666666666667 \n",
"mrs ruler |[40, 50, 70]|['alan', 'david', 'esther'] |['math', 'math', 'math'] |53.333333333333336"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grades.listby('teacher')(avg_grade = lambda grades: np.mean(grades))"
]
},
{
"cell_type": "markdown",
"id": "steady-rebate",
"metadata": {},
"source": [
"### unlist\n",
"unlist undoes listby() assuming it is possible..."
]
},
{
"cell_type": "code",
"execution_count": 55,
"id": "worldwide-cathedral",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[9 x 4]\n",
"grades|name |subject|teacher \n",
"90 |barbara|french |dr francois\n",
"60 |chris |french |dr francois\n",
"80 |esther |french |dr francois\n",
"...9 rows...\n",
"40 |alan |math |mrs ruler \n",
"50 |david |math |mrs ruler \n",
"70 |esther |math |mrs ruler "
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grades.listby('teacher').unlist()"
]
},
{
"cell_type": "markdown",
"id": "useful-murray",
"metadata": {},
"source": [
"### groupby(keys) and ungroup\n",
"This is similar to DatFrame groupby except that instead of a new object, a dictable is returned: The name of the grouped column is given by 'grp'. ungroup allows us to get back to original."
]
},
{
"cell_type": "code",
"execution_count": 56,
"id": "surprised-mexico",
"metadata": {},
"outputs": [],
"source": [
"classes = grades.groupby(['teacher', 'subject'], grp = 'class')"
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "compatible-pizza",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'teacher': 'dr francois',\n",
" 'subject': 'french',\n",
" 'class': dictable[3 x 2]\n",
" grades|name \n",
" 90 |barbara\n",
" 60 |chris \n",
" 80 |esther }"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"classes[0]"
]
},
{
"cell_type": "code",
"execution_count": 58,
"id": "acknowledged-roommate",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[9 x 4]\n",
"grades|name |subject|teacher \n",
"90 |barbara|french |dr francois\n",
"60 |chris |french |dr francois\n",
"80 |esther |french |dr francois\n",
"...9 rows...\n",
"40 |alan |math |mrs ruler \n",
"50 |david |math |mrs ruler \n",
"70 |esther |math |mrs ruler "
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"classes.ungroup('class')"
]
},
{
"cell_type": "markdown",
"id": "mineral-psychology",
"metadata": {},
"source": [
"### inner join\n",
"The multiplication operation is overloaded for the join method. By default, if two dictables share keys, the join is an inner join on the keys "
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "imported-enemy",
"metadata": {},
"outputs": [],
"source": [
"students = dictable(name = ['alan', 'barbara', 'chris', 'david', 'esthar', 'fabian'], surname = ['abramsom', 'brown', 'cohen', 'drummond', 'ecklestone', 'fox'])"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "australian-citation",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"shared keys: ['name']\n"
]
},
{
"data": {
"text/plain": [
"dictable[7 x 5]\n",
"name |grades|subject|teacher |surname \n",
"alan |30 |english|mr bennet |abramsom\n",
"alan |40 |math |mrs ruler |abramsom\n",
"barbara|90 |english|mr bennet |brown \n",
"...7 rows...\n",
"chris |80 |english|mr bennet |cohen \n",
"chris |60 |french |dr francois|cohen \n",
"david |50 |math |mrs ruler |drummond"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print('shared keys:', grades.keys() & students.keys())\n",
"grades * students"
]
},
{
"cell_type": "markdown",
"id": "fitted-muslim",
"metadata": {},
"source": [
"Are there students with no surname? We can do a xor or use division which is overloaded for xor: "
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "about-norway",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[2 x 4]\n",
"grades|name |subject|teacher \n",
"70 |esther|math |mrs ruler \n",
"80 |esther|french |dr francois"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grades / students"
]
},
{
"cell_type": "markdown",
"id": "intimate-lighting",
"metadata": {},
"source": [
"Are there students with no grades?"
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "cordless-peeing",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[2 x 2]\n",
"name |surname \n",
"esthar|ecklestone\n",
"fabian|fox "
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"students / grades"
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "acting-founder",
"metadata": {},
"outputs": [],
"source": [
"students = dictable(name = ['Alan', 'Barbara', 'Chris', 'David', 'Esther', 'Fabian'], surname = ['abramsom', 'brown', 'cohen', 'drummond', 'ecklestone', 'fox'])"
]
},
{
"cell_type": "markdown",
"id": "classified-launch",
"metadata": {},
"source": [
"We fixed Esther's spelling but introduced capitalization, that is OK, we are allowed to inner join on functions of keys too."
]
},
{
"cell_type": "code",
"execution_count": 64,
"id": "curious-orlando",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[9 x 5]\n",
"name |grades|subject|teacher |surname \n",
"alan |30 |english|mr bennet |abramsom \n",
"alan |40 |math |mrs ruler |abramsom \n",
"barbara|90 |english|mr bennet |brown \n",
"...9 rows...\n",
"david |50 |math |mrs ruler |drummond \n",
"esther |70 |math |mrs ruler |ecklestone\n",
"esther |80 |french |dr francois|ecklestone"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grades.join(students, 'name', lambda name: name.lower())"
]
},
{
"cell_type": "code",
"execution_count": 65,
"id": "progressive-phoenix",
"metadata": {},
"outputs": [],
"source": [
"students = dictable(first_name = ['alan', 'barbara', 'chris', 'david', 'esther', 'fabian'], surname = ['abramsom', 'brown', 'cohen', 'drummond', 'ecklestone', 'fox'])"
]
},
{
"cell_type": "markdown",
"id": "naked-trial",
"metadata": {},
"source": [
"You can inner join on different column names and both columns will be populated: "
]
},
{
"cell_type": "code",
"execution_count": 66,
"id": "defined-vocabulary",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[9 x 6]\n",
"name |grades|subject|teacher |first_name|surname \n",
"alan |30 |english|mr bennet |alan |abramsom \n",
"alan |40 |math |mrs ruler |alan |abramsom \n",
"barbara|90 |english|mr bennet |barbara |brown \n",
"...9 rows...\n",
"david |50 |math |mrs ruler |david |drummond \n",
"esther |70 |math |mrs ruler |esther |ecklestone\n",
"esther |80 |french |dr francois|esther |ecklestone"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grades.join(pd.DataFrame(students), 'name', 'first_name')"
]
},
{
"cell_type": "markdown",
"id": "pretty-cheese",
"metadata": {},
"source": [
"### inner join (with other columns that match names)\n",
"By default, if columns are shared but are not in the join, they will be returned with a tuple containing both values"
]
},
{
"cell_type": "code",
"execution_count": 67,
"id": "exterior-receptor",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[5 x 4]\n",
"key|y|z|x \n",
"a |4|2|(1, 4)\n",
"b |5|8|(2, 1)\n",
"b |5|9|(2, 2)\n",
"c |6|1|(3, 3)\n",
"c |7|1|(4, 3)"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x = dictable(key = ['a', 'b', 'c', 'c'], x = [1,2,3,4], y = [4,5,6,7])\n",
"y = dictable(key = ['b', 'b', 'c', 'a'], x = [1,2,3,4], z = [8,9,1,2])\n",
"x.join(y, 'key', 'key') ## ignore x column for joining"
]
},
{
"cell_type": "code",
"execution_count": 68,
"id": "transsexual-highway",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[5 x 4]\n",
"key|y|z|x\n",
"a |4|2|1\n",
"b |5|8|2\n",
"b |5|9|2\n",
"c |6|1|3\n",
"c |7|1|4"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x.join(y, 'key', 'key', mode = 'left') ## grab left value"
]
},
{
"cell_type": "markdown",
"id": "incoming-victoria",
"metadata": {},
"source": [
"### cross join\n",
"If no columns are shared, then a cross join is returned. "
]
},
{
"cell_type": "code",
"execution_count": 69,
"id": "present-roads",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[12 x 2]\n",
"x|y\n",
"1|1\n",
"1|2\n",
"1|3\n",
"...12 rows...\n",
"4|1\n",
"4|2\n",
"4|3"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x = dictable(x = [1,2,3,4])\n",
"y = dict(y = [1,2,3])\n",
"x * y"
]
},
{
"cell_type": "code",
"execution_count": 70,
"id": "apart-bullet",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[12 x 2]\n",
"x|y\n",
"1|1\n",
"1|2\n",
"1|3\n",
"...12 rows...\n",
"4|1\n",
"4|2\n",
"4|3"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x.join(y, [], []) ## you can force a full outer join"
]
},
{
"cell_type": "code",
"execution_count": 71,
"id": "favorite-master",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x / y == x"
]
},
{
"cell_type": "markdown",
"id": "killing-pillow",
"metadata": {},
"source": [
"### xor (versus left and right join)\n",
"We find left/right join actually not very useful. There is usually a genuine reason for records for which there is a match and for records for which there isn't. And the treatment of these is distinct, which means a left-join operation that joins the two outcomes together is positively harmful.\n",
"\n",
"The xor operator is much more useful and you can use it to recreate left/right join if we really must. Here is an example"
]
},
{
"cell_type": "code",
"execution_count": 80,
"id": "described-remedy",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[9 x 5]\n",
"grades|name |teacher |surname |subject\n",
"30 |alan |mr bennet |abramsom|english\n",
"40 |alan |mrs ruler |abramsom|math \n",
"90 |barbara|mr bennet |brown |english\n",
"...9 rows...\n",
"50 |david |mrs ruler |None |math \n",
"70 |esther |mrs ruler |None |math \n",
"80 |esther |dr francois|None |french "
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"students = dictable(name = ['alan', 'barbara', 'chris'], surname = ['abramsom', 'brown', 'cohen',])\n",
"new_students = dictable(name = ['david', 'esther', 'fabian'], surname = ['drummond', 'ecklestone', 'fox'])\n",
"\n",
"inner_join = grades * students ## grades with students\n",
"left_xor = grades / students ## grades without sudents\n",
"\n",
"# you can...\n",
"left_join = grades * students + grades / students ## grades for which no surname is available will have None surname\n",
"left_join"
]
},
{
"cell_type": "code",
"execution_count": 73,
"id": "earned-playback",
"metadata": {},
"outputs": [],
"source": [
"# but really you want to do:\n",
"student_grades = grades * students\n",
"unmapped_grades = grades / students ## we treat this one separately...\n",
"new_student_grades = unmapped_grades * new_students ## and grab surnames from the new students table..."
]
},
{
"cell_type": "code",
"execution_count": 74,
"id": "norwegian-heating",
"metadata": {},
"outputs": [],
"source": [
"assert len(unmapped_grades / new_student_grades) == 0, 'students must exist either in the students table or in the new students table'"
]
},
{
"cell_type": "code",
"execution_count": 75,
"id": "peripheral-reggae",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[9 x 5]\n",
"grades|name |subject|surname |teacher \n",
"30 |alan |english|abramsom |mr bennet \n",
"40 |alan |math |abramsom |mrs ruler \n",
"90 |barbara|english|brown |mr bennet \n",
"...9 rows...\n",
"50 |david |math |drummond |mrs ruler \n",
"70 |esther |math |ecklestone|mrs ruler \n",
"80 |esther |french |ecklestone|dr francois"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_grades = student_grades + new_student_grades; all_grades"
]
},
{
"cell_type": "markdown",
"id": "serious-sierra",
"metadata": {},
"source": [
"### pivot"
]
},
{
"cell_type": "code",
"execution_count": 76,
"id": "dynamic-programmer",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[16 x 2]\n",
"x|y\n",
"1|1\n",
"1|2\n",
"1|3\n",
"...16 rows...\n",
"4|2\n",
"4|3\n",
"4|4"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x = dictable(x = [1,2,3,4])\n",
"y = dictable(y = [1,2,3,4])\n",
"xy = (x * y)\n",
"xy"
]
},
{
"cell_type": "code",
"execution_count": 77,
"id": "actual-scholarship",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[4 x 5]\n",
"x|1 |2 |3 |4 \n",
"1|[1]|[2]|[3] |[4] \n",
"2|[2]|[4]|[6] |[8] \n",
"3|[3]|[6]|[9] |[12]\n",
"4|[4]|[8]|[12]|[16]"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xy.pivot('x', 'y', lambda x, y: x*y)"
]
},
{
"cell_type": "markdown",
"id": "plain-width",
"metadata": {},
"source": [
"### a few observations:"
]
},
{
"cell_type": "markdown",
"id": "prime-novel",
"metadata": {},
"source": [
"* as per usual, can provide a function for values in table (indeed columns y) and not just keys\n",
"* the output in the cells come back as a list. This is because sometimes there are more than one row with given x and y, and sometimes there are none:"
]
},
{
"cell_type": "code",
"execution_count": 78,
"id": "opponent-inclusion",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[4 x 5]\n",
"x|1 |2 |3 |4 \n",
"1|[1, 1]|[2, 2]|[3, 3] |None \n",
"2|[2, 2]|[4, 4]|None |[8, 8] \n",
"3|[3, 3]|None |[9, 9] |[12, 12]\n",
"4|None |[8, 8]|[12, 12]|[16, 16]"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(xy + xy).exc(lambda x,y: x+y == 5).pivot('x', 'y', lambda x, y: x*y)"
]
},
{
"cell_type": "markdown",
"id": "intensive-crash",
"metadata": {},
"source": [
"You can apply a sequence of aggregate functions:"
]
},
{
"cell_type": "code",
"execution_count": 79,
"id": "expressed-wellington",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dictable[4 x 5]\n",
"x|1 |2 |3 |4 \n",
"1|2 |2 |2 |None\n",
"2|2 |2 |None|2 \n",
"3|2 |None|2 |2 \n",
"4|None|2 |2 |2 "
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(xy + xy).exc(lambda x,y: x+y == 5).pivot('x', 'y', lambda x, y: x*y, lambda v: len(v))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}