pyg.base.dictable

dictable is a table, a collection of iterable records. It is also a dict with each key’s value being a column. Why not use a pandas.DataFrame? pd.DataFrame leads a dual life:

  • by day an index-based optimized numpy array supporting e.g. timeseries analytics etc.

  • by night, a table with keys supporting filtering, aggregating, pivoting on keys as well as inner/outer joining on keys.

As a result, the pandas interface is somewhat cumbersome. Further, the DataFrame isn’t really designed for containing more complicated objects within it. Conversely, dictable only tries to do the latter and is designed precisely for holding entire research process in one place. You can think of dictable as ‘one level up’ on a DataFrame: a dictable will handle thousands of data frames within it with ease. Indeed, dictable should be thought of as an ‘organiser of research flow’ rather than as an array of primitives. In general, each row will contain some keys indexing the experiment, while some keys will contain complicated objects: a pd.DataFrame, a timeseries, yield_curves, machine-learning experiments etc. The interface is succinct and extremely intuitive, allowing the user to concentrate on logic of the calculations rather than boilerplate.

Motivation: dictable as an organiser of research flow

We start with a simple motivating example. Here is a typical workflow:

[3]:
from pyg import *; import pandas as pd; import numpy as np
import yfinance as yf
[4]:
symbols = ['MSFT', 'WMT', 'TSLA', 'AAPL', 'BAD_SYMBOL', 'C']
history = [yf.download(symbol) for symbol in symbols]
prices = [h['Adj Close'] for h in history]
rtns = [p.diff() for p in prices]
vols = [r.ewm(30).std() for r in rtns]
zscores = [r/v for r,v in zip(rtns, vols)]
zavgs = [z.mean() for z in zscores]
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed

1 Failed download:
- BAD_SYMBOL: No data found, symbol may be delisted
[*********************100%***********************]  1 of 1 completed
[5]:
zavgs
[5]:
[0.06238896915574035,
 0.045634555332148996,
 0.0676156301672513,
 0.053189575669227614,
 nan,
 0.027297252361386543]

At this point we ask ourselves: Why do we have a nan? Which ticker was it, and when did it go wrong?

[6]:
bad_symbols = [s for s, z in zip(symbols, zavgs) if np.isnan(z)]; bad_symbols
[6]:
['BAD_SYMBOL']

Great, how do we remove bad symbols from all our other variables?

[7]:
vols = [v for s, v in zip(symbols, vols) if s not in bad_symbols]

Now we can calculate some stuff with rtns and vols perhaps?

[8]:
ewmas = [r.ewm(n).mean()/v for r,v in zip(rtns, vols) for n in [10, 20, 30]]

Things went wrong and went wrong silently too:

  • We forgot to remove bad data from rtns as well as from vols so our zip function is zipping the wrong stocks together

  • It is nearly impossible to discover what item in the list belong to what n and what stock

If you ever dealt with real data, the mess described above must be familiar.

Same code, in dictable

[9]:
from pyg import *
import yfinance as yf
s = dictable(symbol = ['MSFT', 'WMT', 'TSLA', 'AAPL', 'BAD_SYMBOL', 'C'])
s = s(history = lambda symbol: yf.download(symbol))
s = s(price = lambda history: history['Adj Close'])
s = s(rtn = lambda price: price.diff())
s = s(vol = lambda rtn: rtn.ewm(30).std())
s = s(zscore = lambda rtn, vol: rtn/vol)
s = s(zavg = lambda zscore: zscore.mean())
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed

1 Failed download:
- BAD_SYMBOL: No data found, symbol may be delisted
[*********************100%***********************]  1 of 1 completed

dictable s contains all our data.

  • each row contains all the variables associated with a specific symbol

  • each column corresponds to a variable

  • adding a new variable is declarative and free of boiler-plate loop and zip

[10]:
s[['symbol', 'history', 'vol', 'zavg']]
[10]:
dictable[6 x 4]
symbol    |history                                           |vol                                        |zavg
MSFT      |                  Open        High         Low    |Date                                       |0.06238896915574035
          |Date                                              |1986-03-13         NaN                     |
          |1986-03-13    0.088542    0.101563    0.088542    |1986-03-14         NaN                     |
          |1986-03-14    0.097222    0.102431    0.097222    |1986-03-17    0.000779                     |
          |1986-03-17    0.100694    0.103299    0.100694    |1986-03-18    0.001997                     |
WMT       |                  Open        High         Low    |Date                                       |0.045634555332148996
          |Date                                              |1972-08-25         NaN                     |
          |1972-08-25    0.063477    0.064697    0.063477    |1972-08-28         NaN                     |
          |1972-08-28    0.064453    0.064941    0.064209    |1972-08-29    0.000198                     |
          |1972-08-29    0.063965    0.063965    0.063477    |1972-08-30    0.000215                     |
TSLA      |                  Open        High         Low    |Date                                       |0.0676156301672513
          |Date                                              |2010-06-29          NaN                    |
          |2010-06-29    3.800000    5.000000    3.508000    |2010-06-30          NaN                    |
          |2010-06-30    5.158000    6.084000    4.660000    |2010-07-01     0.255972                    |
          |2010-07-01    5.000000    5.184000    4.054000    |2010-07-02     0.274164                    |
AAPL      |                  Open        High         Low    |Date                                       |0.053189575669227614
          |Date                                              |1980-12-12         NaN                     |
          |1980-12-12    0.128348    0.128906    0.128348    |1980-12-15         NaN                     |
          |1980-12-15    0.122210    0.122210    0.121652    |1980-12-16    0.001242                     |
          |1980-12-16    0.113281    0.113281    0.112723    |1980-12-17    0.004938                     |
BAD_SYMBOL|Empty DataFrame                                   |Series([], Name: Adj Close, dtype: float64)|nan
          |Columns: [Open, High, Low, Close, Adj Close, Volum|                                           |
          |Index: []                                         |                                           |
C         |                 Open       High        Low      C|Date                                       |0.027297252361386543
          |Date                                              |1977-01-03         NaN                     |
          |1977-01-03  16.133125  16.236876  16.133125  16.23|1977-01-04         NaN                     |
          |1977-01-04  16.236876  16.288750  16.184999  16.28|1977-01-05    0.053720                     |
          |1977-01-05  16.288750  16.288750  16.133125  16.18|1977-01-06    0.043500                     |
[11]:
s.zavg
[11]:
[0.06238896915574035,
 0.045634555332148996,
 0.0676156301672513,
 0.053189575669227614,
 nan,
 0.027297252361386543]

Oh, no, we have a bad symbol, how do we remove it?

[12]:
s = s.exc(zavg = np.nan); s.zavg
[12]:
[0.06238896915574035,
 0.045634555332148996,
 0.0676156301672513,
 0.053189575669227614,
 0.027297252361386543]

Now if we want to calculate something per symbol and window…

We want to create a new table, now keyed on two values: symbol and window n, so we create a bigger table using cross product:

[13]:
sn = s * dict(n = [10,20,30]) ## each row is now unique per symbol and window n
[14]:
sn = sn(ewma = lambda rtn, n, vol: rtn.ewm(n).mean()/vol)

And here is Citibank’s three ewma…

[15]:
sn.inc(symbol = 'C')[['n', 'ewma']]
[15]:
dictable[3 x 2]
n |ewma
10|Date
  |1977-01-03         NaN
  |1977-01-04         NaN
  |1977-01-05   -0.269415
  |1977-01-06   -0.636750
20|Date
  |1977-01-03         NaN
  |1977-01-04         NaN
  |1977-01-05   -0.252990
  |1977-01-06   -0.610388
30|Date
  |1977-01-03         NaN
  |1977-01-04         NaN
  |1977-01-05   -0.247336
  |1977-01-06   -0.601208

Here is a pivot table of the average of each ewma per symbol and window… Note that again, we can access functions of variables and not just the existing keys in the dictable

[16]:
sn.pivot('symbol', 'n', lambda ewma: ewma.mean())
[16]:
dictable[5 x 4]
symbol|10                    |20                    |30
AAPL  |[0.05186739363086048] |[0.048216725636102103]|[0.044663941035992756]
C     |[0.027514106874753336]|[0.026447710541888325]|[0.02513228496004374]
MSFT  |[0.061144190254947106]|[0.058806269926441355]|[0.056647557919386575]
TSLA  |[0.05809192519536144] |[0.0513719449570986]  |[0.0461212842741293]
WMT   |[0.04534668336293025] |[0.04406462686679985] |[0.0426962176790935]

dictable functionality

construction

dictable is quite flexible on constuctions.

[17]:
d = dictable(a = [1,2,3,4], b = ['a', 'b', 'c', 'd']); d
[17]:
dictable[4 x 2]
a|b
1|a
2|b
3|c
4|d
[18]:
d = dictable(dict(a = [1,2,3,4], b = ['a', 'b', 'c', 'd']), symbol = ['MSFT', 'AAPL', 'APA', 'MMM'], exchange = 'NYSE'); d
[18]:
dictable[4 x 4]
symbol|exchange|a|b
MSFT  |NYSE    |1|a
AAPL  |NYSE    |2|b
APA   |NYSE    |3|c
MMM   |NYSE    |4|d
[19]:
df = pd.DataFrame(d) # can instantiate a DataFrame from a dictable with no code and vice versa...
[20]:
d = dictable(df); d
[20]:
dictable[4 x 4]
symbol|exchange|a|b
MSFT  |NYSE    |1|a
AAPL  |NYSE    |2|b
APA   |NYSE    |3|c
MMM   |NYSE    |4|d
[21]:
d = dictable([(1,3), (2,4), (3,5)], ['a', 'b']); d # construction from records as tuples
[21]:
dictable[3 x 2]
a|b
1|3
2|4
3|5
[22]:
d = dictable([dict(a = 1, b = 3), dict(a = 2, b = 4, d = 'new column'), dict(a = 3, b = 5, c = 'also here')]); d # construction from records as dicts, mismatching on keys is fine
[22]:
dictable[3 x 4]
b|c        |a|d
3|None     |1|None
4|None     |2|new column
5|also here|3|None
[23]:
d = dictable(read_csv('d:/dropbox/yoav/python/pyg/docs/constituents_csv.csv')); d = d[:6]; d
[23]:
dictable[6 x 3]
Symbol|Name               |Sector
MMM   |3M Company         |Industrials
AOS   |A.O. Smith Corp    |Industrials
ABT   |Abbott Laboratories|Health Care
ABBV  |AbbVie Inc.        |Health Care
ABMD  |ABIOMED Inc        |Health Care
ACN   |Accenture plc      |Information Technology

row access

[24]:
d[0] #returns a record
[24]:
{'Symbol': 'MMM', 'Name': '3M Company', 'Sector': 'Industrials'}
[25]:
d[:2] ## subset rows using slice
[25]:
dictable[2 x 3]
Symbol|Name           |Sector
MMM   |3M Company     |Industrials
AOS   |A.O. Smith Corp|Industrials
[26]:
for row in d: # iteration is by row
    print(row)
{'Symbol': 'MMM', 'Name': '3M Company', 'Sector': 'Industrials'}
{'Symbol': 'AOS', 'Name': 'A.O. Smith Corp', 'Sector': 'Industrials'}
{'Symbol': 'ABT', 'Name': 'Abbott Laboratories', 'Sector': 'Health Care'}
{'Symbol': 'ABBV', 'Name': 'AbbVie Inc.', 'Sector': 'Health Care'}
{'Symbol': 'ABMD', 'Name': 'ABIOMED Inc', 'Sector': 'Health Care'}
{'Symbol': 'ACN', 'Name': 'Accenture plc', 'Sector': 'Information Technology'}

column access

[27]:
d.Name
[27]:
['3M Company',
 'A.O. Smith Corp',
 'Abbott Laboratories',
 'AbbVie Inc.',
 'ABIOMED Inc',
 'Accenture plc']
[28]:
d['Name']
[28]:
['3M Company',
 'A.O. Smith Corp',
 'Abbott Laboratories',
 'AbbVie Inc.',
 'ABIOMED Inc',
 'Accenture plc']
[29]:
d['Name', 'Sector']
[29]:
[('3M Company', 'Industrials'),
 ('A.O. Smith Corp', 'Industrials'),
 ('Abbott Laboratories', 'Health Care'),
 ('AbbVie Inc.', 'Health Care'),
 ('ABIOMED Inc', 'Health Care'),
 ('Accenture plc', 'Information Technology')]
[30]:
d[['Name', 'Sector']]
[30]:
dictable[6 x 2]
Name               |Sector
3M Company         |Industrials
A.O. Smith Corp    |Industrials
Abbott Laboratories|Health Care
AbbVie Inc.        |Health Care
ABIOMED Inc        |Health Care
Accenture plc      |Information Technology

d is a dict so supports the usual keys(), values() and items():

[31]:
for key, column in d.items():
    print(key, ':', column)
Symbol : ['MMM', 'AOS', 'ABT', 'ABBV', 'ABMD', 'ACN']
Name : ['3M Company', 'A.O. Smith Corp', 'Abbott Laboratories', 'AbbVie Inc.', 'ABIOMED Inc', 'Accenture plc']
Sector : ['Industrials', 'Industrials', 'Health Care', 'Health Care', 'Health Care', 'Information Technology']

access via function of variables is also supported

[32]:
d[lambda Symbol, Sector: '%s, %s'%(Symbol, Sector)]
[32]:
['MMM, Industrials',
 'AOS, Industrials',
 'ABT, Health Care',
 'ABBV, Health Care',
 'ABMD, Health Care',
 'ACN, Information Technology']

column and row access are commutative

[33]:
assert d[0].Name == d.Name[0] == '3M Company'
assert d[0][lambda Symbol, Sector: '%s, %s'%(Symbol, Sector)] == d[lambda Symbol, Sector: '%s, %s'%(Symbol, Sector)][0] == 'MMM, Industrials'
assert d[0]['Name'] == d['Name'][0]
assert d[:2]['Name', 'Sector'] == d['Name', 'Sector'][:2]
assert d[:2][['Name', 'Sector']] == d[['Name', 'Sector']][:2]

adding records

[34]:
d = dictable(name = ['alan', 'barbara', 'chris'], surname = ['abramson', 'brown', 'cohen'], age = [1,2,3])
[35]:
d + dict(name = 'david', surname = 'donaldson', age = 4) ## adding a single record
[35]:
dictable[4 x 3]
age|name   |surname
1  |alan   |abramson
2  |barbara|brown
3  |chris  |cohen
4  |david  |donaldson
[36]:
d + [dict(name = 'david', surname = 'donaldson', age = 4), dict(name = 'evan', surname = 'emmerson', age = 5)]
[36]:
dictable[5 x 3]
age|name   |surname
1  |alan   |abramson
2  |barbara|brown
3  |chris  |cohen
4  |david  |donaldson
5  |evan   |emmerson
[37]:
d + dict(name = ['david', 'evan'], surname = ['donaldson', 'emmerson'], age = [4,5])
[37]:
dictable[5 x 3]
age|name   |surname
1  |alan   |abramson
2  |barbara|brown
3  |chris  |cohen
4  |david  |donaldson
5  |evan   |emmerson
[38]:
d + pd.DataFrame(dict(name = ['david', 'evan'], surname = ['donaldson', 'emmerson'], age = [4,5]))
[38]:
dictable[5 x 3]
age|name   |surname
1  |alan   |abramson
2  |barbara|brown
3  |chris  |cohen
4  |david  |donaldson
5  |evan   |emmerson

adding/modifying columns

You can add a column or a constant by simply calling the dictable with the values:

[39]:
d(gender = ['m', 'f', 'm'])(school = 'St Paul')
[39]:
dictable[3 x 5]
name   |surname |age|gender|school
alan   |abramson|1  |m     |St Paul
barbara|brown   |2  |f     |St Paul
chris  |cohen   |3  |m     |St Paul

More interestingly, it can be a callable function using the other variables…

[40]:
d = d(initials = lambda name, surname: name[0] + surname[0]); d
[40]:
dictable[3 x 4]
name   |surname |age|initials
alan   |abramson|1  |aa
barbara|brown   |2  |bb
chris  |cohen   |3  |cc

Given d is a dict, a more traditional way of setting a new key is by simple assignment:

[41]:
d['initials'] = d[lambda name, surname: name[0] + surname[0]]; d
[41]:
dictable[3 x 4]
name   |surname |age|initials
alan   |abramson|1  |aa
barbara|brown   |2  |bb
chris  |cohen   |3  |cc

Or you can use the dict.update method:

[42]:
d.update(dict(gender = ['m', 'f', 'm'])); d
[42]:
dictable[3 x 5]
name   |surname |age|initials|gender
alan   |abramson|1  |aa      |m
barbara|brown   |2  |bb      |f
chris  |cohen   |3  |cc      |m

do

Sometime we want to apply the same function(s) to a collection of columns. For this, ‘do’ will do nicely:

[43]:
d = d.do(upper, 'initials', 'gender').do(proper, 'name', 'surname'); d
[43]:
dictable[3 x 5]
name   |surname |age|initials|gender
Alan   |Abramson|1  |AA      |M
Barbara|Brown   |2  |BB      |F
Chris  |Cohen   |3  |CC      |M

removing columns

[44]:
d = d - 'initials'; d
[44]:
dictable[3 x 4]
name   |surname |age|gender
Alan   |Abramson|1  |M
Barbara|Brown   |2  |F
Chris  |Cohen   |3  |M

removing rows

[45]:
d.exc(name = 'Alan')
[45]:
dictable[2 x 4]
age|gender|name   |surname
2  |F     |Barbara|Brown
3  |M     |Chris  |Cohen
[46]:
d.inc(name = ['Alan', 'Chris'])
[46]:
dictable[2 x 4]
age|gender|name |surname
1  |M     |Alan |Abramson
3  |M     |Chris|Cohen
[47]:
d.inc(lambda age: age>1)
[47]:
dictable[2 x 4]
age|gender|name   |surname
2  |F     |Barbara|Brown
3  |M     |Chris  |Cohen
[48]:
d.exc(lambda gender: gender == 'M')
[48]:
dictable[1 x 4]
name   |surname|age|gender
Barbara|Brown  |2  |F
[49]:
d.exc(lambda name, surname: len(name)>len(surname))
[49]:
dictable[2 x 4]
age|gender|name |surname
1  |M     |Alan |Abramson
3  |M     |Chris|Cohen

sort

[50]:
d.sort('name', 'surname')
[50]:
dictable[3 x 4]
name   |surname |age|gender
Alan   |Abramson|1  |M
Barbara|Brown   |2  |F
Chris  |Cohen   |3  |M
[51]:
d.sort(lambda name: name[::-1]) # can sort on functions of variables too
[51]:
dictable[3 x 4]
name   |surname |age|gender
Barbara|Brown   |2  |F
Alan   |Abramson|1  |M
Chris  |Cohen   |3  |M

listby(keys)

listby is like groupby except it returns a dictable with unique keys and the other columns are returned as a list. We find that MUCH more useful usually than groupby

[52]:
grades = dictable(name = ['alan', 'barbara', 'chris'], grades = [30,90,80], subject = 'english', teacher = 'mr bennet') \
       + dictable(name = ['alan', 'david', 'esther'], grades = [40,50,70], subject = 'math', teacher = 'mrs ruler') \
       + dictable(name = ['barbara', 'chris', 'esther'], grades = [90,60,80], subject = 'french', teacher = 'dr francois')
[53]:
grades.listby('teacher')
[53]:
dictable[3 x 4]
teacher    |grades      |name                          |subject
dr francois|[90, 60, 80]|['barbara', 'chris', 'esther']|['french', 'french', 'french']
mr bennet  |[30, 90, 80]|['alan', 'barbara', 'chris']  |['english', 'english', 'english']
mrs ruler  |[40, 50, 70]|['alan', 'david', 'esther']   |['math', 'math', 'math']
[54]:
grades.listby('teacher')(avg_grade = lambda grades: np.mean(grades))
[54]:
dictable[3 x 5]
teacher    |grades      |name                          |subject                          |avg_grade
dr francois|[90, 60, 80]|['barbara', 'chris', 'esther']|['french', 'french', 'french']   |76.66666666666667
mr bennet  |[30, 90, 80]|['alan', 'barbara', 'chris']  |['english', 'english', 'english']|66.66666666666667
mrs ruler  |[40, 50, 70]|['alan', 'david', 'esther']   |['math', 'math', 'math']         |53.333333333333336

unlist

unlist undoes listby() assuming it is possible…

[55]:
grades.listby('teacher').unlist()
[55]:
dictable[9 x 4]
grades|name   |subject|teacher
90    |barbara|french |dr francois
60    |chris  |french |dr francois
80    |esther |french |dr francois
...9 rows...
40    |alan   |math   |mrs ruler
50    |david  |math   |mrs ruler
70    |esther |math   |mrs ruler

groupby(keys) and ungroup

This is similar to DatFrame groupby except that instead of a new object, a dictable is returned: The name of the grouped column is given by ‘grp’. ungroup allows us to get back to original.

[56]:
classes = grades.groupby(['teacher', 'subject'], grp = 'class')
[57]:
classes[0]
[57]:
{'teacher': 'dr francois',
 'subject': 'french',
 'class': dictable[3 x 2]
 grades|name
 90    |barbara
 60    |chris
 80    |esther }
[58]:
classes.ungroup('class')
[58]:
dictable[9 x 4]
grades|name   |subject|teacher
90    |barbara|french |dr francois
60    |chris  |french |dr francois
80    |esther |french |dr francois
...9 rows...
40    |alan   |math   |mrs ruler
50    |david  |math   |mrs ruler
70    |esther |math   |mrs ruler

inner join

The multiplication operation is overloaded for the join method. By default, if two dictables share keys, the join is an inner join on the keys

[59]:
students = dictable(name = ['alan', 'barbara', 'chris', 'david', 'esthar', 'fabian'], surname = ['abramsom', 'brown', 'cohen', 'drummond', 'ecklestone', 'fox'])
[60]:
print('shared keys:', grades.keys() & students.keys())
grades * students
shared keys: ['name']
[60]:
dictable[7 x 5]
name   |grades|subject|teacher    |surname
alan   |30    |english|mr bennet  |abramsom
alan   |40    |math   |mrs ruler  |abramsom
barbara|90    |english|mr bennet  |brown
...7 rows...
chris  |80    |english|mr bennet  |cohen
chris  |60    |french |dr francois|cohen
david  |50    |math   |mrs ruler  |drummond

Are there students with no surname? We can do a xor or use division which is overloaded for xor:

[61]:
grades / students
[61]:
dictable[2 x 4]
grades|name  |subject|teacher
70    |esther|math   |mrs ruler
80    |esther|french |dr francois

Are there students with no grades?

[62]:
students / grades
[62]:
dictable[2 x 2]
name  |surname
esthar|ecklestone
fabian|fox
[63]:
students = dictable(name = ['Alan', 'Barbara', 'Chris', 'David', 'Esther', 'Fabian'], surname = ['abramsom', 'brown', 'cohen', 'drummond', 'ecklestone', 'fox'])

We fixed Esther’s spelling but introduced capitalization, that is OK, we are allowed to inner join on functions of keys too.

[64]:
grades.join(students, 'name', lambda name: name.lower())
[64]:
dictable[9 x 5]
name   |grades|subject|teacher    |surname
alan   |30    |english|mr bennet  |abramsom
alan   |40    |math   |mrs ruler  |abramsom
barbara|90    |english|mr bennet  |brown
...9 rows...
david  |50    |math   |mrs ruler  |drummond
esther |70    |math   |mrs ruler  |ecklestone
esther |80    |french |dr francois|ecklestone
[65]:
students = dictable(first_name = ['alan', 'barbara', 'chris', 'david', 'esther', 'fabian'], surname = ['abramsom', 'brown', 'cohen', 'drummond', 'ecklestone', 'fox'])

You can inner join on different column names and both columns will be populated:

[66]:
grades.join(pd.DataFrame(students), 'name', 'first_name')
[66]:
dictable[9 x 6]
name   |grades|subject|teacher    |first_name|surname
alan   |30    |english|mr bennet  |alan      |abramsom
alan   |40    |math   |mrs ruler  |alan      |abramsom
barbara|90    |english|mr bennet  |barbara   |brown
...9 rows...
david  |50    |math   |mrs ruler  |david     |drummond
esther |70    |math   |mrs ruler  |esther    |ecklestone
esther |80    |french |dr francois|esther    |ecklestone

inner join (with other columns that match names)

By default, if columns are shared but are not in the join, they will be returned with a tuple containing both values

[67]:
x = dictable(key = ['a', 'b', 'c', 'c'], x = [1,2,3,4], y = [4,5,6,7])
y = dictable(key = ['b', 'b', 'c', 'a'], x = [1,2,3,4], z = [8,9,1,2])
x.join(y, 'key', 'key') ## ignore x column for joining
[67]:
dictable[5 x 4]
key|y|z|x
a  |4|2|(1, 4)
b  |5|8|(2, 1)
b  |5|9|(2, 2)
c  |6|1|(3, 3)
c  |7|1|(4, 3)
[68]:
x.join(y, 'key', 'key', mode = 'left') ## grab left value
[68]:
dictable[5 x 4]
key|y|z|x
a  |4|2|1
b  |5|8|2
b  |5|9|2
c  |6|1|3
c  |7|1|4

cross join

If no columns are shared, then a cross join is returned.

[69]:
x = dictable(x = [1,2,3,4])
y = dict(y = [1,2,3])
x * y
[69]:
dictable[12 x 2]
x|y
1|1
1|2
1|3
...12 rows...
4|1
4|2
4|3
[70]:
x.join(y, [], []) ## you can force a full outer join
[70]:
dictable[12 x 2]
x|y
1|1
1|2
1|3
...12 rows...
4|1
4|2
4|3
[71]:
x / y == x
[71]:
True

xor (versus left and right join)

We find left/right join actually not very useful. There is usually a genuine reason for records for which there is a match and for records for which there isn’t. And the treatment of these is distinct, which means a left-join operation that joins the two outcomes together is positively harmful.

The xor operator is much more useful and you can use it to recreate left/right join if we really must. Here is an example

[80]:
students = dictable(name = ['alan', 'barbara', 'chris'], surname = ['abramsom', 'brown', 'cohen',])
new_students = dictable(name = ['david', 'esther', 'fabian'], surname = ['drummond', 'ecklestone', 'fox'])

inner_join = grades * students ## grades with students
left_xor = grades / students  ## grades without sudents

# you can...
left_join = grades * students + grades / students ## grades for which no surname is available will have None surname
left_join
[80]:
dictable[9 x 5]
grades|name   |teacher    |surname |subject
30    |alan   |mr bennet  |abramsom|english
40    |alan   |mrs ruler  |abramsom|math
90    |barbara|mr bennet  |brown   |english
...9 rows...
50    |david  |mrs ruler  |None    |math
70    |esther |mrs ruler  |None    |math
80    |esther |dr francois|None    |french
[73]:
# but really you want to do:
student_grades = grades * students
unmapped_grades = grades / students ## we treat this one separately...
new_student_grades =  unmapped_grades * new_students ## and grab surnames from the new students table...
[74]:
assert len(unmapped_grades / new_student_grades) == 0, 'students must exist either in the students table or in the new students table'
[75]:
all_grades = student_grades + new_student_grades; all_grades
[75]:
dictable[9 x 5]
grades|name   |subject|surname   |teacher
30    |alan   |english|abramsom  |mr bennet
40    |alan   |math   |abramsom  |mrs ruler
90    |barbara|english|brown     |mr bennet
...9 rows...
50    |david  |math   |drummond  |mrs ruler
70    |esther |math   |ecklestone|mrs ruler
80    |esther |french |ecklestone|dr francois

pivot

[76]:
x = dictable(x = [1,2,3,4])
y = dictable(y = [1,2,3,4])
xy = (x * y)
xy
[76]:
dictable[16 x 2]
x|y
1|1
1|2
1|3
...16 rows...
4|2
4|3
4|4
[77]:
xy.pivot('x', 'y', lambda x, y: x*y)
[77]:
dictable[4 x 5]
x|1  |2  |3   |4
1|[1]|[2]|[3] |[4]
2|[2]|[4]|[6] |[8]
3|[3]|[6]|[9] |[12]
4|[4]|[8]|[12]|[16]

a few observations:

  • as per usual, can provide a function for values in table (indeed columns y) and not just keys

  • the output in the cells come back as a list. This is because sometimes there are more than one row with given x and y, and sometimes there are none:

[78]:
(xy + xy).exc(lambda x,y: x+y == 5).pivot('x', 'y', lambda x, y: x*y)
[78]:
dictable[4 x 5]
x|1     |2     |3       |4
1|[1, 1]|[2, 2]|[3, 3]  |None
2|[2, 2]|[4, 4]|None    |[8, 8]
3|[3, 3]|None  |[9, 9]  |[12, 12]
4|None  |[8, 8]|[12, 12]|[16, 16]

You can apply a sequence of aggregate functions:

[79]:
(xy + xy).exc(lambda x,y: x+y == 5).pivot('x', 'y', lambda x, y: x*y, lambda v: len(v))
[79]:
dictable[4 x 5]
x|1   |2   |3   |4
1|2   |2   |2   |None
2|2   |2   |None|2
3|2   |None|2   |2
4|None|2   |2   |2