pyg.base.join¶
Only read this if you are a seasoned dictable user. In data science, we usually have data in multiple tables and we want to pull specific columns together for an analysis. We will first look at join function and then examine the perdictable decorator.
Join¶
Example: Using join function to transfer money to a bank¶
We begin by setting up a mini database:
[22]:
from pyg import *
customers = dictable(customer = ['alan', 'barbara', 'charles'], address = ['1 Abba Avenue', '2 Beatles Lane', '3 Corrs Close'], bank = ['allied', 'barclays', 'chase'])
products = dictable(product = ['apple', 'banana', 'cherry'], price = [1,2,3], supplier = ['grove limited', 'go banabas', 'cherry pickers'])
customer_products = dictable(customer = ['alan', 'alan', 'charles', 'charles'], product = ['apple', 'banana', 'cherry', 'apple'], amount = [1,2,3,4], purchase_date = drange(-2,1))
banks = dictable(bank = ['allied', 'barclays'], account = [5556, 2461])
print('Customers\n', customers, '\n\nProducts\n', products, '\n\nCustomer_products\n', customer_products, '\n\nBanks\n', banks)
Customers
customer|address |bank
alan |1 Abba Avenue |allied
barbara |2 Beatles Lane|barclays
charles |3 Corrs Close |chase
Products
product|price|supplier
apple |1 |grove limited
banana |2 |go banabas
cherry |3 |cherry pickers
Customer_products
customer|product|amount|purchase_date
alan |apple |1 |2021-02-23 00:00:00
alan |banana |2 |2021-02-24 00:00:00
charles |cherry |3 |2021-02-25 00:00:00
charles |apple |4 |2021-02-26 00:00:00
Banks
bank |account
allied |5556
barclays|2461
Simple join: inner join between tables¶
Suppose we want to know how much money is to be transferred from each bank. - We only care about the fields ‘bank’, ‘amount’ and ‘price’ - each field is pulled from different tables, - need to specify customer & product as the keys we will join on:
[23]:
join(dict(bank = customers, amount = customer_products, price = products), on = ['customer', 'product'])
[23]:
dictable[4 x 5]
product|customer|bank |amount|price
apple |alan |allied|1 |1
banana |alan |allied|2 |2
apple |charles |chase |4 |1
cherry |charles |chase |3 |3
Defaults for fields we want to left-join on…¶
The function we need to run to transfer money looks like this, so actually, we would like to have account details too.
[24]:
def transfer_money(bank, amount, price, account = 'default'):
## if account == 'default' transfer money slowly, else transfer quickly
## return
pass
We can grab the account details from the ‘banks’ table:
[25]:
join(dict(bank = customers, amount = customer_products, price = products, account = banks), on = ['customer', 'product', 'bank'])
[25]:
dictable[2 x 6]
bank |product|customer|amount|price|account
allied|apple |alan |1 |1 |5556
allied|banana |alan |2 |2 |5556
but we just lost Chase transactions as we dont have its account details. However, money is transfered perfectly (albeit slowly) even without account id. So instead….
[26]:
join(dict(bank = customers, amount = customer_products, price = products, account = banks),
on = ['customer', 'product', 'bank'],
defaults = dict(account = 'default'))
[26]:
dictable[4 x 6]
account|amount|bank |customer|price|product
5556 |1 |allied|alan |1 |apple
5556 |2 |allied|alan |2 |banana
default|4 |chase |charles |1 |apple
default|3 |chase |charles |3 |cherry
Renaming & calculating fields¶
We also want to ensure we don’t transfer money that we already transferred… so we need to grab an expiry column based on purchase_date in customer_product table
[27]:
join(dict(bank = customers, amount = customer_products, price = products, account = banks, expiry = customer_products),
on = ['customer', 'product', 'bank'],
renames = dict(expiry = lambda purchase_date: dt(purchase_date, '1b')), ## it takes 1 business day to transfer money
defaults = dict(account = 'default'))
[27]:
dictable[4 x 7]
account|amount|bank |customer|expiry |price|product
5556 |1 |allied|alan |2021-02-24 00:00:00|1 |apple
5556 |2 |allied|alan |2021-02-25 00:00:00|2 |banana
default|4 |chase |charles |2021-03-01 00:00:00|1 |apple
default|3 |chase |charles |2021-02-26 00:00:00|3 |cherry
Perdictable¶
perdictable takes the same operation one steps further and actually runs the function. We also use the function signature to determine the defaults parameter. Here is another example: ### Example: Oil prices In Finance, there are contracts called Futures, each Future contract has an expiry. E.g. Futures contracts for Oil are contracts agreeing the delivery of oil to a particular place in a particular month. Once that month is gone, that contract is no longer traded and the oil needs to be delivered.
[28]:
from pyg import *
oil = dictable(y = dt().year-1, m = range(3, 13, 3)) + dictable(y = dt().year, m = range(3, 13, 3))
oil = oil(ticker = lambda y, m: 'OIL_%i_%s'%(y, m if m>9 else '0%i'%m))
oil
[28]:
dictable[8 x 3]
m |y |ticker
3 |2020|OIL_2020_03
6 |2020|OIL_2020_06
9 |2020|OIL_2020_09
...8 rows...
6 |2021|OIL_2021_06
9 |2021|OIL_2021_09
12|2021|OIL_2021_12
y,m and ticker will form our primary keys
[29]:
pk = ['y', 'm', 'ticker']
expiry = perdictable(lambda y, m: dt(y,m+1,1), on = pk)(y = oil, m = oil)
expiry
[29]:
dictable[8 x 4]
y |m |ticker |data
2020|3 |OIL_2020_03|2020-04-01 00:00:00
2020|6 |OIL_2020_06|2020-07-01 00:00:00
2020|9 |OIL_2020_09|2020-10-01 00:00:00
...8 rows...
2021|6 |OIL_2021_06|2021-07-01 00:00:00
2021|9 |OIL_2021_09|2021-10-01 00:00:00
2021|12|OIL_2021_12|2022-01-01 00:00:00
[30]:
def fake_ts(ticker, expiry):
return 500 + pd.Series(np.random.normal(0,1,100), drange(dt_bump(expiry,-99), expiry)).cumsum()
To add a price for each of the futures, we first wrap fake_ts and then run it:
[31]:
price = perdictable(fake_ts, on = pk)(ticker = oil, expiry = expiry)
price
[31]:
dictable[8 x 4]
y |m |ticker |data
2020|3 |OIL_2020_03|2019-12-24 499.000139
| | |2019-12-25 500.904180
| | |2019-12-26 501.792007
| | |2019-12-27 502.410313
| | |2019-12-28 502.843697
2020|6 |OIL_2020_06|2020-03-24 500.575052
| | |2020-03-25 499.504860
| | |2020-03-26 500.558506
| | |2020-03-27 500.599754
| | |2020-03-28 500.704313
2020|9 |OIL_2020_09|2020-06-24 500.333677
| | |2020-06-25 500.974220
| | |2020-06-26 499.882500
| | |2020-06-27 500.342359
| | |2020-06-28 501.423622
...8 rows...
2021|6 |OIL_2021_06|2021-03-24 501.437903
| | |2021-03-25 500.808820
| | |2021-03-26 499.478861
| | |2021-03-27 499.203311
| | |2021-03-28 498.270609
2021|9 |OIL_2021_09|2021-06-24 499.950777
| | |2021-06-25 500.458993
| | |2021-06-26 498.564582
| | |2021-06-27 497.988147
| | |2021-06-28 498.193692
2021|12|OIL_2021_12|2021-09-24 500.320503
| | |2021-09-25 499.915132
| | |2021-09-26 498.200049
| | |2021-09-27 497.805575
| | |2021-09-28 497.879972
We have wrapped a function so that we get a price for each of these contracts. This allows us to move from operating on single timeseries, to run it on multiple rows from multiple tables
[32]:
rtn = perdictable(diff, on = pk)(a = price, expiry = expiry)
yesterday_price = perdictable(shift, on = pk)(a = price, expiry = expiry)
percentage_return = perdictable(div_, on = pk)(a = rtn, b = yesterday_price, expiry = expiry)
percentage_return
[32]:
dictable[8 x 4]
y |m |ticker |data
2020|3 |OIL_2020_03|2019-12-24 NaN
| | |2019-12-25 0.003816
| | |2019-12-26 0.001772
| | |2019-12-27 0.001232
| | |2019-12-28 0.000863
2020|6 |OIL_2020_06|2020-03-24 NaN
| | |2020-03-25 -0.002138
| | |2020-03-26 0.002109
| | |2020-03-27 0.000082
| | |2020-03-28 0.000209
2020|9 |OIL_2020_09|2020-06-24 NaN
| | |2020-06-25 0.001280
| | |2020-06-26 -0.002179
| | |2020-06-27 0.000920
| | |2020-06-28 0.002161
...8 rows...
2021|6 |OIL_2021_06|2021-03-24 NaN
| | |2021-03-25 -0.001255
| | |2021-03-26 -0.002656
| | |2021-03-27 -0.000552
| | |2021-03-28 -0.001868
2021|9 |OIL_2021_09|2021-06-24 NaN
| | |2021-06-25 0.001017
| | |2021-06-26 -0.003785
| | |2021-06-27 -0.001156
| | |2021-06-28 0.000413
2021|12|OIL_2021_12|2021-09-24 NaN
| | |2021-09-25 -0.000810
| | |2021-09-26 -0.003431
| | |2021-09-27 -0.000792
| | |2021-09-28 0.000149
perdictable and caching¶
This is nice but (a) what have we gained? and (b) why do we keep using expiry as a variable? The answer is to do with caching actually. If we rerun prices, we should get brand new data, since fake_ts just generates random prices… perdictable identifies rows that have been run and are now ‘expired’ It uses provided old data and does not recalculate. If either expiry or old values are not provided then it calculates everything.
[37]:
new_price = perdictable(fake_ts, on = pk)(ticker = oil, data = price, expiry = expiry)
(new_price.relabel(data = 'new') * price.relabel(data = 'old')).sort('y', 'm')
[37]:
dictable[8 x 5]
y |m |ticker |new |old
2020|3 |OIL_2020_03|2019-12-24 499.000139|2019-12-24 499.000139
| | |2019-12-25 500.904180|2019-12-25 500.904180
| | |2019-12-26 501.792007|2019-12-26 501.792007
| | |2019-12-27 502.410313|2019-12-27 502.410313
| | |2019-12-28 502.843697|2019-12-28 502.843697
2020|6 |OIL_2020_06|2020-03-24 500.575052|2020-03-24 500.575052
| | |2020-03-25 499.504860|2020-03-25 499.504860
| | |2020-03-26 500.558506|2020-03-26 500.558506
| | |2020-03-27 500.599754|2020-03-27 500.599754
| | |2020-03-28 500.704313|2020-03-28 500.704313
2020|9 |OIL_2020_09|2020-06-24 500.333677|2020-06-24 500.333677
| | |2020-06-25 500.974220|2020-06-25 500.974220
| | |2020-06-26 499.882500|2020-06-26 499.882500
| | |2020-06-27 500.342359|2020-06-27 500.342359
| | |2020-06-28 501.423622|2020-06-28 501.423622
...8 rows...
2021|6 |OIL_2021_06|2021-03-24 500.429724|2021-03-24 501.437903
| | |2021-03-25 501.537890|2021-03-25 500.808820
| | |2021-03-26 501.167511|2021-03-26 499.478861
| | |2021-03-27 502.611689|2021-03-27 499.203311
| | |2021-03-28 501.820261|2021-03-28 498.270609
2021|9 |OIL_2021_09|2021-06-24 499.911914|2021-06-24 499.950777
| | |2021-06-25 497.451472|2021-06-25 500.458993
| | |2021-06-26 498.190816|2021-06-26 498.564582
| | |2021-06-27 498.015362|2021-06-27 497.988147
| | |2021-06-28 497.224958|2021-06-28 498.193692
2021|12|OIL_2021_12|2021-09-24 498.129511|2021-09-24 500.320503
| | |2021-09-25 498.739546|2021-09-25 499.915132
| | |2021-09-26 499.321094|2021-09-26 498.200049
| | |2021-09-27 498.491587|2021-09-27 497.805575
| | |2021-09-28 497.057529|2021-09-28 497.879972
perdictable with the cell framework¶
We can run the function and use a cell to store the output…
[40]:
c = cell(perdictable(fake_ts, on = pk), ticker = oil, expiry = expiry)()
c.data
[40]:
dictable[8 x 4]
y |m |ticker |data
2020|3 |OIL_2020_03|2019-12-24 501.331417
| | |2019-12-25 500.332873
| | |2019-12-26 500.160526
| | |2019-12-27 496.688779
| | |2019-12-28 497.774215
2020|6 |OIL_2020_06|2020-03-24 500.899756
| | |2020-03-25 500.830490
| | |2020-03-26 501.829020
| | |2020-03-27 501.875464
| | |2020-03-28 503.241949
2020|9 |OIL_2020_09|2020-06-24 500.395880
| | |2020-06-25 500.311780
| | |2020-06-26 499.817331
| | |2020-06-27 499.780468
| | |2020-06-28 497.550235
...8 rows...
2021|6 |OIL_2021_06|2021-03-24 501.291426
| | |2021-03-25 499.592175
| | |2021-03-26 499.104934
| | |2021-03-27 497.698320
| | |2021-03-28 497.868177
2021|9 |OIL_2021_09|2021-06-24 499.978264
| | |2021-06-25 500.784927
| | |2021-06-26 501.212177
| | |2021-06-27 501.852472
| | |2021-06-28 502.035097
2021|12|OIL_2021_12|2021-09-24 500.282482
| | |2021-09-25 501.077309
| | |2021-09-26 501.005312
| | |2021-09-27 501.173168
| | |2021-09-28 502.098126
[43]:
recalculated_cell = c.go(1) ## force a recalculation
recalculated_cell.data
[43]:
dictable[8 x 4]
y |m |ticker |data
2020|3 |OIL_2020_03|2019-12-24 501.331417
| | |2019-12-25 500.332873
| | |2019-12-26 500.160526
| | |2019-12-27 496.688779
| | |2019-12-28 497.774215
2020|6 |OIL_2020_06|2020-03-24 500.899756
| | |2020-03-25 500.830490
| | |2020-03-26 501.829020
| | |2020-03-27 501.875464
| | |2020-03-28 503.241949
2020|9 |OIL_2020_09|2020-06-24 500.395880
| | |2020-06-25 500.311780
| | |2020-06-26 499.817331
| | |2020-06-27 499.780468
| | |2020-06-28 497.550235
...8 rows...
2021|6 |OIL_2021_06|2021-03-24 499.383712
| | |2021-03-25 498.812289
| | |2021-03-26 498.995159
| | |2021-03-27 499.504985
| | |2021-03-28 498.453581
2021|9 |OIL_2021_09|2021-06-24 499.310726
| | |2021-06-25 500.248325
| | |2021-06-26 500.458067
| | |2021-06-27 498.482094
| | |2021-06-28 499.704684
2021|12|OIL_2021_12|2021-09-24 501.359202
| | |2021-09-25 501.178162
| | |2021-09-26 501.846290
| | |2021-09-27 502.217393
| | |2021-09-28 500.996568
We observe that the cell, when recalculates, automatically caches the history and does not recalculate fake_ts. This is not magic. When a cell calculates its function, it provides the function with the variables it needs. Once calculated, it stores the output in data and will be able to provide data to the function next time, allowing it to avoid re-running expired calculations. Then cell will store the functions’s result back in the data key for later use and this is repeated.
perdictable API¶
Parameters on, renames and defaults parameters determine the way the data is joined. If defaults is missing, the defaults from the function are used:
[45]:
function = lambda price, quantity = 1: price * quantity
price = dictable(product = ['apple', 'banana', 'cherry'], price = [1,2,3])
quantity = dictable(product = ['apple', 'banana', 'damson'], quantity = [2,3,4])
perdictable(function, on = 'product')(price = price, quantity = quantity) ## cherry should appear with default quantity
[45]:
dictable[3 x 2]
product|data
apple |2
banana |6
cherry |3
If you want to see the full calculations and inputs to the function set include_inputs=True:
[46]:
perdictable(function, on = 'product', include_inputs = True)(price = price, quantity = quantity)
[46]:
dictable[3 x 5]
price|product|quantity|expiry|data
1 |apple |2 |None |2
2 |banana |3 |None |6
3 |cherry |1 |None |3
If you want output column to be not data, use col:
[47]:
perdictable(function, on = 'product', include_inputs = True, col = 'cost')(price = price, quantity = quantity)
[47]:
dictable[3 x 5]
price|product|quantity|expiry|cost
1 |apple |2 |None |2
2 |banana |3 |None |6
3 |cherry |1 |None |3
The if_none parameter determines how data is calculated for rows that have expired but their data is None:
[52]:
expiry = dictable(product = ['apple', 'banana', 'cherry'], expiry = [dt(-2), dt(-1), dt(1)])
previous_data = dictable(product = ['apple', 'banana', 'cherry'], data = [None, 'some value that will be kept', 'this value will be recalculated'])
perdictable(function, on = 'product', include_inputs = True, if_none = False)(price = price, quantity = quantity, expiry = expiry, data = previous_data)
[52]:
dictable[3 x 5]
expiry |price|product|quantity|data
2021-02-24 00:00:00|1 |apple |2 |None
2021-02-25 00:00:00|2 |banana |3 |some value that will be kept
2021-02-27 00:00:00|3 |cherry |1 |3
[53]:
perdictable(function, on = 'product', include_inputs = True, if_none = True)(price = price, quantity = quantity, expiry = expiry, data = previous_data)
[53]:
dictable[3 x 5]
expiry |price|product|quantity|data
2021-02-24 00:00:00|1 |apple |2 |2
2021-02-25 00:00:00|2 |banana |3 |some value that will be kept
2021-02-27 00:00:00|3 |cherry |1 |3
Some function want to receive historic data and they use it themselves. Parameter output_is_input controls this. For example: If your function is pulling historic prices from yahoo finance, you can use existing data to ask yahoo for only recent ones.
[54]:
def running_total_costs(price, quantity=1, data=0):
return data + price * quantity
previous_data = dictable(product = ['apple', 'banana', 'cherry', 'damson'], data = [10, 20, 30, 40])
perdictable(running_total_costs, on = 'product', include_inputs = True)(price = price, quantity = quantity, data = previous_data)
[54]:
dictable[3 x 5]
price|product|quantity|expiry|data
1 |apple |2 |None |12
2 |banana |3 |None |26
3 |cherry |1 |None |33
[57]:
## if you don't want existing data to be presented to the function:
perdictable(running_total_costs, on = 'product', include_inputs = True, output_is_input = False)(price = price, quantity = quantity, data = previous_data)
[57]:
dictable[3 x 5]
price|product|quantity|expiry|data
1 |apple |2 |None |2
2 |banana |3 |None |6
3 |cherry |1 |None |3
Conclusions¶
pyg.base.join allows us to create joined table with the variables we need. This is leveraged by perdictable so that the ‘atomic’ data we work with is not a single timeseries but a whole table of timeseries data indexed by some keys. We can use various perdictable parameters to control cache policy. All this is done with very little additional code, allowing us to manage quite a lot of data items with very little effort while managing caching expired items.