pyg.timeseries¶
Given pandas, why do we need this timeseries library? pandas is amazing but there are a few features in pyg.timeseries designed to enhance it.
There are three issues with pandas that pyg.timeseries tries to address:
pandas works on pandas objects (obviously) but not on numpy arrays.
pandas handles nan within timeseries inconsistently across its functions. This makes your results sensitive to reindexing/resampling. E.g.:
a.expanding() & a.ewm() ignore nan’s for calculation and then forward-fill the result.
a.diff(), a.rolling() include any nans in the calculation, leading to nan propagation.
pandas is great if you have the full timeseries. However, if you now want to run the same calculations in a live environment, on recent data, you have to append recent data at the end of the DataFrame and rerun.
pyg.timeseries tries to address this:
pyg.timeseries agrees with pandas 100% (if there are no nan in the dataframe) while being of comparable speed
pyg.timeseries works seemlessly on pandas objects and on numpy arrays, with no code change.
pyg.timeseries handles nan consistently across all its functions, ‘ignoring’ all nan, making your results consistent regardless of reindexing/resampling.
pyg.timeseries exposes the state of the internal function calculation. The exposure of internal state allows us to calculate the output of additional data without re-running history. This speeds up of two very common problems in finance:
risk calculations, Monte Carlo scenarios: We can run a trading strategy up to today and then generate multiple scenarios and see what-if, without having to rerun the full history.
live versus history: pandas is designed to run a full historical simulation. However, once we reach “today”, speed is of the essense and running a full historical simulation every time we ingest a new price, is just too slow. That is why most fast trading is built around fast state-machines. Of course, making sure research & live versions do the same thing is tricky. pyg gives you the ability to run two systems in parallel with almost the same code base: run full history overnight and then run today’s code base instantly, instantiated with the output of the historical simulation.
Agreement between pyg.timeseries and pandas¶
[1]:
from pyg import *; import pandas as pd; import numpy as np
s = pd.Series(np.random.normal(0,1,10000), drange(-9999)); a = s.values
t = pd.Series(np.random.normal(0,1,10000), drange(-9999))
[2]:
assert abs(s.count() - ts_count(s))< 1e-10
assert abs(s.mean() - ts_mean(s)) < 1e-10
assert abs(s.sum() - ts_sum(s)) < 1e-10
assert abs(s.std() - ts_std(s)) < 1e-10
assert abs(s.skew() - ts_skew(s)) < 1e-10
[3]:
assert abs(ewma(s, 10) - s.ewm(10).mean()).max() < 1e-10
assert abs(ewmstd(s, 10) - s.ewm(10).std()).max() < 1e-10
assert abs(ewmvar(s, 10) - s.ewm(10).var()).max() < 1e-10
assert abs(ewmcor(s, t, 10) - s.ewm(10).corr(t)).max() < 1e-10
[4]:
assert abs(expanding_sum(s) - s.expanding().sum()).max() < 1e-10
assert abs(expanding_mean(s) - s.expanding().mean()).max() < 1e-10
assert abs(expanding_std(s) - s.expanding().std()).max() < 1e-10
assert abs(expanding_skew(s) - s.expanding().skew()).max() < 1e-10
assert abs(expanding_min(s) - s.expanding().min()).max() < 1e-10
assert abs(expanding_max(s) - s.expanding().max()).max() < 1e-10
assert abs(expanding_median(s) - s.expanding().median()).max() < 1e-10
[5]:
assert abs(rolling_sum(s,10) - s.rolling(10).sum()).max() < 1e-10
assert abs(rolling_mean(s,10) - s.rolling(10).mean()).max() < 1e-10
assert abs(rolling_std(s,10) - s.rolling(10).std()).max() < 1e-10
assert abs(rolling_skew(s,10) - s.rolling(10).skew()).max() < 1e-10
assert abs(rolling_min(s,10) - s.rolling(10).min()).max() < 1e-10
assert abs(rolling_max(s,10) - s.rolling(10).max()).max() < 1e-10
assert abs(rolling_median(s,10) - s.rolling(10).median()).max() < 1e-10
assert abs(rolling_quantile(s,10,0.3)[0.3] - s.rolling(10).quantile(0.3)).max() < 1e-10 ## The rolling_quantile returns the quantile as the header, since it supports multiple quantiles calculations: e.g. rolling_quantile(s,10,[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9])
Quick performance comparison¶
pyg, when run on pandas dataframes rather than arrays, is of comparable speed to pandas
[6]:
compare = dictable(op = ['rolling_sum', 'rolling_mean', 'rolling_std', 'rolling_min', 'rolling_median'],
pyg = [rolling_sum, rolling_mean, rolling_std, rolling_min, rolling_median],
pandas = [s.rolling(10).sum, s.rolling(10).mean, s.rolling(10).std, s.rolling(10).min, s.rolling(10).median]).do(lambda v: timer(v, n = 100, time = True), 'pyg', 'pandas')(pyg = lambda pyg: pyg(s, 10))(pandas = lambda pandas: pandas())
compare += dictable(op = ['expanding_sum', 'expanding_mean', 'expanding_std', 'expanding_min', 'expanding_median'],
pyg = [expanding_sum, expanding_mean, expanding_std, expanding_min, expanding_median],
pandas = [s.expanding().sum, s.expanding().mean, s.expanding().std, s.expanding().min, s.expanding().median]).do(lambda v: timer(v, n = 100, time = True), 'pyg', 'pandas')(pyg = lambda pyg: pyg(s))(pandas = lambda pandas: pandas())
compare += dictable(op = ['ewma', 'ewmstd', 'ewmvar'],
pyg = [ewma, ewmstd, ewmvar],
pandas = [s.ewm(10).mean, s.ewm(10).std, s.ewm(10).var]).do(lambda v: timer(v, n = 100, time = True), 'pyg', 'pandas')(pyg = lambda pyg: pyg(s, 10))(pandas = lambda pandas: pandas())
print(compare(winner = lambda pyg, pandas: 'pyg' if pyg<pandas * 0.8 else 'pandas' if pyg > 1.2 * pandas else 'draw'))
2021-03-06 22:41:35,805 - pyg - INFO - TIMER:'rolling_sum' args:[["<class 'pandas.core.series.Series'>[10000]", '10'], []] (100 runs) took 0:00:00.109934 sec
2021-03-06 22:41:35,898 - pyg - INFO - TIMER:'rolling_mean' args:[["<class 'pandas.core.series.Series'>[10000]", '10'], []] (100 runs) took 0:00:00.087950 sec
2021-03-06 22:41:35,995 - pyg - INFO - TIMER:'rolling_std' args:[["<class 'pandas.core.series.Series'>[10000]", '10'], []] (100 runs) took 0:00:00.090944 sec
2021-03-06 22:41:36,116 - pyg - INFO - TIMER:'rolling_min' args:[["<class 'pandas.core.series.Series'>[10000]", '10'], []] (100 runs) took 0:00:00.119930 sec
2021-03-06 22:41:36,269 - pyg - INFO - TIMER:'rolling_median' args:[["<class 'pandas.core.series.Series'>[10000]", '10'], []] (100 runs) took 0:00:00.153483 sec
2021-03-06 22:41:36,351 - pyg - INFO - TIMER:'sum' args:[[], []] (100 runs) took 0:00:00.079685 sec
2021-03-06 22:41:36,465 - pyg - INFO - TIMER:'mean' args:[[], []] (100 runs) took 0:00:00.112599 sec
2021-03-06 22:41:36,585 - pyg - INFO - TIMER:'std' args:[[], []] (100 runs) took 0:00:00.119877 sec
2021-03-06 22:41:36,687 - pyg - INFO - TIMER:'min' args:[[], []] (100 runs) took 0:00:00.100942 sec
2021-03-06 22:41:37,378 - pyg - INFO - TIMER:'median' args:[[], []] (100 runs) took 0:00:00.688107 sec
2021-03-06 22:41:37,467 - pyg - INFO - TIMER:'expanding_sum' args:[["<class 'pandas.core.series.Series'>[10000]"], []] (100 runs) took 0:00:00.086951 sec
2021-03-06 22:41:37,528 - pyg - INFO - TIMER:'expanding_mean' args:[["<class 'pandas.core.series.Series'>[10000]"], []] (100 runs) took 0:00:00.059966 sec
2021-03-06 22:41:37,651 - pyg - INFO - TIMER:'expanding_std' args:[["<class 'pandas.core.series.Series'>[10000]"], []] (100 runs) took 0:00:00.120938 sec
2021-03-06 22:41:37,695 - pyg - INFO - TIMER:'expanding_min' args:[["<class 'pandas.core.series.Series'>[10000]"], []] (100 runs) took 0:00:00.040991 sec
2021-03-06 22:41:37,903 - pyg - INFO - TIMER:'expanding_median' args:[["<class 'pandas.core.series.Series'>[10000]"], []] (100 runs) took 0:00:00.206892 sec
2021-03-06 22:41:37,939 - pyg - INFO - TIMER:'sum' args:[[], []] (100 runs) took 0:00:00.033967 sec
2021-03-06 22:41:38,002 - pyg - INFO - TIMER:'mean' args:[[], []] (100 runs) took 0:00:00.059981 sec
2021-03-06 22:41:38,075 - pyg - INFO - TIMER:'std' args:[[], []] (100 runs) took 0:00:00.071959 sec
2021-03-06 22:41:38,245 - pyg - INFO - TIMER:'min' args:[[], []] (100 runs) took 0:00:00.168553 sec
2021-03-06 22:41:39,523 - pyg - INFO - TIMER:'median' args:[[], []] (100 runs) took 0:00:01.277246 sec
2021-03-06 22:41:39,620 - pyg - INFO - TIMER:'ewma' args:[["<class 'pandas.core.series.Series'>[10000]", '10'], []] (100 runs) took 0:00:00.094945 sec
2021-03-06 22:41:39,732 - pyg - INFO - TIMER:'ewmstd' args:[["<class 'pandas.core.series.Series'>[10000]", '10'], []] (100 runs) took 0:00:00.110924 sec
2021-03-06 22:41:39,827 - pyg - INFO - TIMER:'ewmvar' args:[["<class 'pandas.core.series.Series'>[10000]", '10'], []] (100 runs) took 0:00:00.093965 sec
2021-03-06 22:41:39,855 - pyg - INFO - TIMER:'mean' args:[[], []] (100 runs) took 0:00:00.026971 sec
2021-03-06 22:41:39,953 - pyg - INFO - TIMER:'std' args:[[], []] (100 runs) took 0:00:00.096954 sec
2021-03-06 22:41:39,995 - pyg - INFO - TIMER:'var' args:[[], []] (100 runs) took 0:00:00.039983 sec
op |pandas |pyg |winner
rolling_sum |0:00:00.079685|0:00:00.109934|pandas
rolling_mean |0:00:00.112599|0:00:00.087950|pyg
rolling_std |0:00:00.119877|0:00:00.090944|pyg
rolling_min |0:00:00.100942|0:00:00.119930|draw
rolling_median |0:00:00.688107|0:00:00.153483|pyg
expanding_sum |0:00:00.033967|0:00:00.086951|pandas
expanding_mean |0:00:00.059981|0:00:00.059966|draw
expanding_std |0:00:00.071959|0:00:00.120938|pandas
expanding_min |0:00:00.168553|0:00:00.040991|pyg
expanding_median|0:00:01.277246|0:00:00.206892|pyg
ewma |0:00:00.026971|0:00:00.094945|pandas
ewmstd |0:00:00.096954|0:00:00.110924|draw
ewmvar |0:00:00.039983|0:00:00.093965|pandas
pyg and numpy arrays¶
pyg supports numpy arrays natively. Indeed, pyg is 3-5 times faster on numpy arrays.
[7]:
a = s.values
assert abs(ts_count(a) - ts_count(s))< 1e-10
assert abs(ts_mean(a) - ts_mean(s)) < 1e-10
assert abs(ts_sum(a) - ts_sum(s)) < 1e-10
assert abs(ts_std(a) - ts_std(s)) < 1e-10
assert abs(ts_skew(a) - ts_skew(s)) < 1e-10
[8]:
assert abs(ewma(s, 10) - ewma(a,10)).max() < 1e-10
assert abs(ewmstd(s, 10) - ewmstd(a,10)).max() < 1e-10
assert abs(ewmvar(s, 10) - ewmvar(a,10)).max() < 1e-10
assert abs(ewmcor(s, t, 10) - ewmcor(a, t.values, 10)).max() < 1e-10
[9]:
assert abs(expanding_sum(s) - expanding_sum(a)).max() < 1e-10
assert abs(expanding_min(s) - expanding_min(a)).max() < 1e-10
assert abs(expanding_max(s) - expanding_max(a)).max() < 1e-10
assert abs(expanding_mean(s) - expanding_mean(a)).max() < 1e-10
assert abs(expanding_std(s) - expanding_std(a)).max() < 1e-10
assert abs(expanding_skew(s) - expanding_skew(a)).max() < 1e-10
assert abs(expanding_median(s) - expanding_median(a)).max() < 1e-10
[10]:
assert abs(rolling_sum(s,10) - rolling_sum(a,10)).max() < 1e-10
assert abs(rolling_min(s,10) - rolling_min(a,10)).max() < 1e-10
assert abs(rolling_max(s,10) - rolling_max(a,10)).max() < 1e-10
assert abs(rolling_mean(s,10) - rolling_mean(a,10)).max() < 1e-10
assert abs(rolling_std(s,10) - rolling_std(a,10)).max() < 1e-10
assert abs(rolling_skew(s,10) - rolling_skew(a,10)).max() < 1e-10
assert abs(rolling_median(s,10) - rolling_median(a,10)).max() < 1e-10
pandas treatment of nan¶
Suppose we have weekly data that at some point we resample to daily… The two look the same…
[11]:
t0 = dt_bump('20210301', '-999w')
days = drange(t0,'20210301','1b')
weekly = pd.Series(np.random.normal(0,1,1000), drange(t0,None,'1w')); weekly.name = 'weekly'
daily = weekly.reindex(days); daily.name = 'daily'
pd.concat([weekly,daily], axis = 1)
[11]:
weekly | daily | |
---|---|---|
2002-01-07 | 0.423187 | 0.423187 |
2002-01-08 | NaN | NaN |
2002-01-09 | NaN | NaN |
2002-01-10 | NaN | NaN |
2002-01-11 | NaN | NaN |
... | ... | ... |
2021-02-23 | NaN | NaN |
2021-02-24 | NaN | NaN |
2021-02-25 | NaN | NaN |
2021-02-26 | NaN | NaN |
2021-03-01 | 1.408439 | 1.408439 |
4996 rows × 2 columns
… but any calculation using the daily will yield a different result from a calculation on the weekly which is then resampled to daily:
[12]:
pd.concat([weekly.ewm(4).mean().reindex(days), daily.ewm(4).mean()], axis = 1) ## The result depends on what is done first...
[12]:
weekly | daily | |
---|---|---|
2002-01-07 | 0.423187 | 0.423187 |
2002-01-08 | NaN | 0.423187 |
2002-01-09 | NaN | 0.423187 |
2002-01-10 | NaN | 0.423187 |
2002-01-11 | NaN | 0.423187 |
... | ... | ... |
2021-02-23 | NaN | 0.178687 |
2021-02-24 | NaN | 0.178687 |
2021-02-25 | NaN | 0.178687 |
2021-02-26 | NaN | 0.178687 |
2021-03-01 | 0.655222 | 1.005474 |
4996 rows × 2 columns
[13]:
pd.concat([weekly.diff().reindex(days), daily.diff()], axis = 1) ## The result depends on what is done first...
[13]:
weekly | daily | |
---|---|---|
2002-01-07 | NaN | NaN |
2002-01-08 | NaN | NaN |
2002-01-09 | NaN | NaN |
2002-01-10 | NaN | NaN |
2002-01-11 | NaN | NaN |
... | ... | ... |
2021-02-23 | NaN | NaN |
2021-02-24 | NaN | NaN |
2021-02-25 | NaN | NaN |
2021-02-26 | NaN | NaN |
2021-03-01 | 1.644159 | NaN |
4996 rows × 2 columns
Indeed, for diff, daily.diff() is all nan!
pyg.timeseries treatment of nans¶
pyg treats nan as if they are not there, so the fact that we resampled the data and introduced lots of nan’s does not affect the calculations. We find this to be a more logical (and less error prone) approach.
[14]:
nona(pd.concat([ewma(weekly, 4).reindex(days), ewma(daily,4)], axis = 1)) ## The two match exactly
[14]:
0 | 1 | |
---|---|---|
2002-01-07 | 0.423187 | 0.423187 |
2002-01-14 | -0.105302 | -0.105302 |
2002-01-21 | -0.019371 | -0.019371 |
2002-01-28 | 0.332137 | 0.332137 |
2002-02-04 | 0.559419 | 0.559419 |
... | ... | ... |
2021-02-01 | 0.369931 | 0.369931 |
2021-02-08 | 0.526351 | 0.526351 |
2021-02-15 | 0.642578 | 0.642578 |
2021-02-22 | 0.466918 | 0.466918 |
2021-03-01 | 0.655222 | 0.655222 |
1000 rows × 2 columns
[15]:
nona(pd.concat([diff(weekly).reindex(days), diff(daily)], axis = 1)) ## The result depends on what is done first...
[15]:
0 | 1 | |
---|---|---|
2002-01-14 | -0.951280 | -0.951280 |
2002-01-21 | 0.632462 | 0.632462 |
2002-01-28 | 0.913911 | 0.913911 |
2002-02-04 | 0.077887 | 0.077887 |
2002-02-11 | -2.180086 | -2.180086 |
... | ... | ... |
2021-02-01 | -0.678079 | -0.678079 |
2021-02-08 | 1.049093 | 1.049093 |
2021-02-15 | -0.044543 | -0.044543 |
2021-02-22 | -1.343206 | -1.343206 |
2021-03-01 | 1.644159 | 1.644159 |
999 rows × 2 columns
Using pyg.timeseries to manage state¶
One of the problem in timeseries analysis is writing research code that works in analysing past data but ideally, the same code can be used in live application. One easy approach is “stick the extra data point at the end and run it again from 1980”. This leaves us with a single code base but for many live applications (e.g. live trading), this is not viable.
Further, given our positions today, we may want to run simulations of “what happens next?” to understand what the system is likely to do should various events occur. Risk calculations are expensive and re-running 10k Monte Carlo scenarios, each time running from 1980 is expensive.
Conversely, we can run research and live systems on two separate code base. This makes live systems responsive but six months down the line, we realise research code base and live code base did not do quite the same thing.
pyg approaches this problem by exposing the internal state of each of its calculation. Each function has two versions:
function(…) returns the calculation as performed by pandas
function_(…) returns a dictionary of dict(data = , state = ). The data agrees with function(…) while the state is a dict we can instantiate new calculations with.
[16]:
from pyg import *
history = pd.Series(np.random.normal(0,1,1000), drange(-1000,-1))
history_signal = ewma_(history, 10)
history_signal # The output consists of 'data' and 'state' where data matches a normal ewma calculation
[16]:
{'data': 2018-06-10 -0.511500
2018-06-11 0.445609
2018-06-12 -0.065606
2018-06-13 -0.358735
2018-06-14 -0.069188
...
2021-03-01 -0.144503
2021-03-02 -0.066708
2021-03-03 -0.141431
2021-03-04 -0.122797
2021-03-05 -0.051610
Length: 1000, dtype: float64,
'state': {'t': nan, 't0': 0.9999999999999994, 't1': -0.05161000819451757}}
[17]:
live = pd.Series(np.random.normal(0,1,10), drange(9))
live_signal = ewma(live, 10, state = history_signal.state) ## I only feed in live timeseries
'live: from today onwards', live_signal
[17]:
('live: from today onwards',
2021-03-06 -0.059815
2021-03-07 -0.165151
2021-03-08 -0.104525
2021-03-09 -0.160978
2021-03-10 -0.224791
2021-03-11 -0.325723
2021-03-12 -0.207468
2021-03-13 -0.233642
2021-03-14 -0.228141
2021-03-15 -0.244483
dtype: float64)
[18]:
joint_data = pd.concat([history, live])
joint_signal = ewma(joint_data, 10)
assert eq(live_signal, joint_signal[dt(0):]) # The live signal is the same, even though it only received live data for its calculation.
joint_signal[dt(0):]
[18]:
2021-03-06 -0.059815
2021-03-07 -0.165151
2021-03-08 -0.104525
2021-03-09 -0.160978
2021-03-10 -0.224791
2021-03-11 -0.325723
2021-03-12 -0.207468
2021-03-13 -0.233642
2021-03-14 -0.228141
2021-03-15 -0.244483
dtype: float64
This allows us to set up three parallel pipelines that share a virtually identical codebase:
workflow |
historic data |
live data |
risk analysis |
---|---|---|---|
when run? |
research/overnight |
live |
overnight |
data source? |
ts = long timeseries |
a = short ts/array |
1000’s of sims |
speed? |
slow, non-critical |
instantenous |
quick |
apply f to data |
x_ = f_(ts) |
x = f(a, **x_) |
same as live |
apply g |
y_ = g_(ts, x_) |
y = g(a, x, **y_) |
same as live |
final result h |
z = h(a, x, y, **z_) |
same as live |
Note that for live trading or risk analysis, we tend to switch and run on numpy arrays rather than pandas object. This speeds up the calculations while introduces no code change. In the example below we explore how to create state-aware, functions within pyg. The paradigm is that for most functions, function_ will return not just the timeseries output but also the states.
Example: creating a function exposing its state¶
Suppose we try to write an ewma crossover function (the difference of two ewma). We want to normalize it by its own volatility. Traditionally we will write:
[19]:
def pandas_crossover(a, fast, slow, vol):
fast_ewma = a.ewm(fast).mean()
slow_ewma = a.ewm(slow).mean()
raw_signal = fast_ewma - slow_ewma
signal_rms = (raw_signal**2).ewm(vol).mean()**0.5
signal_rms[signal_rms==0] = np.nan
normalized = raw_signal/signal_rms
return normalized
a = pd.Series(np.random.normal(0,1,10000), drange(-9999)); fast = 10; slow = 30; vol = 50
pandas_x = pandas_crossover(a, fast, slow, vol)
pandas_x
[19]:
1993-10-20 NaN
1993-10-21 -1.407264
1993-10-22 -1.714259
1993-10-23 1.177760
1993-10-24 -1.220600
...
2021-03-02 -1.767405
2021-03-03 -1.183420
2021-03-04 -1.764486
2021-03-05 -2.458497
2021-03-06 -2.242366
Length: 10000, dtype: float64
We can quickly rewrite it using pyg:
[28]:
def crossover(a, fast, slow, vol):
fast_ewma = ewma(a, fast)
slow_ewma = ewma(a, slow)
raw_signal = fast_ewma - slow_ewma
signal_rms = ewmrms(raw_signal, vol)
signal_rms = v2na(signal_rms)
normalized = raw_signal/signal_rms
return normalized
x = crossover(a, fast, slow, vol)
assert abs(x-pandas_x).max()<1e-10
x
[28]:
1993-10-20 -1.000000
1993-10-21 -1.407264
1993-10-22 -1.714259
1993-10-23 1.177760
1993-10-24 -1.220600
...
2021-03-02 -1.767405
2021-03-03 -1.183420
2021-03-04 -1.764486
2021-03-05 -2.458497
2021-03-06 -2.242366
Length: 10000, dtype: float64
And with very little additional effort, we can write a new function that also exposes the internal state:
[29]:
_data = 'data'
def crossover_(a, fast, slow, vol, instate = None):
state = Dict(fast = {}, slow = {}, vol = {}) if instate is None else instate
fast_ewma_ = ewma_(a, fast, instate = state.fast)
slow_ewma_ = ewma_(a, slow, instate = state.slow)
raw_signal = fast_ewma_.data - slow_ewma_.data
signal_rms = ewmrms_(raw_signal, vol, instate = state.vol)
normalized = raw_signal/v2na(signal_rms.data)
return Dict(data = normalized, state = Dict(fast = fast_ewma_.state, slow = slow_ewma_.state, vol = signal_rms.state))
crossover_.output = ['data', 'state'] # output declares the function to have a dict output and is used by cell
def crossover(a, fast, slow, vol, state = None):
return crossover_(a, fast, slow, vol, instate = state).data
x_ = crossover_(a, fast, slow, vol)
assert eq(x, x_.data) and eq(x, crossover(a, fast, slow, vol))
x_.data
[29]:
1993-10-20 -1.000000
1993-10-21 -1.407264
1993-10-22 -1.714259
1993-10-23 1.177760
1993-10-24 -1.220600
...
2021-03-02 -1.767405
2021-03-03 -1.183420
2021-03-04 -1.764486
2021-03-05 -2.458497
2021-03-06 -2.242366
Length: 10000, dtype: float64
The three give idential results and we can also verify that crossover_ will allow us to split the evaluation to the long-history and the new data:
[45]:
history = a[:9900]
live = a[9900:].values
x_history = crossover_(history, 10, 30, 50)
x_live = crossover(live, 10, 30, 50, state = x_history.state)
x_ = crossover_(a, fast, slow, vol)
assert eq(x_live , x_.data[9900:].values)
Have we gained anything?
[46]:
pandas_old = timer(pandas_crossover, 100, time = True)(history, 10, 30, 50)
x_history = crossover_(history, 10, 30, 50)
x_history_time = timer(crossover_, 100, time = True)(history, 10, 30, 50)
x_live = timer(crossover, 100, time = True)(live, 10, 30, 50, state = x_history.state)
'pandas: ', pandas_old.microseconds//1000, 'pyg history:', x_history_time.microseconds//1000, 'pyg_live:', x_live.microseconds//1000
2021-03-06 23:55:39,746 - pyg - INFO - TIMER:'pandas_crossover' args:[["<class 'pandas.core.series.Series'>[9900]", '10', '30', '50'], []] (100 runs) took 0:00:00.373514 sec
2021-03-06 23:55:39,953 - pyg - INFO - TIMER:'crossover_' args:[["<class 'pandas.core.series.Series'>[9900]", '10', '30', '50'], []] (100 runs) took 0:00:00.202883 sec
2021-03-06 23:55:40,004 - pyg - INFO - TIMER:'crossover' args:[["<class 'numpy.ndarray'>[100]", '10', '30', '50'], ["state=<class 'pyg.base._dict.Dict'>[3]"]] (100 runs) took 0:00:00.049972 sec
[46]:
('pandas: ', 373, 'pyg history:', 202, 'pyg_live:', 49)
We see that pyg is already faster than pandas. Running just the new data using numpy arrays, is about 4-5 times faster still. Indeed, running 10k 100-day forward scenarios take about 2 seconds at most.
[48]:
scenarios = np.random.normal(0,1,(100,10000))
x_scenarios = timer(crossover)(scenarios , 10, 30, 50, state = x_history.state)
2021-03-06 23:56:10,252 - pyg - INFO - TIMER:'crossover' args:[["<class 'numpy.ndarray'>[100]", '10', '30', '50'], ["state=<class 'pyg.base._dict.Dict'>[3]"]] (1 runs) took 0:00:01.605710 sec
Using cells, our code looks like this, with live and historical codebase looking pretty similar
[49]:
x_history = cell(crossover_, a = history, fast = 10, slow = 30, vol = 50)()
x_live = cell(crossover, a = live, fast = 10, slow = 30, vol = 50, state = x_history)()
x_history
[49]:
cell
a:
1993-10-20 0.463739
1993-10-21 0.429161
1993-10-22 -0.342095
1993-10-23 1.192557
1993-10-24 -0.448828
...
2020-11-22 -0.272184
2020-11-23 0.121197
2020-11-24 -0.581223
2020-11-25 -0.682961
2020-11-26 -1.084583
Length: 9900, dtype: float64
fast:
10
slow:
30
vol:
50
function:
<function crossover_ at 0x000001CF9B58BA60>
instate:
None
data:
1993-10-20 -1.000000
1993-10-21 -1.407264
1993-10-22 -1.714259
1993-10-23 1.177760
1993-10-24 -1.220600
...
2020-11-22 -2.091785
2020-11-23 -1.765958
2020-11-24 -1.796933
2020-11-25 -1.853106
2020-11-26 -2.044795
Length: 9900, dtype: float64
state:
Dict
fast:
{'t': nan, 't0': 0.9999999999999994, 't1': -0.4251894284980144}
slow:
{'t': nan, 't0': 0.9999999999999983, 't1': -0.14408421908740027}
vol:
{'t': nan, 't0': 0.9999999999999972, 't2': 0.01889897942675779}
[50]:
pd.concat([pd.Series(x_live.data, pandas_x.index[-100:]), pandas_x.iloc[-100:]], axis = 1)
[50]:
0 | 1 | |
---|---|---|
2020-11-27 | -2.466036 | -2.466036 |
2020-11-28 | -1.899795 | -1.899795 |
2020-11-29 | -1.573653 | -1.573653 |
2020-11-30 | -1.473624 | -1.473624 |
2020-12-01 | -1.978180 | -1.978180 |
... | ... | ... |
2021-03-02 | -1.767405 | -1.767405 |
2021-03-03 | -1.183420 | -1.183420 |
2021-03-04 | -1.764486 | -1.764486 |
2021-03-05 | -2.458497 | -2.458497 |
2021-03-06 | -2.242366 | -2.242366 |
100 rows × 2 columns