{ "cells": [ { "cell_type": "markdown", "id": "continuing-carter", "metadata": {}, "source": [ "# pyg.timeseries\n", "\n", "Given pandas, why do we need this timeseries library? \n", "pandas is amazing but there are a few features in pyg.timeseries designed to enhance it. \n", "\n", "
There are three issues with pandas that pyg.timeseries tries to address:" ] }, { "cell_type": "markdown", "id": "ahead-johnston", "metadata": {}, "source": [ "- pandas works on pandas objects (obviously) but not on numpy arrays.\n", "- pandas handles nan within timeseries inconsistently across its functions. This makes your results sensitive to reindexing/resampling. E.g.:\n", " - a.expanding() & a.ewm() **ignore** nan's for calculation and then forward-fill the result.\n", " - a.diff(), a.rolling() **include** any nans in the calculation, leading to nan propagation.\n", "- pandas is great if you have the full timeseries. However, if you now want to run the same calculations in a live environment, on recent data, you have to append recent data at the end of the DataFrame and rerun.\n" ] }, { "cell_type": "markdown", "id": "addressed-promise", "metadata": {}, "source": [ "pyg.timeseries tries to address this:" ] }, { "cell_type": "markdown", "id": "roman-drive", "metadata": {}, "source": [ "- pyg.timeseries agrees with pandas 100% (if there are no nan in the dataframe) while being of comparable speed\n", "- pyg.timeseries works seemlessly on pandas objects and on numpy arrays, with no code change. \n", "- pyg.timeseries handles nan consistently across all its functions, 'ignoring' all nan, making your results consistent regardless of reindexing/resampling.\n", "- pyg.timeseries exposes the state of the internal function calculation. The exposure of internal state allows us to calculate the output of additional data **without** re-running history. This speeds up of two very common problems in finance:\n", " - risk calculations, Monte Carlo scenarios: We can run a trading strategy up to today and then generate multiple scenarios and see what-if, without having to rerun the full history. \n", " - live versus history: pandas is designed to run a full historical simulation. However, once we reach \"today\", speed is of the essense and running a full historical simulation every time we ingest a new price, is just too slow. That is why most fast trading is built around fast state-machines. Of course, making sure research & live versions do the same thing is tricky. pyg gives you the ability to run two systems in parallel with almost the same code base: run full history overnight and then run today's code base instantly, instantiated with the output of the historical simulation.\n" ] }, { "cell_type": "markdown", "id": "liquid-favor", "metadata": {}, "source": [ "## Agreement between pyg.timeseries and pandas" ] }, { "cell_type": "code", "execution_count": 1, "id": "crucial-chassis", "metadata": {}, "outputs": [], "source": [ "from pyg import *; import pandas as pd; import numpy as np\n", "s = pd.Series(np.random.normal(0,1,10000), drange(-9999)); a = s.values\n", "t = pd.Series(np.random.normal(0,1,10000), drange(-9999))" ] }, { "cell_type": "code", "execution_count": 2, "id": "written-midwest", "metadata": {}, "outputs": [], "source": [ "assert abs(s.count() - ts_count(s))< 1e-10\n", "assert abs(s.mean() - ts_mean(s)) < 1e-10\n", "assert abs(s.sum() - ts_sum(s)) < 1e-10\n", "assert abs(s.std() - ts_std(s)) < 1e-10\n", "assert abs(s.skew() - ts_skew(s)) < 1e-10" ] }, { "cell_type": "code", "execution_count": 3, "id": "arctic-accommodation", "metadata": {}, "outputs": [], "source": [ "assert abs(ewma(s, 10) - s.ewm(10).mean()).max() < 1e-10\n", "assert abs(ewmstd(s, 10) - s.ewm(10).std()).max() < 1e-10\n", "assert abs(ewmvar(s, 10) - s.ewm(10).var()).max() < 1e-10\n", "assert abs(ewmcor(s, t, 10) - s.ewm(10).corr(t)).max() < 1e-10" ] }, { "cell_type": "code", "execution_count": 4, "id": "worth-telling", "metadata": {}, "outputs": [], "source": [ "assert abs(expanding_sum(s) - s.expanding().sum()).max() < 1e-10\n", "assert abs(expanding_mean(s) - s.expanding().mean()).max() < 1e-10\n", "assert abs(expanding_std(s) - s.expanding().std()).max() < 1e-10\n", "assert abs(expanding_skew(s) - s.expanding().skew()).max() < 1e-10\n", "assert abs(expanding_min(s) - s.expanding().min()).max() < 1e-10\n", "assert abs(expanding_max(s) - s.expanding().max()).max() < 1e-10\n", "assert abs(expanding_median(s) - s.expanding().median()).max() < 1e-10" ] }, { "cell_type": "code", "execution_count": 5, "id": "turkish-montana", "metadata": {}, "outputs": [], "source": [ "assert abs(rolling_sum(s,10) - s.rolling(10).sum()).max() < 1e-10\n", "assert abs(rolling_mean(s,10) - s.rolling(10).mean()).max() < 1e-10\n", "assert abs(rolling_std(s,10) - s.rolling(10).std()).max() < 1e-10\n", "assert abs(rolling_skew(s,10) - s.rolling(10).skew()).max() < 1e-10\n", "assert abs(rolling_min(s,10) - s.rolling(10).min()).max() < 1e-10\n", "assert abs(rolling_max(s,10) - s.rolling(10).max()).max() < 1e-10\n", "assert abs(rolling_median(s,10) - s.rolling(10).median()).max() < 1e-10\n", "assert abs(rolling_quantile(s,10,0.3)[0.3] - s.rolling(10).quantile(0.3)).max() < 1e-10 ## The rolling_quantile returns the quantile as the header, since it supports multiple quantiles calculations: e.g. rolling_quantile(s,10,[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9])" ] }, { "cell_type": "markdown", "id": "naughty-congo", "metadata": {}, "source": [ "### Quick performance comparison \n", "pyg, when run on pandas dataframes rather than arrays, is of comparable speed to pandas" ] }, { "cell_type": "code", "execution_count": 6, "id": "challenging-separate", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2021-03-06 22:41:35,805 - pyg - INFO - TIMER:'rolling_sum' args:[[\"[10000]\", '10'], []] (100 runs) took 0:00:00.109934 sec\n", "2021-03-06 22:41:35,898 - pyg - INFO - TIMER:'rolling_mean' args:[[\"[10000]\", '10'], []] (100 runs) took 0:00:00.087950 sec\n", "2021-03-06 22:41:35,995 - pyg - INFO - TIMER:'rolling_std' args:[[\"[10000]\", '10'], []] (100 runs) took 0:00:00.090944 sec\n", "2021-03-06 22:41:36,116 - pyg - INFO - TIMER:'rolling_min' args:[[\"[10000]\", '10'], []] (100 runs) took 0:00:00.119930 sec\n", "2021-03-06 22:41:36,269 - pyg - INFO - TIMER:'rolling_median' args:[[\"[10000]\", '10'], []] (100 runs) took 0:00:00.153483 sec\n", "2021-03-06 22:41:36,351 - pyg - INFO - TIMER:'sum' args:[[], []] (100 runs) took 0:00:00.079685 sec\n", "2021-03-06 22:41:36,465 - pyg - INFO - TIMER:'mean' args:[[], []] (100 runs) took 0:00:00.112599 sec\n", "2021-03-06 22:41:36,585 - pyg - INFO - TIMER:'std' args:[[], []] (100 runs) took 0:00:00.119877 sec\n", "2021-03-06 22:41:36,687 - pyg - INFO - TIMER:'min' args:[[], []] (100 runs) took 0:00:00.100942 sec\n", "2021-03-06 22:41:37,378 - pyg - INFO - TIMER:'median' args:[[], []] (100 runs) took 0:00:00.688107 sec\n", "2021-03-06 22:41:37,467 - pyg - INFO - TIMER:'expanding_sum' args:[[\"[10000]\"], []] (100 runs) took 0:00:00.086951 sec\n", "2021-03-06 22:41:37,528 - pyg - INFO - TIMER:'expanding_mean' args:[[\"[10000]\"], []] (100 runs) took 0:00:00.059966 sec\n", "2021-03-06 22:41:37,651 - pyg - INFO - TIMER:'expanding_std' args:[[\"[10000]\"], []] (100 runs) took 0:00:00.120938 sec\n", "2021-03-06 22:41:37,695 - pyg - INFO - TIMER:'expanding_min' args:[[\"[10000]\"], []] (100 runs) took 0:00:00.040991 sec\n", "2021-03-06 22:41:37,903 - pyg - INFO - TIMER:'expanding_median' args:[[\"[10000]\"], []] (100 runs) took 0:00:00.206892 sec\n", "2021-03-06 22:41:37,939 - pyg - INFO - TIMER:'sum' args:[[], []] (100 runs) took 0:00:00.033967 sec\n", "2021-03-06 22:41:38,002 - pyg - INFO - TIMER:'mean' args:[[], []] (100 runs) took 0:00:00.059981 sec\n", "2021-03-06 22:41:38,075 - pyg - INFO - TIMER:'std' args:[[], []] (100 runs) took 0:00:00.071959 sec\n", "2021-03-06 22:41:38,245 - pyg - INFO - TIMER:'min' args:[[], []] (100 runs) took 0:00:00.168553 sec\n", "2021-03-06 22:41:39,523 - pyg - INFO - TIMER:'median' args:[[], []] (100 runs) took 0:00:01.277246 sec\n", "2021-03-06 22:41:39,620 - pyg - INFO - TIMER:'ewma' args:[[\"[10000]\", '10'], []] (100 runs) took 0:00:00.094945 sec\n", "2021-03-06 22:41:39,732 - pyg - INFO - TIMER:'ewmstd' args:[[\"[10000]\", '10'], []] (100 runs) took 0:00:00.110924 sec\n", "2021-03-06 22:41:39,827 - pyg - INFO - TIMER:'ewmvar' args:[[\"[10000]\", '10'], []] (100 runs) took 0:00:00.093965 sec\n", "2021-03-06 22:41:39,855 - pyg - INFO - TIMER:'mean' args:[[], []] (100 runs) took 0:00:00.026971 sec\n", "2021-03-06 22:41:39,953 - pyg - INFO - TIMER:'std' args:[[], []] (100 runs) took 0:00:00.096954 sec\n", "2021-03-06 22:41:39,995 - pyg - INFO - TIMER:'var' args:[[], []] (100 runs) took 0:00:00.039983 sec\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "op |pandas |pyg |winner\n", "rolling_sum |0:00:00.079685|0:00:00.109934|pandas\n", "rolling_mean |0:00:00.112599|0:00:00.087950|pyg \n", "rolling_std |0:00:00.119877|0:00:00.090944|pyg \n", "rolling_min |0:00:00.100942|0:00:00.119930|draw \n", "rolling_median |0:00:00.688107|0:00:00.153483|pyg \n", "expanding_sum |0:00:00.033967|0:00:00.086951|pandas\n", "expanding_mean |0:00:00.059981|0:00:00.059966|draw \n", "expanding_std |0:00:00.071959|0:00:00.120938|pandas\n", "expanding_min |0:00:00.168553|0:00:00.040991|pyg \n", "expanding_median|0:00:01.277246|0:00:00.206892|pyg \n", "ewma |0:00:00.026971|0:00:00.094945|pandas\n", "ewmstd |0:00:00.096954|0:00:00.110924|draw \n", "ewmvar |0:00:00.039983|0:00:00.093965|pandas\n" ] } ], "source": [ "compare = dictable(op = ['rolling_sum', 'rolling_mean', 'rolling_std', 'rolling_min', 'rolling_median'],\n", " pyg = [rolling_sum, rolling_mean, rolling_std, rolling_min, rolling_median], \n", " pandas = [s.rolling(10).sum, s.rolling(10).mean, s.rolling(10).std, s.rolling(10).min, s.rolling(10).median]).do(lambda v: timer(v, n = 100, time = True), 'pyg', 'pandas')(pyg = lambda pyg: pyg(s, 10))(pandas = lambda pandas: pandas())\n", "\n", "compare += dictable(op = ['expanding_sum', 'expanding_mean', 'expanding_std', 'expanding_min', 'expanding_median'],\n", " pyg = [expanding_sum, expanding_mean, expanding_std, expanding_min, expanding_median], \n", " pandas = [s.expanding().sum, s.expanding().mean, s.expanding().std, s.expanding().min, s.expanding().median]).do(lambda v: timer(v, n = 100, time = True), 'pyg', 'pandas')(pyg = lambda pyg: pyg(s))(pandas = lambda pandas: pandas())\n", "\n", "compare += dictable(op = ['ewma', 'ewmstd', 'ewmvar'],\n", " pyg = [ewma, ewmstd, ewmvar], \n", " pandas = [s.ewm(10).mean, s.ewm(10).std, s.ewm(10).var]).do(lambda v: timer(v, n = 100, time = True), 'pyg', 'pandas')(pyg = lambda pyg: pyg(s, 10))(pandas = lambda pandas: pandas())\n", "\n", "print(compare(winner = lambda pyg, pandas: 'pyg' if pyg 1.2 * pandas else 'draw'))" ] }, { "cell_type": "markdown", "id": "circular-album", "metadata": {}, "source": [ "## pyg and numpy arrays\n", "pyg supports numpy arrays natively. Indeed, pyg is 3-5 times faster on numpy arrays." ] }, { "cell_type": "code", "execution_count": 7, "id": "attempted-horror", "metadata": {}, "outputs": [], "source": [ "a = s.values\n", "assert abs(ts_count(a) - ts_count(s))< 1e-10\n", "assert abs(ts_mean(a) - ts_mean(s)) < 1e-10\n", "assert abs(ts_sum(a) - ts_sum(s)) < 1e-10\n", "assert abs(ts_std(a) - ts_std(s)) < 1e-10\n", "assert abs(ts_skew(a) - ts_skew(s)) < 1e-10" ] }, { "cell_type": "code", "execution_count": 8, "id": "tender-league", "metadata": {}, "outputs": [], "source": [ "assert abs(ewma(s, 10) - ewma(a,10)).max() < 1e-10\n", "assert abs(ewmstd(s, 10) - ewmstd(a,10)).max() < 1e-10\n", "assert abs(ewmvar(s, 10) - ewmvar(a,10)).max() < 1e-10\n", "assert abs(ewmcor(s, t, 10) - ewmcor(a, t.values, 10)).max() < 1e-10" ] }, { "cell_type": "code", "execution_count": 9, "id": "solar-specification", "metadata": {}, "outputs": [], "source": [ "assert abs(expanding_sum(s) - expanding_sum(a)).max() < 1e-10\n", "assert abs(expanding_min(s) - expanding_min(a)).max() < 1e-10\n", "assert abs(expanding_max(s) - expanding_max(a)).max() < 1e-10\n", "assert abs(expanding_mean(s) - expanding_mean(a)).max() < 1e-10\n", "assert abs(expanding_std(s) - expanding_std(a)).max() < 1e-10\n", "assert abs(expanding_skew(s) - expanding_skew(a)).max() < 1e-10\n", "assert abs(expanding_median(s) - expanding_median(a)).max() < 1e-10" ] }, { "cell_type": "code", "execution_count": 10, "id": "continuous-malta", "metadata": {}, "outputs": [], "source": [ "assert abs(rolling_sum(s,10) - rolling_sum(a,10)).max() < 1e-10\n", "assert abs(rolling_min(s,10) - rolling_min(a,10)).max() < 1e-10\n", "assert abs(rolling_max(s,10) - rolling_max(a,10)).max() < 1e-10\n", "assert abs(rolling_mean(s,10) - rolling_mean(a,10)).max() < 1e-10\n", "assert abs(rolling_std(s,10) - rolling_std(a,10)).max() < 1e-10\n", "assert abs(rolling_skew(s,10) - rolling_skew(a,10)).max() < 1e-10\n", "assert abs(rolling_median(s,10) - rolling_median(a,10)).max() < 1e-10" ] }, { "cell_type": "markdown", "id": "established-parts", "metadata": {}, "source": [ "## pandas treatment of nan\n", "\n", "Suppose we have weekly data that at some point we resample to daily... The two look the same... " ] }, { "cell_type": "code", "execution_count": 11, "id": "yellow-lawyer", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weeklydaily
2002-01-070.4231870.423187
2002-01-08NaNNaN
2002-01-09NaNNaN
2002-01-10NaNNaN
2002-01-11NaNNaN
.........
2021-02-23NaNNaN
2021-02-24NaNNaN
2021-02-25NaNNaN
2021-02-26NaNNaN
2021-03-011.4084391.408439
\n", "

4996 rows × 2 columns

\n", "
" ], "text/plain": [ " weekly daily\n", "2002-01-07 0.423187 0.423187\n", "2002-01-08 NaN NaN\n", "2002-01-09 NaN NaN\n", "2002-01-10 NaN NaN\n", "2002-01-11 NaN NaN\n", "... ... ...\n", "2021-02-23 NaN NaN\n", "2021-02-24 NaN NaN\n", "2021-02-25 NaN NaN\n", "2021-02-26 NaN NaN\n", "2021-03-01 1.408439 1.408439\n", "\n", "[4996 rows x 2 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t0 = dt_bump('20210301', '-999w')\n", "days = drange(t0,'20210301','1b')\n", "weekly = pd.Series(np.random.normal(0,1,1000), drange(t0,None,'1w')); weekly.name = 'weekly'\n", "daily = weekly.reindex(days); daily.name = 'daily'\n", "pd.concat([weekly,daily], axis = 1)" ] }, { "cell_type": "markdown", "id": "annoying-kazakhstan", "metadata": {}, "source": [ "... but any calculation using the daily will yield a different result from a calculation on the weekly which is then resampled to daily:" ] }, { "cell_type": "code", "execution_count": 12, "id": "italic-eligibility", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weeklydaily
2002-01-070.4231870.423187
2002-01-08NaN0.423187
2002-01-09NaN0.423187
2002-01-10NaN0.423187
2002-01-11NaN0.423187
.........
2021-02-23NaN0.178687
2021-02-24NaN0.178687
2021-02-25NaN0.178687
2021-02-26NaN0.178687
2021-03-010.6552221.005474
\n", "

4996 rows × 2 columns

\n", "
" ], "text/plain": [ " weekly daily\n", "2002-01-07 0.423187 0.423187\n", "2002-01-08 NaN 0.423187\n", "2002-01-09 NaN 0.423187\n", "2002-01-10 NaN 0.423187\n", "2002-01-11 NaN 0.423187\n", "... ... ...\n", "2021-02-23 NaN 0.178687\n", "2021-02-24 NaN 0.178687\n", "2021-02-25 NaN 0.178687\n", "2021-02-26 NaN 0.178687\n", "2021-03-01 0.655222 1.005474\n", "\n", "[4996 rows x 2 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([weekly.ewm(4).mean().reindex(days), daily.ewm(4).mean()], axis = 1) ## The result depends on what is done first..." ] }, { "cell_type": "code", "execution_count": 13, "id": "noticed-warren", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weeklydaily
2002-01-07NaNNaN
2002-01-08NaNNaN
2002-01-09NaNNaN
2002-01-10NaNNaN
2002-01-11NaNNaN
.........
2021-02-23NaNNaN
2021-02-24NaNNaN
2021-02-25NaNNaN
2021-02-26NaNNaN
2021-03-011.644159NaN
\n", "

4996 rows × 2 columns

\n", "
" ], "text/plain": [ " weekly daily\n", "2002-01-07 NaN NaN\n", "2002-01-08 NaN NaN\n", "2002-01-09 NaN NaN\n", "2002-01-10 NaN NaN\n", "2002-01-11 NaN NaN\n", "... ... ...\n", "2021-02-23 NaN NaN\n", "2021-02-24 NaN NaN\n", "2021-02-25 NaN NaN\n", "2021-02-26 NaN NaN\n", "2021-03-01 1.644159 NaN\n", "\n", "[4996 rows x 2 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([weekly.diff().reindex(days), daily.diff()], axis = 1) ## The result depends on what is done first..." ] }, { "cell_type": "markdown", "id": "frozen-cricket", "metadata": {}, "source": [ "Indeed, for diff, daily.diff() is all nan! " ] }, { "cell_type": "markdown", "id": "laden-protest", "metadata": {}, "source": [ "## pyg.timeseries treatment of nans\n", "pyg treats nan as if they are not there, so the fact that we resampled the data and introduced lots of nan's does not affect the calculations. We find this to be a more logical (and less error prone) approach. " ] }, { "cell_type": "code", "execution_count": 14, "id": "outside-agent", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
2002-01-070.4231870.423187
2002-01-14-0.105302-0.105302
2002-01-21-0.019371-0.019371
2002-01-280.3321370.332137
2002-02-040.5594190.559419
.........
2021-02-010.3699310.369931
2021-02-080.5263510.526351
2021-02-150.6425780.642578
2021-02-220.4669180.466918
2021-03-010.6552220.655222
\n", "

1000 rows × 2 columns

\n", "
" ], "text/plain": [ " 0 1\n", "2002-01-07 0.423187 0.423187\n", "2002-01-14 -0.105302 -0.105302\n", "2002-01-21 -0.019371 -0.019371\n", "2002-01-28 0.332137 0.332137\n", "2002-02-04 0.559419 0.559419\n", "... ... ...\n", "2021-02-01 0.369931 0.369931\n", "2021-02-08 0.526351 0.526351\n", "2021-02-15 0.642578 0.642578\n", "2021-02-22 0.466918 0.466918\n", "2021-03-01 0.655222 0.655222\n", "\n", "[1000 rows x 2 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nona(pd.concat([ewma(weekly, 4).reindex(days), ewma(daily,4)], axis = 1)) ## The two match exactly" ] }, { "cell_type": "code", "execution_count": 15, "id": "logical-minimum", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
2002-01-14-0.951280-0.951280
2002-01-210.6324620.632462
2002-01-280.9139110.913911
2002-02-040.0778870.077887
2002-02-11-2.180086-2.180086
.........
2021-02-01-0.678079-0.678079
2021-02-081.0490931.049093
2021-02-15-0.044543-0.044543
2021-02-22-1.343206-1.343206
2021-03-011.6441591.644159
\n", "

999 rows × 2 columns

\n", "
" ], "text/plain": [ " 0 1\n", "2002-01-14 -0.951280 -0.951280\n", "2002-01-21 0.632462 0.632462\n", "2002-01-28 0.913911 0.913911\n", "2002-02-04 0.077887 0.077887\n", "2002-02-11 -2.180086 -2.180086\n", "... ... ...\n", "2021-02-01 -0.678079 -0.678079\n", "2021-02-08 1.049093 1.049093\n", "2021-02-15 -0.044543 -0.044543\n", "2021-02-22 -1.343206 -1.343206\n", "2021-03-01 1.644159 1.644159\n", "\n", "[999 rows x 2 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nona(pd.concat([diff(weekly).reindex(days), diff(daily)], axis = 1)) ## The result depends on what is done first..." ] }, { "cell_type": "markdown", "id": "driven-session", "metadata": {}, "source": [ "## Using pyg.timeseries to manage state\n", "\n", "One of the problem in timeseries analysis is writing research code that works in analysing past data but ideally, the same code can be used in live application. \n", "One easy approach is \"stick the extra data point at the end and run it again from 1980\". This leaves us with a single code base but for many live applications (e.g. live trading), this is not viable. \n", "\n", "Further, given our positions today, we may want to run simulations of \"what happens next?\" to understand what the system is likely to do should various events occur.\n", "Risk calculations are expensive and re-running 10k Monte Carlo scenarios, each time running from 1980 is expensive.\n", "\n", "Conversely, we can run research and live systems on two separate code base. This makes live systems responsive but six months down the line, we realise research code base and live code base did not do quite the same thing.\n", "\n", "pyg approaches this problem by exposing the internal state of each of its calculation. Each function has two versions:\n" ] }, { "cell_type": "markdown", "id": "disabled-spring", "metadata": {}, "source": [ "- function(...) returns the calculation as performed by pandas\n", "- function_(...) returns a dictionary of dict(data = , state = ). The data agrees with function(...) while the state is a dict we can instantiate new calculations with." ] }, { "cell_type": "code", "execution_count": 16, "id": "bottom-acrobat", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'data': 2018-06-10 -0.511500\n", " 2018-06-11 0.445609\n", " 2018-06-12 -0.065606\n", " 2018-06-13 -0.358735\n", " 2018-06-14 -0.069188\n", " ... \n", " 2021-03-01 -0.144503\n", " 2021-03-02 -0.066708\n", " 2021-03-03 -0.141431\n", " 2021-03-04 -0.122797\n", " 2021-03-05 -0.051610\n", " Length: 1000, dtype: float64,\n", " 'state': {'t': nan, 't0': 0.9999999999999994, 't1': -0.05161000819451757}}" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pyg import *\n", "history = pd.Series(np.random.normal(0,1,1000), drange(-1000,-1))\n", "history_signal = ewma_(history, 10) \n", "history_signal # The output consists of 'data' and 'state' where data matches a normal ewma calculation" ] }, { "cell_type": "code", "execution_count": 17, "id": "ancient-sheriff", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('live: from today onwards',\n", " 2021-03-06 -0.059815\n", " 2021-03-07 -0.165151\n", " 2021-03-08 -0.104525\n", " 2021-03-09 -0.160978\n", " 2021-03-10 -0.224791\n", " 2021-03-11 -0.325723\n", " 2021-03-12 -0.207468\n", " 2021-03-13 -0.233642\n", " 2021-03-14 -0.228141\n", " 2021-03-15 -0.244483\n", " dtype: float64)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "live = pd.Series(np.random.normal(0,1,10), drange(9))\n", "live_signal = ewma(live, 10, state = history_signal.state) ## I only feed in live timeseries\n", "'live: from today onwards', live_signal" ] }, { "cell_type": "code", "execution_count": 18, "id": "voluntary-marsh", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2021-03-06 -0.059815\n", "2021-03-07 -0.165151\n", "2021-03-08 -0.104525\n", "2021-03-09 -0.160978\n", "2021-03-10 -0.224791\n", "2021-03-11 -0.325723\n", "2021-03-12 -0.207468\n", "2021-03-13 -0.233642\n", "2021-03-14 -0.228141\n", "2021-03-15 -0.244483\n", "dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joint_data = pd.concat([history, live])\n", "joint_signal = ewma(joint_data, 10)\n", "assert eq(live_signal, joint_signal[dt(0):]) # The live signal is the same, even though it only received live data for its calculation.\n", "joint_signal[dt(0):]" ] }, { "cell_type": "markdown", "id": "young-environment", "metadata": {}, "source": [ "This allows us to set up three parallel pipelines that share a virtually identical codebase:\n", "\n", "| workflow | historic data | live data | risk analysis |\n", "|---|---|---|---|\n", "| when run? | research/overnight | live | overnight |\n", "| data source? | ts = long timeseries| a = short ts/array | 1000's of sims |\n", "| speed? | slow, non-critical | instantenous | quick |\n", "| apply f to data | x_ = f_(ts) | x = f(a, **x_) | same as live |\n", "| apply g | y_ = g_(ts, x_) | y = g(a, x, **y_) | same as live |\n", "| final result h | z_ = h_(ts, x_, y_) | z = h(a, x, y, **z_)| same as live |\n" ] }, { "cell_type": "markdown", "id": "vertical-sierra", "metadata": {}, "source": [ "Note that for live trading or risk analysis, we tend to switch and run on numpy arrays rather than pandas object. \n", "This speeds up the calculations while introduces no code change.\n", "In the example below we explore how to create state-aware, functions within pyg.\n", "The paradigm is that for most functions, function_ will return not just the timeseries output but also the states.\n", "\n", "### Example: creating a function exposing its state\n", "\n", "Suppose we try to write an ewma crossover function (the difference of two ewma). We want to normalize it by its own volatility.\n", "Traditionally we will write:\n" ] }, { "cell_type": "code", "execution_count": 19, "id": "destroyed-saying", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1993-10-20 NaN\n", "1993-10-21 -1.407264\n", "1993-10-22 -1.714259\n", "1993-10-23 1.177760\n", "1993-10-24 -1.220600\n", " ... \n", "2021-03-02 -1.767405\n", "2021-03-03 -1.183420\n", "2021-03-04 -1.764486\n", "2021-03-05 -2.458497\n", "2021-03-06 -2.242366\n", "Length: 10000, dtype: float64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def pandas_crossover(a, fast, slow, vol):\n", " fast_ewma = a.ewm(fast).mean()\n", " slow_ewma = a.ewm(slow).mean() \n", " raw_signal = fast_ewma - slow_ewma\n", " signal_rms = (raw_signal**2).ewm(vol).mean()**0.5\n", " signal_rms[signal_rms==0] = np.nan\n", " normalized = raw_signal/signal_rms\n", " return normalized\n", "\n", "a = pd.Series(np.random.normal(0,1,10000), drange(-9999)); fast = 10; slow = 30; vol = 50\n", "pandas_x = pandas_crossover(a, fast, slow, vol)\n", "pandas_x" ] }, { "cell_type": "markdown", "id": "signal-danish", "metadata": {}, "source": [ "We can quickly rewrite it using pyg:" ] }, { "cell_type": "code", "execution_count": 28, "id": "crucial-retention", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1993-10-20 -1.000000\n", "1993-10-21 -1.407264\n", "1993-10-22 -1.714259\n", "1993-10-23 1.177760\n", "1993-10-24 -1.220600\n", " ... \n", "2021-03-02 -1.767405\n", "2021-03-03 -1.183420\n", "2021-03-04 -1.764486\n", "2021-03-05 -2.458497\n", "2021-03-06 -2.242366\n", "Length: 10000, dtype: float64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def crossover(a, fast, slow, vol):\n", " fast_ewma = ewma(a, fast)\n", " slow_ewma = ewma(a, slow) \n", " raw_signal = fast_ewma - slow_ewma\n", " signal_rms = ewmrms(raw_signal, vol)\n", " signal_rms = v2na(signal_rms)\n", " normalized = raw_signal/signal_rms\n", " return normalized\n", "x = crossover(a, fast, slow, vol)\n", "assert abs(x-pandas_x).max()<1e-10\n", "x" ] }, { "cell_type": "markdown", "id": "bridal-gambling", "metadata": {}, "source": [ "And with very little additional effort, we can write a new function that also exposes the internal state:\n" ] }, { "cell_type": "code", "execution_count": 29, "id": "regular-crowd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1993-10-20 -1.000000\n", "1993-10-21 -1.407264\n", "1993-10-22 -1.714259\n", "1993-10-23 1.177760\n", "1993-10-24 -1.220600\n", " ... \n", "2021-03-02 -1.767405\n", "2021-03-03 -1.183420\n", "2021-03-04 -1.764486\n", "2021-03-05 -2.458497\n", "2021-03-06 -2.242366\n", "Length: 10000, dtype: float64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "_data = 'data'\n", "def crossover_(a, fast, slow, vol, instate = None):\n", " state = Dict(fast = {}, slow = {}, vol = {}) if instate is None else instate\n", " fast_ewma_ = ewma_(a, fast, instate = state.fast)\n", " slow_ewma_ = ewma_(a, slow, instate = state.slow) \n", " raw_signal = fast_ewma_.data - slow_ewma_.data\n", " signal_rms = ewmrms_(raw_signal, vol, instate = state.vol)\n", " normalized = raw_signal/v2na(signal_rms.data)\n", " return Dict(data = normalized, state = Dict(fast = fast_ewma_.state, slow = slow_ewma_.state, vol = signal_rms.state))\n", "\n", "crossover_.output = ['data', 'state'] # output declares the function to have a dict output and is used by cell\n", "\n", "def crossover(a, fast, slow, vol, state = None):\n", " return crossover_(a, fast, slow, vol, instate = state).data\n", "\n", "x_ = crossover_(a, fast, slow, vol)\n", "assert eq(x, x_.data) and eq(x, crossover(a, fast, slow, vol))\n", "x_.data" ] }, { "cell_type": "markdown", "id": "weird-excitement", "metadata": {}, "source": [ "The three give idential results and we can also verify that crossover_ will allow us to split the evaluation to the long-history and the new data:" ] }, { "cell_type": "code", "execution_count": 45, "id": "alpine-airfare", "metadata": {}, "outputs": [], "source": [ "history = a[:9900]\n", "live = a[9900:].values \n", "x_history = crossover_(history, 10, 30, 50)\n", "x_live = crossover(live, 10, 30, 50, state = x_history.state)\n", "x_ = crossover_(a, fast, slow, vol)\n", "assert eq(x_live , x_.data[9900:].values)" ] }, { "cell_type": "markdown", "id": "certain-division", "metadata": {}, "source": [ "Have we gained anything?" ] }, { "cell_type": "code", "execution_count": 46, "id": "white-return", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2021-03-06 23:55:39,746 - pyg - INFO - TIMER:'pandas_crossover' args:[[\"[9900]\", '10', '30', '50'], []] (100 runs) took 0:00:00.373514 sec\n", "2021-03-06 23:55:39,953 - pyg - INFO - TIMER:'crossover_' args:[[\"[9900]\", '10', '30', '50'], []] (100 runs) took 0:00:00.202883 sec\n", "2021-03-06 23:55:40,004 - pyg - INFO - TIMER:'crossover' args:[[\"[100]\", '10', '30', '50'], [\"state=[3]\"]] (100 runs) took 0:00:00.049972 sec\n" ] }, { "data": { "text/plain": [ "('pandas: ', 373, 'pyg history:', 202, 'pyg_live:', 49)" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pandas_old = timer(pandas_crossover, 100, time = True)(history, 10, 30, 50)\n", "x_history = crossover_(history, 10, 30, 50)\n", "x_history_time = timer(crossover_, 100, time = True)(history, 10, 30, 50)\n", "x_live = timer(crossover, 100, time = True)(live, 10, 30, 50, state = x_history.state)\n", "'pandas: ', pandas_old.microseconds//1000, 'pyg history:', x_history_time.microseconds//1000, 'pyg_live:', x_live.microseconds//1000" ] }, { "cell_type": "markdown", "id": "vertical-bandwidth", "metadata": {}, "source": [ "We see that pyg is already faster than pandas. Running just the new data using numpy arrays, is about 4-5 times faster still. \n", "Indeed, running 10k 100-day forward scenarios take about 2 seconds at most." ] }, { "cell_type": "code", "execution_count": 48, "id": "polished-right", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2021-03-06 23:56:10,252 - pyg - INFO - TIMER:'crossover' args:[[\"[100]\", '10', '30', '50'], [\"state=[3]\"]] (1 runs) took 0:00:01.605710 sec\n" ] } ], "source": [ "scenarios = np.random.normal(0,1,(100,10000))\n", "x_scenarios = timer(crossover)(scenarios , 10, 30, 50, state = x_history.state)" ] }, { "cell_type": "markdown", "id": "alone-chuck", "metadata": {}, "source": [ "Using cells, our code looks like this, with live and historical codebase looking pretty similar" ] }, { "cell_type": "code", "execution_count": 49, "id": "honey-headset", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cell\n", "a:\n", " 1993-10-20 0.463739\n", " 1993-10-21 0.429161\n", " 1993-10-22 -0.342095\n", " 1993-10-23 1.192557\n", " 1993-10-24 -0.448828\n", " ... \n", " 2020-11-22 -0.272184\n", " 2020-11-23 0.121197\n", " 2020-11-24 -0.581223\n", " 2020-11-25 -0.682961\n", " 2020-11-26 -1.084583\n", " Length: 9900, dtype: float64\n", "fast:\n", " 10\n", "slow:\n", " 30\n", "vol:\n", " 50\n", "function:\n", " \n", "instate:\n", " None\n", "data:\n", " 1993-10-20 -1.000000\n", " 1993-10-21 -1.407264\n", " 1993-10-22 -1.714259\n", " 1993-10-23 1.177760\n", " 1993-10-24 -1.220600\n", " ... \n", " 2020-11-22 -2.091785\n", " 2020-11-23 -1.765958\n", " 2020-11-24 -1.796933\n", " 2020-11-25 -1.853106\n", " 2020-11-26 -2.044795\n", " Length: 9900, dtype: float64\n", "state:\n", " Dict\n", " fast:\n", " {'t': nan, 't0': 0.9999999999999994, 't1': -0.4251894284980144}\n", " slow:\n", " {'t': nan, 't0': 0.9999999999999983, 't1': -0.14408421908740027}\n", " vol:\n", " {'t': nan, 't0': 0.9999999999999972, 't2': 0.01889897942675779}" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x_history = cell(crossover_, a = history, fast = 10, slow = 30, vol = 50)()\n", "x_live = cell(crossover, a = live, fast = 10, slow = 30, vol = 50, state = x_history)()\n", "x_history" ] }, { "cell_type": "code", "execution_count": 50, "id": "northern-quest", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
2020-11-27-2.466036-2.466036
2020-11-28-1.899795-1.899795
2020-11-29-1.573653-1.573653
2020-11-30-1.473624-1.473624
2020-12-01-1.978180-1.978180
.........
2021-03-02-1.767405-1.767405
2021-03-03-1.183420-1.183420
2021-03-04-1.764486-1.764486
2021-03-05-2.458497-2.458497
2021-03-06-2.242366-2.242366
\n", "

100 rows × 2 columns

\n", "
" ], "text/plain": [ " 0 1\n", "2020-11-27 -2.466036 -2.466036\n", "2020-11-28 -1.899795 -1.899795\n", "2020-11-29 -1.573653 -1.573653\n", "2020-11-30 -1.473624 -1.473624\n", "2020-12-01 -1.978180 -1.978180\n", "... ... ...\n", "2021-03-02 -1.767405 -1.767405\n", "2021-03-03 -1.183420 -1.183420\n", "2021-03-04 -1.764486 -1.764486\n", "2021-03-05 -2.458497 -2.458497\n", "2021-03-06 -2.242366 -2.242366\n", "\n", "[100 rows x 2 columns]" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([pd.Series(x_live.data, pandas_x.index[-100:]), pandas_x.iloc[-100:]], axis = 1)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 5 }