pyg.mongo

A few words on MongoDB, a no-SQL database versus SQL:

  • Mongo has ‘collections’ that are the equivalent of tables

  • Mongo will refer to ‘documents’ instead of traditional records. Those records are unstructured and look like trees: dicts of dicts. They contain arbitary objects as well as just the primary types a SQL database is designed to support.

  • Mongo collections do not have the concept of primary keys

  • Mongo WHERE SQL clause is replaced by a query in a form of a dict “presented” to the collection object.

  • Mongo SELECT SQL clause is replaced by a ‘projection’ on the cursor, specifying what fields are retrieved.

Query generator

We start by simplifying the way we generate mongo query dictionaries.

q and Q

class pyg.mongo._q.Q(keys=None)

The MongoDB interface for query of a collection (table) is via a creation of a complicated looking dict: https://docs.mongodb.com/manual/tutorial/query-documents/

This is rather complicated for the average user so Q simplifies it greatly. Q is based on TinyDB and users of TinyDB will recognise it. https://tinydb.readthedocs.io/en/latest/usage.html

q is the singleton of Q.

q supports both calling to generate the querying dict

>>> q(a = 1, b = 2)

or

>>> (q.a == 1) & (q.b == 2)  # {"$and": [{"a": {"$eq": 1}}, {"b": {"$eq": 2}}]}
>>> (q.a == 1) | (q.b == 2)  # {"$or": [{"a": {"$eq": 1}}, {"b": {"$eq": 2}}]}

or indeed

>>> q(q.a == 1, q.b  == 2)
Example

>>> from pyg import q
>>> import re
>>> assert dict(q.a == 1) == {"a": {"$eq": 1}}
>>> assert dict(q(a = [1,2])) == {'a': {'$in': [1, 2]}}
>>> assert dict(q(q.a == [1,2], q.b > 3)) == {'$and': [{"a": {"$in": [1, 2]}}, {"b": {"$gt": 3}}]}  # a in [1,2] and b greater than 3
>>> assert dict(q(a = re.compile('^hello'))) == {'a': {'regex': '^hello'}}     # a regex query using regex expressions    
>>> assert dict(q.a.exists + q.b.not_exists)  == {"$and": [{"a": {"$exists": true}}, {"b": {"$exists": false}}]}
>>> assert dict(~(q.a==1))  == {'$not': {"a": {"$eq": 1}}}

Tables in Mongo

mongo_cursor

mongo_cursor has hybrid functionality of a Mongo cursor and Mongo collection objects.

class pyg.mongo._cursor.mongo_cursor(cursor, writer=None, reader=None, query=None, **_)

mongo_cursor is a souped-up combination of mongo.Cursor and mongo.Collection with a simple API.

Parameters

cursor : MongoDB cursor or MongoDB collection

writerTrue/False/string, optional

The default is None.

writer allows you to transform the data before saving it in Mongo. You can create a function yourself or use built-in options:

  • False: do nothing, save the document as is

  • True/None: use pyg.base.encode to encode objects. This will transform numpy array/dataframes into bytes that can be stored

  • ‘.csv’: save dataframes into csv files and then save reference of these files to mongo

  • ‘.parquet’ save dataframes into .parquet and np.ndarray into .npy files.

For .csv and .parquet to work, you will need to specify WHERE the document is to be saved. This can be done either:

  • the document has a ‘root’ key, specifying the root.

  • you specify root by setting writer = ‘c:/%name%surname.parquet’

readercallable or None, optional

The default is None, using decode. Use reader = False to passthru

querydict, optional

This is used to specify the Mongo query, e.g. q.a==1.

**_ :

Example

>>> from pyg import *
>>> cursor = mongo_table('test', 'test')
>>> cursor.drop()

## insert some data

>>> table = dictable(a = range(5)) * dictable(b = range(5))
>>> cursor.insert_many(table)
>>> cursor.set(c = lambda a, b: a * b)
Filtering

>>> assert len(cursor) == 25
>>> assert len(cursor.find(a = 3)) == 5
>>> assert len(cursor.exc(a = 3)) == 20
>>> assert len(cursor.find(a = [3,2]).find(q.b<3)) == 6 ## can chain queries as well as use q to create complicated expressions
Row access

>>> cursor[0]

{‘_id’: ObjectId(‘603aec85cd15e2c090c07b87’), ‘a’: 0, ‘b’: 0}

>>> cursor[::] - '_id' == dictable(cursor) - '_id'
>>> dictable[25 x 3]
>>> a|b|c 
>>> 0|0|0 
>>> 0|1|0 
>>> 0|2|0 
>>> ...25 rows...
>>> 4|2|8 
>>> 4|3|12
>>> 4|4|16
Column access

>>> cursor[['a', 'b']]  ## just columns 'a' and 'b'
>>> del cursor['c'] ## delete all c
>>> cursor.set(c = lambda a, b: a * b)
>>> assert cursor.find_one(a = 3, b = 2)[0].c == 6
Example

root specification

>>> from pyg import *
>>> t = mongo_table('test', 'test', writer = 'c:/temp/%name/%surname.parquet')
>>> t.drop()
>>> doc = dict(name = 'adam', surname = 'smith', ts = pd.Series(np.arange(10)))
>>> t.insert_one(doc)
>>> assert eq(pd_read_parquet('c:/temp/adam/smith/ts.parquet'), doc['ts'])
>>> assert eq(t[0]['ts'], doc['ts'])
>>> doc = dict(name = 'beth', surname = 'brown', a = np.arange(10))
>>> t.insert_one(doc)

Since mongo_cursor is too powerful, we also have a mongo_reader version which is read-only.

delete_many()

Equivalent to drop: deletes all documents the cursor currently points to.

Note

If you want to drop a subset of the data, then use c.find(criteria).delete_many()

Returns

itself

delete_one(*args, **kwargs)

drops a specific record after verifying exactly one exists.

Parameters

*args : query **kwargs : query

Returns

itself

drop()

Equivalent to drop: deletes all documents the cursor currently points to.

Note

If you want to drop a subset of the data, then use c.find(criteria).delete_many()

Returns

itself

insert_many(table)

inserts multiple documents into the collection

tablesequence of documents

list of dicts or dictable

mongo_cursor

Example

simple insertion

>>> from pyg import *
>>> t = mongo_table('test', 'test')
>>> t = t.drop()
>>> values = dictable(a = [1,2,3,4,], b = [5,6,7,8])
>>> t = t.insert_many(values)
>>> t[::]        
>>> dictable[4 x 3]
>>> _id                     |a|b
>>> 602daee68c336f6429a77bdd|1|5
>>> 602daee68c336f6429a77bde|2|6
>>> 602daee68c336f6429a77bdf|3|7
>>> 602daee68c336f6429a77be0|4|8
Example

update

>>> table = t[::]
>>> modified = table(b = lambda b: b**2)
>>> t = t.insert_many(modified)

Since each of the documents we uploaded already has an _id…

>>> assert len(t) == 4
>>> t[::]
>>> dictable[4 x 3]
>>> _id                     |a|b
>>> 602daee68c336f6429a77bdd|1|25
>>> 602daee68c336f6429a77bde|2|36
>>> 602daee68c336f6429a77bdf|3|49
>>> 602daee68c336f6429a77be0|4|64
insert_one(doc)

inserts/updates a single document.

If the document ALREADY has _id in it, it updates that document If the document has no _id in it, it inserts it as a new document

Parameters

docdict

document.

Example

>>> from pyg import *
>>> t = mongo_table('test', 'test')
>>> t = t.drop()
>>> values = dictable(a = [1,2,3,4,], b = [5,6,7,8])
>>> t = t.insert_many(values)
Example

used to update an existing document

>>> doc = t[0]
>>> doc['c'] = 8
>>> str(doc)
>>> "{'_id': ObjectId('602d36150a5cd32717323197'), 'a': 1, 'b': 5, 'c': 8}"
>>> t = t.insert_one(doc)
>>> assert len(t) == 4        
>>> assert t[0] == doc        
Example

used to insert

>>> doc = Dict(a = 1, b = 8, c = 10)
>>> t = t.insert_one(doc)
>>> assert len(t) == 5
>>> t.drop()
property raw

returns an unfiltered mongo_reader

set(**kwargs)

updates all documents in current cursor based on the kwargs. It is similar to update_many but supports also functions

Parameters

kwargs: dict of values to be updated

Example

>>> from pyg import *
>>> t = mongo_table('test', 'test')
>>> t = t.drop()
>>> values = dictable(a = [1,2,3,4,], b = [5,6,7,8])
>>> t = t.insert_many(values)
>>> assert t[::]-'_id' == values
>>> t.set(c = lambda a, b: a+b)
>>> assert t[::]-'_id' == values(c = [6,8,10,12])
>>> t.set(d = 1)
>>> assert t[::]-'_id' == values(c =lambda a, b: a+b)(d = 1)
Returns

itself

update_many(doc, upsert=False)

updates all documents in current cursor based on the doc. The two are equivalent:

>>> cursot.update_many(doc)
>>> collection.update_many(cursor.spec, { 'set' : update})
Parameters

doc : dict of values to be updated

Returns

itself

update_one(doc, upsert=True)
  • updates a document if an _id is present in doc.

  • insert a document if an _id is not present and upsert is true

Parameters

docdocument

doc to be upserted.

upsertbool, optional

insert if no document present? The default is True.

Returns

doc

document updated.

mongo_reader

mongo_reader is a read-only version of the cursor to avoid any unintentional damage to database.

class pyg.mongo._reader.mongo_reader(cursor, writer=None, reader=None, query=None, **_)

mongo_reader is a read-only version of the mongo_cursor. You can instantiate it with a mongo_reader(cursor) call where cursor can be a mongo_cursor, a pymongo.Cursor or a pymongo.Collection

property address
Returns

tuple

A unique combination of the client addres, database name and collection name, identifying the collection uniquely.

clone(**params)
Returns

mongo_reader

Returns a cloned version of current mongo_reader but allows additional parameters to be set (see spec and project)

property collection
Returns

pymongo.Collection object

count()

cursor.count() and len(cursor) are the same and return the number of documents matching current specification.

distinct(key)

returns the distinct values of the key

keystr

a key in the documents.

list of strings

distinct values

Example

>>> from pyg import *; import pymongo
>>> table = pymongo.MongoClient()['test']['test']
>>> table.insert_one(dict(name = 'alan', surname = 'abrahams', age = 39, marriage = dt(2000)))
>>> table.insert_one(dict(name = 'barbara', surname = 'brown', age = 50, marriage = dt(2020)))
>>> table.insert_one(dict(name = 'charlie', surname = 'cohen', age = 20))
>>> t = mongo_reader(table)
>>> assert t.name == t.distinct('name') == ['alan', 'barbara', 'charlie']
>>> table.drop()
docs(doc='doc', *keys)

self[::] flattens the entire document. At times, we want to see the full documents, indexed by keys and docs does that. returns a dictable with both keys and the document in the ‘doc’ column

exc(**kwargs)

filters ‘negatively’ removing documents that match the criteria specified.

cursor

filtered documents.

Example

>>> from pyg import *; import pymongo
>>> table = pymongo.MongoClient()['test']['test']
>>> table.insert_one(dict(name = 'alan', surname = 'abrahams', age = 39, marriage = dt(2000)))
>>> table.insert_one(dict(name = 'barbara', surname = 'brown', age = 50, marriage = dt(2020)))
>>> table.insert_one(dict(name = 'charlie', surname = 'cohen', age = 20))
>>> t = mongo_reader(table)
>>> assert len(t.exc(name = 'alan')) == 2        
>>> assert len(t.exc(name = ['alan', 'barbara'])) == 1        
>>> table.drop()
find(*args, **kwargs)

Same as self.specify()

The ‘spec’ is the cursor’s filter on documents (can think of it as row-selection) within the collection. We use q (see pyg.mongo._q.q) to specify the filter on the cursor.

Returns

A filtered mongo_reader cursor

Example

>>> from pyg import *; import pymongo
>>> table = pymongo.MongoClient()['test']['test']
>>> table.insert_one(dict(name = 'alan', surname = 'abrahams', age = 39, marriage = dt(2000)))
>>> table.insert_one(dict(name = 'barbara', surname = 'brown', age = 50, marriage = dt(2020)))
>>> table.insert_one(dict(name = 'charlie', surname = 'cohen', age = 20))
>>> t = mongo_reader(table)
>>> assert len(t.find(name = 'alan')) == 1
>>> assert len(t.find(q.age>25)) == 2
>>> assert len(t.find(q.age>25, q.marriage<dt(2010))) == 1
>>> table.drop()
find_one(doc=None, *args, **kwargs)

searches for records based either on the doc, or the args/kwargs specified. Unlike mongo cursor which finds one of many, here, when we ask for find_one, we will throw an exception if more than one documents are found.

Returns

A cursor pointing to a single record (document)

inc(*args, **kwargs)

Same as self.specify()

The ‘spec’ is the cursor’s filter on documents (can think of it as row-selection) within the collection. We use q (see pyg.mongo._q.q) to specify the filter on the cursor.

Returns

A filtered mongo_reader cursor

Example

>>> from pyg import *; import pymongo
>>> table = pymongo.MongoClient()['test']['test']
>>> table.insert_one(dict(name = 'alan', surname = 'abrahams', age = 39, marriage = dt(2000)))
>>> table.insert_one(dict(name = 'barbara', surname = 'brown', age = 50, marriage = dt(2020)))
>>> table.insert_one(dict(name = 'charlie', surname = 'cohen', age = 20))
>>> t = mongo_reader(table)
>>> assert len(t.find(name = 'alan')) == 1
>>> assert len(t.find(q.age>25)) == 2
>>> assert len(t.find(q.age>25, q.marriage<dt(2010))) == 1
>>> table.drop()
project(projection=None)

The ‘projection’ is the cursor’s column selection on documents. If in SQL we write SELECT col1, col2 FROM …, in Mongo, the cursor.projection = [‘col1’, ‘col2’]

Parameters

projection: a list/str of keys we are interested in reading. Note that nested keys are OK: ‘level1.level2.name’ is perfectly good

Returns

A mongo_reader cursor filtered to read just these keys

property projection
Returns

The ‘projection’ is the cursor’s column selection on documents. If in SQL we write SELECT col1, col2 FROM …, in Mongo, the cursor.projection = [‘col1’, ‘col2’]

property raw

returns an unfiltered mongo_reader

read(item=0, reader=None)

reads the next document from the collection.

Parameters

itemint, optional

Please read the ith record. The default is 0.

readercallable/list of callables, optional

When we read the document from the collection, we first transform them. The default behaviour is to use pyg.base._encode.decode but you may pass reader = False to grab the raw data from mongo

Returns

document

The document from Mongo

sort(*by)

sorting on server side, per key(s)

by : str/list of strs

sorted cursor.

property spec
Returns

The ‘spec’ is the cursor’s filter on documents (can think of it as row-selection) within the collection

specify(*args, **kwargs)

The ‘spec’ is the cursor’s filter on documents (can think of it as row-selection) within the collection. We use q (see pyg.mongo._q.q) to specify the filter on the cursor.

Returns

A filtered mongo_reader cursor

mongo_pk_reader

mongo_pk_reader extends the standard reader to handle tables with primary keys (pk) while being read-only.

class pyg.mongo._pk_reader.mongo_pk_reader(cursor, pk, writer=None, reader=None, query=None, **_)

we set up a system in Mongo to ensure we can mimin tables with primary keys. The way we do this is two folds:

  • At document insertion, we mark as _deleted old documents sharing the same keys by adding a key _deleted to the old doc

  • At reading, we filter for documents where q._deleted.not_exists.

clone(**kwargs)
Returns

mongo_reader

Returns a cloned version of current mongo_reader but allows additional parameters to be set (see spec and project)

create_index(*keys)

creates a sorted index on the collection

Parameters

*keysstrings

if misssing, use the primary keys.

dedup()

Although in principle, if a single process reads/writes to Mongo, we should not get duplicates. In practice, when multiple clients access the database, we occasionally get multiple records with the same primary keys. When this happens, we also end up with poor mongo _ids

mongo_pk_cursor

Hopefully, a table with unique keys.

docs(doc='doc', *keys)

self[::] flattens the entire document. At times, we want to see the full documents, indexed by keys and docs does that. returns a dictable with both keys and the document in the ‘doc’ column

mongo_pk_cursor

mongo_pk_cursor is our go-to object and it manages all our primary-keyed tables. .. autoclass:: pyg.mongo._pk_cursor.mongo_pk_cursor

members

encoding docs before saving to mongo

Before we save data to Mongo, we may need to transform it, especially if we are to save pd.DataFrame. By default, we encode them into bytes and push to mongo. You can choose to save pandas dataframes/series as .parquet files and numpy arrays into .npy files.

parquet_write

pyg.mongo._encoders.parquet_write(doc, root=None)

MongoDB is great for manipulating/searching dict keys/values. However, the actual dataframes in each doc, we may want to save in a file system. - The DataFrames are stored as bytes in MongoDB anyway, so they are not searchable - Storing in files allows other non-python/non-MongoDB users easier access, allowing data to be detached from app - MongoDB free version has limitations on size of document - file based system may be faster, especially if saved locally not over network - for data licensing issues, data must not sit on servers but stored on local computer

Therefore, the doc encode will cycle through the elements in the doc. Each time it sees a pd.DataFrame/pd.Series, it will - determine where to write it (with the help of the doc) - save it to a .parquet file

csv_write

pyg.mongo._encoders.csv_write(doc, root=None)

MongoDB is great for manipulating/searching dict keys/values. However, the actual dataframes in each doc, we may want to save in a file system. - The DataFrames are stored as bytes in MongoDB anyway, so they are not searchable - Storing in files allows other non-python/non-MongoDB users easier access, allowing data to be detached from orignal application - MongoDB free version has limitations on size of document - file based system may be faster, especially if saved locally not over network - for data licensing issues, data must not sit on servers but stored on local computer

Therefore, the doc encode will cycle through the elements in the doc. Each time it sees a pd.DataFrame/pd.Series, it will - determine where to write it (with the help of the doc) - save it to a .csv file

cells in Mongo

Now that we have a database, we construct cells that can load/save data to collections.

db_cell

class pyg.mongo._db_cell.db_cell(function=None, output=None, db=None, **kwargs)

a db_cell is a specialized cell with a ‘db’ member pointing to a database where cell is to be stored. We use this to implement save/load for the cell.

It is important to recognize the duality in the design: - the job of the cell.db is to be able to save/load based on the primary keys. - the job of the cell is to provide the primary keys to the db object.

The cell saves itself by ‘presenting’ itself to cell.db() and say… go on, load my data based on my keys.

Example

saving & loading

>>> from pyg import *
>>> people = partial(mongo_table, db = 'test', table = 'test', pk = ['name', 'surname'])
>>> anna = db_cell(db = people, name = 'anna', surname = 'abramzon', age = 46).save()
>>> bob  = db_cell(db = people, name = 'bob', surname = 'brown', age = 25).save()
>>> james = db_cell(db = people, name = 'james', surname = 'johnson', age = 39).save()

Now we can pull the data directly from the database

>>> people()['name', 'surname', 'age'][::]
>>> dictable[3 x 4]
>>> _id                     |age|name |surname 
>>> 601e732e0ef13bec9cd8a6cb|39 |james|johnson 
>>> 601e73db0ef13bec9cd8a6d4|46 |anna |abramzon
>>> 601e73db0ef13bec9cd8a6d7|25 |bob  |brown       

db_cell can implement a function:

>>> def is_young(age):
>>>    return age < 30
>>> bob.function = is_young
>>> bob = bob.go()
>>> assert bob.data is True

When run, it saves its new data to Mongo and we can load its own data:

>>> new_cell_with_just_db_and_keys = db_cell(db = people, name = 'bob', surname = 'brown')
>>> assert 'age' not in new_cell_with_just_db_and_keys 
>>> now_with_the_data_from_database = new_cell_with_just_db_and_keys.load()
>>> assert now_with_the_data_from_database.age == 25
>>> people()['name', 'surname', 'age', 'data'][::]
>>>  dictable[3 x 4]
>>> _id                     |age|name |surname |data
>>> 601e732e0ef13bec9cd8a6cb|39 |james|johnson |None
>>> 601e73db0ef13bec9cd8a6d4|46 |anna |abramzon|None
>>> 601e73db0ef13bec9cd8a6d7|25 |bob  |brown   |True
>>> people().raw.drop()    
load(mode=0)

loads a document from the database and updates various keys.

Persistency

Since we want to avoid hitting the database, there is a singleton GRAPH, a dict, storing the cells by their address. Every time we load/save from/to Mongo, we also update GRAPH.

We use the GRAPH often so if you want to FORCE the cell to go to the database when loading, use this:

>>> cell.load(-1) 
>>> cell.load(-1).load(0)  # clear GRAPH and load from db
>>> cell.load([0])     # same thing: clear GRAPH and then load if available
Merge of cached cell and calling cell

Once we load from memory (either MongoDB or GRAPH), we tree_update the cached cell with the new values in the current cell. This means that it is next to impossible to actually delete keys. If you want to delete keys in a cell/cells in the database, you need to:

>>> del db.inc(filters)['key.subkey']
Parameters

modeint , dataetime, optional

if -1, then does not load and clears the GRAPH if 0, then will load from database if found. If not found, will return original document if 1, then will throw an exception if no document is found in the database if mode is a date, will return the version alive at that date The default is 0.

IF you enclose any of these in a list, then GRAPH is cleared prior to running and the database is called.

Returns

document

save()

Saves the cell for persistency. Not implemented for simple cell. see db_cell

Returns

cell

self, saved.

periodic_cell

class pyg.mongo._periodic_cell.periodic_cell(function=None, output=None, db=None, _period='1b', updated=None, **kwargs)

periodic_cell inherits from db_cell its ability to save itself in MongoDb using its db members Its calculation schedule depends on when it was last updated.

Example

>>> from pyg import *
>>> c = periodic_cell(lambda a: a + 1, a = 0)

We now assert it needs to be calculated and calculate it…

>>> assert c.run()
>>> c = c.go()
>>> assert c.data == 1
>>> assert not c.run()

Now let us cheat and tell it, it was last run 3 days ago…

>>> c.updated = dt(-3)
>>> assert c.run()
run()

checks if the cell needs calculation. This depends on the nature of the cell. By default (for cell and db_cell), if the cell is already calculated so that cell._output exists, then returns False. otherwise True

bool

run cell?

Example

>>> c = cell(lambda x: x+1, x = 1)
>>> assert c.run()
>>> c = c()
>>> assert c.data == 2 and not c.run()

get_cell

db_save

pyg.mongo._db_cell.db_save(value)

saves a db_cell from the database. Will iterates through lists and dicts

Parameters

value: obj

db_cell (or list/dict of) to be loaded

Example

>>> from pyg import *
>>> db = partial(mongo_table, table = 'test', db = 'test', pk = ['a','b'])
>>> c = db_cell(add_, a = 2, b = 3, key = 'test', db = db)
>>> c = db_save(c)    
>>> assert get_cell('test', 'test', a = 2, b = 3).key == 'test'

db_load

pyg.mongo._db_cell.db_load(value, mode=0)

loads a db_cell from the database. Iterates through lists and dicts

Parameters

value: obj

db_cell (or list/dict of) to be loaded

mode: int

loading mode -1: dont load, +1: load and throw an exception if not found, 0: load if found