api2db.ingest.post_process package

Submodules

api2db.ingest.post_process.column_add module

Contains the ColumnAdd class

Summary of ColumnAdd Usage:

DataFrame df

Foo

Bar

1

A

2

B

3

C

post = ColumnAdd(key="FooBar", lam=lambda: 5, dtype=int)

DataFrame df

Foo

Bar

FooBar

1

A

5

2

B

5

3

C

5

Example Usage of ColumnAdd:

>>> import pandas as pd
... def f():
...     return 5
... df = pd.DataFrame({"Foo": [1, 2, 3], "Bar": ["A", "B", "C"]})   # Setup
...
... post = ColumnAdd(key="timestamp", lam=lambda x: f, dtype=int)
... post.lam_wrap(df)
pd.DataFrame({"Foo": [1, 2, 3], "Bar": ["A", "B", "C"], "FooBar": [5, 5, 5]})
class api2db.ingest.post_process.column_add.ColumnAdd(key: str, lam: Callable[], Any], dtype: Any)

Bases: api2db.ingest.post_process.post.Post

Used to add global values to a DataFrame, primarily for timestamps/ids

__init__(key: str, lam: Callable[], Any], dtype: Any)

Creates a ColumnAdd object

Parameters
  • key – The column name for the DataFrame

  • lam – A function that returns the value that should be globally placed into the DataFrame in key column

  • dtype – The python native type of the functions return

ctype

type of the data processor

Type

str

lam_wrap(lam_arg: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame

Overrides super class method

Workflow:

  1. Assign the lam function return to lam_arg[self.key]

  2. Typecast lam_arg[self.key] to dtype

  3. Return lam_arg

Parameters

lam_arg – The DataFrame to add a column to

Returns

The modified DataFrame

api2db.ingest.post_process.column_apply module

Contains the ColumnApply class

Summary of ColumnApply Usage:

DataFrame df

Foo

Bar

1

A

2

B

3

C

post = ColumnApply(key="Foo", lam=lambda x: x + 1, dtype=int)

DataFrame df

Foo

Bar

2

A

3

B

4

C

Example Usage of ColumnApply:

>>> import pandas as pd
... df = pd.DataFrame({"Foo": [1, 2, 3], "Bar": ["A", "B", "C"]})   # Setup
...
... post = ColumnApply(key="Foo", lam=lambda x: x + 1, dtype=int)
... post.lam_wrap(df)
pd.DataFrame({"Foo": [2, 3, 4], "Bar": ["A", "B", "C"]})
class api2db.ingest.post_process.column_apply.ColumnApply(key: str, lam: Callable[[Any], Any], dtype: Any)

Bases: api2db.ingest.post_process.post.Post

Used to apply a function across the rows in a column of a DataFrame

__init__(key: str, lam: Callable[[Any], Any], dtype: Any)

Creates a ColumnApply Object

Parameters
  • key – The column to apply the function to

  • lam – The function to apply

  • dtype – The python native type of the function output

ctype

type of data processor

Type

str

lam_wrap(lam_arg: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame

Overrides a super class method

Workflow:

  1. Apply lam to lam_arg[self.key]

  2. Cast lam_arg[self.key] to dtype

  3. Return lam_arg

Parameters

lam_arg – The DataFrame to modify

Returns

The modified DataFrame

api2db.ingest.post_process.columns_calculate module

Contains the ColumnsCalculate class

Note

ColumnsCalculate can be used to

  1. Replace columns in a DataFrame with calculated values

  2. Add new columns to a DataFrame based on calculations from existing columns

Summary of ColumnsCalculate Usage:

DataFrame df

Foo

Bar

1

2

2

4

3

8

def foobar(df):
    df["Foo+Bar"] = df["Foo"] + df["Bar"]
    df["Foo*Bar"] = df["Foo"] * df["Bar"]
    return df[["Foo+Bar", "Foo*Bar"]]

post = ColumnsCalculate(keys=["Foo+Bar", "Foo*Bar"], lam=lambda x: foobar(x), dtype=int)

DataFrame df

Foo

Bar

Foo+Bar

Foo*Bar

1

2

3

2

2

4

6

8

3

8

11

24

Example Usage of ColumnsCalculate:

>>> import pandas as pd
... df = pd.DataFrame({"Foo": [1, 2, 3], "Bar": [2, 4, 8]})   # Setup
...
... def foobar(d):
...     d["Foo+Bar"] = d["Foo"] + d["Bar"]
...     d["Foo*Bar"] = d["Foo"] * d["Bar"]
...     return d[["Foo+Bar", "Foo*Bar"]]
...
... post = ColumnsCalculate(keys=["Foo+Bar", "Foo*Bar"], lam=lambda x: foobar(x), dtype=int)
... post.lam_wrap(df)
pd.DataFrame({"Foo+Bar": [3, 6, 11], "Foo*Bar": [2, 8, 24]})
class api2db.ingest.post_process.columns_calculate.ColumnsCalculate(keys: List[str], lam: Callable[[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame], dtypes: List[Any])

Bases: api2db.ingest.post_process.post.Post

Used to calculate new column values to add to the DataFrame

__init__(keys: List[str], lam: Callable[[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame], dtypes: List[Any])

Creates a ColumnsCalculate object

Parameters
  • keys – A list of the keys to add/replace in the existing DataFrame

  • lam – A function that takes as parameter a DataFrame, and returns a DataFrame with column names matching keys and the columns having/being castable to dtypes

  • dtypes – A list of python native types that are associated with keys

ctype

type of data processor

Type

str

lam_wrap(lam_arg: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame

Overrides super class method

Workflow:

  1. Create a temporary DataFrame tmp_df by applying lam to lam_arg

  2. For each key in self.keys set lam_arg[key] = tmp_df[key]

  3. For each key in self.keys cast lam_arg[key] to the appropriate pandas dtype

  4. Return lam_arg

Parameters

lam_arg – The DataFrame to modify

Returns

The modified DataFrame

api2db.ingest.post_process.date_cast module

Contains the DateCast class

Summary of DateCast Usage:

DataFrame df

Foo

Bar

2021-04-29 01:39:00

False

2021-04-29 01:39:00

False

Bar!

True

DataFrame df.dtypes

Foo

Bar

string

bool

post = DateCast(key="Foo", fmt="%Y-%m-%d %H:%M:%S")

DataFrame df

Foo

Bar

2021-04-29 01:39:00

False

2021-04-29 01:39:00

False

NaT

True

DataFrame df.dtypes

Foo

Bar

datetime64[ns]

bool

class api2db.ingest.post_process.date_cast.DateCast(key: str, fmt: str)

Bases: api2db.ingest.post_process.post.Post

Used to cast columns containing dates in string format to pandas DateTimes

__init__(key: str, fmt: str)

Creates a DateCast object

Parameters
  • key – The name of the column containing strings that should be cast to datetimes

  • fmt – A string formatter that specifies the datetime format of the strings in the column named key

ctype

type of data processor

Type

str

lam_wrap(lam_arg: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame

Overrides super class method

Workflow:

  1. Attempt to cast lam_arg[self.key] from strings to datetimes

  2. Return the modified lam_arg

Parameters

lam_arg – The DataFrame to modify

Returns

The modified DataFrame

api2db.ingest.post_process.drop_na module

Contains the DropNa class

Simply a shortcut class for a common operation.

Summary of DropNa Usage:

See pandas Documentation

class api2db.ingest.post_process.drop_na.DropNa(keys: List[str])

Bases: api2db.ingest.post_process.post.Post

Used to drop columns with null values on specified keys

__init__(keys: List[str])

Creates a DropNa object

Parameters

keys – The subset of keys to drop if the keys are null

ctype

type of data processor

Type

str

lam_wrap(lam_arg: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame

Overrides super class method

Shortcut used to drop null values. Performs pd.DataFrame.drop_na(subset=self.keys)

Parameters

lam_arg – The DataFrame to modify

Returns

The modified DataFrame

api2db.ingest.post_process.merge_static module

Contains the MergeStatic class

Note

MergeStatic is used to merge data together. A common use case of this is in situations where a data-vendor provides an API that gives data-points “Foo”, “Bar”, and “location_id” where “location_id” references a different data-set.

It is common for data-providers to have a file that does not update very frequently, i.e. is mostly static that contains this information.

The typical workflow of a MergeStatic instance is as follows:

  1. Create a LocalStream with mode set to update or replace and a target like CACHE/my_local_stream.pickle

  2. Set the LocalStream to run periodically (6 hours, 24 hours, 10 days, whatever frequency this data is updated)

  3. Add a MergeStatic object to the frequently updating datas post-processors and set the path to the LocalStream storage path.

class api2db.ingest.post_process.merge_static.MergeStatic(key: str, path: str)

Bases: api2db.ingest.post_process.post.Post

Merges incoming data with a locally stored DataFrame

__init__(key: str, path: str)

Creates a MergeStatic object

Parameters
  • key – The key that the DataFrames should be merged on

  • path – The path to the locally stored file containing the pickled DataFrame to merge with

ctype

type of data processor

Type

str

lam_wrap(lam_arg: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame

Overrides super class method

Workflow:

  1. Load DataFrame df from file specified at self.path

  2. Use lam_arg to perform left-merge on self.key merging with df

  3. Return the modified DataFrame

Parameters

lam_arg – The DataFrame to modify

Returns

The modified DataFrame

api2db.ingest.post_process.post module

Contains the Post class

class api2db.ingest.post_process.post.Post

Bases: api2db.ingest.base_lam.BaseLam

Used as a BaseClass for all PostProcessors

static typecast(dtype: Any)str

Yields a string that can be used for typecasting to pandas dtype.

Parameters

dtype – A python native type

Returns

A string that can be used in conjunction with a pandas DataFrame/Series for typecasting

Module contents

Original Author

Tristen Harr

Creation Date

04/29/2021

Revisions

None