Features

Collection of feature calculators

QuarterlyFeatures

class ml_investment.features.QuarterlyFeatures(data_key: str, columns: typing.List[str], quarter_counts: typing.List[int] = [2, 4, 10], max_back_quarter: int = 10, min_back_quarter: int = 0, stats: typing.Dict[str, typing.Callable] = {'max': <function amax>, 'mean': <function mean>, 'median': <function median>, 'min': <function amin>, 'std': <function std>}, calc_stats_on_diffs: bool = True, data_preprocessing: typing.Optional[typing.Callable] = None, n_jobs: int = 2, verbose: bool = False)[source]

Bases: object

Feature calculator for qaurtrly-based statistics. Return features for company quarter slices.

Parameters

data_key – key of dataloader in data argument during calculate()
columns – column names for feature calculation(like revenue, debt etc)
quarter_counts – list of number of quarters for statistics calculation. e.g. if quarter_counts = [2] than statistics will be calculated on current and previous quarter
max_back_quarter – max bound of company slices in time. If max_back_quarter = 1 than features will be calculated for only current company quarter. If max_back_quarter is larger than total number of quarters for company than features will be calculated for all quarters
min_back_quarter – min bound of company slices in time. If min_back_quarter = 0 (default) than features will be calculated for all quarters. If min_back_quarter = 2 than current and previous quarter slices will not be used for feature calculation
stats – aggregation functions for features calculation. Should be as Dict[str, Callable]. Keys of this dict will be used as features names prefixes. Values of this dict should implement foo(x:List) -> float interface
calc_stats_on_diffs – calculate statistics on series diffs( np.diff(series) ) or not
data_preprocessing – function implemening foo(x) -> x_ interface. It will be used before feature calculation.
n_jobs – number of threads for calculation
verbose – show progress or not

calculate(data: Dict, index: List[str]) → pandas.core.frame.DataFrame[source]

Interface to calculate features for tickers based on data

Parameters

data – dict having field named as value in data_key param of __init__() This field should contain class implementing load(index) -> pd.DataFrame interface
index – list of tickers to calculate features for, i.e. ['AAPL', 'TSLA']

Returns

resulted features with index ['ticker', 'date']. Each row contains features for ticker company at date quarter

Return type

pd.DataFrame

QuarterlyDiffFeatures

class ml_investment.features.QuarterlyDiffFeatures(data_key: str, columns: List[str], compare_quarter_idxs: List[int] = [1, 4], max_back_quarter: int = 10, min_back_quarter: int = 0, norm: bool = True, data_preprocessing: Optional[Callable] = None, n_jobs: int = 2, verbose: bool = False)[source]

Bases: object

Feature calculator for qaurtr-to-another-quarter company indicators(revenue, debt etc) progress evaluation. Return features for company quarter slices.

Parameters

data_key – key of dataloader in data argument during calculate()
columns – column names for feature calculation(like revenue, debt etc)
compare_quarter_idxs – list of back quarter idxs for progress calculation. e.g. if compare_quarter_idxs = [1] than current quarter will be compared with previous quarter. If compare_quarter_idxs = [4] than current quarter will be compared with previous year quarter.
max_back_quarter – max bound of company slices in time. If max_back_quarter = 1 than features will be calculated for only current company quarter. If max_back_quarter is larger than total number of quarters for company than features will be calculated for all quarters
min_back_quarter – min bound of company slices in time. If min_back_quarter = 0 (default) than features will be calculated for all quarters. If min_back_quarter = 2 than current and previous quarter slices will not be used for feature calculation
norm – normalize to compare quarter or not
data_preprocessing – function implemening foo(x) -> x_ interface. It will be used before feature calculation.
n_jobs – number of threads for calculation
verbose – show progress or not

calculate(data: Dict, index: List[str]) → pandas.core.frame.DataFrame[source]

Interface to calculate features for tickers based on data

Parameters

data – dict having field named as value in data_key param of __init__() This field should contain class implementing load(index) -> pd.DataFrame interface
index – list of tickers to calculate features for, i.e. ['AAPL', 'TSLA']

Returns

resulted features with index ['ticker', 'date']. Each row contains features for ticker company at date quarter

Return type

pd.DataFrame

BaseCompanyFeatures

class ml_investment.features.BaseCompanyFeatures(data_key: str, cat_columns: List[str], verbose: bool = False)[source]

Bases: object

Feature calculator for getting base company information(sector, industry etc). Encode categorical columns via hashing label encoding. Return features for current company state.

Parameters

data_key – key of dataloader in data argument during calculate()
cat_columns – column names of categorical features for encoding
verbose – show progress or not

calculate(data: Dict, index: List[str]) → pandas.core.frame.DataFrame[source]

Interface to calculate features for tickers based on data

Parameters

data – dict having field named as value in data_key param of __init__() This field should contain class implementing load(index) -> pd.DataFrame interface
index – list of tickers to calculate features for, i.e. ['AAPL', 'TSLA']

Returns

resulted features with index ['ticker']. Each row contains features for ticker company

Return type

pd.DataFrame

DailyAggQuarterFeatures

class ml_investment.features.DailyAggQuarterFeatures(daily_data_key: str, quarterly_data_key: str, columns: typing.List[str], agg_day_counts: typing.List[typing.Union[int, numpy.timedelta64]] = [100, 200], max_back_quarter: int = 10, min_back_quarter: int = 0, daily_index=None, stats: typing.Dict[str, typing.Callable] = {'max': <function amax>, 'mean': <function mean>, 'median': <function median>, 'min': <function amin>, 'std': <function std>}, norm: bool = True, n_jobs: int = 2, verbose: bool = False)[source]

Bases: object

Feature calculator for daily-based statistics for quarter slices. Return features for company quarter slices.

Parameters

daily_data_key – key of dataloader in data argument during calculate() for daily data loading
quarterly_data_key – key of dataloader in data argument during calculate() for quarterly data loading
columns – column names for feature calculation(like marketcap, pe)
agg_day_counts – list of days counts to calculate statistics on. e.g. if agg_day_counts = [100, 200] statistics will be calculated based on last 100 and 200 days(separetly).
max_back_quarter – max bound of company slices in time. If max_back_quarter = 1 than features will be calculated for only current company quarter. If max_back_quarter is larger than total number of quarters for company than features will be calculated for all quarters
min_back_quarter – min bound of company slices in time. If min_back_quarter = 0 (default) than features will be calculated for all quarters. If min_back_quarter = 2 than current and previous quarter slices will not be used for feature calculation
daily_index – indexes for data[daily_data_key] dataloader. If None than index will be the same as for data[quarterly]. I.e. if you want to use this class for calculating commodities features, daily_index may be list of interesting commodities codes. If you want want to use it i.e. for calculating daily price features, daily_index should be None
stats – aggregation functions for features calculation. Should be as Dict[str, Callable]. Keys of this dict will be used as features names prefixes. Values of this dict should implement foo(x:List) -> float interface
norm – normalize daily stats or not
n_jobs – number of threads for calculation
verbose – show progress or not

calculate(data: Dict, index: List[str]) → pandas.core.frame.DataFrame[source]

Interface to calculate features for tickers based on data

Parameters

data – dict having fields named as values in daily_data_key and quarterly_data_key params of __init__() This fields should contain classes implementing load(index) -> pd.DataFrame interfaces
index – list of tickers to calculate features for, i.e. ['AAPL', 'TSLA']

Returns

resulted features with index ['ticker', 'date']. Each row contains features for ticker company at date quarter

Return type

pd.DataFrame

RelativeGroupFeatures

class ml_investment.features.RelativeGroupFeatures(feature_calculator, group_data_key: str, group_col: str, relation_foo=<function RelativeGroupFeatures.<lambda>>, keep_group_feats=False, verbose: bool = False)[source]

Bases: object

Feature calculator for features relative to some group median. I.e. calculate revenue growth relative to median in sector/industry.

Parameters

feature_calculator – key of dataloader in data argument during calculate() for daily data loading
group_data_key – key of dataloader in data argument during calculate() for loading data having group_col
group_col – column name for groups in which median values will be calculated
relation_foo – function implementing foo(x, y) -> z interface. E.g. if foo = lambda x: x - y, than resulted features will be calculated as difference between current company features and group median features.
keep_group_feats – return group median features or not
verbose – show progress or not

calculate(data, index)[source]

Interface to calculate features for tickers based on data

Parameters

data – dict having fields named as values in group_data_key and necessary for feature_calculator keys. This fields should contain classes implementing load(index) -> pd.DataFrame interfaces
index – index needed for feature_calculator.calculate()

Returns

resulted features with index as in ‘’feature_calculator.calculate``.

Return type

pd.DataFrame

FeatureMerger

class ml_investment.features.FeatureMerger(fc1, fc2, on=typing.Union[str, typing.List[str]])[source]

Bases: object

Feature calculator that combined two other feature calculators. Merge is executed by left.

Parameters

fc1 – first feature calculator implements calculate(data: Dict, index) -> pd.DataFrame interface
fc2 – second feature calculator implements calculate(data: Dict, index) -> pd.DataFrame interface
on – columns on which merge the results of executed calculate methods

calculate(data: Dict, index) → pandas.core.frame.DataFrame[source]

Interface to calculate features for tickers based on data

Parameters

data – dict having field names needed for fc1 and fc2 This fields should contain classes implementing load(index) -> pd.DataFrame interface
index – indexes dor feature calculators. I.e. if features about companies than index may be list of tickers, like ['AAPL', 'TSLA']

Returns

resulted merged features

Return type

pd.DataFrame