damast.data_handling.transformers

damast.data_handling.transformers#

Collection of generic Transformer implementations

Submodules#

Classes#

`AddTimestamp`	Add Timestamp from date Time UTC.
`AddUndefinedValue`	Replace missing and Not Available (NA) entries in a column with a given value.
`BallTreeAugmenter`	A class for computation in distance computation using BallTree.
`ChangeTypeColumn`	Create a new column with the new type of a given column.
`JoinDataFrameByColumn`	Add a column to an input dataframe by merging it with another dataset.
`MultiplyValue`	Multiply a column by a value.
`DropMissingOrNan`	Drop rows that do not have a defined value or NaN for a given column.
`FilterWithin`	Filter rows and keep those within given values.
`RemoveValueRows`	Remove rows that do not have a defined value for a given column.

Functions#

normalize(→ numpy.typing.NDArray[numpy.float64])

Normalize data in array x with lower bound x_min and upper bound x_max

Package Contents#

class damast.data_handling.transformers.AddTimestamp#

Bases: damast.core.dataprocessing.PipelineElement

Add Timestamp from date Time UTC.

If time-stamp is not supplied for a row add NaN

transform(df: damast.core.AnnotatedDataFrame) → damast.core.AnnotatedDataFrame#: Add Timestamp from datetimeUTC

class damast.data_handling.transformers.AddUndefinedValue(fill_value: Any)#

Bases: damast.core.dataprocessing.PipelineElement

Replace missing and Not Available (NA) entries in a column with a given value.

Parameters:: fill_value – The value replacing NA

_fill_value: Any#

property fill_value#

transform(df: damast.core.AnnotatedDataFrame) → damast.core.AnnotatedDataFrame#: Fill in values for NA and missing entries

class damast.data_handling.transformers.BallTreeAugmenter(x: numpy.typing.NDArray[numpy.float64], metric: str)#

A class for computation in distance computation using BallTree.

Uses the sklearn.neighbours.BallTree to compute the distance for any n-dimensional feature. The BallTree is created prior to being passed in as the lambda function of a DataFrame.add_virtual_column. The object can later be depickled from the state, and one can retrieve any meta-data added to the class after construction.

Parameters:

x – The points to use in the BallTree
metric – The metric to use in the BallTree, for available metrics see: https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

_tree: sklearn.neighbors.BallTree#

_metric: str#

_modified: datetime.datetime#

__name__#

update_balltree(x: numpy.typing.NDArray[numpy.float64])#

Replace points in the Balltree

Parameters:: x – (npt.NDArray[np.float64]): The new points

__call__(x: numpy.typing.NDArray[numpy.float64], y: numpy.typing.NDArray[numpy.float64]) → numpy.typing.NDArray[numpy.float64]#: Compute distances between the Balltree and each entry in x

property modified: datetime.datetime#: Last time the underlying BallTree was modified

class damast.data_handling.transformers.ChangeTypeColumn(new_type: Any)#

Bases: damast.core.dataprocessing.PipelineElement

Create a new column with the new type of a given column.

The new column name can be defined by providing a name_mapping for a column ‘y’. If no name_mapping is provided the column’s new name will be ‘y’

Parameters:: new_type – The new type of the column

_new_type: Any#

property new_type#

transform(df: damast.core.AnnotatedDataFrame) → damast.core.AnnotatedDataFrame#: Change the default type of a column

class damast.data_handling.transformers.JoinDataFrameByColumn(dataset: str | pathlib.Path | damast.core.types.XDataFrame, right_on: str, dataset_col: str, how: JoinHowType = JoinHowType.LEFT, sep: str = ';')#

Bases: damast.core.dataprocessing.PipelineElement

Add a column to an input dataframe by merging it with another dataset.

Parameters:

dataset – Path to .csv/.hdf5-file or a polars.dataframe.LazyFrame.
right_on – Column from dataset to use for joining data
dataset_column – Name of column in dataset to add
col_name – Name of augmented column
sep – Separator in CSV file

Note

right_on will not be added as a new column in the transformed dataset

class JoinHowType#

Bases: str, enum.Enum

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

INNER = 'inner'#

LEFT = 'left'#

RIGHT = 'right'#

FULL = 'full'#

SEMI = 'semi'#

ANTI = 'anti'#

CROSS = 'cross'#

_right_on: str#

_dataset: damast.core.types.DataFrame#

_dataset_column: str#

_join_how: JoinDataFrameByColumn.JoinHowType#

column_dtype#

classmethod load_data(filename: str | pathlib.Path, sep: str) → damast.core.types.DataFrame#

Load dataset from file

Parameters:

filename – The input file (or path)
sep – Separator in csv

Returns:

A DataFrame with the data

transform(df: damast.core.AnnotatedDataFrame) → damast.core.AnnotatedDataFrame#

Join datasets by column “x”. Adds column “out”.

Returns:: DataFrame with added column

class damast.data_handling.transformers.MultiplyValue(mul_value: Any)#

Bases: damast.core.dataprocessing.PipelineElement

Multiply a column by a value.

Parameters:: multiply_value – The value to use to multiply

_mul_value: Any#

property mul_value#

transform(df: damast.core.AnnotatedDataFrame) → damast.core.AnnotatedDataFrame#: Multiply a column by a given value

class damast.data_handling.transformers.DropMissingOrNan#

Bases: damast.core.dataprocessing.PipelineElement

Drop rows that do not have a defined value or NaN for a given column.

transform(df: damast.core.AnnotatedDataFrame) → damast.core.AnnotatedDataFrame#: Drop rows with missing value

class damast.data_handling.transformers.FilterWithin(within_values: Any)#

Bases: damast.core.dataprocessing.PipelineElement

Filter rows and keep those within given values.

Parameters:: within_values – list of values to keep

_within_values: Any#

property within_values#

transform(df: damast.core.AnnotatedDataFrame) → damast.core.AnnotatedDataFrame#: Filter rows and keep those within given values

class damast.data_handling.transformers.RemoveValueRows(remove_value: Any)#

Bases: damast.core.dataprocessing.PipelineElement

Remove rows that do not have a defined value for a given column.

Parameters:: remove_value – remove rows with this value.

_remove_value: Any#

property remove_value#

transform(df: damast.core.AnnotatedDataFrame) → damast.core.AnnotatedDataFrame#: Delete rows with remove_values

damast.data_handling.transformers.normalize(x: numpy.typing.NDArray[numpy.float64], x_min: float, x_max: float, a: float, b: float) → numpy.typing.NDArray[numpy.float64]#

Normalize data in array x with lower bound x_min and upper bound x_max to be in the range [a, b]

\[x_n = (b-a)\frac{x-x_{min}}{x_{max}-x_{min}} + a\]

Parameters:

x – Input array
x_min – Minimum bound of input data
x_max – Maximum bound of input data
a – Minimum bound of output data
b – Maximum bound of output data

Returns:

Normalized data