damast.data_handling.transformers#

Collection of generic Transformer implementations

Submodules#

Classes#

AddTimestamp

Add Timestamp from date Time UTC.

AddUndefinedValue

Replace missing and Not Available (NA) entries in a column with a given value.

BallTreeAugmenter

A class for computation in distance computation using BallTree.

ChangeTypeColumn

Create a new column with the new type of a given column.

JoinDataFrameByColumn

Add a column to an input dataframe by merging it with another dataset.

MultiplyValue

Multiply a column by a value.

DropMissingOrNan

Drop rows that do not have a defined value or NaN for a given column.

FilterWithin

Filter rows and keep those within given values.

RemoveValueRows

Remove rows that do not have a defined value for a given column.

Functions#

normalize(→ numpy.typing.NDArray[numpy.float64])

Normalize data in array x with lower bound x_min and upper bound x_max

Package Contents#

class damast.data_handling.transformers.AddTimestamp#

Bases: damast.core.dataprocessing.PipelineElement

Add Timestamp from date Time UTC.

If time-stamp is not supplied for a row add NaN

transform(df: damast.core.AnnotatedDataFrame) damast.core.AnnotatedDataFrame#

Add Timestamp from datetimeUTC

class damast.data_handling.transformers.AddUndefinedValue(fill_value: Any)#

Bases: damast.core.dataprocessing.PipelineElement

Replace missing and Not Available (NA) entries in a column with a given value.

Parameters:

fill_value – The value replacing NA

_fill_value: Any#
property fill_value#
transform(df: damast.core.AnnotatedDataFrame) damast.core.AnnotatedDataFrame#

Fill in values for NA and missing entries

class damast.data_handling.transformers.BallTreeAugmenter(x: numpy.typing.NDArray[numpy.float64], metric: str)#

A class for computation in distance computation using BallTree.

Uses the sklearn.neighbours.BallTree to compute the distance for any n-dimensional feature. The BallTree is created prior to being passed in as the lambda function of a DataFrame.add_virtual_column. The object can later be depickled from the state, and one can retrieve any meta-data added to the class after construction.

Parameters:
_tree: sklearn.neighbors.BallTree#
_metric: str#
_modified: datetime.datetime#
__name__#
update_balltree(x: numpy.typing.NDArray[numpy.float64])#

Replace points in the Balltree

Parameters:

x – (npt.NDArray[np.float64]): The new points

__call__(x: numpy.typing.NDArray[numpy.float64], y: numpy.typing.NDArray[numpy.float64]) numpy.typing.NDArray[numpy.float64]#

Compute distances between the Balltree and each entry in x

property modified: datetime.datetime#

Last time the underlying BallTree was modified

class damast.data_handling.transformers.ChangeTypeColumn(new_type: Any)#

Bases: damast.core.dataprocessing.PipelineElement

Create a new column with the new type of a given column.

The new column name can be defined by providing a name_mapping for a column ‘y’. If no name_mapping is provided the column’s new name will be ‘y’

Parameters:

new_type – The new type of the column

_new_type: Any#
property new_type#
transform(df: damast.core.AnnotatedDataFrame) damast.core.AnnotatedDataFrame#

Change the default type of a column

class damast.data_handling.transformers.JoinDataFrameByColumn(dataset: str | pathlib.Path | damast.core.types.XDataFrame, right_on: str, dataset_col: str, how: JoinHowType = JoinHowType.LEFT, sep: str = ';')#

Bases: damast.core.dataprocessing.PipelineElement

Add a column to an input dataframe by merging it with another dataset.

Parameters:
  • dataset – Path to .csv/.hdf5-file or a polars.dataframe.LazyFrame.

  • right_on – Column from dataset to use for joining data

  • dataset_column – Name of column in dataset to add

  • col_name – Name of augmented column

  • sep – Separator in CSV file

Note

right_on will not be added as a new column in the transformed dataset

class JoinHowType#

Bases: str, enum.Enum

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

INNER = 'inner'#
LEFT = 'left'#
RIGHT = 'right'#
FULL = 'full'#
SEMI = 'semi'#
ANTI = 'anti'#
CROSS = 'cross'#
_right_on: str#
_dataset: damast.core.types.DataFrame#
_dataset_column: str#
_join_how: JoinDataFrameByColumn.JoinHowType#
column_dtype#
classmethod load_data(filename: str | pathlib.Path, sep: str) damast.core.types.DataFrame#

Load dataset from file

Parameters:
  • filename – The input file (or path)

  • sep – Separator in csv

Returns:

A DataFrame with the data

transform(df: damast.core.AnnotatedDataFrame) damast.core.AnnotatedDataFrame#

Join datasets by column “x”. Adds column “out”.

Returns:

DataFrame with added column

class damast.data_handling.transformers.MultiplyValue(mul_value: Any)#

Bases: damast.core.dataprocessing.PipelineElement

Multiply a column by a value.

Parameters:

multiply_value – The value to use to multiply

_mul_value: Any#
property mul_value#
transform(df: damast.core.AnnotatedDataFrame) damast.core.AnnotatedDataFrame#

Multiply a column by a given value

class damast.data_handling.transformers.DropMissingOrNan#

Bases: damast.core.dataprocessing.PipelineElement

Drop rows that do not have a defined value or NaN for a given column.

transform(df: damast.core.AnnotatedDataFrame) damast.core.AnnotatedDataFrame#

Drop rows with missing value

class damast.data_handling.transformers.FilterWithin(within_values: Any)#

Bases: damast.core.dataprocessing.PipelineElement

Filter rows and keep those within given values.

Parameters:

within_values – list of values to keep

_within_values: Any#
property within_values#
transform(df: damast.core.AnnotatedDataFrame) damast.core.AnnotatedDataFrame#

Filter rows and keep those within given values

class damast.data_handling.transformers.RemoveValueRows(remove_value: Any)#

Bases: damast.core.dataprocessing.PipelineElement

Remove rows that do not have a defined value for a given column.

Parameters:

remove_value – remove rows with this value.

_remove_value: Any#
property remove_value#
transform(df: damast.core.AnnotatedDataFrame) damast.core.AnnotatedDataFrame#

Delete rows with remove_values

damast.data_handling.transformers.normalize(x: numpy.typing.NDArray[numpy.float64], x_min: float, x_max: float, a: float, b: float) numpy.typing.NDArray[numpy.float64]#

Normalize data in array x with lower bound x_min and upper bound x_max to be in the range [a, b]

\[x_n = (b-a)\frac{x-x_{min}}{x_{max}-x_{min}} + a\]
Parameters:
  • x – Input array

  • x_min – Minimum bound of input data

  • x_max – Maximum bound of input data

  • a – Minimum bound of output data

  • b – Maximum bound of output data

Returns:

Normalized data