damast.data_handling.transformers#
Collection of generic Transformer implementations
Submodules#
Classes#
Add Timestamp from date Time UTC. |
|
Replace missing and Not Available (NA) entries in a column with a given value. |
|
A class for computation in distance computation using BallTree. |
|
Create a new column with the new type of a given column. |
|
Add a column to an input dataframe by merging it with another dataset. |
|
Multiply a column by a value. |
|
Drop rows that do not have a defined value or NaN for a given column. |
|
Filter rows and keep those within given values. |
|
Remove rows that do not have a defined value for a given column. |
Functions#
|
Normalize data in array x with lower bound x_min and upper bound x_max |
Package Contents#
- class damast.data_handling.transformers.AddTimestamp#
Bases:
damast.core.dataprocessing.PipelineElement
Add Timestamp from date Time UTC.
If time-stamp is not supplied for a row add
NaN
- transform(df: damast.core.AnnotatedDataFrame) damast.core.AnnotatedDataFrame #
Add Timestamp from datetimeUTC
- class damast.data_handling.transformers.AddUndefinedValue(fill_value: Any)#
Bases:
damast.core.dataprocessing.PipelineElement
Replace missing and Not Available (NA) entries in a column with a given value.
- Parameters:
fill_value – The value replacing NA
- _fill_value: Any#
- property fill_value#
- transform(df: damast.core.AnnotatedDataFrame) damast.core.AnnotatedDataFrame #
Fill in values for NA and missing entries
- class damast.data_handling.transformers.BallTreeAugmenter(x: numpy.typing.NDArray[numpy.float64], metric: str)#
A class for computation in distance computation using BallTree.
Uses the sklearn.neighbours.BallTree to compute the distance for any n-dimensional feature. The BallTree is created prior to being passed in as the lambda function of a DataFrame.add_virtual_column. The object can later be depickled from the state, and one can retrieve any meta-data added to the class after construction.
- Parameters:
x – The points to use in the BallTree
metric – The metric to use in the BallTree, for available metrics see: https://docs.scipy.org/doc/scipy/reference/spatial.distance.html
- _tree: sklearn.neighbors.BallTree#
- _metric: str#
- _modified: datetime.datetime#
- __name__#
- update_balltree(x: numpy.typing.NDArray[numpy.float64])#
Replace points in the Balltree
- Parameters:
x – (npt.NDArray[np.float64]): The new points
- __call__(x: numpy.typing.NDArray[numpy.float64], y: numpy.typing.NDArray[numpy.float64]) numpy.typing.NDArray[numpy.float64] #
Compute distances between the Balltree and each entry in x
- property modified: datetime.datetime#
Last time the underlying BallTree was modified
- class damast.data_handling.transformers.ChangeTypeColumn(new_type: Any)#
Bases:
damast.core.dataprocessing.PipelineElement
Create a new column with the new type of a given column.
The new column name can be defined by providing a name_mapping for a column ‘y’. If no name_mapping is provided the column’s new name will be ‘y’
- Parameters:
new_type – The new type of the column
- _new_type: Any#
- property new_type#
- transform(df: damast.core.AnnotatedDataFrame) damast.core.AnnotatedDataFrame #
Change the default type of a column
- class damast.data_handling.transformers.JoinDataFrameByColumn(dataset: str | pathlib.Path | damast.core.types.XDataFrame, right_on: str, dataset_col: str, how: JoinHowType = JoinHowType.LEFT, sep: str = ';')#
Bases:
damast.core.dataprocessing.PipelineElement
Add a column to an input dataframe by merging it with another dataset.
- Parameters:
dataset – Path to .csv/.hdf5-file or a polars.dataframe.LazyFrame.
right_on – Column from dataset to use for joining data
dataset_column – Name of column in dataset to add
col_name – Name of augmented column
sep – Separator in CSV file
Note
right_on
will not be added as a new column in the transformed dataset- class JoinHowType#
Bases:
str
,enum.Enum
str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.
- INNER = 'inner'#
- LEFT = 'left'#
- RIGHT = 'right'#
- FULL = 'full'#
- SEMI = 'semi'#
- ANTI = 'anti'#
- CROSS = 'cross'#
- _right_on: str#
- _dataset: damast.core.types.DataFrame#
- _dataset_column: str#
- _join_how: JoinDataFrameByColumn.JoinHowType#
- column_dtype#
- classmethod load_data(filename: str | pathlib.Path, sep: str) damast.core.types.DataFrame #
Load dataset from file
- Parameters:
filename – The input file (or path)
sep – Separator in csv
- Returns:
A DataFrame with the data
- transform(df: damast.core.AnnotatedDataFrame) damast.core.AnnotatedDataFrame #
Join datasets by column “x”. Adds column “out”.
- Returns:
DataFrame with added column
- class damast.data_handling.transformers.MultiplyValue(mul_value: Any)#
Bases:
damast.core.dataprocessing.PipelineElement
Multiply a column by a value.
- Parameters:
multiply_value – The value to use to multiply
- _mul_value: Any#
- property mul_value#
- transform(df: damast.core.AnnotatedDataFrame) damast.core.AnnotatedDataFrame #
Multiply a column by a given value
- class damast.data_handling.transformers.DropMissingOrNan#
Bases:
damast.core.dataprocessing.PipelineElement
Drop rows that do not have a defined value or NaN for a given column.
- transform(df: damast.core.AnnotatedDataFrame) damast.core.AnnotatedDataFrame #
Drop rows with missing value
- class damast.data_handling.transformers.FilterWithin(within_values: Any)#
Bases:
damast.core.dataprocessing.PipelineElement
Filter rows and keep those within given values.
- Parameters:
within_values – list of values to keep
- _within_values: Any#
- property within_values#
- transform(df: damast.core.AnnotatedDataFrame) damast.core.AnnotatedDataFrame #
Filter rows and keep those within given values
- class damast.data_handling.transformers.RemoveValueRows(remove_value: Any)#
Bases:
damast.core.dataprocessing.PipelineElement
Remove rows that do not have a defined value for a given column.
- Parameters:
remove_value – remove rows with this value.
- _remove_value: Any#
- property remove_value#
- transform(df: damast.core.AnnotatedDataFrame) damast.core.AnnotatedDataFrame #
Delete rows with remove_values
- damast.data_handling.transformers.normalize(x: numpy.typing.NDArray[numpy.float64], x_min: float, x_max: float, a: float, b: float) numpy.typing.NDArray[numpy.float64] #
Normalize data in array x with lower bound x_min and upper bound x_max to be in the range [a, b]
\[x_n = (b-a)\frac{x-x_{min}}{x_{max}-x_{min}} + a\]- Parameters:
x – Input array
x_min – Minimum bound of input data
x_max – Maximum bound of input data
a – Minimum bound of output data
b – Maximum bound of output data
- Returns:
Normalized data