# Usage
We start by exploring the data-processing pipeline part of `DAMAST`.
We consider a manufactured dataset of Automatic Identification System (AIS) messages.
The data is generated for 150 boats, where the minimal length of a trajectory is 30 messages, and the maximal length is 1000

In [None]:
!pip install damast

import polars
import damast.domains.maritime.ais.data_generator as generator

data = generator.AISTestData(number_of_trajectories=1000, min_length=25, max_length=300)

The data is stored in a [polars.LazyFrame](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html), and we can inspect the first and last 5 messages in the dataset.

In [None]:
print(data.dataframe)

The dataset consists of 11 columns, which we will go through in detail.

## Data-specification
The Maritime Mobile Service Identity (MMSI) used to identify a ship. It *should* be a 9 digit number whose first integer should be between 2 and 7.
The data we have generated should contain some invalid numbers. Let us inspect these.

In [None]:
from damast.domains.maritime.data_specification import MMSI
df = data.dataframe
invalid_mmsis = df.filter((polars.col('mmsi') < MMSI.min_value) | (polars.col('mmsi') > MMSI.max_value))
invalid_mmsis

Before sending this data to a machine learning algorithm, one would have to filter out invalid data.
We can do this by creating a `damast.core.DataSpecification` describing what valid output we would like in our data-frame.

In [None]:
from damast.core import DataSpecification, MinMax
mmsi_spec = DataSpecification(name="mmsi", description="Maritime Mobile Service Identity", representation_type=int,
                              value_range=MinMax(MMSI.min_value, MMSI.max_value))

We have here described what data this column is supposed to describe, how the data is represented in Python, and its minimum and maximum range.
Next, we create a `damast.core.MetaData` object that we can apply to the dataframe.

In [None]:
from damast.core import MetaData,ValidationMode
metadata = MetaData([mmsi_spec])
metadata.apply(df.lazy(), ValidationMode.UPDATE_DATA)

Of course, we do not want to do this process manually per row. Therefore, we can create a `DataSpecification` per row, and let the `damast.core.AnnotatedDataFrame` handle the validation of the data. We can choose between three ways of handling the input data with metadata, we can either use:
- `ValidationMode.READONLY`: Reads in the data, checks it against the meta-data and throws an error if the data does not adhere to the data-specification.
- `ValidationMode.UPDATE_METADATA`: Update the metadata based on the input in the annotated data-frame. This might change the representation type, column name and valid rages of the data.
- `ValidationMode.UPDATE_DATA`: Update data so that it adheres to the meta-data.

In [None]:
from damast.core.metadata import DataCategory
from damast.core.dataframe import AnnotatedDataFrame
dataspec = {
    "annotations": {"comment": "This is a autogenerated test data set"},
    "columns": [
        {"name": "mmsi", "is_optional": False, "category": DataCategory.STATIC,
         "value_range":{"MinMax": {"min": MMSI.min_value, "max": MMSI.max_value}}},
        {"name": "lon", "is_optional": False, "unit": "deg", "category": DataCategory.DYNAMIC},
        {"name": "lat", "is_optional": False, "unit": "deg", "category": DataCategory.DYNAMIC},
        {"name": "date_time_utc", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "sog", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "cog", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "true_heading", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "nav_status", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "rot", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "message_nr", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "source", "is_optional": False, "category": DataCategory.DYNAMIC},
    ]
}
metadata = MetaData.from_dict(dataspec)
data = generator.AISTestData(number_of_trajectories=1000, min_length=25, max_length=300)
adf = AnnotatedDataFrame(data.dataframe, metadata, validation_mode=ValidationMode.UPDATE_DATA)
adf

## Data-processing
Say we want to repeat this process on any data-set we read in. Then, we should create a `damast.core.dataprocessing.DataProcessingPipeline`.
A pipeline consists of pipeline-elements, that is a set of transformations on the original dataset.
We start by creating a Pipeline-element that drops all rows missing an `"mmsi"` entry.

In [None]:
from damast.data_handling.transformers.filters import DropMissingOrNan
from damast.core.dataprocessing import DataProcessingPipeline
pipeline = DataProcessingPipeline(name="Remove missing MMSI columns",
                                  base_dir="./output_dir",
                                  inplace_transformation=True)
pipeline.add(name="Remove MMSI column",
             transformer=DropMissingOrNan(),
             name_mappings={"x": "mmsi"})

transformed_adf = pipeline.transform(adf)
transformed_adf

In [None]:
transformed_adf.collect()