Usage#

We start by exploring the data-processing pipeline part of DAMAST. We consider a manufactured dataset of Automatic Identification System (AIS) messages. The data is generated for 150 boats, where the minimal length of a trajectory is 30 messages, and the maximal length is 1000

!pip install damast

import polars
import damast.domains.maritime.ais.data_generator as generator

data = generator.AISTestData(number_of_trajectories=1000, min_length=25, max_length=300)
Requirement already satisfied: damast in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (0.1.2)
Requirement already satisfied: astropy in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (6.1.7)
Requirement already satisfied: cloudpickle in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.1.1)
Requirement already satisfied: keras>=3.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.9.0)
Requirement already satisfied: matplotlib in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.10.1)
Requirement already satisfied: numba in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (0.61.0)
Requirement already satisfied: numpy in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.1.3)
Requirement already satisfied: polars in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (1.24.0)
Requirement already satisfied: psutil in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (7.0.0)
Requirement already satisfied: pyais in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.9.0)
Requirement already satisfied: pyarrow in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (19.0.1)
Requirement already satisfied: pydantic in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.10.6)
Requirement already satisfied: scikit-learn in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (1.6.1)
Requirement already satisfied: tables in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.10.1)
Requirement already satisfied: jax[cpu] in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (0.5.2)
Requirement already satisfied: tf-nightly in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.20.0.dev20250310)
Requirement already satisfied: torch in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.6.0)
Requirement already satisfied: absl-py in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (2.1.0)
Requirement already satisfied: rich in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (13.9.4)
Requirement already satisfied: namex in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (0.0.8)
Requirement already satisfied: h5py in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (3.13.0)
Requirement already satisfied: optree in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (0.14.1)
Requirement already satisfied: ml-dtypes in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (0.5.1)
Requirement already satisfied: packaging in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (24.2)
Requirement already satisfied: pyerfa>=2.0.1.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astropy->damast) (2.0.1.5)
Requirement already satisfied: astropy-iers-data>=0.2024.10.28.0.34.7 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astropy->damast) (0.2025.3.10.0.29.26)
Requirement already satisfied: PyYAML>=3.13 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astropy->damast) (6.0.2)
Requirement already satisfied: jaxlib<=0.5.2,>=0.5.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from jax[cpu]->damast) (0.5.1)
Requirement already satisfied: opt_einsum in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from jax[cpu]->damast) (3.4.0)
Requirement already satisfied: scipy>=1.11.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from jax[cpu]->damast) (1.15.2)
Requirement already satisfied: contourpy>=1.0.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (4.56.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (1.4.8)
Requirement already satisfied: pillow>=8 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (2.9.0.post0)
Requirement already satisfied: llvmlite<0.45,>=0.44.0dev0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from numba->damast) (0.44.0)
Requirement already satisfied: bitarray in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pyais->damast) (3.1.1)
Requirement already satisfied: attrs in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pyais->damast) (25.1.0)
Requirement already satisfied: annotated-types>=0.6.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic->damast) (0.7.0)
Requirement already satisfied: pydantic-core==2.27.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic->damast) (2.27.2)
Requirement already satisfied: typing-extensions>=4.12.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic->damast) (4.12.2)
Requirement already satisfied: joblib>=1.2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from scikit-learn->damast) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from scikit-learn->damast) (3.5.0)
Requirement already satisfied: numexpr>=2.6.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tables->damast) (2.10.2)
Requirement already satisfied: py-cpuinfo in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tables->damast) (9.0.0)
Requirement already satisfied: blosc2>=2.3.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tables->damast) (2.7.1)
Requirement already satisfied: astunparse>=1.6.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (1.6.3)
Requirement already satisfied: flatbuffers>=24.3.25 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (25.2.10)
Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (0.6.0)
Requirement already satisfied: google-pasta>=0.1.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (0.2.0)
Requirement already satisfied: libclang>=13.0.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (18.1.1)
Requirement already satisfied: protobuf<6.0.0dev,>=4.21.6 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (5.29.3)
Requirement already satisfied: requests<3,>=2.21.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (2.32.3)
Requirement already satisfied: setuptools in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (75.8.0)
Requirement already satisfied: six>=1.12.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (1.17.0)
Requirement already satisfied: termcolor>=1.1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (2.5.0)
Requirement already satisfied: wrapt>=1.11.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (1.17.2)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (1.71.0)
Requirement already satisfied: tb-nightly~=2.19.0.a in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (2.19.0a20250218)
Requirement already satisfied: keras-nightly>=3.6.0.dev in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (3.9.0.dev2025031103)
Requirement already satisfied: filelock in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.17.0)
Requirement already satisfied: networkx in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.4.2)
Requirement already satisfied: jinja2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.1.6)
Requirement already satisfied: fsspec in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (2025.3.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.127 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.4.127)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.4.127 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.4.127)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.4.127 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.4.127)
Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (9.1.0.70)
Requirement already satisfied: nvidia-cublas-cu12==12.4.5.8 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.4.5.8)
Requirement already satisfied: nvidia-cufft-cu12==11.2.1.3 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (11.2.1.3)
Requirement already satisfied: nvidia-curand-cu12==10.3.5.147 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (10.3.5.147)
Requirement already satisfied: nvidia-cusolver-cu12==11.6.1.9 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (11.6.1.9)
Requirement already satisfied: nvidia-cusparse-cu12==12.3.1.170 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.3.1.170)
Requirement already satisfied: nvidia-cusparselt-cu12==0.6.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (0.6.2)
Requirement already satisfied: nvidia-nccl-cu12==2.21.5 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (2.21.5)
Requirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.4.127)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.4.127 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.4.127)
Requirement already satisfied: triton==3.2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.2.0)
Requirement already satisfied: sympy==1.13.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from sympy==1.13.1->torch->damast) (1.3.0)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astunparse>=1.6.0->tf-nightly->damast) (0.45.1)
Requirement already satisfied: ndindex>=1.4 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from blosc2>=2.3.0->tables->damast) (1.9.2)
Requirement already satisfied: msgpack in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from blosc2>=2.3.0->tables->damast) (1.1.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests<3,>=2.21.0->tf-nightly->damast) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests<3,>=2.21.0->tf-nightly->damast) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests<3,>=2.21.0->tf-nightly->damast) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests<3,>=2.21.0->tf-nightly->damast) (2025.1.31)
Requirement already satisfied: markdown>=2.6.8 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tb-nightly~=2.19.0.a->tf-nightly->damast) (3.7)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tb-nightly~=2.19.0.a->tf-nightly->damast) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tb-nightly~=2.19.0.a->tf-nightly->damast) (3.1.3)
Requirement already satisfied: MarkupSafe>=2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from jinja2->torch->damast) (3.0.2)
Requirement already satisfied: markdown-it-py>=2.2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from rich->keras>=3.0->damast) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from rich->keras>=3.0->damast) (2.19.1)
Requirement already satisfied: mdurl~=0.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.0->damast) (0.1.2)

The data is stored in a polars.LazyFrame, and we can inspect the first and last 5 messages in the dataset.

print(data.dataframe)
shape: (162_118, 11)
┌───────────┬────────────┬────────────┬───────────────┬───┬────────────┬─────┬────────────┬────────┐
│ mmsi      ┆ lon        ┆ lat        ┆ date_time_utc ┆ … ┆ nav_status ┆ rot ┆ message_nr ┆ source │
│ ---       ┆ ---        ┆ ---        ┆ ---           ┆   ┆ ---        ┆ --- ┆ ---        ┆ ---    │
│ i64       ┆ f64        ┆ f64        ┆ str           ┆   ┆ i64        ┆ f64 ┆ i64        ┆ str    │
╞═══════════╪════════════╪════════════╪═══════════════╪═══╪════════════╪═════╪════════════╪════════╡
│ 665159851 ┆ 103.770273 ┆ 78.472639  ┆ 1985-03-22    ┆ … ┆ 0          ┆ 0.0 ┆ 2          ┆ s      │
│           ┆            ┆            ┆ 14:50:16      ┆   ┆            ┆     ┆            ┆        │
│ 645053718 ┆ 39.967344  ┆ 29.150584  ┆ 2001-12-02    ┆ … ┆ 0          ┆ 0.0 ┆ 1          ┆ s      │
│           ┆            ┆            ┆ 16:00:10      ┆   ┆            ┆     ┆            ┆        │
│ 829664905 ┆ 146.296785 ┆ -68.14958  ┆ 1988-06-02    ┆ … ┆ 7          ┆ 0.0 ┆ 2          ┆ s      │
│           ┆            ┆            ┆ 20:32:13      ┆   ┆            ┆     ┆            ┆        │
│ 244557659 ┆ 73.488362  ┆ 52.166825  ┆ 1991-02-14    ┆ … ┆ 0          ┆ 0.0 ┆ 3          ┆ g      │
│           ┆            ┆            ┆ 18:23:40      ┆   ┆            ┆     ┆            ┆        │
│ 757514195 ┆ 172.669604 ┆ 8.815414   ┆ 2008-04-22    ┆ … ┆ 0          ┆ 0.0 ┆ 1          ┆ s      │
│           ┆            ┆            ┆ 12:47:43      ┆   ┆            ┆     ┆            ┆        │
│ …         ┆ …          ┆ …          ┆ …             ┆ … ┆ …          ┆ …   ┆ …          ┆ …      │
│ 497570870 ┆ 144.543314 ┆ -76.135199 ┆ null          ┆ … ┆ 7          ┆ 0.0 ┆ 3          ┆ s      │
│ 551920744 ┆ 108.758311 ┆ -11.502653 ┆ null          ┆ … ┆ 7          ┆ 0.0 ┆ 2          ┆ s      │
│ 220739595 ┆ 83.560212  ┆ -2.216197  ┆ 1991-01-10    ┆ … ┆ 1          ┆ 0.0 ┆ 3          ┆ s      │
│           ┆            ┆            ┆ 17:29:36      ┆   ┆            ┆     ┆            ┆        │
│ 275532654 ┆ -53.465037 ┆ 51.27748   ┆ 1970-09-23    ┆ … ┆ 0          ┆ 0.0 ┆ 2          ┆ g      │
│           ┆            ┆            ┆ 22:49:00      ┆   ┆            ┆     ┆            ┆        │
│ 244605512 ┆ 58.250896  ┆ -76.052974 ┆ 1987-05-16    ┆ … ┆ 0          ┆ 0.0 ┆ 3          ┆ s      │
│           ┆            ┆            ┆ 17:45:44      ┆   ┆            ┆     ┆            ┆        │
└───────────┴────────────┴────────────┴───────────────┴───┴────────────┴─────┴────────────┴────────┘

The dataset consists of 11 columns, which we will go through in detail.

Data-specification#

The Maritime Mobile Service Identity (MMSI) used to identify a ship. It should be a 9 digit number whose first integer should be between 2 and 7. The data we have generated should contain some invalid numbers. Let us inspect these.

from damast.domains.maritime.data_specification import MMSI
df = data.dataframe
invalid_mmsis = df.filter((polars.col('mmsi') < MMSI.min_value) | (polars.col('mmsi') > MMSI.max_value))
invalid_mmsis
shape: (12_166, 11)
mmsilonlatdate_time_utcsogcogtrue_headingnav_statusrotmessage_nrsource
i64f64f64strf64f64f64i64f64i64str
829664905146.296785-68.14958"1988-06-02 20:32:13"-18.5816982.4537572.45408370.02"s"
191860039-69.47306547.91841"2015-08-10 04:22:38"7.3064580.7836160.81795810.03"g"
831760986-121.81403341.198894"2012-02-27 18:43:44"4.656469-4.758333-4.7093570.01"s"
826453030-16.070802-81.95998"1980-01-30 06:05:54"-6.534687-0.207772-0.14548870.01"g"
834449441-123.70354282.106814"1977-08-12 14:36:49"4.205259-5.529068-5.49275400.02"s"
827110460-149.08576631.831945"1970-09-05 17:31:24"1.750886-1.067-1.01530200.03"g"
1937145780.048932-61.353975"2008-03-27 00:50:56"-11.3668278.5605968.58488600.03"g"
80344790048.37980334.144149"1975-12-15 03:51:35"-1.166723-0.622385-0.58867470.03"g"
825260785-155.82356448.097149"1994-12-18 11:24:35"18.1628270.5501130.56569110.03"g"
834449441-123.79292882.175318"1977-08-12 15:07:46"-3.14995-5.673962-5.59683470.02"s"

Before sending this data to a machine learning algorithm, one would have to filter out invalid data. We can do this by creating a damast.core.DataSpecification describing what valid output we would like in our data-frame.

from damast.core import DataSpecification, MinMax
mmsi_spec = DataSpecification(name="mmsi", description="Maritime Mobile Service Identity", representation_type=int,
                              value_range=MinMax(MMSI.min_value, MMSI.max_value))

We have here described what data this column is supposed to describe, how the data is represented in Python, and its minimum and maximum range. Next, we create a damast.core.MetaData object that we can apply to the dataframe.

from damast.core import MetaData,ValidationMode
metadata = MetaData([mmsi_spec])
metadata.apply(df.lazy(), ValidationMode.UPDATE_DATA)
/home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages/damast/core/metadata.py:568: UserWarning: DataSpecification.apply: column 'mmsi': expected representation type: <class 'int'>, but got 'Int64'
  warnings.warn(
/home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages/damast/core/metadata.py:589: UserWarning: Filtering out for column 'mmsi' values that are out of range.
  warnings.warn(
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

FILTER [(col("mmsi")) >= (200000000)] FROM

FILTER [(col("mmsi")) <= (799999999)] FROM

DF ["mmsi", "lon", "lat", "date_time_utc", ...]; PROJECT */11 COLUMNS

Of course, we do not want to do this process manually per row. Therefore, we can create a DataSpecification per row, and let the damast.core.AnnotatedDataFrame handle the validation of the data. We can choose between three ways of handling the input data with metadata, we can either use:

  • ValidationMode.READONLY: Reads in the data, checks it against the meta-data and throws an error if the data does not adhere to the data-specification.

  • ValidationMode.UPDATE_METADATA: Update the metadata based on the input in the annotated data-frame. This might change the representation type, column name and valid rages of the data.

  • ValidationMode.UPDATE_DATA: Update data so that it adheres to the meta-data.

from damast.core.metadata import DataCategory
from damast.core.dataframe import AnnotatedDataFrame
dataspec = {
    "annotations": {"comment": "This is a autogenerated test data set"},
    "columns": [
        {"name": "mmsi", "is_optional": False, "category": DataCategory.STATIC,
         "value_range":{"MinMax": {"min": MMSI.min_value, "max": MMSI.max_value}}},
        {"name": "lon", "is_optional": False, "unit": "deg", "category": DataCategory.DYNAMIC},
        {"name": "lat", "is_optional": False, "unit": "deg", "category": DataCategory.DYNAMIC},
        {"name": "date_time_utc", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "sog", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "cog", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "true_heading", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "nav_status", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "rot", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "message_nr", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "source", "is_optional": False, "category": DataCategory.DYNAMIC},
    ]
}
metadata = MetaData.from_dict(dataspec)
data = generator.AISTestData(number_of_trajectories=1000, min_length=25, max_length=300)
adf = AnnotatedDataFrame(data.dataframe, metadata, validation_mode=ValidationMode.UPDATE_DATA)
adf
/home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages/damast/core/metadata.py:589: UserWarning: Filtering out for column 'mmsi' values that are out of range.
  warnings.warn(
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

FILTER [([(col("mmsi")) >= (200000000)]) & ([(col("mmsi")) <= (799999999)])] FROM

DF ["mmsi", "lon", "lat", "date_time_utc", ...]; PROJECT */11 COLUMNS

Data-processing#

Say we want to repeat this process on any data-set we read in. Then, we should create a damast.core.dataprocessing.DataProcessingPipeline. A pipeline consists of pipeline-elements, that is a set of transformations on the original dataset. We start by creating a Pipeline-element that drops all rows missing an "mmsi" entry.

from damast.data_handling.transformers.filters import DropMissingOrNan
from damast.core.dataprocessing import DataProcessingPipeline
pipeline = DataProcessingPipeline(name="Remove missing MMSI columns",
                                  base_dir="./output_dir",
                                  inplace_transformation=True)
pipeline.add(name="Remove MMSI column",
             transformer=DropMissingOrNan(),
             name_mappings={"x": "mmsi"})

transformed_adf = pipeline.transform(adf)
transformed_adf
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

FILTER col("mmsi").is_not_nan().cast(Boolean) FROM

FILTER col("mmsi").is_not_null().cast(Boolean) FROM

FILTER [([(col("mmsi")) >= (200000000)]) & ([(col("mmsi")) <= (799999999)])] FROM

DF ["mmsi", "lon", "lat", "date_time_utc", ...]; PROJECT */11 COLUMNS
transformed_adf.collect()
shape: (151_884, 11)
mmsilonlatdate_time_utcsogcogtrue_headingnav_statusrotmessage_nrsource
i64f64f64strf64f64f64i64f64i64str
78641453817.63198958.675481"2005-04-15 00:00:35"-18.047639-0.545029-0.45373370.03"g"
35698636988.657238-62.300499"2018-12-14 05:43:21"-5.879265-7.633635-7.59241970.02"s"
709754860143.000563-21.344114"2015-04-17 01:03:32"15.993758-1.739446-1.68777300.03"s"
739509160105.89532946.555108"2016-06-28 16:38:58"-2.268003-1.100663-1.07790870.03"s"
410210939-81.43939779.613767"1994-03-09 07:35:41"7.333727-2.018947-1.94121600.02"s"
54162973475.768201-48.507674"2005-02-25 02:05:58"-10.5018741.6630481.70987410.01"s"
333897361-115.430138-10.509307null10.9817452.2097662.26538410.02"s"
460482459-88.30331873.13233"1993-08-09 05:40:40"-4.6377091.2698561.2782610.03"s"
363148064-17.4711328.034439"2007-10-06 23:03:57"-8.7442091.1102411.13879610.02"g"
452660538-172.9087163.038617"2009-09-05 09:21:51"-21.245215-7.598973-7.55680770.01"g"