Usage#

We start by exploring the data-processing pipeline part of DAMAST. We consider a manufactured dataset of Automatic Identification System (AIS) messages. The data is generated for 150 boats, where the minimal length of a trajectory is 30 messages, and the maximal length is 1000

!pip install damast

import polars
import damast.domains.maritime.ais.data_generator as generator

data = generator.AISTestData(number_of_trajectories=1000, min_length=25, max_length=300)
Requirement already satisfied: damast in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (0.1.8)
Requirement already satisfied: astropy in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (6.1.7)
Requirement already satisfied: cloudpickle in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.1.1)
Requirement already satisfied: keras>=3.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.10.0)
Requirement already satisfied: matplotlib in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.10.3)
Requirement already satisfied: numba in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (0.61.2)
Requirement already satisfied: numpy>=2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.1.3)
Requirement already satisfied: polars>=1.20 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (1.30.0)
Requirement already satisfied: psutil in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (7.0.0)
Requirement already satisfied: pyais in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.9.4)
Requirement already satisfied: pyarrow in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (20.0.0)
Requirement already satisfied: pydantic>=2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.11.6)
Requirement already satisfied: ratarmount in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (1.0.0)
Requirement already satisfied: scikit-learn in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (1.7.0)
Requirement already satisfied: tables in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.10.1)
Requirement already satisfied: torch in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.7.1)
Requirement already satisfied: tqdm in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (4.67.1)
Requirement already satisfied: absl-py in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (2.3.0)
Requirement already satisfied: rich in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (14.0.0)
Requirement already satisfied: namex in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (0.1.0)
Requirement already satisfied: h5py in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (3.14.0)
Requirement already satisfied: optree in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (0.16.0)
Requirement already satisfied: ml-dtypes in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (0.5.1)
Requirement already satisfied: packaging in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (25.0)
Requirement already satisfied: annotated-types>=0.6.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic>=2.0->damast) (0.7.0)
Requirement already satisfied: pydantic-core==2.33.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic>=2.0->damast) (2.33.2)
Requirement already satisfied: typing-extensions>=4.12.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic>=2.0->damast) (4.14.0)
Requirement already satisfied: typing-inspection>=0.4.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic>=2.0->damast) (0.4.1)
Requirement already satisfied: pyerfa>=2.0.1.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astropy->damast) (2.0.1.5)
Requirement already satisfied: astropy-iers-data>=0.2024.10.28.0.34.7 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astropy->damast) (0.2025.6.9.14.9.37)
Requirement already satisfied: PyYAML>=3.13 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astropy->damast) (6.0.2)
Requirement already satisfied: contourpy>=1.0.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (4.58.2)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (1.4.8)
Requirement already satisfied: pillow>=8 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (11.2.1)
Requirement already satisfied: pyparsing>=2.3.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (3.2.3)
Requirement already satisfied: python-dateutil>=2.7 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib->damast) (1.17.0)
Requirement already satisfied: llvmlite<0.45,>=0.44.0dev0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from numba->damast) (0.44.0)
Requirement already satisfied: bitarray in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pyais->damast) (3.4.2)
Requirement already satisfied: attrs in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pyais->damast) (25.3.0)
Requirement already satisfied: ratarmountcore~=0.8.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (0.8.2)
Requirement already satisfied: python-xz~=0.4.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (0.4.0)
Requirement already satisfied: rarfile~=4.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (4.2)
Requirement already satisfied: pyfatfs~=1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (1.1.0)
Requirement already satisfied: rapidgzip>=0.13.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (0.14.3)
Requirement already satisfied: indexed_zstd<2.0,>=1.2.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (1.6.1)
Requirement already satisfied: fast_zip_decryption in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (3.0.0)
Requirement already satisfied: indexed_gzip<2.0,>=1.6.3 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (1.9.5)
Requirement already satisfied: libarchive-c<6.0,~=5.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (5.3)
Requirement already satisfied: pygit2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (1.18.0)
Requirement already satisfied: fs~=2.4 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pyfatfs~=1.0->ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (2.4.16)
Requirement already satisfied: appdirs~=1.4.3 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from fs~=2.4->pyfatfs~=1.0->ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (1.4.4)
Requirement already satisfied: setuptools in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from fs~=2.4->pyfatfs~=1.0->ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (80.3.1)
Requirement already satisfied: cffi>=1.17.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pygit2->ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (1.17.1)
Requirement already satisfied: pycparser in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from cffi>=1.17.0->pygit2->ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (2.22)
Requirement already satisfied: markdown-it-py>=2.2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from rich->keras>=3.0->damast) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from rich->keras>=3.0->damast) (2.19.1)
Requirement already satisfied: mdurl~=0.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.0->damast) (0.1.2)
Requirement already satisfied: scipy>=1.8.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from scikit-learn->damast) (1.15.3)
Requirement already satisfied: joblib>=1.2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from scikit-learn->damast) (1.5.1)
Requirement already satisfied: threadpoolctl>=3.1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from scikit-learn->damast) (3.6.0)
Requirement already satisfied: numexpr>=2.6.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tables->damast) (2.11.0)
Requirement already satisfied: py-cpuinfo in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tables->damast) (9.0.0)
Requirement already satisfied: blosc2>=2.3.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tables->damast) (3.3.4)
Requirement already satisfied: ndindex in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from blosc2>=2.3.0->tables->damast) (1.10.0)
Requirement already satisfied: msgpack in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from blosc2>=2.3.0->tables->damast) (1.1.1)
Requirement already satisfied: platformdirs in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from blosc2>=2.3.0->tables->damast) (4.3.8)
Requirement already satisfied: requests in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from blosc2>=2.3.0->tables->damast) (2.32.4)
Requirement already satisfied: charset_normalizer<4,>=2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests->blosc2>=2.3.0->tables->damast) (3.4.2)
Requirement already satisfied: idna<4,>=2.5 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests->blosc2>=2.3.0->tables->damast) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests->blosc2>=2.3.0->tables->damast) (2.4.0)
Requirement already satisfied: certifi>=2017.4.17 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests->blosc2>=2.3.0->tables->damast) (2025.4.26)
Requirement already satisfied: filelock in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.18.0)
Requirement already satisfied: sympy>=1.13.3 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (1.14.0)
Requirement already satisfied: networkx in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.4.2)
Requirement already satisfied: jinja2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.1.6)
Requirement already satisfied: fsspec in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (2025.5.1)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.6.77 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.6.77)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.6.77 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.6.77)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.6.80 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.6.80)
Requirement already satisfied: nvidia-cudnn-cu12==9.5.1.17 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (9.5.1.17)
Requirement already satisfied: nvidia-cublas-cu12==12.6.4.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.6.4.1)
Requirement already satisfied: nvidia-cufft-cu12==11.3.0.4 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (11.3.0.4)
Requirement already satisfied: nvidia-curand-cu12==10.3.7.77 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (10.3.7.77)
Requirement already satisfied: nvidia-cusolver-cu12==11.7.1.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (11.7.1.2)
Requirement already satisfied: nvidia-cusparse-cu12==12.5.4.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.5.4.2)
Requirement already satisfied: nvidia-cusparselt-cu12==0.6.3 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (0.6.3)
Requirement already satisfied: nvidia-nccl-cu12==2.26.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (2.26.2)
Requirement already satisfied: nvidia-nvtx-cu12==12.6.77 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.6.77)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.6.85 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.6.85)
Requirement already satisfied: nvidia-cufile-cu12==1.11.1.6 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (1.11.1.6)
Requirement already satisfied: triton==3.3.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.3.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from sympy>=1.13.3->torch->damast) (1.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from jinja2->torch->damast) (3.0.2)

The data is stored in a polars.LazyFrame, and we can inspect the first and last 5 messages in the dataset.

print(data.dataframe)
shape: (163_933, 11)
┌───────────┬─────────────┬────────────┬──────────────┬───┬────────────┬─────┬────────────┬────────┐
│ mmsi      ┆ lon         ┆ lat        ┆ date_time_ut ┆ … ┆ nav_status ┆ rot ┆ message_nr ┆ source │
│ ---       ┆ ---         ┆ ---        ┆ c            ┆   ┆ ---        ┆ --- ┆ ---        ┆ ---    │
│ i64       ┆ f64         ┆ f64        ┆ ---          ┆   ┆ i64        ┆ f64 ┆ i64        ┆ str    │
│           ┆             ┆            ┆ str          ┆   ┆            ┆     ┆            ┆        │
╞═══════════╪═════════════╪════════════╪══════════════╪═══╪════════════╪═════╪════════════╪════════╡
│ 523002531 ┆ 84.043223   ┆ -72.394031 ┆ 1987-03-25   ┆ … ┆ 1          ┆ 0.0 ┆ 3          ┆ g      │
│           ┆             ┆            ┆ 23:22:28     ┆   ┆            ┆     ┆            ┆        │
│ 287450013 ┆ 100.600925  ┆ 14.62509   ┆ 2022-11-19   ┆ … ┆ 7          ┆ 0.0 ┆ 2          ┆ g      │
│           ┆             ┆            ┆ 06:33:47     ┆   ┆            ┆     ┆            ┆        │
│ 276233932 ┆ -126.730308 ┆ -28.529244 ┆ null         ┆ … ┆ 7          ┆ 0.0 ┆ 2          ┆ s      │
│ 477676853 ┆ -110.513478 ┆ 78.242938  ┆ 2022-09-05   ┆ … ┆ 0          ┆ 0.0 ┆ 3          ┆ s      │
│           ┆             ┆            ┆ 13:54:03     ┆   ┆            ┆     ┆            ┆        │
│ 482930493 ┆ 4.940892    ┆ -38.987358 ┆ 1990-02-04   ┆ … ┆ 1          ┆ 0.0 ┆ 2          ┆ g      │
│           ┆             ┆            ┆ 20:50:28     ┆   ┆            ┆     ┆            ┆        │
│ …         ┆ …           ┆ …          ┆ …            ┆ … ┆ …          ┆ …   ┆ …          ┆ …      │
│ 200728655 ┆ 53.311805   ┆ 58.28434   ┆ 1987-02-20   ┆ … ┆ 7          ┆ 0.0 ┆ 3          ┆ s      │
│           ┆             ┆            ┆ 00:13:59     ┆   ┆            ┆     ┆            ┆        │
│ 669134243 ┆ 11.608388   ┆ -47.932483 ┆ 2002-05-02   ┆ … ┆ 0          ┆ 0.0 ┆ 3          ┆ g      │
│           ┆             ┆            ┆ 01:11:39     ┆   ┆            ┆     ┆            ┆        │
│ 825600050 ┆ -5.285948   ┆ 32.437435  ┆ 1978-01-31   ┆ … ┆ 7          ┆ 0.0 ┆ 3          ┆ g      │
│           ┆             ┆            ┆ 14:40:12     ┆   ┆            ┆     ┆            ┆        │
│ 498671479 ┆ -74.088281  ┆ -55.334557 ┆ 1988-08-13   ┆ … ┆ 0          ┆ 0.0 ┆ 1          ┆ s      │
│           ┆             ┆            ┆ 04:46:42     ┆   ┆            ┆     ┆            ┆        │
│ 649364375 ┆ -97.57014   ┆ 22.635676  ┆ 2012-05-09   ┆ … ┆ 7          ┆ 0.0 ┆ 1          ┆ s      │
│           ┆             ┆            ┆ 22:17:02     ┆   ┆            ┆     ┆            ┆        │
└───────────┴─────────────┴────────────┴──────────────┴───┴────────────┴─────┴────────────┴────────┘

The dataset consists of 11 columns, which we will go through in detail.

Data-specification#

The Maritime Mobile Service Identity (MMSI) used to identify a ship. It should be a 9 digit number whose first integer should be between 2 and 7. The data we have generated should contain some invalid numbers. Let us inspect these.

from damast.domains.maritime.data_specification import MMSI
df = data.dataframe
invalid_mmsis = df.filter((polars.col('mmsi') < MMSI.min_value) | (polars.col('mmsi') > MMSI.max_value))
invalid_mmsis
shape: (13_540, 11)
mmsilonlatdate_time_utcsogcogtrue_headingnav_statusrotmessage_nrsource
i64f64f64strf64f64f64i64f64i64str
198705566-118.77169681.909949"2011-02-02 02:15:55"2.5702030.6463920.67446500.01"s"
827273045-70.69988339.136651"1995-04-04 03:34:28"6.988184-2.384334-2.35716700.01"g"
804205360-154.201175-24.302368"2022-10-06 19:11:52"0.518591-0.217661-0.16048610.02"s"
80396954638.400868-15.679453"2002-04-09 12:52:00"-14.1126771.3258741.39564200.03"s"
830343656-85.78480971.457058"1992-02-01 17:13:50"-33.071532-1.56768-1.49922970.01"s"
829993812112.9551554.012104"2002-08-31 23:51:56"-10.915988-1.994626-1.93886600.03"g"
804205360-154.123639-23.906636"2022-10-06 17:45:06"-1.1750420.120120.14264700.02"g"
195036015-146.650398-84.236247"1995-04-11 20:36:22"14.7810231.9453041.99339300.02"s"
80399951789.68637449.581019"2005-09-08 19:22:52"-9.347322-6.926102-6.83780170.03"s"
825600050-5.28594832.437435"1978-01-31 14:40:12"-3.111474-2.177295-2.12919170.03"g"

Before sending this data to a machine learning algorithm, one would have to filter out invalid data. We can do this by creating a damast.core.DataSpecification describing what valid output we would like in our data-frame.

from damast.core import DataSpecification, MinMax
mmsi_spec = DataSpecification(name="mmsi", description="Maritime Mobile Service Identity", representation_type=int,
                              value_range=MinMax(MMSI.min_value, MMSI.max_value))

We have here described what data this column is supposed to describe, how the data is represented in Python, and its minimum and maximum range. Next, we create a damast.core.MetaData object that we can apply to the dataframe.

from damast.core import MetaData,ValidationMode
metadata = MetaData([mmsi_spec])
metadata.apply(df.lazy(), ValidationMode.UPDATE_DATA)
/home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages/damast/core/metadata.py:610: UserWarning: DataSpecification.apply: column 'mmsi': expected representation type: <class 'int'>, but got 'Int64'
  warnings.warn(
/home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages/damast/core/metadata.py:619: UserWarning: Filtering out for column 'mmsi' values that are out of range.
  warnings.warn(
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

FILTER [(col("mmsi")) <= (799999999)]

FROM

FILTER [(col("mmsi")) >= (200000000)]

FROM

DF ["mmsi", "lon", "lat", "date_time_utc", ...]; PROJECT */11 COLUMNS

Of course, we do not want to do this process manually per row. Therefore, we can create a DataSpecification per row, and let the damast.core.AnnotatedDataFrame handle the validation of the data. We can choose between three ways of handling the input data with metadata, we can either use:

  • ValidationMode.READONLY: Reads in the data, checks it against the meta-data and throws an error if the data does not adhere to the data-specification.

  • ValidationMode.UPDATE_METADATA: Update the metadata based on the input in the annotated data-frame. This might change the representation type, column name and valid rages of the data.

  • ValidationMode.UPDATE_DATA: Update data so that it adheres to the meta-data.

from damast.core.metadata import DataCategory
from damast.core.dataframe import AnnotatedDataFrame
dataspec = {
    "annotations": {"comment": "This is a autogenerated test data set"},
    "columns": [
        {"name": "mmsi", "is_optional": False, "category": DataCategory.STATIC,
         "value_range":{"MinMax": {"min": MMSI.min_value, "max": MMSI.max_value}}},
        {"name": "lon", "is_optional": False, "unit": "deg", "category": DataCategory.DYNAMIC},
        {"name": "lat", "is_optional": False, "unit": "deg", "category": DataCategory.DYNAMIC},
        {"name": "date_time_utc", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "sog", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "cog", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "true_heading", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "nav_status", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "rot", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "message_nr", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "source", "is_optional": False, "category": DataCategory.DYNAMIC},
    ]
}
metadata = MetaData.from_dict(dataspec)
data = generator.AISTestData(number_of_trajectories=1000, min_length=25, max_length=300)
adf = AnnotatedDataFrame(data.dataframe, metadata, validation_mode=ValidationMode.UPDATE_DATA)
adf
/home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages/damast/core/metadata.py:619: UserWarning: Filtering out for column 'mmsi' values that are out of range.
  warnings.warn(
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

FILTER [([(col("mmsi")) >= (200000000)]) & ([(col("mmsi")) <= (799999999)])]

FROM

DF ["mmsi", "lon", "lat", "date_time_utc", ...]; PROJECT */11 COLUMNS

Data-processing#

Say we want to repeat this process on any data-set we read in. Then, we should create a damast.core.dataprocessing.DataProcessingPipeline. A pipeline consists of pipeline-elements, that is a set of transformations on the original dataset. We start by creating a Pipeline-element that drops all rows missing an "mmsi" entry.

from damast.data_handling.transformers.filters import DropMissingOrNan
from damast.core.dataprocessing import DataProcessingPipeline
pipeline = DataProcessingPipeline(name="Remove missing MMSI columns",
                                  base_dir="./output_dir",
                                  inplace_transformation=True)
pipeline.add(name="Remove MMSI column",
             transformer=DropMissingOrNan(),
             name_mappings={"x": "mmsi"})

transformed_adf = pipeline.transform(adf)
transformed_adf
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

FILTER [(col("mmsi").is_nan().cast(Boolean)) !=v (true)]

FROM

FILTER col("mmsi").is_not_null().cast(Boolean)

FROM

FILTER [([(col("mmsi")) >= (200000000)]) & ([(col("mmsi")) <= (799999999)])]

FROM

DF ["mmsi", "lon", "lat", "date_time_utc", ...]; PROJECT */11 COLUMNS
transformed_adf.collect()
shape: (153_858, 11)
mmsilonlatdate_time_utcsogcogtrue_headingnav_statusrotmessage_nrsource
i64f64f64strf64f64f64i64f64i64str
232532191-55.655376-75.263977null-12.9129591.6344951.70289100.03"s"
5373400406.087661-87.18102"2001-08-06 06:16:41"5.3428830.8006910.8882900.03"s"
509697426-2.11251538.019535"2010-06-17 06:57:07"8.1577079.1763419.24532600.02"g"
61580686327.988985-4.022445"1983-11-02 17:45:43"10.965288-1.094736-1.08978610.01"g"
682290790105.0741-70.308761"1991-03-10 06:34:15"-5.837299-2.362193-2.30997510.03"g"
657959055-169.71815-73.962004null-4.436311-0.203366-0.12994270.02"s"
239601674120.10676847.458229"1972-02-21 04:30:09"-23.4323572.6496522.70674500.02"s"
724859151-173.404494-23.707157"1997-06-13 13:25:52"-10.534615-0.464212-0.39238900.03"g"
437856144-60.871687-70.392893"1992-06-08 14:59:34"15.998613-4.572923-4.5715900.01"s"
488927194-165.97743254.476965"2007-02-03 20:36:20"-9.670403-0.534671-0.45323210.02"s"