Usage#

We start by exploring the data-processing pipeline part of DAMAST. We consider a manufactured dataset of Automatic Identification System (AIS) messages. The data is generated for 150 boats, where the minimal length of a trajectory is 30 messages, and the maximal length is 1000

!pip install damast

import polars
import damast.domains.maritime.ais.data_generator as generator

data = generator.AISTestData(number_of_trajectories=1000, min_length=25, max_length=300)
Requirement already satisfied: damast in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (0.2.2)
Requirement already satisfied: astropy in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (6.1.7)
Requirement already satisfied: cloudpickle in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.1.1)
Requirement already satisfied: iso8601 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.1.0)
Requirement already satisfied: keras>=3.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.12.0)
Requirement already satisfied: matplotlib in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.10.7)
Requirement already satisfied: networkx in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.4.2)
Requirement already satisfied: numba in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (0.62.1)
Requirement already satisfied: numpy>=2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.2.6)
Requirement already satisfied: polars>=1.20 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (1.34.0)
Requirement already satisfied: psutil in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (7.1.2)
Requirement already satisfied: pyais in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.13.2)
Requirement already satisfied: pyarrow in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (22.0.0)
Requirement already satisfied: pydantic>=2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.12.3)
Requirement already satisfied: ratarmount>=1.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (1.2.0)
Requirement already satisfied: scikit-learn in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (1.7.2)
Requirement already satisfied: tables in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.10.1)
Requirement already satisfied: torch in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.9.0)
Requirement already satisfied: tqdm in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (4.67.1)
Requirement already satisfied: absl-py in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (2.3.1)
Requirement already satisfied: rich in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (14.2.0)
Requirement already satisfied: namex in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (0.1.0)
Requirement already satisfied: h5py in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (3.15.1)
Requirement already satisfied: optree in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (0.17.0)
Requirement already satisfied: ml-dtypes in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (0.5.3)
Requirement already satisfied: packaging in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (25.0)
Requirement already satisfied: polars-runtime-32==1.34.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from polars>=1.20->damast) (1.34.0)
Requirement already satisfied: annotated-types>=0.6.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic>=2.0->damast) (0.7.0)
Requirement already satisfied: pydantic-core==2.41.4 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic>=2.0->damast) (2.41.4)
Requirement already satisfied: typing-extensions>=4.14.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic>=2.0->damast) (4.15.0)
Requirement already satisfied: typing-inspection>=0.4.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic>=2.0->damast) (0.4.2)
Requirement already satisfied: ratarmountcore~=0.10.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (0.10.1)
Requirement already satisfied: mfusepy~=3.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmount>=1.1->damast) (3.0.0)
Requirement already satisfied: libarchive-c<6.0,~=5.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (5.3)
Requirement already satisfied: py7zr~=1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (1.0.0)
Requirement already satisfied: rapidgzip~=0.15.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (0.15.2)
Requirement already satisfied: indexed_zstd<2.0,>=1.2.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (1.6.1)
Requirement already satisfied: python-xz~=0.4.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (0.4.0)
Requirement already satisfied: rarfile~=4.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (4.2)
Requirement already satisfied: indexed_gzip~=1.7 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (1.10.1)
Requirement already satisfied: fast_zip_decryption in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (3.0.0)
Requirement already satisfied: texttable in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from py7zr~=1.0->ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (1.7.0)
Requirement already satisfied: pycryptodomex>=3.20.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from py7zr~=1.0->ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (3.23.0)
Requirement already satisfied: brotli>=1.1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from py7zr~=1.0->ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (1.1.0)
Requirement already satisfied: pyzstd>=0.16.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from py7zr~=1.0->ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (0.18.0)
Requirement already satisfied: pyppmd<1.3.0,>=1.1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from py7zr~=1.0->ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (1.2.0)
Requirement already satisfied: pybcj<1.1.0,>=1.0.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from py7zr~=1.0->ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (1.0.6)
Requirement already satisfied: multivolumefile>=0.2.3 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from py7zr~=1.0->ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (0.2.3)
Requirement already satisfied: inflate64<1.1.0,>=1.0.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from py7zr~=1.0->ratarmountcore[7z,bzip2,fat,gzip,rar,xz,zip,zstd]~=0.10.0->ratarmount>=1.1->damast) (1.0.3)
Requirement already satisfied: pyerfa>=2.0.1.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astropy->damast) (2.0.1.5)
Requirement already satisfied: astropy-iers-data>=0.2024.10.28.0.34.7 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astropy->damast) (0.2025.10.27.0.39.10)
Requirement already satisfied: PyYAML>=3.13 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astropy->damast) (6.0.3)
Requirement already satisfied: contourpy>=1.0.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (4.60.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (1.4.9)
Requirement already satisfied: pillow>=8 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (12.0.0)
Requirement already satisfied: pyparsing>=3 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (3.2.5)
Requirement already satisfied: python-dateutil>=2.7 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib->damast) (1.17.0)
Requirement already satisfied: llvmlite<0.46,>=0.45.0dev0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from numba->damast) (0.45.1)
Requirement already satisfied: bitarray in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pyais->damast) (3.7.2)
Requirement already satisfied: attrs in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pyais->damast) (25.4.0)
Requirement already satisfied: markdown-it-py>=2.2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from rich->keras>=3.0->damast) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from rich->keras>=3.0->damast) (2.19.2)
Requirement already satisfied: mdurl~=0.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.0->damast) (0.1.2)
Requirement already satisfied: scipy>=1.8.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from scikit-learn->damast) (1.15.3)
Requirement already satisfied: joblib>=1.2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from scikit-learn->damast) (1.5.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from scikit-learn->damast) (3.6.0)
Requirement already satisfied: numexpr>=2.6.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tables->damast) (2.14.1)
Requirement already satisfied: py-cpuinfo in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tables->damast) (9.0.0)
Requirement already satisfied: blosc2>=2.3.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tables->damast) (3.11.0)
Requirement already satisfied: ndindex in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from blosc2>=2.3.0->tables->damast) (1.10.0)
Requirement already satisfied: msgpack in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from blosc2>=2.3.0->tables->damast) (1.1.2)
Requirement already satisfied: platformdirs in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from blosc2>=2.3.0->tables->damast) (4.5.0)
Requirement already satisfied: requests in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from blosc2>=2.3.0->tables->damast) (2.32.5)
Requirement already satisfied: charset_normalizer<4,>=2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests->blosc2>=2.3.0->tables->damast) (3.4.4)
Requirement already satisfied: idna<4,>=2.5 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests->blosc2>=2.3.0->tables->damast) (3.11)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests->blosc2>=2.3.0->tables->damast) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests->blosc2>=2.3.0->tables->damast) (2025.10.5)
Requirement already satisfied: filelock in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.20.0)
Requirement already satisfied: sympy>=1.13.3 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (1.14.0)
Requirement already satisfied: jinja2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.1.6)
Requirement already satisfied: fsspec>=0.8.5 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (2025.9.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.8.93 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.8.93)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.8.90 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.8.90)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.8.90 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.8.90)
Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (9.10.2.21)
Requirement already satisfied: nvidia-cublas-cu12==12.8.4.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.8.4.1)
Requirement already satisfied: nvidia-cufft-cu12==11.3.3.83 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (11.3.3.83)
Requirement already satisfied: nvidia-curand-cu12==10.3.9.90 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (10.3.9.90)
Requirement already satisfied: nvidia-cusolver-cu12==11.7.3.90 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (11.7.3.90)
Requirement already satisfied: nvidia-cusparse-cu12==12.5.8.93 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.5.8.93)
Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (0.7.1)
Requirement already satisfied: nvidia-nccl-cu12==2.27.5 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (2.27.5)
Requirement already satisfied: nvidia-nvshmem-cu12==3.3.20 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.3.20)
Requirement already satisfied: nvidia-nvtx-cu12==12.8.90 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.8.90)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.8.93 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.8.93)
Requirement already satisfied: nvidia-cufile-cu12==1.13.1.3 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (1.13.1.3)
Requirement already satisfied: triton==3.5.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.5.0)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from sympy>=1.13.3->torch->damast) (1.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from jinja2->torch->damast) (3.0.3)

The data is stored in a polars.LazyFrame, and we can inspect the first and last 5 messages in the dataset.

print(data.dataframe)
shape: (165_452, 11)
┌───────────┬─────────────┬────────────┬──────────────┬───┬────────────┬─────┬────────────┬────────┐
│ mmsi      ┆ lon         ┆ lat        ┆ date_time_ut ┆ … ┆ nav_status ┆ rot ┆ message_nr ┆ source │
│ ---       ┆ ---         ┆ ---        ┆ c            ┆   ┆ ---        ┆ --- ┆ ---        ┆ ---    │
│ i64       ┆ f64         ┆ f64        ┆ ---          ┆   ┆ i64        ┆ f64 ┆ i64        ┆ str    │
│           ┆             ┆            ┆ str          ┆   ┆            ┆     ┆            ┆        │
╞═══════════╪═════════════╪════════════╪══════════════╪═══╪════════════╪═════╪════════════╪════════╡
│ 486478847 ┆ 83.307808   ┆ 4.705502   ┆ 1984-09-23   ┆ … ┆ 7          ┆ 0.0 ┆ 2          ┆ s      │
│           ┆             ┆            ┆ 05:06:46     ┆   ┆            ┆     ┆            ┆        │
│ 610768320 ┆ 64.477319   ┆ 79.461065  ┆ null         ┆ … ┆ 7          ┆ 0.0 ┆ 1          ┆ g      │
│ 648202497 ┆ 46.24024    ┆ 59.350363  ┆ 2002-10-17   ┆ … ┆ 7          ┆ 0.0 ┆ 1          ┆ s      │
│           ┆             ┆            ┆ 16:31:55     ┆   ┆            ┆     ┆            ┆        │
│ 212351073 ┆ 128.709309  ┆ 49.964669  ┆ 2002-01-09   ┆ … ┆ 7          ┆ 0.0 ┆ 1          ┆ g      │
│           ┆             ┆            ┆ 17:29:55     ┆   ┆            ┆     ┆            ┆        │
│ 781140504 ┆ -88.805079  ┆ -7.356065  ┆ 2004-08-24   ┆ … ┆ 7          ┆ 0.0 ┆ 3          ┆ s      │
│           ┆             ┆            ┆ 09:34:46     ┆   ┆            ┆     ┆            ┆        │
│ …         ┆ …           ┆ …          ┆ …            ┆ … ┆ …          ┆ …   ┆ …          ┆ …      │
│ 338491727 ┆ -131.703793 ┆ 54.714471  ┆ null         ┆ … ┆ 1          ┆ 0.0 ┆ 1          ┆ g      │
│ 477733856 ┆ 145.991335  ┆ -74.410873 ┆ 2009-01-08   ┆ … ┆ 1          ┆ 0.0 ┆ 2          ┆ s      │
│           ┆             ┆            ┆ 17:54:39     ┆   ┆            ┆     ┆            ┆        │
│ 508836094 ┆ -39.184297  ┆ 12.108261  ┆ 2021-11-09   ┆ … ┆ 1          ┆ 0.0 ┆ 3          ┆ g      │
│           ┆             ┆            ┆ 12:07:59     ┆   ┆            ┆     ┆            ┆        │
│ 392885921 ┆ -105.371133 ┆ -3.488945  ┆ null         ┆ … ┆ 1          ┆ 0.0 ┆ 3          ┆ g      │
│ 604211131 ┆ 44.432725   ┆ -79.399875 ┆ 1992-07-11   ┆ … ┆ 0          ┆ 0.0 ┆ 1          ┆ s      │
│           ┆             ┆            ┆ 07:24:10     ┆   ┆            ┆     ┆            ┆        │
└───────────┴─────────────┴────────────┴──────────────┴───┴────────────┴─────┴────────────┴────────┘

The dataset consists of 11 columns, which we will go through in detail.

Data-specification#

The Maritime Mobile Service Identity (MMSI) used to identify a ship. It should be a 9 digit number whose first integer should be between 2 and 7. The data we have generated should contain some invalid numbers. Let us inspect these.

from damast.domains.maritime.data_specification import MMSI
df = data.dataframe
invalid_mmsis = df.filter((polars.col('mmsi') < MMSI.min_value) | (polars.col('mmsi') > MMSI.max_value))
invalid_mmsis
shape: (14_445, 11)
mmsilonlatdate_time_utcsogcogtrue_headingnav_statusrotmessage_nrsource
i64f64f64strf64f64f64i64f64i64str
196769883-80.242694-69.066486"2016-06-08 02:52:48"6.099882-2.246309-2.21130470.01"s"
824829201-161.198518-38.791761"2017-11-05 13:33:57"0.85752-2.568307-2.48829500.02"s"
19044289585.001341-20.221326"1974-01-13 16:45:49"-22.1914440.4738220.55116300.01"s"
801016426-123.278066-55.755871"2009-07-30 08:34:42"-5.936636-7.997424-7.97786410.01"g"
19428064289.724502-5.627072"1998-01-29 02:04:37"-3.132864-0.596275-0.50378170.01"s"
803305007-124.90983141.183272"2004-12-22 21:54:50"1.4180385.7629565.77153270.03"s"
80841995230.85469827.444489"2002-11-18 20:13:10"1.4312770.5624870.57491970.01"g"
804130533-173.868082-12.230837"1983-10-29 04:24:10"7.010262-0.732165-0.6519410.01"s"
80841995231.17414127.702581"2002-11-18 21:07:29"9.0143630.9581051.04119900.01"g"
835139433-101.61570.530259"1990-04-19 06:56:04"27.54008-4.823291-4.74008470.02"g"

Before sending this data to a machine learning algorithm, one would have to filter out invalid data. We can do this by creating a damast.core.DataSpecification describing what valid output we would like in our data-frame.

from damast.core import DataSpecification, MinMax
mmsi_spec = DataSpecification(name="mmsi", description="Maritime Mobile Service Identity", representation_type=int,
                              value_range=MinMax(MMSI.min_value, MMSI.max_value))

We have here described what data this column is supposed to describe, how the data is represented in Python, and its minimum and maximum range. Next, we create a damast.core.MetaData object that we can apply to the dataframe.

from damast.core import MetaData,ValidationMode
metadata = MetaData([mmsi_spec])
metadata.apply(df.lazy(), ValidationMode.UPDATE_DATA)
/home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages/damast/core/metadata.py:617: UserWarning: DataSpecification.apply: column 'mmsi': expected representation type: <class 'int'>, but got 'Int64'
  warnings.warn(
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

FILTER [(col("mmsi")) <= (799999999)]

FROM

FILTER [(col("mmsi")) >= (200000000)]

FROM

WITH_COLUMNS:

[col("mmsi")]

DF ["mmsi", "lon", "lat", "date_time_utc", ...]; PROJECT */11 COLUMNS

Of course, we do not want to do this process manually per row. Therefore, we can create a DataSpecification per row, and let the damast.core.AnnotatedDataFrame handle the validation of the data. We can choose between three ways of handling the input data with metadata, we can either use:

  • ValidationMode.READONLY: Reads in the data, checks it against the meta-data and throws an error if the data does not adhere to the data-specification.

  • ValidationMode.UPDATE_METADATA: Update the metadata based on the input in the annotated data-frame. This might change the representation type, column name and valid rages of the data.

  • ValidationMode.UPDATE_DATA: Update data so that it adheres to the meta-data.

from damast.core.metadata import DataCategory
from damast.core.dataframe import AnnotatedDataFrame
dataspec = {
    "annotations": {"comment": "This is a autogenerated test data set"},
    "columns": [
        {"name": "mmsi", "is_optional": False, "category": DataCategory.STATIC,
         "value_range":{"MinMax": {"min": MMSI.min_value, "max": MMSI.max_value}}},
        {"name": "lon", "is_optional": False, "unit": "deg", "category": DataCategory.DYNAMIC},
        {"name": "lat", "is_optional": False, "unit": "deg", "category": DataCategory.DYNAMIC},
        {"name": "date_time_utc", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "sog", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "cog", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "true_heading", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "nav_status", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "rot", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "message_nr", "is_optional": False, "category": DataCategory.DYNAMIC},
        {"name": "source", "is_optional": False, "category": DataCategory.DYNAMIC},
    ]
}
metadata = MetaData.from_dict(dataspec)
data = generator.AISTestData(number_of_trajectories=1000, min_length=25, max_length=300)
adf = AnnotatedDataFrame(data.dataframe, metadata, validation_mode=ValidationMode.UPDATE_DATA)
adf
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

FILTER [([(col("mmsi")) >= (200000000)]) & ([(col("mmsi")) <= (799999999)])]

FROM

DF ["mmsi", "lon", "lat", "date_time_utc", ...]; PROJECT */11 COLUMNS

Data-processing#

Say we want to repeat this process on any data-set we read in. Then, we should create a damast.core.dataprocessing.DataProcessingPipeline. A pipeline consists of pipeline-elements, that is a set of transformations on the original dataset. We start by creating a Pipeline-element that drops all rows missing an "mmsi" entry.

from damast.data_handling.transformers.filters import DropMissingOrNan
from damast.core.dataprocessing import DataProcessingPipeline
pipeline = DataProcessingPipeline(name="Remove missing MMSI columns",
                                  base_dir="./output_dir",
                                  inplace_transformation=True)
pipeline.add(name="Remove MMSI column",
             transformer=DropMissingOrNan(),
             name_mappings={"x": "mmsi"})

transformed_adf = pipeline.transform(adf)
transformed_adf
Step :   0%|          | 0/2 [00:00<?, ?it/s]
Step : 100%|██████████| 2/2 [00:00<00:00, 380.68it/s]

naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

FILTER [(col("mmsi").is_nan()) !=v (true)]

FROM

FILTER col("mmsi").is_not_null()

FROM

FILTER [([(col("mmsi")) >= (200000000)]) & ([(col("mmsi")) <= (799999999)])]

FROM

DF ["mmsi", "lon", "lat", "date_time_utc", ...]; PROJECT */11 COLUMNS
transformed_adf.collect()
shape: (150_879, 11)
mmsilonlatdate_time_utcsogcogtrue_headingnav_statusrotmessage_nrsource
i64f64f64strf64f64f64i64f64i64str
712848322-89.52082614.709628null10.7049320.9025261.00012210.02"s"
257723004124.075699-16.237677"1975-03-21 14:00:39"0.4446641.5027411.54204100.03"s"
424914353155.89368470.32585"1970-04-30 19:30:46"-1.052104-1.639079-1.63437100.02"g"
64762406769.648262-73.055493"1981-12-03 20:20:27"-26.992568-0.150494-0.12097710.03"g"
440698163-27.966634-49.491982"2019-11-11 17:03:07"2.997805-2.698211-2.69168800.02"s"
43577007395.54874757.068609"1978-02-06 12:30:52"-10.7868230.4307590.4831370.03"s"
532011153106.721493-11.28358"1981-01-29 04:16:16"-2.030162-0.456034-0.38569170.03"s"
245053937-172.16258962.672262null14.8551584.9697815.05270770.02"g"
720706096-125.13666916.271604"1994-02-11 22:12:07"-13.489944.4420924.51720300.03"s"
74879335010.69683988.475503"1987-06-28 23:40:10"8.619029-4.207975-4.15639100.02"g"

Data Pipelines with multiple input sources#

When multiple input sources exist and should be merged, a join operator (transformer) can be designed. The join can, but must not necessarily involve two pipelines as illustrated in the following. The pipeline is named using the ‘data_source’ argument, but this is optiona

import damast
from damast.core.transformations import PipelineElement

class JoinByTime(PipelineElement):                                                                                                                                                                             
    @damast.core.describe("Join data by timestamp")                                                                                        
    @damast.core.input({                                                                                                          
                           "timestamp": {},                                                                                       
                           "lon": {},                                                                                             
                           "lat": {},                                                                                             
                       })                                                                                                         
    @damast.core.input({                                                                                                          
                            "timestamp": {},                                                                                      
                            "lat": {},                                                                                            
                            "lon": {}                                                                                             
                        }, label='other'                                                                                          
    )                                                                                                                             
    @damast.core.output({})                                                                                                       
    def transform(self, df: AnnotatedDataFrame, other: AnnotatedDataFrame) -> AnnotatedDataFrame:                                 
        other_timestamp = self.get_name('timestamp', datasource='other')                                                          
        df_timestamp = self.get_name('timestamp')                                                                                 
                                                                                                                                  
        df._dataframe = df.join(other._dataframe, left_on=df_timestamp, right_on=other_timestamp)                                 
        return df

event_time = data.dataframe.drop_nulls().select(polars.col('date_time_utc')).item(0,0)
events_dataframe = polars.from_dict({'latitude': [40.0, 40.1], 'longitude': [10.0,10.2], 'timestamp': [event_time, event_time], 'event_type': ["accident", "accident"]})
events_metadata = AnnotatedDataFrame.infer_annotation(events_dataframe)
events_adf = AnnotatedDataFrame(events_dataframe, metadata=events_metadata)
Extract str and categorical column metadata:   0%|          | 0/4 [00:00<?, ?it/s]
Extract str and categorical column metadata: 100%|██████████| 4/4 [00:00<00:00, 3467.80it/s]

Extract numeric column metadata:   0%|          | 0/2 [00:00<?, ?it/s]
Extract numeric column metadata: 100%|██████████| 2/2 [00:00<00:00, 11052.18it/s]

from damast.data_handling.transformers.cycle_transformer import CycleTransformer

events_pipeline = DataProcessingPipeline(name="events",
                                  base_dir="./output_dir") \
    .add("lat_cycle_transform", CycleTransformer(n=180), name_mappings={'x': 'latitude'}) \
    .add("lon_cycle_transform", CycleTransformer(n=90), name_mappings={'x': 'longitude'}) \

pipeline = DataProcessingPipeline(name="ais_events_merge",
                                  base_dir="./output_dir") \
    .add("lat_cycle_transform", CycleTransformer(n=180), name_mappings={'x': 'lat'}) \
    .add("lon_cycle_transform", CycleTransformer(n=90), name_mappings={'x': 'lon'}) \
    .join("events", data_source=events_pipeline, operator=JoinByTime(),
              name_mappings = {
                  'df': {
                      "timestamp": "date_time_utc",
                  },
                  'other': {
                      "timestamp": "timestamp",
                      "lon": "longitude",
                      "lat": "latitude"
                  }
              },
    )

To run the pipeline, all required datasource aka inputs need to be provided as arguments. While the default input is ‘df’, the datasource for the ‘join’ operator requires to be provided via the keyword of the same name, here ‘events’

joined_adf = pipeline.transform(df=adf, events=events_adf)
Step :   0%|          | 0/7 [00:00<?, ?it/s]
Step : 100%|██████████| 7/7 [00:00<00:00, 147.78it/s]

joined_adf.head(10)
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

SLICE[offset: 0, len: 10]

INNER JOIN:

LEFT PLAN ON: [col("date_time_utc")]

WITH_COLUMNS:

[[([([(col("lon")) * (2.0)]) * (3.141593)].python_udf()) / (90.0)].alias("lon_x"), [([([(col("lon")) * (2.0)]) * (3.141593)].python_udf()) / (90.0)].alias("lon_y")]

WITH_COLUMNS:

[[([([(col("lat")) * (2.0)]) * (3.141593)].python_udf()) / (180.0)].alias("lat_x"), [([([(col("lat")) * (2.0)]) * (3.141593)].python_udf()) / (180.0)].alias("lat_y")]

FILTER [(col("mmsi").is_nan()) !=v (true)]

FROM

FILTER col("mmsi").is_not_null()

FROM

FILTER [(col("mmsi")) <= (799999999)]

FROM

FILTER [(col("mmsi")) >= (200000000)]

FROM

DF ["mmsi", "lon", "lat", "date_time_utc", ...]; PROJECT */11 COLUMNS

RIGHT PLAN ON: [col("timestamp")]

WITH_COLUMNS:

[[([([(col("longitude")) * (2.0)]) * (3.141593)].python_udf()) / (90.0)].alias("longitude_x"), [([([(col("longitude")) * (2.0)]) * (3.141593)].python_udf()) / (90.0)].alias("longitude_y")]

WITH_COLUMNS:

[[([([(col("latitude")) * (2.0)]) * (3.141593)].python_udf()) / (180.0)].alias("latitude_x"), [([([(col("latitude")) * (2.0)]) * (3.141593)].python_udf()) / (180.0)].alias("latitude_y")]

DF ["latitude", "longitude", "timestamp", "event_type"]; PROJECT */4 COLUMNS

END INNER JOIN
joined_adf.head(10).collect()
shape: (2, 22)
mmsilonlatdate_time_utcsogcogtrue_headingnav_statusrotmessage_nrsourcelat_xlat_ylon_xlon_ylatitudelongitudeevent_typelatitude_xlatitude_ylongitude_xlongitude_y
i64f64f64strf64f64f64i64f64i64strf64f64f64f64f64f64strf64f64f64f64
257723004124.075699-16.237677"1975-03-21 14:00:39"0.4446641.5027411.54204100.03"s"0.000430.000430.0098780.00987840.010.0"accident"0.0055560.0055560.0111110.011111
257723004124.075699-16.237677"1975-03-21 14:00:39"0.4446641.5027411.54204100.03"s"0.000430.000430.0098780.00987840.110.2"accident"0.0044950.0044950.0034340.003434