Usage#
We start by exploring the data-processing pipeline part of DAMAST
.
We consider a manufactured dataset of Automatic Identification System (AIS) messages.
The data is generated for 150 boats, where the minimal length of a trajectory is 30 messages, and the maximal length is 1000
!pip install damast
import polars
import damast.domains.maritime.ais.data_generator as generator
data = generator.AISTestData(number_of_trajectories=1000, min_length=25, max_length=300)
Requirement already satisfied: damast in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (0.1.8)
Requirement already satisfied: astropy in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (6.1.7)
Requirement already satisfied: cloudpickle in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.1.1)
Requirement already satisfied: keras>=3.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.10.0)
Requirement already satisfied: matplotlib in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.10.3)
Requirement already satisfied: numba in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (0.61.2)
Requirement already satisfied: numpy>=2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.1.3)
Requirement already satisfied: polars>=1.20 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (1.30.0)
Requirement already satisfied: psutil in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (7.0.0)
Requirement already satisfied: pyais in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.9.4)
Requirement already satisfied: pyarrow in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (20.0.0)
Requirement already satisfied: pydantic>=2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.11.6)
Requirement already satisfied: ratarmount in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (1.0.0)
Requirement already satisfied: scikit-learn in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (1.7.0)
Requirement already satisfied: tables in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.10.1)
Requirement already satisfied: torch in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.7.1)
Requirement already satisfied: tqdm in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (4.67.1)
Requirement already satisfied: absl-py in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (2.3.0)
Requirement already satisfied: rich in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (14.0.0)
Requirement already satisfied: namex in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (0.1.0)
Requirement already satisfied: h5py in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (3.14.0)
Requirement already satisfied: optree in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (0.16.0)
Requirement already satisfied: ml-dtypes in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (0.5.1)
Requirement already satisfied: packaging in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (25.0)
Requirement already satisfied: annotated-types>=0.6.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic>=2.0->damast) (0.7.0)
Requirement already satisfied: pydantic-core==2.33.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic>=2.0->damast) (2.33.2)
Requirement already satisfied: typing-extensions>=4.12.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic>=2.0->damast) (4.14.0)
Requirement already satisfied: typing-inspection>=0.4.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic>=2.0->damast) (0.4.1)
Requirement already satisfied: pyerfa>=2.0.1.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astropy->damast) (2.0.1.5)
Requirement already satisfied: astropy-iers-data>=0.2024.10.28.0.34.7 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astropy->damast) (0.2025.6.9.14.9.37)
Requirement already satisfied: PyYAML>=3.13 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astropy->damast) (6.0.2)
Requirement already satisfied: contourpy>=1.0.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (4.58.2)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (1.4.8)
Requirement already satisfied: pillow>=8 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (11.2.1)
Requirement already satisfied: pyparsing>=2.3.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (3.2.3)
Requirement already satisfied: python-dateutil>=2.7 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib->damast) (1.17.0)
Requirement already satisfied: llvmlite<0.45,>=0.44.0dev0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from numba->damast) (0.44.0)
Requirement already satisfied: bitarray in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pyais->damast) (3.4.2)
Requirement already satisfied: attrs in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pyais->damast) (25.3.0)
Requirement already satisfied: ratarmountcore~=0.8.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (0.8.2)
Requirement already satisfied: python-xz~=0.4.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (0.4.0)
Requirement already satisfied: rarfile~=4.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (4.2)
Requirement already satisfied: pyfatfs~=1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (1.1.0)
Requirement already satisfied: rapidgzip>=0.13.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (0.14.3)
Requirement already satisfied: indexed_zstd<2.0,>=1.2.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (1.6.1)
Requirement already satisfied: fast_zip_decryption in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (3.0.0)
Requirement already satisfied: indexed_gzip<2.0,>=1.6.3 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (1.9.5)
Requirement already satisfied: libarchive-c<6.0,~=5.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (5.3)
Requirement already satisfied: pygit2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (1.18.0)
Requirement already satisfied: fs~=2.4 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pyfatfs~=1.0->ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (2.4.16)
Requirement already satisfied: appdirs~=1.4.3 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from fs~=2.4->pyfatfs~=1.0->ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (1.4.4)
Requirement already satisfied: setuptools in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from fs~=2.4->pyfatfs~=1.0->ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (80.3.1)
Requirement already satisfied: cffi>=1.17.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pygit2->ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (1.17.1)
Requirement already satisfied: pycparser in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from cffi>=1.17.0->pygit2->ratarmountcore[7z,bzip2,fat,git,gzip,rar,xz,zip,zstd]~=0.8.0->ratarmount->damast) (2.22)
Requirement already satisfied: markdown-it-py>=2.2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from rich->keras>=3.0->damast) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from rich->keras>=3.0->damast) (2.19.1)
Requirement already satisfied: mdurl~=0.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.0->damast) (0.1.2)
Requirement already satisfied: scipy>=1.8.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from scikit-learn->damast) (1.15.3)
Requirement already satisfied: joblib>=1.2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from scikit-learn->damast) (1.5.1)
Requirement already satisfied: threadpoolctl>=3.1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from scikit-learn->damast) (3.6.0)
Requirement already satisfied: numexpr>=2.6.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tables->damast) (2.11.0)
Requirement already satisfied: py-cpuinfo in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tables->damast) (9.0.0)
Requirement already satisfied: blosc2>=2.3.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tables->damast) (3.3.4)
Requirement already satisfied: ndindex in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from blosc2>=2.3.0->tables->damast) (1.10.0)
Requirement already satisfied: msgpack in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from blosc2>=2.3.0->tables->damast) (1.1.1)
Requirement already satisfied: platformdirs in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from blosc2>=2.3.0->tables->damast) (4.3.8)
Requirement already satisfied: requests in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from blosc2>=2.3.0->tables->damast) (2.32.4)
Requirement already satisfied: charset_normalizer<4,>=2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests->blosc2>=2.3.0->tables->damast) (3.4.2)
Requirement already satisfied: idna<4,>=2.5 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests->blosc2>=2.3.0->tables->damast) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests->blosc2>=2.3.0->tables->damast) (2.4.0)
Requirement already satisfied: certifi>=2017.4.17 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests->blosc2>=2.3.0->tables->damast) (2025.4.26)
Requirement already satisfied: filelock in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.18.0)
Requirement already satisfied: sympy>=1.13.3 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (1.14.0)
Requirement already satisfied: networkx in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.4.2)
Requirement already satisfied: jinja2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.1.6)
Requirement already satisfied: fsspec in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (2025.5.1)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.6.77 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.6.77)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.6.77 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.6.77)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.6.80 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.6.80)
Requirement already satisfied: nvidia-cudnn-cu12==9.5.1.17 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (9.5.1.17)
Requirement already satisfied: nvidia-cublas-cu12==12.6.4.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.6.4.1)
Requirement already satisfied: nvidia-cufft-cu12==11.3.0.4 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (11.3.0.4)
Requirement already satisfied: nvidia-curand-cu12==10.3.7.77 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (10.3.7.77)
Requirement already satisfied: nvidia-cusolver-cu12==11.7.1.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (11.7.1.2)
Requirement already satisfied: nvidia-cusparse-cu12==12.5.4.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.5.4.2)
Requirement already satisfied: nvidia-cusparselt-cu12==0.6.3 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (0.6.3)
Requirement already satisfied: nvidia-nccl-cu12==2.26.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (2.26.2)
Requirement already satisfied: nvidia-nvtx-cu12==12.6.77 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.6.77)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.6.85 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.6.85)
Requirement already satisfied: nvidia-cufile-cu12==1.11.1.6 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (1.11.1.6)
Requirement already satisfied: triton==3.3.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.3.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from sympy>=1.13.3->torch->damast) (1.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from jinja2->torch->damast) (3.0.2)
The data is stored in a polars.LazyFrame, and we can inspect the first and last 5 messages in the dataset.
print(data.dataframe)
shape: (163_933, 11)
┌───────────┬─────────────┬────────────┬──────────────┬───┬────────────┬─────┬────────────┬────────┐
│ mmsi ┆ lon ┆ lat ┆ date_time_ut ┆ … ┆ nav_status ┆ rot ┆ message_nr ┆ source │
│ --- ┆ --- ┆ --- ┆ c ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ --- ┆ ┆ i64 ┆ f64 ┆ i64 ┆ str │
│ ┆ ┆ ┆ str ┆ ┆ ┆ ┆ ┆ │
╞═══════════╪═════════════╪════════════╪══════════════╪═══╪════════════╪═════╪════════════╪════════╡
│ 523002531 ┆ 84.043223 ┆ -72.394031 ┆ 1987-03-25 ┆ … ┆ 1 ┆ 0.0 ┆ 3 ┆ g │
│ ┆ ┆ ┆ 23:22:28 ┆ ┆ ┆ ┆ ┆ │
│ 287450013 ┆ 100.600925 ┆ 14.62509 ┆ 2022-11-19 ┆ … ┆ 7 ┆ 0.0 ┆ 2 ┆ g │
│ ┆ ┆ ┆ 06:33:47 ┆ ┆ ┆ ┆ ┆ │
│ 276233932 ┆ -126.730308 ┆ -28.529244 ┆ null ┆ … ┆ 7 ┆ 0.0 ┆ 2 ┆ s │
│ 477676853 ┆ -110.513478 ┆ 78.242938 ┆ 2022-09-05 ┆ … ┆ 0 ┆ 0.0 ┆ 3 ┆ s │
│ ┆ ┆ ┆ 13:54:03 ┆ ┆ ┆ ┆ ┆ │
│ 482930493 ┆ 4.940892 ┆ -38.987358 ┆ 1990-02-04 ┆ … ┆ 1 ┆ 0.0 ┆ 2 ┆ g │
│ ┆ ┆ ┆ 20:50:28 ┆ ┆ ┆ ┆ ┆ │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 200728655 ┆ 53.311805 ┆ 58.28434 ┆ 1987-02-20 ┆ … ┆ 7 ┆ 0.0 ┆ 3 ┆ s │
│ ┆ ┆ ┆ 00:13:59 ┆ ┆ ┆ ┆ ┆ │
│ 669134243 ┆ 11.608388 ┆ -47.932483 ┆ 2002-05-02 ┆ … ┆ 0 ┆ 0.0 ┆ 3 ┆ g │
│ ┆ ┆ ┆ 01:11:39 ┆ ┆ ┆ ┆ ┆ │
│ 825600050 ┆ -5.285948 ┆ 32.437435 ┆ 1978-01-31 ┆ … ┆ 7 ┆ 0.0 ┆ 3 ┆ g │
│ ┆ ┆ ┆ 14:40:12 ┆ ┆ ┆ ┆ ┆ │
│ 498671479 ┆ -74.088281 ┆ -55.334557 ┆ 1988-08-13 ┆ … ┆ 0 ┆ 0.0 ┆ 1 ┆ s │
│ ┆ ┆ ┆ 04:46:42 ┆ ┆ ┆ ┆ ┆ │
│ 649364375 ┆ -97.57014 ┆ 22.635676 ┆ 2012-05-09 ┆ … ┆ 7 ┆ 0.0 ┆ 1 ┆ s │
│ ┆ ┆ ┆ 22:17:02 ┆ ┆ ┆ ┆ ┆ │
└───────────┴─────────────┴────────────┴──────────────┴───┴────────────┴─────┴────────────┴────────┘
The dataset consists of 11 columns, which we will go through in detail.
Data-specification#
The Maritime Mobile Service Identity (MMSI) used to identify a ship. It should be a 9 digit number whose first integer should be between 2 and 7. The data we have generated should contain some invalid numbers. Let us inspect these.
from damast.domains.maritime.data_specification import MMSI
df = data.dataframe
invalid_mmsis = df.filter((polars.col('mmsi') < MMSI.min_value) | (polars.col('mmsi') > MMSI.max_value))
invalid_mmsis
mmsi | lon | lat | date_time_utc | sog | cog | true_heading | nav_status | rot | message_nr | source |
---|---|---|---|---|---|---|---|---|---|---|
i64 | f64 | f64 | str | f64 | f64 | f64 | i64 | f64 | i64 | str |
198705566 | -118.771696 | 81.909949 | "2011-02-02 02:15:55" | 2.570203 | 0.646392 | 0.674465 | 0 | 0.0 | 1 | "s" |
827273045 | -70.699883 | 39.136651 | "1995-04-04 03:34:28" | 6.988184 | -2.384334 | -2.357167 | 0 | 0.0 | 1 | "g" |
804205360 | -154.201175 | -24.302368 | "2022-10-06 19:11:52" | 0.518591 | -0.217661 | -0.160486 | 1 | 0.0 | 2 | "s" |
803969546 | 38.400868 | -15.679453 | "2002-04-09 12:52:00" | -14.112677 | 1.325874 | 1.395642 | 0 | 0.0 | 3 | "s" |
830343656 | -85.784809 | 71.457058 | "1992-02-01 17:13:50" | -33.071532 | -1.56768 | -1.499229 | 7 | 0.0 | 1 | "s" |
… | … | … | … | … | … | … | … | … | … | … |
829993812 | 112.955155 | 4.012104 | "2002-08-31 23:51:56" | -10.915988 | -1.994626 | -1.938866 | 0 | 0.0 | 3 | "g" |
804205360 | -154.123639 | -23.906636 | "2022-10-06 17:45:06" | -1.175042 | 0.12012 | 0.142647 | 0 | 0.0 | 2 | "g" |
195036015 | -146.650398 | -84.236247 | "1995-04-11 20:36:22" | 14.781023 | 1.945304 | 1.993393 | 0 | 0.0 | 2 | "s" |
803999517 | 89.686374 | 49.581019 | "2005-09-08 19:22:52" | -9.347322 | -6.926102 | -6.837801 | 7 | 0.0 | 3 | "s" |
825600050 | -5.285948 | 32.437435 | "1978-01-31 14:40:12" | -3.111474 | -2.177295 | -2.129191 | 7 | 0.0 | 3 | "g" |
Before sending this data to a machine learning algorithm, one would have to filter out invalid data.
We can do this by creating a damast.core.DataSpecification
describing what valid output we would like in our data-frame.
from damast.core import DataSpecification, MinMax
mmsi_spec = DataSpecification(name="mmsi", description="Maritime Mobile Service Identity", representation_type=int,
value_range=MinMax(MMSI.min_value, MMSI.max_value))
We have here described what data this column is supposed to describe, how the data is represented in Python, and its minimum and maximum range.
Next, we create a damast.core.MetaData
object that we can apply to the dataframe.
from damast.core import MetaData,ValidationMode
metadata = MetaData([mmsi_spec])
metadata.apply(df.lazy(), ValidationMode.UPDATE_DATA)
/home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages/damast/core/metadata.py:610: UserWarning: DataSpecification.apply: column 'mmsi': expected representation type: <class 'int'>, but got 'Int64'
warnings.warn(
/home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages/damast/core/metadata.py:619: UserWarning: Filtering out for column 'mmsi' values that are out of range.
warnings.warn(
Of course, we do not want to do this process manually per row. Therefore, we can create a DataSpecification
per row, and let the damast.core.AnnotatedDataFrame
handle the validation of the data. We can choose between three ways of handling the input data with metadata, we can either use:
ValidationMode.READONLY
: Reads in the data, checks it against the meta-data and throws an error if the data does not adhere to the data-specification.ValidationMode.UPDATE_METADATA
: Update the metadata based on the input in the annotated data-frame. This might change the representation type, column name and valid rages of the data.ValidationMode.UPDATE_DATA
: Update data so that it adheres to the meta-data.
from damast.core.metadata import DataCategory
from damast.core.dataframe import AnnotatedDataFrame
dataspec = {
"annotations": {"comment": "This is a autogenerated test data set"},
"columns": [
{"name": "mmsi", "is_optional": False, "category": DataCategory.STATIC,
"value_range":{"MinMax": {"min": MMSI.min_value, "max": MMSI.max_value}}},
{"name": "lon", "is_optional": False, "unit": "deg", "category": DataCategory.DYNAMIC},
{"name": "lat", "is_optional": False, "unit": "deg", "category": DataCategory.DYNAMIC},
{"name": "date_time_utc", "is_optional": False, "category": DataCategory.DYNAMIC},
{"name": "sog", "is_optional": False, "category": DataCategory.DYNAMIC},
{"name": "cog", "is_optional": False, "category": DataCategory.DYNAMIC},
{"name": "true_heading", "is_optional": False, "category": DataCategory.DYNAMIC},
{"name": "nav_status", "is_optional": False, "category": DataCategory.DYNAMIC},
{"name": "rot", "is_optional": False, "category": DataCategory.DYNAMIC},
{"name": "message_nr", "is_optional": False, "category": DataCategory.DYNAMIC},
{"name": "source", "is_optional": False, "category": DataCategory.DYNAMIC},
]
}
metadata = MetaData.from_dict(dataspec)
data = generator.AISTestData(number_of_trajectories=1000, min_length=25, max_length=300)
adf = AnnotatedDataFrame(data.dataframe, metadata, validation_mode=ValidationMode.UPDATE_DATA)
adf
/home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages/damast/core/metadata.py:619: UserWarning: Filtering out for column 'mmsi' values that are out of range.
warnings.warn(
Data-processing#
Say we want to repeat this process on any data-set we read in. Then, we should create a damast.core.dataprocessing.DataProcessingPipeline
.
A pipeline consists of pipeline-elements, that is a set of transformations on the original dataset.
We start by creating a Pipeline-element that drops all rows missing an "mmsi"
entry.
from damast.data_handling.transformers.filters import DropMissingOrNan
from damast.core.dataprocessing import DataProcessingPipeline
pipeline = DataProcessingPipeline(name="Remove missing MMSI columns",
base_dir="./output_dir",
inplace_transformation=True)
pipeline.add(name="Remove MMSI column",
transformer=DropMissingOrNan(),
name_mappings={"x": "mmsi"})
transformed_adf = pipeline.transform(adf)
transformed_adf
transformed_adf.collect()
mmsi | lon | lat | date_time_utc | sog | cog | true_heading | nav_status | rot | message_nr | source |
---|---|---|---|---|---|---|---|---|---|---|
i64 | f64 | f64 | str | f64 | f64 | f64 | i64 | f64 | i64 | str |
232532191 | -55.655376 | -75.263977 | null | -12.912959 | 1.634495 | 1.702891 | 0 | 0.0 | 3 | "s" |
537340040 | 6.087661 | -87.18102 | "2001-08-06 06:16:41" | 5.342883 | 0.800691 | 0.88829 | 0 | 0.0 | 3 | "s" |
509697426 | -2.112515 | 38.019535 | "2010-06-17 06:57:07" | 8.157707 | 9.176341 | 9.245326 | 0 | 0.0 | 2 | "g" |
615806863 | 27.988985 | -4.022445 | "1983-11-02 17:45:43" | 10.965288 | -1.094736 | -1.089786 | 1 | 0.0 | 1 | "g" |
682290790 | 105.0741 | -70.308761 | "1991-03-10 06:34:15" | -5.837299 | -2.362193 | -2.309975 | 1 | 0.0 | 3 | "g" |
… | … | … | … | … | … | … | … | … | … | … |
657959055 | -169.71815 | -73.962004 | null | -4.436311 | -0.203366 | -0.129942 | 7 | 0.0 | 2 | "s" |
239601674 | 120.106768 | 47.458229 | "1972-02-21 04:30:09" | -23.432357 | 2.649652 | 2.706745 | 0 | 0.0 | 2 | "s" |
724859151 | -173.404494 | -23.707157 | "1997-06-13 13:25:52" | -10.534615 | -0.464212 | -0.392389 | 0 | 0.0 | 3 | "g" |
437856144 | -60.871687 | -70.392893 | "1992-06-08 14:59:34" | 15.998613 | -4.572923 | -4.57159 | 0 | 0.0 | 1 | "s" |
488927194 | -165.977432 | 54.476965 | "2007-02-03 20:36:20" | -9.670403 | -0.534671 | -0.453232 | 1 | 0.0 | 2 | "s" |