Usage#
We start by exploring the data-processing pipeline part of DAMAST
.
We consider a manufactured dataset of Automatic Identification System (AIS) messages.
The data is generated for 150 boats, where the minimal length of a trajectory is 30 messages, and the maximal length is 1000
!pip install damast
import polars
import damast.domains.maritime.ais.data_generator as generator
data = generator.AISTestData(number_of_trajectories=1000, min_length=25, max_length=300)
Requirement already satisfied: damast in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (0.1.2)
Requirement already satisfied: astropy in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (6.1.7)
Requirement already satisfied: cloudpickle in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.1.1)
Requirement already satisfied: keras>=3.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.9.0)
Requirement already satisfied: matplotlib in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.10.1)
Requirement already satisfied: numba in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (0.61.0)
Requirement already satisfied: numpy in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.1.3)
Requirement already satisfied: polars in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (1.24.0)
Requirement already satisfied: psutil in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (7.0.0)
Requirement already satisfied: pyais in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.9.0)
Requirement already satisfied: pyarrow in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (19.0.1)
Requirement already satisfied: pydantic in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.10.6)
Requirement already satisfied: scikit-learn in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (1.6.1)
Requirement already satisfied: tables in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (3.10.1)
Requirement already satisfied: jax[cpu] in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (0.5.2)
Requirement already satisfied: tf-nightly in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.20.0.dev20250310)
Requirement already satisfied: torch in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from damast) (2.6.0)
Requirement already satisfied: absl-py in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (2.1.0)
Requirement already satisfied: rich in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (13.9.4)
Requirement already satisfied: namex in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (0.0.8)
Requirement already satisfied: h5py in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (3.13.0)
Requirement already satisfied: optree in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (0.14.1)
Requirement already satisfied: ml-dtypes in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (0.5.1)
Requirement already satisfied: packaging in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from keras>=3.0->damast) (24.2)
Requirement already satisfied: pyerfa>=2.0.1.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astropy->damast) (2.0.1.5)
Requirement already satisfied: astropy-iers-data>=0.2024.10.28.0.34.7 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astropy->damast) (0.2025.3.10.0.29.26)
Requirement already satisfied: PyYAML>=3.13 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astropy->damast) (6.0.2)
Requirement already satisfied: jaxlib<=0.5.2,>=0.5.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from jax[cpu]->damast) (0.5.1)
Requirement already satisfied: opt_einsum in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from jax[cpu]->damast) (3.4.0)
Requirement already satisfied: scipy>=1.11.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from jax[cpu]->damast) (1.15.2)
Requirement already satisfied: contourpy>=1.0.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (4.56.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (1.4.8)
Requirement already satisfied: pillow>=8 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (3.2.1)
Requirement already satisfied: python-dateutil>=2.7 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from matplotlib->damast) (2.9.0.post0)
Requirement already satisfied: llvmlite<0.45,>=0.44.0dev0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from numba->damast) (0.44.0)
Requirement already satisfied: bitarray in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pyais->damast) (3.1.1)
Requirement already satisfied: attrs in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pyais->damast) (25.1.0)
Requirement already satisfied: annotated-types>=0.6.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic->damast) (0.7.0)
Requirement already satisfied: pydantic-core==2.27.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic->damast) (2.27.2)
Requirement already satisfied: typing-extensions>=4.12.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from pydantic->damast) (4.12.2)
Requirement already satisfied: joblib>=1.2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from scikit-learn->damast) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from scikit-learn->damast) (3.5.0)
Requirement already satisfied: numexpr>=2.6.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tables->damast) (2.10.2)
Requirement already satisfied: py-cpuinfo in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tables->damast) (9.0.0)
Requirement already satisfied: blosc2>=2.3.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tables->damast) (2.7.1)
Requirement already satisfied: astunparse>=1.6.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (1.6.3)
Requirement already satisfied: flatbuffers>=24.3.25 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (25.2.10)
Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (0.6.0)
Requirement already satisfied: google-pasta>=0.1.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (0.2.0)
Requirement already satisfied: libclang>=13.0.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (18.1.1)
Requirement already satisfied: protobuf<6.0.0dev,>=4.21.6 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (5.29.3)
Requirement already satisfied: requests<3,>=2.21.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (2.32.3)
Requirement already satisfied: setuptools in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (75.8.0)
Requirement already satisfied: six>=1.12.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (1.17.0)
Requirement already satisfied: termcolor>=1.1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (2.5.0)
Requirement already satisfied: wrapt>=1.11.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (1.17.2)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (1.71.0)
Requirement already satisfied: tb-nightly~=2.19.0.a in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (2.19.0a20250218)
Requirement already satisfied: keras-nightly>=3.6.0.dev in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tf-nightly->damast) (3.9.0.dev2025031103)
Requirement already satisfied: filelock in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.17.0)
Requirement already satisfied: networkx in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.4.2)
Requirement already satisfied: jinja2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.1.6)
Requirement already satisfied: fsspec in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (2025.3.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.4.127 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.4.127)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.4.127 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.4.127)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.4.127 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.4.127)
Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (9.1.0.70)
Requirement already satisfied: nvidia-cublas-cu12==12.4.5.8 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.4.5.8)
Requirement already satisfied: nvidia-cufft-cu12==11.2.1.3 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (11.2.1.3)
Requirement already satisfied: nvidia-curand-cu12==10.3.5.147 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (10.3.5.147)
Requirement already satisfied: nvidia-cusolver-cu12==11.6.1.9 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (11.6.1.9)
Requirement already satisfied: nvidia-cusparse-cu12==12.3.1.170 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.3.1.170)
Requirement already satisfied: nvidia-cusparselt-cu12==0.6.2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (0.6.2)
Requirement already satisfied: nvidia-nccl-cu12==2.21.5 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (2.21.5)
Requirement already satisfied: nvidia-nvtx-cu12==12.4.127 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.4.127)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.4.127 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (12.4.127)
Requirement already satisfied: triton==3.2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (3.2.0)
Requirement already satisfied: sympy==1.13.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from torch->damast) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from sympy==1.13.1->torch->damast) (1.3.0)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from astunparse>=1.6.0->tf-nightly->damast) (0.45.1)
Requirement already satisfied: ndindex>=1.4 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from blosc2>=2.3.0->tables->damast) (1.9.2)
Requirement already satisfied: msgpack in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from blosc2>=2.3.0->tables->damast) (1.1.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests<3,>=2.21.0->tf-nightly->damast) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests<3,>=2.21.0->tf-nightly->damast) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests<3,>=2.21.0->tf-nightly->damast) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from requests<3,>=2.21.0->tf-nightly->damast) (2025.1.31)
Requirement already satisfied: markdown>=2.6.8 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tb-nightly~=2.19.0.a->tf-nightly->damast) (3.7)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tb-nightly~=2.19.0.a->tf-nightly->damast) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from tb-nightly~=2.19.0.a->tf-nightly->damast) (3.1.3)
Requirement already satisfied: MarkupSafe>=2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from jinja2->torch->damast) (3.0.2)
Requirement already satisfied: markdown-it-py>=2.2.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from rich->keras>=3.0->damast) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from rich->keras>=3.0->damast) (2.19.1)
Requirement already satisfied: mdurl~=0.1 in /home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.0->damast) (0.1.2)
The data is stored in a polars.LazyFrame, and we can inspect the first and last 5 messages in the dataset.
print(data.dataframe)
shape: (162_118, 11)
┌───────────┬────────────┬────────────┬───────────────┬───┬────────────┬─────┬────────────┬────────┐
│ mmsi ┆ lon ┆ lat ┆ date_time_utc ┆ … ┆ nav_status ┆ rot ┆ message_nr ┆ source │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ str ┆ ┆ i64 ┆ f64 ┆ i64 ┆ str │
╞═══════════╪════════════╪════════════╪═══════════════╪═══╪════════════╪═════╪════════════╪════════╡
│ 665159851 ┆ 103.770273 ┆ 78.472639 ┆ 1985-03-22 ┆ … ┆ 0 ┆ 0.0 ┆ 2 ┆ s │
│ ┆ ┆ ┆ 14:50:16 ┆ ┆ ┆ ┆ ┆ │
│ 645053718 ┆ 39.967344 ┆ 29.150584 ┆ 2001-12-02 ┆ … ┆ 0 ┆ 0.0 ┆ 1 ┆ s │
│ ┆ ┆ ┆ 16:00:10 ┆ ┆ ┆ ┆ ┆ │
│ 829664905 ┆ 146.296785 ┆ -68.14958 ┆ 1988-06-02 ┆ … ┆ 7 ┆ 0.0 ┆ 2 ┆ s │
│ ┆ ┆ ┆ 20:32:13 ┆ ┆ ┆ ┆ ┆ │
│ 244557659 ┆ 73.488362 ┆ 52.166825 ┆ 1991-02-14 ┆ … ┆ 0 ┆ 0.0 ┆ 3 ┆ g │
│ ┆ ┆ ┆ 18:23:40 ┆ ┆ ┆ ┆ ┆ │
│ 757514195 ┆ 172.669604 ┆ 8.815414 ┆ 2008-04-22 ┆ … ┆ 0 ┆ 0.0 ┆ 1 ┆ s │
│ ┆ ┆ ┆ 12:47:43 ┆ ┆ ┆ ┆ ┆ │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ 497570870 ┆ 144.543314 ┆ -76.135199 ┆ null ┆ … ┆ 7 ┆ 0.0 ┆ 3 ┆ s │
│ 551920744 ┆ 108.758311 ┆ -11.502653 ┆ null ┆ … ┆ 7 ┆ 0.0 ┆ 2 ┆ s │
│ 220739595 ┆ 83.560212 ┆ -2.216197 ┆ 1991-01-10 ┆ … ┆ 1 ┆ 0.0 ┆ 3 ┆ s │
│ ┆ ┆ ┆ 17:29:36 ┆ ┆ ┆ ┆ ┆ │
│ 275532654 ┆ -53.465037 ┆ 51.27748 ┆ 1970-09-23 ┆ … ┆ 0 ┆ 0.0 ┆ 2 ┆ g │
│ ┆ ┆ ┆ 22:49:00 ┆ ┆ ┆ ┆ ┆ │
│ 244605512 ┆ 58.250896 ┆ -76.052974 ┆ 1987-05-16 ┆ … ┆ 0 ┆ 0.0 ┆ 3 ┆ s │
│ ┆ ┆ ┆ 17:45:44 ┆ ┆ ┆ ┆ ┆ │
└───────────┴────────────┴────────────┴───────────────┴───┴────────────┴─────┴────────────┴────────┘
The dataset consists of 11 columns, which we will go through in detail.
Data-specification#
The Maritime Mobile Service Identity (MMSI) used to identify a ship. It should be a 9 digit number whose first integer should be between 2 and 7. The data we have generated should contain some invalid numbers. Let us inspect these.
from damast.domains.maritime.data_specification import MMSI
df = data.dataframe
invalid_mmsis = df.filter((polars.col('mmsi') < MMSI.min_value) | (polars.col('mmsi') > MMSI.max_value))
invalid_mmsis
mmsi | lon | lat | date_time_utc | sog | cog | true_heading | nav_status | rot | message_nr | source |
---|---|---|---|---|---|---|---|---|---|---|
i64 | f64 | f64 | str | f64 | f64 | f64 | i64 | f64 | i64 | str |
829664905 | 146.296785 | -68.14958 | "1988-06-02 20:32:13" | -18.581698 | 2.453757 | 2.454083 | 7 | 0.0 | 2 | "s" |
191860039 | -69.473065 | 47.91841 | "2015-08-10 04:22:38" | 7.306458 | 0.783616 | 0.817958 | 1 | 0.0 | 3 | "g" |
831760986 | -121.814033 | 41.198894 | "2012-02-27 18:43:44" | 4.656469 | -4.758333 | -4.70935 | 7 | 0.0 | 1 | "s" |
826453030 | -16.070802 | -81.95998 | "1980-01-30 06:05:54" | -6.534687 | -0.207772 | -0.145488 | 7 | 0.0 | 1 | "g" |
834449441 | -123.703542 | 82.106814 | "1977-08-12 14:36:49" | 4.205259 | -5.529068 | -5.492754 | 0 | 0.0 | 2 | "s" |
… | … | … | … | … | … | … | … | … | … | … |
827110460 | -149.085766 | 31.831945 | "1970-09-05 17:31:24" | 1.750886 | -1.067 | -1.015302 | 0 | 0.0 | 3 | "g" |
193714578 | 0.048932 | -61.353975 | "2008-03-27 00:50:56" | -11.366827 | 8.560596 | 8.584886 | 0 | 0.0 | 3 | "g" |
803447900 | 48.379803 | 34.144149 | "1975-12-15 03:51:35" | -1.166723 | -0.622385 | -0.588674 | 7 | 0.0 | 3 | "g" |
825260785 | -155.823564 | 48.097149 | "1994-12-18 11:24:35" | 18.162827 | 0.550113 | 0.565691 | 1 | 0.0 | 3 | "g" |
834449441 | -123.792928 | 82.175318 | "1977-08-12 15:07:46" | -3.14995 | -5.673962 | -5.596834 | 7 | 0.0 | 2 | "s" |
Before sending this data to a machine learning algorithm, one would have to filter out invalid data.
We can do this by creating a damast.core.DataSpecification
describing what valid output we would like in our data-frame.
from damast.core import DataSpecification, MinMax
mmsi_spec = DataSpecification(name="mmsi", description="Maritime Mobile Service Identity", representation_type=int,
value_range=MinMax(MMSI.min_value, MMSI.max_value))
We have here described what data this column is supposed to describe, how the data is represented in Python, and its minimum and maximum range.
Next, we create a damast.core.MetaData
object that we can apply to the dataframe.
from damast.core import MetaData,ValidationMode
metadata = MetaData([mmsi_spec])
metadata.apply(df.lazy(), ValidationMode.UPDATE_DATA)
/home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages/damast/core/metadata.py:568: UserWarning: DataSpecification.apply: column 'mmsi': expected representation type: <class 'int'>, but got 'Int64'
warnings.warn(
/home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages/damast/core/metadata.py:589: UserWarning: Filtering out for column 'mmsi' values that are out of range.
warnings.warn(
Of course, we do not want to do this process manually per row. Therefore, we can create a DataSpecification
per row, and let the damast.core.AnnotatedDataFrame
handle the validation of the data. We can choose between three ways of handling the input data with metadata, we can either use:
ValidationMode.READONLY
: Reads in the data, checks it against the meta-data and throws an error if the data does not adhere to the data-specification.ValidationMode.UPDATE_METADATA
: Update the metadata based on the input in the annotated data-frame. This might change the representation type, column name and valid rages of the data.ValidationMode.UPDATE_DATA
: Update data so that it adheres to the meta-data.
from damast.core.metadata import DataCategory
from damast.core.dataframe import AnnotatedDataFrame
dataspec = {
"annotations": {"comment": "This is a autogenerated test data set"},
"columns": [
{"name": "mmsi", "is_optional": False, "category": DataCategory.STATIC,
"value_range":{"MinMax": {"min": MMSI.min_value, "max": MMSI.max_value}}},
{"name": "lon", "is_optional": False, "unit": "deg", "category": DataCategory.DYNAMIC},
{"name": "lat", "is_optional": False, "unit": "deg", "category": DataCategory.DYNAMIC},
{"name": "date_time_utc", "is_optional": False, "category": DataCategory.DYNAMIC},
{"name": "sog", "is_optional": False, "category": DataCategory.DYNAMIC},
{"name": "cog", "is_optional": False, "category": DataCategory.DYNAMIC},
{"name": "true_heading", "is_optional": False, "category": DataCategory.DYNAMIC},
{"name": "nav_status", "is_optional": False, "category": DataCategory.DYNAMIC},
{"name": "rot", "is_optional": False, "category": DataCategory.DYNAMIC},
{"name": "message_nr", "is_optional": False, "category": DataCategory.DYNAMIC},
{"name": "source", "is_optional": False, "category": DataCategory.DYNAMIC},
]
}
metadata = MetaData.from_dict(dataspec)
data = generator.AISTestData(number_of_trajectories=1000, min_length=25, max_length=300)
adf = AnnotatedDataFrame(data.dataframe, metadata, validation_mode=ValidationMode.UPDATE_DATA)
adf
/home/runner/work/damast/damast/.tox/build_docs/lib/python3.10/site-packages/damast/core/metadata.py:589: UserWarning: Filtering out for column 'mmsi' values that are out of range.
warnings.warn(
Data-processing#
Say we want to repeat this process on any data-set we read in. Then, we should create a damast.core.dataprocessing.DataProcessingPipeline
.
A pipeline consists of pipeline-elements, that is a set of transformations on the original dataset.
We start by creating a Pipeline-element that drops all rows missing an "mmsi"
entry.
from damast.data_handling.transformers.filters import DropMissingOrNan
from damast.core.dataprocessing import DataProcessingPipeline
pipeline = DataProcessingPipeline(name="Remove missing MMSI columns",
base_dir="./output_dir",
inplace_transformation=True)
pipeline.add(name="Remove MMSI column",
transformer=DropMissingOrNan(),
name_mappings={"x": "mmsi"})
transformed_adf = pipeline.transform(adf)
transformed_adf
transformed_adf.collect()
mmsi | lon | lat | date_time_utc | sog | cog | true_heading | nav_status | rot | message_nr | source |
---|---|---|---|---|---|---|---|---|---|---|
i64 | f64 | f64 | str | f64 | f64 | f64 | i64 | f64 | i64 | str |
786414538 | 17.631989 | 58.675481 | "2005-04-15 00:00:35" | -18.047639 | -0.545029 | -0.453733 | 7 | 0.0 | 3 | "g" |
356986369 | 88.657238 | -62.300499 | "2018-12-14 05:43:21" | -5.879265 | -7.633635 | -7.592419 | 7 | 0.0 | 2 | "s" |
709754860 | 143.000563 | -21.344114 | "2015-04-17 01:03:32" | 15.993758 | -1.739446 | -1.687773 | 0 | 0.0 | 3 | "s" |
739509160 | 105.895329 | 46.555108 | "2016-06-28 16:38:58" | -2.268003 | -1.100663 | -1.077908 | 7 | 0.0 | 3 | "s" |
410210939 | -81.439397 | 79.613767 | "1994-03-09 07:35:41" | 7.333727 | -2.018947 | -1.941216 | 0 | 0.0 | 2 | "s" |
… | … | … | … | … | … | … | … | … | … | … |
541629734 | 75.768201 | -48.507674 | "2005-02-25 02:05:58" | -10.501874 | 1.663048 | 1.709874 | 1 | 0.0 | 1 | "s" |
333897361 | -115.430138 | -10.509307 | null | 10.981745 | 2.209766 | 2.265384 | 1 | 0.0 | 2 | "s" |
460482459 | -88.303318 | 73.13233 | "1993-08-09 05:40:40" | -4.637709 | 1.269856 | 1.27826 | 1 | 0.0 | 3 | "s" |
363148064 | -17.471132 | 8.034439 | "2007-10-06 23:03:57" | -8.744209 | 1.110241 | 1.138796 | 1 | 0.0 | 2 | "g" |
452660538 | -172.90871 | 63.038617 | "2009-09-05 09:21:51" | -21.245215 | -7.598973 | -7.556807 | 7 | 0.0 | 1 | "g" |