Command Line Interface

Command Line Interface#

The command line interface offers a number of workflow simplification that are encapsulated in sub-commands:

inspect

check metadadata and dataset properties

convert

convert from (zipped) csv, netcdf to parquet (default) or hdf5 (deprecated)

annotate

create metadata file and update dataframe with metadata

process

apply a data pipeline to a dataset

Inspect#

Inspect allow to identify columns and properties of columns in a given dataset. The dataset can consists of one or more (zipped) files, either given as list of filenames or using file pattern.

$ damast inspect -f 1.zip

Subparser: DataInspectParser
Loading dataframe (1 files) of total size: 0.0 MB
Creating offset dictionary for /tmp/damast-example/datasets/1.zip ...
Creating offset dictionary for /tmp/damast-example/datasets/1.zip took 0.00s
Created mount point at: /tmp/damast-mountqigwlx74/1.zip
INFO:damast.core.dataframe:Loading parquet: files=[PosixPath('/tmp/damast-mountqigwlx74/1.zip/dataset-1.zst.parquet')]
WARNING:damast.core.dataframe:/tmp/damast-mountqigwlx74/1.zip/dataset-1.zst.parquet has no (damast) annotations
INFO:damast.core.dataframe:No metadata provided or found in files - searching now for an existing spec file
INFO:damast.core.dataframe:Found no candidate for a spec file
INFO:damast.core.dataframe:Metadata is not available and not required, so inferring annotation
Extract str and categorical column metadata: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 1092.51it/s]
Extract numeric column metadata: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 12767.03it/s]
INFO:damast.core.dataframe:Metadata inferring completed
Annotations:
    accuracy:
        is_optional: False
        representation_type: Boolean
    call_sign:
        is_optional: False
        representation_type: String
        value_range: {'ListOfValues': [None, '', 'SIDF9', 'SABD4', 'STDL5', 'STJE3', 'SKCY7', 'XAGBE']}
    cog:
        is_optional: False
        representation_type: Float32
        value_stats: {'mean': 142.0380096435547, 'stddev': 117.50126647949219, 'total_count': 1234, 'null_count': 745}
    corrupted:
        is_optional: False
        representation_type: Boolean
    corrupted_right:
        is_optional: False
        representation_type: Boolean
    destination:
        is_optional: False
        representation_type: String
        value_range: {'ListOfValues': [None, '', 'VILA', 'ES SUR', 'ESICL', 'EBAL>EDGA', 'IT-SEP', 'PLATF ROMA', 'ITL-BREG']}
    dimension_to_bow:
        is_optional: False
        representation_type: UInt16

 ...
     sog:
     is_optional: False
     representation_type: Float32
     value_stats: {'mean': 2.0780696868896484, 'stddev': 4.677201271057129, 'total_count': 1979, 'null_count': 0}
 version:
     is_optional: False
     representation_type: Int64
     value_range: {'MinMax': {'min': 3, 'max': 3, 'allow_missing': True}}
     value_stats: {'mean': 3.0, 'stddev': 0.0, 'total_count': 1979, 'null_count': 0}


 First 10 and last 10 rows:
 shape: (10, 32)
 ┌───────────┬─────────────────────┬──────────┬───────────┬──────┬───┬────────────┬────────────────────┬──────────────────┬───────────────────────┬─────────┐
  mmsi       reception_date       lon       lat        rot     eta         message_type_right  satellite_static  reception_date_static  version 
  ---        ---                  ---       ---        ---      ---         ---                 ---               ---                    ---     
  i32        datetime[ms]         f64       f64        f32      i64         i64                 str               datetime[ms]           i64     
 ╞═══════════╪═════════════════════╪══════════╪═══════════╪══════╪═══╪════════════╪════════════════════╪══════════════════╪═══════════════════════╪═════════╡
  345080000  2020-11-18 33:00:18  0.783398  40.483513  0.0     null        null                null              null                   3       
  334015340  2020-11-18 33:00:33  0.435345  40.414097  null    1735889800  5                   SAT-AA_037        2020-11-18 33:07:35    3       
  334088470  2020-11-18 33:00:37  0.403745  40.358495  null    null        null                null              null                   3       
  334098970  2020-11-18 33:00:39  0.88999   40.389833  null    1783310700  5                   SAT-AA_038        2020-11-18 33:04:13    3       
  333019738  2020-11-18 33:01:18  0.80045   40.819483  null    null        34                  SAT-AA_038        2020-11-18 33:33:51    3       
  353003075  2020-11-18 33:01:38  0.550948  40.571973  0.0     null        5                   SAT-AA_037        2020-11-18 33:01:13    3       
  345080000  2020-11-18 33:01:37  0.759477  40.481487  0.0     null        null                null              null                   3       
  334015340  2020-11-18 33:01:33  0.435338  40.414093  null    1735889800  5                   SAT-AA_037        2020-11-18 33:07:35    3       
  334088470  2020-11-18 33:01:37  0.403743  40.358513  null    null        null                null              null                   3       
  334098970  2020-11-18 33:03:18  0.890313  40.370833  null    1783310700  5                   SAT-AA_038        2020-11-18 33:04:13    3       
 └───────────┴─────────────────────┴──────────┴───────────┴──────┴───┴────────────┴────────────────────┴──────────────────┴───────────────────────┴─────────┘
 shape: (10, 32)
 ┌───────────┬─────────────────────┬──────────┬───────────┬──────┬───┬────────────┬────────────────────┬──────────────────┬───────────────────────┬─────────┐
  mmsi       reception_date       lon       lat        rot     eta         message_type_right  satellite_static  reception_date_static  version 
  ---        ---                  ---       ---        ---      ---         ---                 ---               ---                    ---     
  i32        datetime[ms]         f64       f64        f32      i64         i64                 str               datetime[ms]           i64     
 ╞═══════════╪═════════════════════╪══════════╪═══════════╪══════╪═══╪════════════╪════════════════════╪══════════════════╪═══════════════════════╪═════════╡
  335990004  2020-11-19 01:59:00  0.849883  40.937813  null    null        34                  SAT-AA_038        2020-11-19 01:31:43    3       
  334015340  2020-11-19 03:00:13  0.435335  40.414083  null    1735889800  5                   SAT-AA_037        2020-11-19 01:19:34    3       
  334088470  2020-11-19 03:00:19  0.40377   40.358493  null    null        null                null              null                   3       
  333049539  2020-11-19 03:00:31  0.80088   40.819835  null    null        34                  SAT-AA_038        2020-11-19 01:03:01    3       
  334018830  2020-11-19 03:00:35  0.895348  40.897835  null    1735889800  5                   SAT-AA_038        2020-11-19 01:07:38    3       
  333058871  2020-11-19 03:00:31  0.800105  40.819735  null    null        34                  SAT-AA_037        2020-11-19 00:59:00    3       
  334098970  2020-11-19 03:00:37  0.891373  40.403033  null    1783310700  5                   SAT-AA_038        2020-11-19 01:04:13    3       
  345080000  2020-11-19 03:00:38  0.373085  40.10578   0.0     null        null                null              null                   3       
  333041379  2020-11-19 03:00:48  0.801778  40.818448  null    null        null                null              null                   3       
  333048134  2020-11-19 03:00:54  0.803097  40.830813  null    null        34                  SAT-AA_037        2020-11-19 01:08:30    3       
 └───────────┴─────────────────────┴──────────┴───────────┴──────┴───┴────────────┴────────────────────┴──────────────────┴───────────────────────┴─────────┘

Examples#

Individual columns can be filtered using a python expression that is compliant with the backend (here: polars) being used.

For instance to extract:

  • the time-series for a particular id (mmsi):

damast inspect -f 1.zip --filter 'mmsi == 335990004'
  • all data in a time interval:

damast inspect -f 1.zip --filter 'reception_date >= dt.datetime.fromisoformat("2020-11-19 00:00:00")' --filter 'reception_date <= dt.datetime.fromisoformat("2020-11-20 00:00:00")'

Convert#

damast convert --help
usage: damast convert [-h] [-w WORKDIR] [-v] [--loglevel LOGLEVEL] [--logfile LOGFILE] -f FILES [FILES ...] [-m METADATA_INPUT] [-o OUTPUT_FILE] [--output-dir OUTPUT_DIR]
                      [--output-type OUTPUT_TYPE] [--validation-mode {ignore,readonly,update_data,update_metadata}]
                      {} ...

damast convert - data conversion subcommand called

positional arguments:
  {}                    sub-command help

options:
  -h, --help            show this help message and exit
  -w WORKDIR, --workdir WORKDIR
  -v, --verbose
  --loglevel LOGLEVEL   Set loglevel to display
  --logfile LOGFILE     Set file for saving log (default prints to terminal)
  -f FILES [FILES ...], --files FILES [FILES ...]
                        Files or patterns of the (annotated) data file that should be converted
  -m METADATA_INPUT, --metadata-input METADATA_INPUT
                        The metadata input file
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        The output file either: .parquet, .hdf5
  --output-dir OUTPUT_DIR
                        The output directory
  --output-type OUTPUT_TYPE
                        The output file type: .parquet (default) or .hdf5
  --validation-mode {ignore,readonly,update_data,update_metadata}
                        Define the validation mode

Examples#

  • convert one or more files to parquet (N:N)

damast convert -f 1.zip --output-dir export --output-type .parquet
  • convert one or more files to a single parquet file (N:1)

damast convert -f 1.zip --output-file data-1.parquet --output-type .parquet

Annotate#

Examples#

  • set the unit for two columns, here lat and lon to deg, and creating a new file in the subfolder export

damast annotate -f input.parquet --set-unit lon:deg lat:deg --output-dir export
  • set the unit for two columns, here lat and lon to deg, inplace, i.e., change the existing file

damast annotate -f input.parquet --set-unit lon:deg lat:deg --inplace

Process#

Once a DataProcessPipeline has been exported and saved, e.g., in the following example as my-pipeline.damast.ppl, it can be reapplied to an existing data set. The dataset needs to comply with the required input columns and metadata requirements, such as units, so that the pipeline can successfully run. Damast will check these requirements and raise an exception if these requirements are not satisfied.

import damast
from damast.core import DataProcessingPipeline

from damast.core.dataframe import AnnotatedDataFrame
from damast.domains.maritime.transformers.features import (
    DeltaDistance,
    Speed
)
from damast.data_handling.transformers import (
    DropMissingOrNan,
    AddDeltaTime,
)

class MyPipeline(DataProcessingPipeline):
    def __init__(self,
                 workdir: str | Path,
                 name: str = "my-pipeline",
                 name_mappings: dict[str, str] = {}):
        super().__init__(name=name,
                         base_dir=workdir,
                         name_mappings=name_mappings)

        self.add("Delta Time",
                 AddDeltaTime(),
                 name_mappings={
                     "group": "mmsi",
                     "time_column": "reception_date"
                })

        self.add("Delta Distance",
                 DeltaDistance(x_shift=True, y_shift=True),
                 name_mappings={
                     "group": "mmsi",
                     "sort": "reception_date",
                     "x": "lat",
                     "y": "lon",
                     "out": "delta_distance",
                })

        self.add("Speed",
                 Speed(),
                 name_mappings={
                     "delta_distance": "delta_distance",
                     "delta_time": "delta_time",
                })

pipeline = MyPipeline(workdir=".")
pipeline.save("pipelines")

Examples#

damast process --input-data input.parquet --pipeline pipelines/my-pipeline.damast.ppl