Command Line Interface

Command Line Interface#

The command line interface offers a number of workflow simplifications that are encapsulated in sub-commands:

inspect

check metadadata and dataset properties

convert

convert from (zipped) csv, netcdf to parquet (default) or hdf5 (deprecated)

annotate

create metadata file and update dataframe with metadata

process

apply a data pipeline to a dataset

Inspect#

usage: damast inspect [-h] [-w WORKDIR] [-v] [--loglevel LOGLEVEL]
                      [--logfile LOGFILE] -f FILES [FILES ...]
                      [--filter FILTER] [--head HEAD] [--tail TAIL]
                      [--column-count COLUMN_COUNT]
                      [--columns COLUMNS [COLUMNS ...]]
                      {} ...

damast inspect - data inspection subcommand called

positional arguments:
  {}                    sub-command help

options:
  -h, --help            show this help message and exit
  -w WORKDIR, --workdir WORKDIR
  -v, --verbose
  --loglevel LOGLEVEL   Set loglevel to display
  --logfile LOGFILE     Set file for saving log (default prints to terminal)
  -f FILES [FILES ...], --files FILES [FILES ...]
                        Files or patterns of the (annotated) data file that
                        should be inspected (space separated)
  --filter FILTER       Filter based on column data, e.g., mmsi==120123
  --head HEAD           First this number of rows, default is 10
  --tail TAIL           Print number of rows from the end, default is 10
  --column-count COLUMN_COUNT
                        Number of columns to show
  --columns COLUMNS [COLUMNS ...]
                        Show/Select these columns

Inspect allows to identify columns and properties of columns in a given dataset. The dataset can consists of one or more (zipped) files, either given as list of filenames or using file pattern.

$ damast inspect -f 1.zip

Subparser: DataInspectParser
Loading dataframe (1 files) of total size: 0.0 MB
Creating offset dictionary for /tmp/damast-example/datasets/1.zip ...
Creating offset dictionary for /tmp/damast-example/datasets/1.zip took 0.00s
Created mount point at: /tmp/damast-mountqigwlx74/1.zip
INFO:damast.core.dataframe:Loading parquet: files=[PosixPath('/tmp/damast-mountqigwlx74/1.zip/dataset-1.zst.parquet')]
WARNING:damast.core.dataframe:/tmp/damast-mountqigwlx74/1.zip/dataset-1.zst.parquet has no (damast) annotations
INFO:damast.core.dataframe:No metadata provided or found in files - searching now for an existing spec file
INFO:damast.core.dataframe:Found no candidate for a spec file
INFO:damast.core.dataframe:Metadata is not available and not required, so inferring annotation
Extract str and categorical column metadata: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 1092.51it/s]
Extract numeric column metadata: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 12767.03it/s]
INFO:damast.core.dataframe:Metadata inferring completed
Annotations:
    accuracy:
        is_optional: False
        representation_type: Boolean
    call_sign:
        is_optional: False
        representation_type: String
        value_range: {'ListOfValues': [None, '', 'SIDF9', 'SABD4', 'STDL5', 'STJE3', 'SKCY7', 'XAGBE']}
    cog:
        is_optional: False
        representation_type: Float32
        value_stats: {'mean': 142.0380096435547, 'stddev': 117.50126647949219, 'total_count': 1234, 'null_count': 745}
    corrupted:
        is_optional: False
        representation_type: Boolean
    corrupted_right:
        is_optional: False
        representation_type: Boolean
    destination:
        is_optional: False
        representation_type: String
        value_range: {'ListOfValues': [None, '', 'VILA', 'ES SUR', 'ESICL', 'EBAL>EDGA', 'IT-SEP', 'PLATF ROMA', 'ITL-BREG']}
    dimension_to_bow:
        is_optional: False
        representation_type: UInt16

 ...
     sog:
     is_optional: False
     representation_type: Float32
     value_stats: {'mean': 2.0780696868896484, 'stddev': 4.677201271057129, 'total_count': 1979, 'null_count': 0}
 version:
     is_optional: False
     representation_type: Int64
     value_range: {'MinMax': {'min': 3, 'max': 3, 'allow_missing': True}}
     value_stats: {'mean': 3.0, 'stddev': 0.0, 'total_count': 1979, 'null_count': 0}


 First 10 and last 10 rows:
 shape: (10, 32)
 ┌───────────┬─────────────────────┬──────────┬───────────┬──────┬───┬────────────┬────────────────────┬──────────────────┬───────────────────────┬─────────┐
  mmsi       reception_date       lon       lat        rot     eta         message_type_right  satellite_static  reception_date_static  version 
  ---        ---                  ---       ---        ---      ---         ---                 ---               ---                    ---     
  i32        datetime[ms]         f64       f64        f32      i64         i64                 str               datetime[ms]           i64     
 ╞═══════════╪═════════════════════╪══════════╪═══════════╪══════╪═══╪════════════╪════════════════════╪══════════════════╪═══════════════════════╪═════════╡
  345080000  2020-11-18 33:00:18  0.783398  40.483513  0.0     null        null                null              null                   3       
  334015340  2020-11-18 33:00:33  0.435345  40.414097  null    1735889800  5                   SAT-AA_037        2020-11-18 33:07:35    3       
  334088470  2020-11-18 33:00:37  0.403745  40.358495  null    null        null                null              null                   3       
  334098970  2020-11-18 33:00:39  0.88999   40.389833  null    1783310700  5                   SAT-AA_038        2020-11-18 33:04:13    3       
  333019738  2020-11-18 33:01:18  0.80045   40.819483  null    null        34                  SAT-AA_038        2020-11-18 33:33:51    3       
  353003075  2020-11-18 33:01:38  0.550948  40.571973  0.0     null        5                   SAT-AA_037        2020-11-18 33:01:13    3       
  345080000  2020-11-18 33:01:37  0.759477  40.481487  0.0     null        null                null              null                   3       
  334015340  2020-11-18 33:01:33  0.435338  40.414093  null    1735889800  5                   SAT-AA_037        2020-11-18 33:07:35    3       
  334088470  2020-11-18 33:01:37  0.403743  40.358513  null    null        null                null              null                   3       
  334098970  2020-11-18 33:03:18  0.890313  40.370833  null    1783310700  5                   SAT-AA_038        2020-11-18 33:04:13    3       
 └───────────┴─────────────────────┴──────────┴───────────┴──────┴───┴────────────┴────────────────────┴──────────────────┴───────────────────────┴─────────┘
 shape: (10, 32)
 ┌───────────┬─────────────────────┬──────────┬───────────┬──────┬───┬────────────┬────────────────────┬──────────────────┬───────────────────────┬─────────┐
  mmsi       reception_date       lon       lat        rot     eta         message_type_right  satellite_static  reception_date_static  version 
  ---        ---                  ---       ---        ---      ---         ---                 ---               ---                    ---     
  i32        datetime[ms]         f64       f64        f32      i64         i64                 str               datetime[ms]           i64     
 ╞═══════════╪═════════════════════╪══════════╪═══════════╪══════╪═══╪════════════╪════════════════════╪══════════════════╪═══════════════════════╪═════════╡
  335990004  2020-11-19 01:59:00  0.849883  40.937813  null    null        34                  SAT-AA_038        2020-11-19 01:31:43    3       
  334015340  2020-11-19 03:00:13  0.435335  40.414083  null    1735889800  5                   SAT-AA_037        2020-11-19 01:19:34    3       
  334088470  2020-11-19 03:00:19  0.40377   40.358493  null    null        null                null              null                   3       
  333049539  2020-11-19 03:00:31  0.80088   40.819835  null    null        34                  SAT-AA_038        2020-11-19 01:03:01    3       
  334018830  2020-11-19 03:00:35  0.895348  40.897835  null    1735889800  5                   SAT-AA_038        2020-11-19 01:07:38    3       
  333058871  2020-11-19 03:00:31  0.800105  40.819735  null    null        34                  SAT-AA_037        2020-11-19 00:59:00    3       
  334098970  2020-11-19 03:00:37  0.891373  40.403033  null    1783310700  5                   SAT-AA_038        2020-11-19 01:04:13    3       
  345080000  2020-11-19 03:00:38  0.373085  40.10578   0.0     null        null                null              null                   3       
  333041379  2020-11-19 03:00:48  0.801778  40.818448  null    null        null                null              null                   3       
  333048134  2020-11-19 03:00:54  0.803097  40.830813  null    null        34                  SAT-AA_037        2020-11-19 01:08:30    3       
 └───────────┴─────────────────────┴──────────┴───────────┴──────┴───┴────────────┴────────────────────┴──────────────────┴───────────────────────┴─────────┘

Examples#

Individual columns can be filtered using a python expression that is compliant with the backend (here: polars) being used.

For instance to extract:

  • the time-series for a particular id (mmsi):

damast inspect -f 1.zip --filter 'mmsi == 335990004'
  • all data in a time interval:

damast inspect -f 1.zip --filter 'reception_date >= dt.datetime.fromisoformat("2020-11-19 00:00:00")' --filter 'reception_date <= dt.datetime.fromisoformat("2020-11-20 00:00:00")'

Convert#

usage: damast convert [-h] [-w WORKDIR] [-v] [--loglevel LOGLEVEL]
                      [--logfile LOGFILE] -f FILES [FILES ...]
                      [-m METADATA_INPUT] [-o OUTPUT_FILE]
                      [--output-dir OUTPUT_DIR] [--output-type OUTPUT_TYPE]
                      [--validation-mode {ignore,readonly,update_data,update_metadata}]
                      {} ...

damast convert - data conversion subcommand called

positional arguments:
  {}                    sub-command help

options:
  -h, --help            show this help message and exit
  -w WORKDIR, --workdir WORKDIR
  -v, --verbose
  --loglevel LOGLEVEL   Set loglevel to display
  --logfile LOGFILE     Set file for saving log (default prints to terminal)
  -f FILES [FILES ...], --files FILES [FILES ...]
                        Files or patterns of the (annotated) data file that
                        should be converted
  -m METADATA_INPUT, --metadata-input METADATA_INPUT
                        The metadata input file
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        The output file either: .parquet, .hdf5
  --output-dir OUTPUT_DIR
                        The output directory
  --output-type OUTPUT_TYPE
                        The output file type: .parquet (default) or .hdf5
  --validation-mode {ignore,readonly,update_data,update_metadata}
                        Define the validation mode

Examples#

  • convert one or more files to parquet (N:N)

damast convert -f 1.zip --output-dir export --output-type .parquet
  • convert one or more files to a single parquet file (N:1)

damast convert -f 1.zip --output-file data-1.parquet --output-type .parquet

Annotate#

usage: damast annotate [-h] [-w WORKDIR] [-v] [--loglevel LOGLEVEL]
                       [--logfile LOGFILE] -f FILES [FILES ...]
                       [-o OUTPUT_DIR] [--output-spec-file OUTPUT_SPEC_FILE]
                       [--set-description COLUMN:VALUE [COLUMN:VALUE ...]]
                       [--set-abbreviation COLUMN:VALUE [COLUMN:VALUE ...]]
                       [--set-unit COLUMN:VALUE [COLUMN:VALUE ...]]
                       [--set-representation_type COLUMN:VALUE [COLUMN:VALUE ...]]
                       [--inplace] [--apply]
                       {} ...

damast annotate - extract (default) or apply annotation to dataset

positional arguments:
  {}                    sub-command help

options:
  -h, --help            show this help message and exit
  -w WORKDIR, --workdir WORKDIR
  -v, --verbose
  --loglevel LOGLEVEL   Set loglevel to display
  --logfile LOGFILE     Set file for saving log (default prints to terminal)
  -f FILES [FILES ...], --files FILES [FILES ...]
                        Files or patterns of the (annotated) data file that
                        should be annotated
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Output directory
  --output-spec-file OUTPUT_SPEC_FILE
                        The spec file name - if provided with path, it will
                        override output-dir
  --set-description COLUMN:VALUE [COLUMN:VALUE ...]
                        Set description in spec for a column and value
  --set-abbreviation COLUMN:VALUE [COLUMN:VALUE ...]
                        Set abbreviation in spec for a column and value
  --set-unit COLUMN:VALUE [COLUMN:VALUE ...]
                        Set unit in spec for a column and value
  --set-representation_type COLUMN:VALUE [COLUMN:VALUE ...]
                        Set representation_type in spec for a column and value
  --inplace             Update the dataset inplace (only possible for a single
                        file)
  --apply               Update the annotation inference and rewrite the
                        metadata to the dataset

Examples#

  • set the unit for two columns, here lat and lon to deg, and creating a new file in the subfolder export

damast annotate -f input.parquet --set-unit lon:deg lat:deg --output-dir export
  • set the unit for two columns, here lat and lon to deg, inplace, i.e., change the existing file

damast annotate -f input.parquet --set-unit lon:deg lat:deg --inplace

Process#

usage: damast process [-h] [-w WORKDIR] [-v] [--loglevel LOGLEVEL]
                      [--logfile LOGFILE] --input-data INPUT_DATA
                      [INPUT_DATA ...] --pipeline PIPELINE
                      [--output-file OUTPUT_FILE]
                      {} ...

damast process - apply an existing pipeline

positional arguments:
  {}                    sub-command help

options:
  -h, --help            show this help message and exit
  -w WORKDIR, --workdir WORKDIR
  -v, --verbose
  --loglevel LOGLEVEL   Set loglevel to display
  --logfile LOGFILE     Set file for saving log (default prints to terminal)
  --input-data INPUT_DATA [INPUT_DATA ...]
                        Input file(s) to process
  --pipeline PIPELINE   Pipeline (*.damast.ppl) file to apply to the data
  --output-file OUTPUT_FILE
                        Save the result in the given (*.parquet) file

Once a DataProcessPipeline has been exported and saved, e.g., in the following example as my-pipeline.damast.ppl, it can be reapplied to an existing data set. The dataset needs to comply with the required input columns and metadata requirements, such as units, so that the pipeline can successfully run. Damast will check these requirements and raise an exception if these requirements are not satisfied.

import damast
from damast.core import DataProcessingPipeline
from damast.core.dataframe import AnnotatedDataFrame
from damast.data_handling.transformers import AddDeltaTime, DropMissingOrNan
from damast.domains.maritime.transformers.features import DeltaDistance, Speed


class MyPipeline(DataProcessingPipeline):
    def __init__(self,
                 workdir: str | Path,
                 name: str = "my-pipeline",
                 name_mappings: dict[str, str] = {}):
        super().__init__(name=name,
                         base_dir=workdir,
                         name_mappings=name_mappings)

        self.add("Delta Time",
                 AddDeltaTime(),
                 name_mappings={
                     "group": "mmsi",
                     "time_column": "reception_date"
                })

        self.add("Delta Distance",
                 DeltaDistance(x_shift=True, y_shift=True),
                 name_mappings={
                     "group": "mmsi",
                     "sort": "reception_date",
                     "x": "lat",
                     "y": "lon",
                     "out": "delta_distance",
                })

        self.add("Speed",
                 Speed(),
                 name_mappings={
                     "delta_distance": "delta_distance",
                     "delta_time": "delta_time",
                })

pipeline = MyPipeline(workdir=".")
pipeline.save("pipelines")

Examples#

damast process --input-data input.parquet --pipeline pipelines/my-pipeline.damast.ppl