Command Line Interface#
The command line interface offers a number of workflow simplification that are encapsulated in sub-commands:
- inspect
check metadadata and dataset properties
- convert
convert from (zipped) csv, netcdf to parquet (default) or hdf5 (deprecated)
- annotate
create metadata file and update dataframe with metadata
- process
apply a data pipeline to a dataset
Inspect#
Inspect allow to identify columns and properties of columns in a given dataset. The dataset can consists of one or more (zipped) files, either given as list of filenames or using file pattern.
$ damast inspect -f 1.zip
Subparser: DataInspectParser
Loading dataframe (1 files) of total size: 0.0 MB
Creating offset dictionary for /tmp/damast-example/datasets/1.zip ...
Creating offset dictionary for /tmp/damast-example/datasets/1.zip took 0.00s
Created mount point at: /tmp/damast-mountqigwlx74/1.zip
INFO:damast.core.dataframe:Loading parquet: files=[PosixPath('/tmp/damast-mountqigwlx74/1.zip/dataset-1.zst.parquet')]
WARNING:damast.core.dataframe:/tmp/damast-mountqigwlx74/1.zip/dataset-1.zst.parquet has no (damast) annotations
INFO:damast.core.dataframe:No metadata provided or found in files - searching now for an existing spec file
INFO:damast.core.dataframe:Found no candidate for a spec file
INFO:damast.core.dataframe:Metadata is not available and not required, so inferring annotation
Extract str and categorical column metadata: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 1092.51it/s]
Extract numeric column metadata: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 12767.03it/s]
INFO:damast.core.dataframe:Metadata inferring completed
Annotations:
accuracy:
is_optional: False
representation_type: Boolean
call_sign:
is_optional: False
representation_type: String
value_range: {'ListOfValues': [None, '', 'SIDF9', 'SABD4', 'STDL5', 'STJE3', 'SKCY7', 'XAGBE']}
cog:
is_optional: False
representation_type: Float32
value_stats: {'mean': 142.0380096435547, 'stddev': 117.50126647949219, 'total_count': 1234, 'null_count': 745}
corrupted:
is_optional: False
representation_type: Boolean
corrupted_right:
is_optional: False
representation_type: Boolean
destination:
is_optional: False
representation_type: String
value_range: {'ListOfValues': [None, '', 'VILA', 'ES SUR', 'ESICL', 'EBAL>EDGA', 'IT-SEP', 'PLATF ROMA', 'ITL-BREG']}
dimension_to_bow:
is_optional: False
representation_type: UInt16
...
sog:
is_optional: False
representation_type: Float32
value_stats: {'mean': 2.0780696868896484, 'stddev': 4.677201271057129, 'total_count': 1979, 'null_count': 0}
version:
is_optional: False
representation_type: Int64
value_range: {'MinMax': {'min': 3, 'max': 3, 'allow_missing': True}}
value_stats: {'mean': 3.0, 'stddev': 0.0, 'total_count': 1979, 'null_count': 0}
First 10 and last 10 rows:
shape: (10, 32)
┌───────────┬─────────────────────┬──────────┬───────────┬──────┬───┬────────────┬────────────────────┬──────────────────┬───────────────────────┬─────────┐
│ mmsi ┆ reception_date ┆ lon ┆ lat ┆ rot ┆ … ┆ eta ┆ message_type_right ┆ satellite_static ┆ reception_date_static ┆ version │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ datetime[ms] ┆ f64 ┆ f64 ┆ f32 ┆ ┆ i64 ┆ i64 ┆ str ┆ datetime[ms] ┆ i64 │
╞═══════════╪═════════════════════╪══════════╪═══════════╪══════╪═══╪════════════╪════════════════════╪══════════════════╪═══════════════════════╪═════════╡
│ 345080000 ┆ 2020-11-18 33:00:18 ┆ 0.783398 ┆ 40.483513 ┆ 0.0 ┆ … ┆ null ┆ null ┆ null ┆ null ┆ 3 │
│ 334015340 ┆ 2020-11-18 33:00:33 ┆ 0.435345 ┆ 40.414097 ┆ null ┆ … ┆ 1735889800 ┆ 5 ┆ SAT-AA_037 ┆ 2020-11-18 33:07:35 ┆ 3 │
│ 334088470 ┆ 2020-11-18 33:00:37 ┆ 0.403745 ┆ 40.358495 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null ┆ 3 │
│ 334098970 ┆ 2020-11-18 33:00:39 ┆ 0.88999 ┆ 40.389833 ┆ null ┆ … ┆ 1783310700 ┆ 5 ┆ SAT-AA_038 ┆ 2020-11-18 33:04:13 ┆ 3 │
│ 333019738 ┆ 2020-11-18 33:01:18 ┆ 0.80045 ┆ 40.819483 ┆ null ┆ … ┆ null ┆ 34 ┆ SAT-AA_038 ┆ 2020-11-18 33:33:51 ┆ 3 │
│ 353003075 ┆ 2020-11-18 33:01:38 ┆ 0.550948 ┆ 40.571973 ┆ 0.0 ┆ … ┆ null ┆ 5 ┆ SAT-AA_037 ┆ 2020-11-18 33:01:13 ┆ 3 │
│ 345080000 ┆ 2020-11-18 33:01:37 ┆ 0.759477 ┆ 40.481487 ┆ 0.0 ┆ … ┆ null ┆ null ┆ null ┆ null ┆ 3 │
│ 334015340 ┆ 2020-11-18 33:01:33 ┆ 0.435338 ┆ 40.414093 ┆ null ┆ … ┆ 1735889800 ┆ 5 ┆ SAT-AA_037 ┆ 2020-11-18 33:07:35 ┆ 3 │
│ 334088470 ┆ 2020-11-18 33:01:37 ┆ 0.403743 ┆ 40.358513 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null ┆ 3 │
│ 334098970 ┆ 2020-11-18 33:03:18 ┆ 0.890313 ┆ 40.370833 ┆ null ┆ … ┆ 1783310700 ┆ 5 ┆ SAT-AA_038 ┆ 2020-11-18 33:04:13 ┆ 3 │
└───────────┴─────────────────────┴──────────┴───────────┴──────┴───┴────────────┴────────────────────┴──────────────────┴───────────────────────┴─────────┘
shape: (10, 32)
┌───────────┬─────────────────────┬──────────┬───────────┬──────┬───┬────────────┬────────────────────┬──────────────────┬───────────────────────┬─────────┐
│ mmsi ┆ reception_date ┆ lon ┆ lat ┆ rot ┆ … ┆ eta ┆ message_type_right ┆ satellite_static ┆ reception_date_static ┆ version │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ datetime[ms] ┆ f64 ┆ f64 ┆ f32 ┆ ┆ i64 ┆ i64 ┆ str ┆ datetime[ms] ┆ i64 │
╞═══════════╪═════════════════════╪══════════╪═══════════╪══════╪═══╪════════════╪════════════════════╪══════════════════╪═══════════════════════╪═════════╡
│ 335990004 ┆ 2020-11-19 01:59:00 ┆ 0.849883 ┆ 40.937813 ┆ null ┆ … ┆ null ┆ 34 ┆ SAT-AA_038 ┆ 2020-11-19 01:31:43 ┆ 3 │
│ 334015340 ┆ 2020-11-19 03:00:13 ┆ 0.435335 ┆ 40.414083 ┆ null ┆ … ┆ 1735889800 ┆ 5 ┆ SAT-AA_037 ┆ 2020-11-19 01:19:34 ┆ 3 │
│ 334088470 ┆ 2020-11-19 03:00:19 ┆ 0.40377 ┆ 40.358493 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null ┆ 3 │
│ 333049539 ┆ 2020-11-19 03:00:31 ┆ 0.80088 ┆ 40.819835 ┆ null ┆ … ┆ null ┆ 34 ┆ SAT-AA_038 ┆ 2020-11-19 01:03:01 ┆ 3 │
│ 334018830 ┆ 2020-11-19 03:00:35 ┆ 0.895348 ┆ 40.897835 ┆ null ┆ … ┆ 1735889800 ┆ 5 ┆ SAT-AA_038 ┆ 2020-11-19 01:07:38 ┆ 3 │
│ 333058871 ┆ 2020-11-19 03:00:31 ┆ 0.800105 ┆ 40.819735 ┆ null ┆ … ┆ null ┆ 34 ┆ SAT-AA_037 ┆ 2020-11-19 00:59:00 ┆ 3 │
│ 334098970 ┆ 2020-11-19 03:00:37 ┆ 0.891373 ┆ 40.403033 ┆ null ┆ … ┆ 1783310700 ┆ 5 ┆ SAT-AA_038 ┆ 2020-11-19 01:04:13 ┆ 3 │
│ 345080000 ┆ 2020-11-19 03:00:38 ┆ 0.373085 ┆ 40.10578 ┆ 0.0 ┆ … ┆ null ┆ null ┆ null ┆ null ┆ 3 │
│ 333041379 ┆ 2020-11-19 03:00:48 ┆ 0.801778 ┆ 40.818448 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null ┆ 3 │
│ 333048134 ┆ 2020-11-19 03:00:54 ┆ 0.803097 ┆ 40.830813 ┆ null ┆ … ┆ null ┆ 34 ┆ SAT-AA_037 ┆ 2020-11-19 01:08:30 ┆ 3 │
└───────────┴─────────────────────┴──────────┴───────────┴──────┴───┴────────────┴────────────────────┴──────────────────┴───────────────────────┴─────────┘
Examples#
Individual columns can be filtered using a python expression that is compliant with the backend (here: polars) being used.
For instance to extract:
the time-series for a particular id (mmsi):
damast inspect -f 1.zip --filter 'mmsi == 335990004'
all data in a time interval:
damast inspect -f 1.zip --filter 'reception_date >= dt.datetime.fromisoformat("2020-11-19 00:00:00")' --filter 'reception_date <= dt.datetime.fromisoformat("2020-11-20 00:00:00")'
Convert#
damast convert --help
usage: damast convert [-h] [-w WORKDIR] [-v] [--loglevel LOGLEVEL] [--logfile LOGFILE] -f FILES [FILES ...] [-m METADATA_INPUT] [-o OUTPUT_FILE] [--output-dir OUTPUT_DIR]
[--output-type OUTPUT_TYPE] [--validation-mode {ignore,readonly,update_data,update_metadata}]
{} ...
damast convert - data conversion subcommand called
positional arguments:
{} sub-command help
options:
-h, --help show this help message and exit
-w WORKDIR, --workdir WORKDIR
-v, --verbose
--loglevel LOGLEVEL Set loglevel to display
--logfile LOGFILE Set file for saving log (default prints to terminal)
-f FILES [FILES ...], --files FILES [FILES ...]
Files or patterns of the (annotated) data file that should be converted
-m METADATA_INPUT, --metadata-input METADATA_INPUT
The metadata input file
-o OUTPUT_FILE, --output-file OUTPUT_FILE
The output file either: .parquet, .hdf5
--output-dir OUTPUT_DIR
The output directory
--output-type OUTPUT_TYPE
The output file type: .parquet (default) or .hdf5
--validation-mode {ignore,readonly,update_data,update_metadata}
Define the validation mode
Examples#
convert one or more files to parquet (N:N)
damast convert -f 1.zip --output-dir export --output-type .parquet
convert one or more files to a single parquet file (N:1)
damast convert -f 1.zip --output-file data-1.parquet --output-type .parquet
Annotate#
Examples#
set the unit for two columns, here lat and lon to deg, and creating a new file in the subfolder export
damast annotate -f input.parquet --set-unit lon:deg lat:deg --output-dir export
set the unit for two columns, here lat and lon to deg, inplace, i.e., change the existing file
damast annotate -f input.parquet --set-unit lon:deg lat:deg --inplace
Process#
Once a DataProcessPipeline has been exported and saved, e.g., in the following example as my-pipeline.damast.ppl, it can be reapplied to an existing data set. The dataset needs to comply with the required input columns and metadata requirements, such as units, so that the pipeline can successfully run. Damast will check these requirements and raise an exception if these requirements are not satisfied.
import damast
from damast.core import DataProcessingPipeline
from damast.core.dataframe import AnnotatedDataFrame
from damast.domains.maritime.transformers.features import (
DeltaDistance,
Speed
)
from damast.data_handling.transformers import (
DropMissingOrNan,
AddDeltaTime,
)
class MyPipeline(DataProcessingPipeline):
def __init__(self,
workdir: str | Path,
name: str = "my-pipeline",
name_mappings: dict[str, str] = {}):
super().__init__(name=name,
base_dir=workdir,
name_mappings=name_mappings)
self.add("Delta Time",
AddDeltaTime(),
name_mappings={
"group": "mmsi",
"time_column": "reception_date"
})
self.add("Delta Distance",
DeltaDistance(x_shift=True, y_shift=True),
name_mappings={
"group": "mmsi",
"sort": "reception_date",
"x": "lat",
"y": "lon",
"out": "delta_distance",
})
self.add("Speed",
Speed(),
name_mappings={
"delta_distance": "delta_distance",
"delta_time": "delta_time",
})
pipeline = MyPipeline(workdir=".")
pipeline.save("pipelines")
Examples#
damast process --input-data input.parquet --pipeline pipelines/my-pipeline.damast.ppl