BaseDataset — R6 infrastructure for clinical event datasets
Source:R/Dataset_BaseDataset.R
BaseDataset.RdBaseDataset — R6 infrastructure for clinical event datasets
BaseDataset — R6 infrastructure for clinical event datasets
Details
The BaseDataset class mirrors rhealth's BaseDataset, providing a
fully-featured, YAML driven loader that converts multi-table electronic
health records into a single event table. It supports:
URL or local-file ingestion (with automatic
.csv/.csv.gzfallback).Per-table joins as declared in the config.
Flexible timestamp parsing (single or multi-column).
A
devmode that caps the number of patients for rapid prototyping.Multi-threaded sample generation with progress bars.
Down-stream, it cooperates with BaseTask (task definition),
Patient (per-subject wrapper), and SampleDataset (collection of
input/output pairs).
Dependencies
Polars is used via the polars R package. Parallelism and progress
reporting require future, future.apply, and progressr.
Public fields
rootRoot directory (or URL prefix) for data files.
tablesCharacter vector of table names to ingest.
dataset_nameHuman-readable dataset label.
configParsed YAML configuration list.
devLogical flag — when TRUE limits to 1000 patients.
cona duckdb connection
global_event_dfA duckdb lazy query with all events combined.
.collected_global_event_dfPolars dataframe storing all global events.
.unique_patient_idsCharacter vector of unique patient IDs.
Methods
Method new()
Instantiate a BaseDataset.
Usage
BaseDataset$new(
root,
tables,
dataset_name = NULL,
config_path = NULL,
dev = FALSE
)Arguments
rootCharacter. Root directory / URL prefix where CSV files live.
tablesCharacter vector of table keys defined in the config.
dataset_nameOptional custom name; defaults to the R6 class name.
config_pathPath to YAML or schema describing each table.
devLogical. If TRUE, limits to 1000 patients for speed.
Method collected_global_event_df()
Materialise (collect) the lazy event dataframe. In dev-mode only the first 1000 patients are kept.
Method load_table()
Load one table, apply joins, lowercase columns, and standardise to the event schema.
Method set_task()
Apply a BaseTask to build a SampleDataset.
Arguments
taskA
BaseTaskinstance; if NULL,default_task()is used.num_workersInteger ≥1. Number of parallel workers.
chunk_sizeInteger. Number of patients to process in each chunk.
cache_dirOptional path to a directory for caching samples. If set, processed samples will be saved to an
.rdsfile and reloaded on subsequent runs, skipping the generation step.