FormatDataForCRNSDataHub - finalise the formatter
Description
The FormatDataForCRNSDataHub
is not currently complete. We need to finalize it so it converts the dataframe into the agreed upon design.
The most important aspects are:
- DateTime Index
- Correct DataTypes in columns
- Correct column names (see
data_management.data_validation_tables.py
) - If multiple instruments for key variables (e.g., pressure) convert these to a single column
- NaN values converted from string or -999 values to np.nan
- more?
Expected Behavior
Complete methods to take the parsed raw data and output a fully formatted data frame ready for the data hub
Current Behavior
Not feature complete
Proposed Solution
There are a few things to consider. If we infer column names, how do we know which one is which when converting the names? It might mean we should use the infer column names method sparingly. Better would be for a user to supply the information in some way. This info could eventually be added to a config file for automated processing. We could make a dataclass which a user fills completes with their own column names? We could
from dataclasses import dataclass
@dataclass
class RawDataInfo:
pressure_column_name_1: str
neutron_column_name_1: str
pressure_column_name_2: Optional[str] = None
pressure_column_name_3: Optional[str] = None
pressure_column_name_4: Optional[str] = None
neutron_column_name_2: Optional[str] = None
# ....
A user then completes this:
raw_data_info = RawDataInfo(
pressure_column_1 = "P1_mb"
pressure_column_2 = "P3_mb"
#...
)
This could be injected to the FormatDataForCRNSDataHub
or built automatically from a config file.
We also need solutions on how to reduce multi columns to one (average? priority with gap filling when missing?)
Acceptance Criteria
-
A method to ensure columns are correctly named -
Methods to take multiple pressure columns and give a single pressure column -
Methods to take multiple relative humidity columns and give a single relative humidity column -
Methods to take multiple temperature columns and give a single temperature column