Page Last Updated: November 28, 2025

Tabulated Data๐Ÿ”—

See Age Variable Definitions for documentation on fields reporting age in tabulated instrument data.

Tabulated data are participant-level summaries of HBCD Study instruments (behavior, biology, and environment), Demographics, and select file-based data. Files are stored under rawdata/phenotype/:

hbcd/
|__ rawdata/ 
    |__ phenotype/ 
        |__ sed_basic_demographics.*        # Basic Demographics
        |__ par_visit_data.*                # Visit Information
        |__ bio_biosample_<nails|urine>.*   # Toxicology
        |__ {instrument_name}.*               # Instrument Data

Key features of tabulated data include:

  • Table Organization: tables are organized following the BIDS standard so that data from different sources can be linked together by participant ID and visit number
  • File Types: tables are available in both plain text (.tsv) and Parquet (.parquet) format, with accompanying metadata that explains the contents of each table

Table Organization๐Ÿ”—

Following the BIDS standard, each table includes identifier columns for participant ID, visit number, and run number (when applicable) that allow you to link information between tables:

Column Name Definition Example
participant_id Unique identifier for a participant sub-0123456789
session_id Unique identifier for session/visit number ses-V01
run_id Unique identifier for run number - only present in tables derived from file-based data with multiple runs, e.g. for MRI acquisition 1

File Types๐Ÿ”—

Tabulated data are available in two formats, plain text files (.tsv/.csv) and Parquet (.parquet) - see details below. Each data table also comes with a shadow matrix file (<instrument_name>_shadow.<tsv|parquet>), which has the same structure of the corresponding data table, but contains codes explaining why values are missing - see details below.

Plain Text vs. Parquet Files๐Ÿ”—

Tabulated data are provided in multiple formats to support a range of tools and user preferences. Plain text files (.tsv/.csv) are widely compatible and easy to open/inspect in Excel or text editors. Metadata (including column types, variable labels, categorical coding, etc.) is stored in separate .json files accompanying each plain text file. Apache Parquet, or simply Parquet (.parquet), is a modern, compressed columnar format optimized for analysis and large-scale data. Unlike plain text files, metadata is embedded directly in parquet files, ensuring correct data types and enabling efficient loading and analysis in Python or R.

Which format should I use?๐Ÿ”—

Format When to use Advantages Limitations
TSV/CSV Quick inspection, spreadsheet use Easy to open
Widely compatible format
Large files load slowly
Separate metadata (see Caution below)
Selective column loading not supported
Parquet Analysis in Python/R for large data Optimized for large-scale data
Fast loading and smaller files
Metadata embedded
Ensures correctly specified data types
Supports selective column loading (saves memory)
Not easily viewable in Excel
Not currently supported by BIDS

Caution: Using Plain Text Files for Analysis๐Ÿ”—

Plain text formats like TSV/CSV can cause problems in large-scale analyses due to the fact that metadata is stored separately (in sidecar JSON files). Python, R, or other tools may make mistakes when importing the data. For example:

  • Tools may misinterpret data types, e.g., 0/1 used for โ€œYes/Noโ€ may be read as numeric instead of categorical.
  • Columns with mostly missing values may be treated as empty if the first few rows contain no data.

We therefore recommend using Parquet files for analysis to avoid these issues, as the metadata is embedded directly. However, if you do choose to use TSV/CSV files for analysis: be sure to manually define column types during import using the sidecar JSON metadata files. Additionally, make sure to specify n/a as placeholder for missing values when reading in the .tsv files (HBCD uses this placeholder as recommended by the BIDS specification). We recommend using NBDCtools to automate this process - see documentation for the function read_dsv_formatted() here.

Working with Parquet in Python and R๐Ÿ”—

/ Loading Parquet Files โ–ธ

Loading parquet files in Python (polars or pandas module):


    # Using `polars` module [RECOMMENDED]:
    import polars as pl
    parquet_df = pl.read_parquet("path/to/file.parquet")

    # Using `pandas` module:
    import pandas as pd
    parquet_df = pd.read_parquet("path/to/file.parquet")
  
Loading Parquet file in R (arrow package):

    # Using `arrow` package:
    library(arrow)
    parquet_df <- read_parquet("path/to/file.parquet")
  

Shadow Matrices for Missing Data๐Ÿ”—

Each TSV or Parquet file in /rawdata/phenotype/ has a corresponding shadow matrix file in the same format that record the reason for missing values (e.g., Don't know, Decline to Answer, Logic Skipped, etc.) in the phenotype data.

How They Work๐Ÿ”—

In the data files, categorical codes for non-responses such as โ€œDonโ€™t knowโ€ (999) and โ€œDecline to answerโ€ (777) are deliberately converted to blank cells. The original responses are converted to a missingness reason stored in the shadow matrix, which mirror the structure and column names of the original data file (i.e. each cell corresponds to the same cell in the associated data file):

  • If a data cell contains a value: the shadow matrix cell is blank.
  • If a data cell is missing: the shadow matrix cell records the reason (e.g., โ€œDonโ€™t knowโ€)

For example, compare the highlighted cells in the data file (left) vs. the corresponding shadow matrix (right) below:

Why Shadow Matrices Are Useful๐Ÿ”—

Shadow matrices make analyses cleaner and more reliable by:

  • Preventing analytical errors, e.g., misinterpreting placeholder codes (like 777 or 999) as valid numbers.
  • Maintaining consistent data types across entries (e.g., avoids mixing text notes into numeric fields).
  • Preserving non-response information without cluttering the main dataset.

Working with Shadow Matrices in Python and R๐Ÿ”—

While the approach of storing missingness reasons in a shadow matrix file supports cleaner analyses, there are situations where non-responses are themselves meaningful. For example, a researcher might be interested in how often participants do not understand a given question and how this relates to other variables. To understand patterns of missing data, users can re-integrate the non-responses from the shadow matrix back into the data using the following helper functions (click to expand):

Python โ–ธ

import pandas as pd
import os

def load_data_with_shadow(data_path, shadow_path):  
    """  
    Loads a data file (CSV or TSV) and its corresponding shadow matrix  
    (CSV or TSV) and adds '_missing_reason' columns for missing values.
    """  

    # Detect delimiter from file extension and load data
    def get_delimiter(path):
        ext = os.path.splitext(path)[1].lower()
        return "\t" if ext == ".tsv" else ","

    data = pd.read_csv(data_path, delimiter=get_delimiter(data_path))  
    shadow = pd.read_csv(shadow_path, delimiter=get_delimiter(shadow_path))

    # Annotate data with non-empty missingness reason columns (excluding participant_id 
    # and session_id) in shadow matrix 
    for col in data.columns[2:]:  
        if col in shadow.columns:
            if not shadow[col].isna().all() and not (shadow[col] == '').all():
                data[f"{col}_missing_reason"] = shadow[col]

    return data

# Example usage:
df = load_data_with_shadow("data.tsv", "shadow_matrix.tsv")

# Example: View reasons for missing data for a given column/variable in the data file 
df[df["<COLUMN NAME>"].isna()][["<COLUMN NAME>_missing_reason"]]
R (using NBDCtools) โ–ธ

    library(dplyr)
    library(NBDCtools)

    # read in data and shadow matrix
    data <- arrow::read_parquet("path/to/data/<table_name>.parquet")
    shadow <- arrow::read_parquet("path/to/data/<table_name_shadow>.parquet")

    # bind shadow columns to data
    data_shadow <- shadow_bind_data(data, shadow)

    # show the reasons for missing values for a given variable
    data_shadow |>
      filter(is.na(<column_name>)) |> 
      count(<column_name>)