Page Last Updated: May 17, 2026

Tabulated DataπŸ”—

See Age Variable Definitions for documentation on fields reporting age in tabulated instrument data.

Tabulated data are participant-level summaries of HBCD Study instruments (behavior, biology, and environment), Demographics, and select file-based data. Files are stored under rawdata/phenotype/:

hbcd/
└── rawdata/ 
    └── phenotype/ 
        β”œβ”€β”€ sed_basic_demographics.*        # Basic Demographics
        β”œβ”€β”€ par_visit_data.*                # Visit Level Data
        β”œβ”€β”€ bio_biosample_{nails|urine}.*   # Toxicology
        └── {instrument_name}.*             # Instrument Data

Key features of tabulated data include:

  • Table Organization: tables are organized following the BIDS standard so that data from different sources can be linked together by participant ID and visit number
  • File Types: tables are available in both plain text (.tsv) and Parquet (.parquet) format, with accompanying metadata that explains the contents of each table

Table OrganizationπŸ”—

Following the BIDS standard, each table includes unique identifier columns for the following items that allow you to link information between tables:

  • Participant ID (participant_id)
  • Session/visit number (session_id)
  • Run number (run_id) - only as applicable, e.g., for MRI where multiple runs are acquired

Study Design Logic: Child-Centric Data StructureπŸ”—

The HBCD Study organizes data around the Child ID as the primary key, meaning each caregiver and child share the same participant ID, with all caregiver-reported data nested under the corresponding Child ID. This structure supports longitudinal analyses by enabling straightforward tracking of each child’s data over time without needing to remap caregiver information. It also simplifies multi-birth cases: when a caregiver reports on multiple children, each child is assigned a unique record, so each child's data remains distinct (avoiding complex joins or disambiguation).

File TypesπŸ”—

Tabulated data are available in two formats, plain text files (.tsv/.csv) and Parquet (.parquet) - see details below. Each data table also comes with a shadow matrix file (<instrument_name>_shadow.<tsv|parquet>), which has the same structure of the corresponding data table, but contains codes explaining why values are missing - see details below.

Plain Text vs. Parquet FilesπŸ”—

Tabulated data are provided in multiple formats to support a range of tools and user preferences. Plain text files (.tsv/.csv) are widely compatible and easy to open/inspect in Excel or text editors and have metadata (including column types, variable labels, categorical coding, etc.) stored in accompanying .json files. Apache Parquet, or simply Parquet (.parquet), is a modern, compressed columnar format optimized for analysis and large-scale data. Unlike plain text files, metadata is embedded directly in parquet files, ensuring correct data types and enabling efficient loading and analysis in Python or R.

Which format should I use? β–Έ
Format Advantages Limitations
TSV/CSV
Quick inspection/spreadsheets
Easy to open
Widely compatible format
Large files load slowly
Separate metadata (see Caution below)
Selective column loading not supported
Parquet
Large data analysis in Python/R
Optimized for large-scale data
Fast loading and smaller files
Metadata embedded
Ensures correctly specified data types
Supports selective column loading (saves memory)
Not easily viewable in Excel
Not currently supported by BIDS

CAUTION: Using Plain Text Files for AnalysisπŸ”—

For large data, plain text formats (TSV/CSV) can cause import issues (in Python, R, etc.) due to the separation of metadata. We therefore recommend using Parquet files for analysis whenever possible to avoid these issues. Parquet files embed metadata directly, ensuring correct data types and handling of missing values. Common issues include:

  • Misinterpretation of data types, e.g., 0/1 used for β€œYes/No” may be read as numeric instead of categorical
  • Mishandling missing values (columns with mostly missing values may be treated as empty)

If you do use CSV/TSV files for analysis: be sure to (1) manually define column types during import using the sidecar JSON metadata files and (2) replace blank values with n/a (missing values are blank in HBCD data following BIDS specification). We recommend using NBDCtools to automate these processes (e.g. read_dsv_formatted()).

Working with Parquet in Python and RπŸ”—

  Loading Parquet Files β–Έ
Loading parquet files in Python (polars or pandas module):
  # Using `polars` module [RECOMMENDED]:
  import polars as pl
  parquet_df = pl.read_parquet("path/to/file.parquet")

  # Using `pandas` module:
  import pandas as pd
  parquet_df = pd.read_parquet("path/to/file.parquet")
Loading Parquet file in R (arrow package):
  # Using `arrow` package:
  library(arrow)
  parquet_df <- read_parquet("path/to/file.parquet")

Shadow Matrices for Missing DataπŸ”—

Each TSV or Parquet file in /rawdata/phenotype/ has a corresponding shadow matrix file in the same format that record the reason for missing values (e.g., Don't know, Decline to Answer, Logic Skipped, etc.) in the phenotype data.

Shadow Matrix Values for Missingness β–Έ

Possible Values Across Instruments
The following are standard possible values for missingness reason found in the shadow matrices across instruments.

  • Decline to Answer (e.g., participant declined to answer a question)
  • Don't Know (e.g., participant did not know the answer)
  • Missed Visit (e.g., participant did not attend a visit)
  • Missed Instrument (e.g., participant did not complete assessment)
  • Logic Skipped (e.g., question skipped due to branching logic)
  • Unknown Missing (e.g., reason for missing value unknown)

Note that for cases where an instrument was not administered, this would be indicated in the shadow matrix as 'Unknown Missing' for blank entries (as well as 'Logic Skipped' for fields skipped due to branching logic). There is also an 'Administration' field for all instruments that indicates whether an instrument was administered or not for a given participant/visit.

Special Cases
The following domains/instruments have additional unique shadow matrix values used where applicable:

Table(s) Unique Shadow Matrix Values [+Variable Name If Specific]
BioSpecimens (All)
  • "Please refer to corresponding categorical field for more details"
Basic Demographics
  • "Child's DOB not reported or available for participant" [{gestational|mother}_age_delivery]
  • "Missing Information From Ripple" [ACS-derived fields]
Visit Level Data
  • "Data not available for participants at this timepoint"
  • "No candidate age for V01" [candidate_age]
  • "Gestational Age at Administration is only at V01 and not calculated for V02 onwards" [gestational_age]

How They WorkπŸ”—

In the data files, categorical codes for non-responses such as β€œDon’t know” (999) and β€œDecline to answer” (777) are deliberately converted to blank cells. The original responses are converted to a missingness reason stored in the shadow matrix, which mirror the structure and column names of the original data file (i.e. each cell corresponds to the same cell in the associated data file):

  • If a data cell contains a value: the shadow matrix cell is blank.
  • If a data cell is missing: the shadow matrix cell records the reason (e.g., β€œDon’t know”)

For example, compare the highlighted cells in the data file (left) vs. the corresponding shadow matrix (right) below:

Why Shadow Matrices Are UsefulπŸ”—

Shadow matrices make analyses cleaner and more reliable by:

  • Preventing analytical errors, e.g., misinterpreting placeholder codes (like 777 or 999) as valid numbers.
  • Maintaining consistent data types across entries (e.g., avoids mixing text notes into numeric fields).
  • Preserving non-response information without cluttering the main dataset.

Working with Shadow Matrices in Python and RπŸ”—

While the approach of storing missingness reasons in a shadow matrix file supports cleaner analyses, there are situations where non-responses are themselves meaningful. For example, a researcher might be interested in how often participants do not understand a given question and how this relates to other variables. To understand patterns of missing data, users can re-integrate the non-responses from the shadow matrix back into the data using the following helper functions (click to expand):

Python β–Έ
# Example 1: Load CSV/TSV and corresponding shadow matrix and add '_missing_reason' columns for missing values.
import pandas as pd
import os

def load_data_with_shadow(data_path, shadow_path):  
    # Detect delimiter from file extension and load data
    def get_delimiter(path):
        ext = os.path.splitext(path)[1].lower()
        return "\t" if ext == ".tsv" else ","
    data = pd.read_csv(data_path, delimiter=get_delimiter(data_path))  
    shadow = pd.read_csv(shadow_path, delimiter=get_delimiter(shadow_path))

    # Annotate data with non-empty missingness reason columns (excluding participant_id, session_id) in shadow matrix 
    for col in data.columns[2:]:  
        if col in shadow.columns:
            if not shadow[col].isna().all() and not (shadow[col] == '').all():
                data[f"{col}_missing_reason"] = shadow[col]
    return data

# Example usage:
df = load_data_with_shadow("data.tsv", "shadow_matrix.tsv")
# Example: View reasons for missing data for a given column/variable in the data file 
df[df["<COLUMN NAME>"].isna()][["<COLUMN NAME>_missing_reason"]]

# Example 2: Using NBDCtools Python package
# install R backend with `NBDCtools` is required to run this code
from NBDCtools import create_dataset
create_dataset(
    dir_data="path/to/data",
    study="hbcd",
    vars=["var1", "var2", "var3"],
    tables=["table1", "table2"],
    bind_shadow=True
)
R (using NBDCtools) β–Έ
library(NBDCtools)
create_dataset(
  dir_data = "path/to/data",
  study = "hbcd",
  vars = c("var1", "var2", "var3"),
  tables = c("table1", "table2"),
  bind_shadow = TRUE
)