Page Last Updated: May 17, 2026
Tabulated Dataπ
Tabulated data are participant-level summaries of HBCD Study instruments (behavior, biology, and environment), Demographics, and select file-based data. Files are stored under rawdata/phenotype/:
hbcd/
βββ rawdata/
βββ phenotype/
βββ sed_basic_demographics.* # Basic Demographics
βββ par_visit_data.* # Visit Level Data
βββ bio_biosample_{nails|urine}.* # Toxicology
βββ {instrument_name}.* # Instrument Data
Key features of tabulated data include:
- Table Organization: tables are organized following the BIDS standard so that data from different sources can be linked together by participant ID and visit number
- File Types: tables are available in both plain text (
.tsv) and Parquet (.parquet) format, with accompanying metadata that explains the contents of each table
Table Organizationπ
Following the BIDS standard, each table includes unique identifier columns for the following items that allow you to link information between tables:
- Participant ID (
participant_id) - Session/visit number (
session_id) - Run number (
run_id) - only as applicable, e.g., for MRI where multiple runs are acquired
Study Design Logic: Child-Centric Data Structureπ
The HBCD Study organizes data around the Child ID as the primary key, meaning each caregiver and child share the same participant ID, with all caregiver-reported data nested under the corresponding Child ID. This structure supports longitudinal analyses by enabling straightforward tracking of each childβs data over time without needing to remap caregiver information. It also simplifies multi-birth cases: when a caregiver reports on multiple children, each child is assigned a unique record, so each child's data remains distinct (avoiding complex joins or disambiguation).
File Typesπ
Tabulated data are available in two formats, plain text files (.tsv/.csv) and Parquet (.parquet) - see details below. Each data table also comes with a shadow matrix file (<instrument_name>_shadow.<tsv|parquet>), which has the same structure of the corresponding data table, but contains codes explaining why values are missing - see details below.
Plain Text vs. Parquet Filesπ
Tabulated data are provided in multiple formats to support a range of tools and user preferences. Plain text files (.tsv/.csv) are widely compatible and easy to open/inspect in Excel or text editors and have metadata (including column types, variable labels, categorical coding, etc.) stored in accompanying .json files. Apache Parquet, or simply Parquet (.parquet), is a modern, compressed columnar format optimized for analysis and large-scale data. Unlike plain text files, metadata is embedded directly in parquet files, ensuring correct data types and enabling efficient loading and analysis in Python or R.
| Format | Advantages | Limitations |
|---|---|---|
| TSV/CSV Quick inspection/spreadsheets |
Easy to open Widely compatible format |
Large files load slowly Separate metadata (see Caution below) Selective column loading not supported |
| Parquet Large data analysis in Python/R |
Optimized for large-scale data Fast loading and smaller files Metadata embedded Ensures correctly specified data types Supports selective column loading (saves memory) |
Not easily viewable in Excel Not currently supported by BIDS |
CAUTION: Using Plain Text Files for Analysisπ
For large data, plain text formats (TSV/CSV) can cause import issues (in Python, R, etc.) due to the separation of metadata. We therefore recommend using Parquet files for analysis whenever possible to avoid these issues. Parquet files embed metadata directly, ensuring correct data types and handling of missing values. Common issues include:
- Misinterpretation of data types, e.g.,
0/1used for βYes/Noβ may be read as numeric instead of categorical - Mishandling missing values (columns with mostly missing values may be treated as empty)
If you do use CSV/TSV files for analysis: be sure to (1) manually define column types during import using the sidecar JSON metadata files and (2) replace blank values with n/a (missing values are blank in HBCD data following BIDS specification). We recommend using NBDCtools to automate these processes (e.g. read_dsv_formatted()).
Working with Parquet in Python and Rπ
Loading parquet files in Python (polars or pandas module):
# Using `polars` module [RECOMMENDED]:
import polars as pl
parquet_df = pl.read_parquet("path/to/file.parquet")
# Using `pandas` module:
import pandas as pd
parquet_df = pd.read_parquet("path/to/file.parquet")
Loading Parquet file in R (arrow package):
# Using `arrow` package:
library(arrow)
parquet_df <- read_parquet("path/to/file.parquet")
Shadow Matrices for Missing Dataπ
Each TSV or Parquet file in /rawdata/phenotype/ has a corresponding shadow matrix file in the same format that record the reason for missing values (e.g., Don't know, Decline to Answer, Logic Skipped, etc.) in the phenotype data.
Possible Values Across Instruments
The following are standard possible values for missingness reason found in the shadow matrices across instruments.
- Decline to Answer (e.g., participant declined to answer a question)
- Don't Know (e.g., participant did not know the answer)
- Missed Visit (e.g., participant did not attend a visit)
- Missed Instrument (e.g., participant did not complete assessment)
- Logic Skipped (e.g., question skipped due to branching logic)
- Unknown Missing (e.g., reason for missing value unknown)
Note that for cases where an instrument was not administered, this would be indicated in the shadow matrix as 'Unknown Missing' for blank entries (as well as 'Logic Skipped' for fields skipped due to branching logic). There is also an 'Administration' field for all instruments that indicates whether an instrument was administered or not for a given participant/visit.
Special Cases
The following domains/instruments have additional unique shadow matrix values used where applicable:
| Table(s) | Unique Shadow Matrix Values [+Variable Name If Specific] |
|---|---|
| BioSpecimens (All) |
|
| Basic Demographics |
|
| Visit Level Data |
|
How They Workπ
In the data files, categorical codes for non-responses such as βDonβt knowβ (999) and βDecline to answerβ (777) are deliberately converted to blank cells. The original responses are converted to a missingness reason stored in the shadow matrix, which mirror the structure and column names of the original data file (i.e. each cell corresponds to the same cell in the associated data file):
- If a data cell contains a value: the shadow matrix cell is blank.
- If a data cell is missing: the shadow matrix cell records the reason (e.g., βDonβt knowβ)
For example, compare the highlighted cells in the data file (left) vs. the corresponding shadow matrix (right) below:

Why Shadow Matrices Are Usefulπ
Shadow matrices make analyses cleaner and more reliable by:
- Preventing analytical errors, e.g., misinterpreting placeholder codes (like
777or999) as valid numbers. - Maintaining consistent data types across entries (e.g., avoids mixing text notes into numeric fields).
- Preserving non-response information without cluttering the main dataset.
Working with Shadow Matrices in Python and Rπ
While the approach of storing missingness reasons in a shadow matrix file supports cleaner analyses, there are situations where non-responses are themselves meaningful. For example, a researcher might be interested in how often participants do not understand a given question and how this relates to other variables. To understand patterns of missing data, users can re-integrate the non-responses from the shadow matrix back into the data using the following helper functions (click to expand):
# Example 1: Load CSV/TSV and corresponding shadow matrix and add '_missing_reason' columns for missing values.
import pandas as pd
import os
def load_data_with_shadow(data_path, shadow_path):
# Detect delimiter from file extension and load data
def get_delimiter(path):
ext = os.path.splitext(path)[1].lower()
return "\t" if ext == ".tsv" else ","
data = pd.read_csv(data_path, delimiter=get_delimiter(data_path))
shadow = pd.read_csv(shadow_path, delimiter=get_delimiter(shadow_path))
# Annotate data with non-empty missingness reason columns (excluding participant_id, session_id) in shadow matrix
for col in data.columns[2:]:
if col in shadow.columns:
if not shadow[col].isna().all() and not (shadow[col] == '').all():
data[f"{col}_missing_reason"] = shadow[col]
return data
# Example usage:
df = load_data_with_shadow("data.tsv", "shadow_matrix.tsv")
# Example: View reasons for missing data for a given column/variable in the data file
df[df["<COLUMN NAME>"].isna()][["<COLUMN NAME>_missing_reason"]]
# Example 2: Using NBDCtools Python package
# install R backend with `NBDCtools` is required to run this code
from NBDCtools import create_dataset
create_dataset(
dir_data="path/to/data",
study="hbcd",
vars=["var1", "var2", "var3"],
tables=["table1", "table2"],
bind_shadow=True
)
library(NBDCtools)
create_dataset(
dir_data = "path/to/data",
study = "hbcd",
vars = c("var1", "var2", "var3"),
tables = c("table1", "table2"),
bind_shadow = TRUE
)