I began wondering about the reate of increase of oxycontin overdoses in the United States after watching a documentary on the Sackler Family. I found that wonder.cdc.gov was difficult and unintuitive to use, so I decided that I needed to organize the data and put in a database in order to interact with it easier. Text file exports of the data were available to download and came with some user guides. Should be easy enough to ETL these text files into a relational database.

Deciphering the text file format

after downloading the Mortality Multiple Cause zip file for 2017 1 and extracting the content, looking at the first ten lines can give us an idea of what the data looks like.

head VS17MORT.DUSMCPUB
                   1                                          4101  M1071 402009  1M2                2017U7CN                                    G20 188 051   37 0111G20                                                                                                                                                                           01 G20                                                                                                  01  11                                 100 601
                   1                                          1101  M1074 402009  1D4                2017U7CN                                    M726315 111   37 0211A419 21M726                                                                                                                                                                   02 M726 A419                                                                                            01  11                                 100 601
                   1                                          4101  M1059 371708  4M3                2017U7CN                                    C159077 021   15 0211C159 61F179                                                                                                                                                                   02 C159 F179                                                                                            01  11                                 100 601
                   1                                          1101  M1090 442411  4M3                2017U7BN                                    C61 113 033   11 0111C61                                                                                                                                                                           01 C61                                                                                                  03  23                                 100 803
                   1                                          5101  M1066 391909  4M3                2017U7CN                                    J449267 086   28 0311J449 61C900 62F179                                                                                                                                                            03 J449 C900 F179                                                                                       01  11                                 100 601
                   1                                          3101  F1090 442411  4W2                2017U7CN                                    F011175 111   37 0211F019 21I694                                                                                                                                                                   03 F011 F019 I694                                                                                       01  11                                 100 601
                   1                                          6101  M1045 351507  4M2                2017U7CN                                  99X599423 123   39 0711R092 12J969 21J988 22J189 31I64  41S069 61X599                                                                                                                                07 X599 I64  J189 J969 J988 R092 S069                                                                   01  11                                 100 601
                   1                                          9101  F1084 422210  4M5                2017U7BN                                    I509231 067   22 0211I509 61L899                                                                                                                                                                   02 I509 L899                                                                                            03  23                                 100 803
                   1                                          6101  F1082 422210  4M3                2017U7DN                                    G309189 052   17 0111G309                                                                                                                                                                          01 G309                                                                                                 01  11                                 100 601
                   1                                          5101  F1052 361607  4D5                2017U7CN                                    I251215 063   21 0811I469 21I500 31I429 41I251 42I258 61E149 62I272 63I059                                                                                                                         08 I251 E149 I059 I258 I272 I429 I469 I500                                                              01  11                                 100 601
                   1                                          4101  M1071 402009  1M2                2017U7CN                                    G20 188 051   37 0111G20                                                                                                                                                                           01 G20                                                                                                  01  11                                 100 601
                   1                                          1101  M1074 402009  1D4                2017U7CN                                    M726315 111   37 0211A419 21M726                                                                                                                                                                   02 M726 A419                                                                                            01  11                                 100 601
                   1                                          4101  M1059 371708  4M3                2017U7CN                                    C159077 021   15 0211C159 61F179                                                                                                                                                                   02 C159 F179                                                                                            01  11                                 100 601
                   1                                          1101  M1090 442411  4M3                2017U7BN                                    C61 113 033   11 0111C61                                                                                                                                                                           01 C61                                                                                                  03  23                                 100 803
                   1                                          5101  M1066 391909  4M3                2017U7CN                                    J449267 086   28 0311J449 61C900 62F179                                                                                                                                                            03 J449 C900 F179                                                                                       01  11                                 100 601
                   1                                          3101  F1090 442411  4W2                2017U7CN                                    F011175 111   37 0211F019 21I694                                                                                                                                                                   03 F011 F019 I694                                                                                       01  11                                 100 601
                   1                                          6101  M1045 351507  4M2                2017U7CN                                  99X599423 123   39 0711R092 12J969 21J988 22J189 31I64  41S069 61X599                                                                                                                                07 X599 I64  J189 J969 J988 R092 S069                                                                   01  11                                 100 601
                   1                                          9101  F1084 422210  4M5                2017U7BN                                    I509231 067   22 0211I509 61L899                                                                                                                                                                   02 I509 L899                                                                                            03  23                                 100 803
                   1                                          6101  F1082 422210  4M3                2017U7DN                                    G309189 052   17 0111G309                                                                                                                                                                          01 G309                                                                                                 01  11                                 100 601
                   1                                          5101  F1052 361607  4D5                2017U7CN                                    I251215 063   21 0811I469 21I500 31I429 41I251 42I258 61E149 62I272 63I059                                                                                                                         08 I251 E149 I059 I258 I272 I429 I469 I500                                                              01  11                                 100 601

Looking at the format, it seems to be organized into rows and columns of some sort and with plain text codes. This is a good sign, I already have some sort of format to be able to load the data into, although somewhat cryptic. Lets take a look at the User Guide for this file. Page 2 contains the following table:

List of File Data Elements and Tape Locations

Data Items                                Locations
1. General
  a. Resident status                             20
  b. Place of death and decedent’s status        83
  c. Day of week of death                        85
  d. Data year                              102-105
2. The Decedent
  a. Month of Death                           65-66
  b. Sex                                         69
  c. Race                                   445-450
  d. Age                                      70-82
  e. Marital status                              84
  f. Hispanic origin                        484-488
  h. Education                                61-64
3. Underlying Cause
  a. ICD-10 code                            146-149
  b. 358 cause recode                       150-152
  c. 113 cause recode                       154-156
  d. 130 cause recode                       157-159
  e. 39 cause recode                        160-161
4. Multiple Conditions
  a. Entity-Axis conditions
    1. Number of                            163-164
    2. The conditions                       165-304
  b Record-Axis conditions
    1. Number of                            341-342
    2. The conditions                       344-443
5. Other medical items
  a. Place of injury                            145
  b. Injury at work                             106
  c. Method of disposition                      108
  d. Autopsy                                    109
  e. Activity code                              144

It appears the tape locations appear to be the spaces from the begingin of the rows. The codes are still unkown to me, but for now that's okay. It will be easier to decipher the code, and sanitize our data once it's in a format that can be properly loaded into a database. Looking around google I found a Github Gist that parses the CDC data files to json. That looks like a good start to me and worth a shot to run.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
For each year, parse the pdf manual, then use that information to
unpack the fixed-width data file.

Source data files can be found here:
https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm#Mortality_Multiple

Passes basic tests for 2005-2015. Untested on earlier years.
"""

import json
import os
import pandas as pd
import pdb
import re
import tabula  # https://github.com/chezou/tabula-py, requires JDK8 install

from collections import defaultdict, OrderedDict
from slugify import slugify
from string import digits, whitespace


YEARS_TO_PROCESS = [i for i in range(2005, 2016)[::-1]]
INPUT_DATA_DIR = "raw_data"
EXPORT_DATA_DIR = "cleaned_data"
FIELDS_TO_DROP = {'detail_age'}
FIELD_MAP_PATTERNS_TO_DROP = [r'(?i)(_|\s)condition']
MANUALS_DIR = "file_definitions"
GIBBERISH_SEPARATOR = '>!^@^!<'  # should not match anything
UNCODED_FIELDS_TO_KEEP = {r'(?i)ICD_Code.*'}

LAST_PII_YEAR = 2004
ICD_VERSION = {year: 10 for year in range(1999, 2016)}
ICD_VERSION.update({year: 9 for year in range(1979, 1999)})
YEARS_MISSING_ICD_GROUPING = {2003}

DATA_PATHS = {
    2005: 'Mort05uspb.dat',
    2006: 'MORT06.DUSMCPUB',
    2007: 'VS07MORT.DUSMCPUB',
    2008: 'Mort2008us.dat',
    2009: 'VS09MORT.DUSMCPUB',
    2010: 'VS10MORT.DUSMCPUB',
    2011: 'VS11MORT.DUSMCPUB',
    2012: 'VS12MORT.DUSMCPUB',
    2013: 'VS13MORT.DUSMCPUB',
    2014: 'VS14MORT.DUSMCPUB',
    2015: 'VS15MORT.DUSMCPUB',
             }

PDF_PATHS = {
    2005: 'Record_Layout_2005.pdf',
    2006: 'Record_Layout_2006.pdf',
    2007: 'Record_Layout_2007.pdf',
    2008: 'Record_Layout_2008.pdf',
    2009: 'Record_Layout_2009.pdf',
    2010: 'Record_Layout_2010.pdf',
    2011: 'Record_Layout_2011.pdf',
    2012: 'Record_Layout_2012.pdf',
    2013: 'Record_Layout_2013.pdf',
    2014: 'Record_Layout_2014.pdf',
    2015: 'multiple_cause_record_layout_2015.pdf',
                 }

# Detail_Age_Type, Detail_Age are malformed, require special handling
HARDCODED_CODES = {x:
    {'detail_age_type': {1: 'Years', 2: 'Months', 4: 'Days', 5: 'Hours', 6: 'Minutes', 9: pd.np.nan}}
    for x in range(2005, 2016)}

HARDCODED_LOCATIONS = {x: {'detail_age_type': (69, 70), 'detail_age': (70, 73)}
    for x in range(2005, 2016)}


def extract_pdf_section(path, sep=',', pages=None):
    # May log several warnings that can be ignored like:
    # `org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation`.
    full_path = os.path.join(MANUALS_DIR, path)
    try:
        return tabula.read_pdf(full_path, guess=False, pages=pages,
            pandas_options={'dtype': object, 'header': None, 'sep': sep, 'engine': 'python'})
    except pd.errors.ParserError:
        pdb.set_trace()


def extract_separate_pdf_pages(path):
    # distinct function as passing any sep at all causes an error with multiple_tables
    full_path = os.path.join(MANUALS_DIR, path)
    try:
        return tabula.read_pdf(full_path, multiple_tables=True, guess=False, pages='all',
            pandas_options={'dtype': object, 'header': None})
    except pd.errors.ParserError:
        pdb.set_trace()


def is_column_definition_page(page_data, year):
    # can't just pop the first column definition page; earlier years
    # do actually have different format that won't generate false positive
    clean_page_data = page_data.dropna(how='all').copy()
    cleaned_first_line = str(clean_page_data.iloc[0, -1]).strip()
    starts_with_year = cleaned_first_line.endswith(str(year))
    cleaned_second_line = str(clean_page_data.iloc[1, -1]).strip()
    # spacing is one of the main things tabula fails on
    # have to assume zero or multiple spaces where one normally exists
    follows_with_title = bool(safe_re_search(r'(?i)Mortality\s*Multiple\s*Cause-of-Death\s*Public\s*Use\s*Record$',
        cleaned_second_line))
    cleaned_third_line = str(clean_page_data.iloc[2, -1]).strip()
    is_summary_page = bool(safe_re_search(r'(?i)List\s*of\s*File\s*Data\s*Elements\s*and\s*Tape\s*Locations',
        cleaned_third_line))
    return starts_with_year and follows_with_title and not is_summary_page


def is_icd_ten_group_page(page_data):
    # matches text like 'Tenth Revision 130 Selected Causes of Infant Death Adapted for use by DVS'
    cleaned_first_line = str(page_data.iloc[0, 0]).strip()
    return bool(safe_re_search(r'(?i)Tenth\s*Revision\s*\d{2,3}\s*Selected\s*Causes\s*of\s*(\w+\s*)?Death', cleaned_first_line))


def identify_section_pages(path, year):
    # get page numbers for each section.
    # need 2 passes as multiple_tables does not support the 'sep' option
    column_definition_pages = []
    icd_group_pages = []
    pages = extract_separate_pdf_pages(path)
    have_seen_icd_pages = False
    for page_num, page_data in enumerate(pages):
        if is_column_definition_page(page_data, year):
            column_definition_pages.append(page_num + 1)
        elif is_icd_ten_group_page(page_data):
            icd_group_pages.append(page_num + 1)
            have_seen_icd_pages = True
        elif have_seen_icd_pages:
            # abort after end of icd to avoid pulling territorial column_definition_pages
            break
    return column_definition_pages, icd_group_pages


def extract_pdf(year):
    pdf_path = PDF_PATHS[year]
    column_definition_pages, icd_group_pages = identify_section_pages(pdf_path, year)
    results = [extract_pdf_section(pdf_path, pages=column_definition_pages)]
    # gibberish separator ensures a section is read in as 1 column per page
    results.append(extract_pdf_section(pdf_path, pages=icd_group_pages, sep=GIBBERISH_SEPARATOR))
    return results


def safe_re_search(pattern, text):
    if pd.isnull(text):
        return None
    return re.search(pattern, text)


def is_column_location_row(location_text):
    return bool(safe_re_search(r'^\d{1,5}(-\d{1,5})?', location_text))


def is_encoding_row(code_text):
    return bool(safe_re_search(r'\.{3}.+', code_text))


def is_continuation(field_text):
    return bool(safe_re_search(r'(?i)\s*-\s*Con\.$', field_text))


def is_reserved_position(field_text):
    return bool(safe_re_search(r'(?i)Reserved(\s*Position)?(s)?$', field_text))


def is_condition_entry(field_text):
    return bool(safe_re_search(r'(?i)^\d{1,2}.+\s*Condition$', field_text))


def is_condition_header(field_text):
    return bool(safe_re_search(r'(?i)^\w+\s*-\s*Axis\s*Conditions?$', field_text))


def is_condition_size(field_text):
    return bool(safe_re_search(r'(?i)^Number\s*of\s*\w+\s*-\s*Axis\s*Conditions?$', field_text))


def check_for_duplicate_rows(df):
    duplication_filter = (df.field_or_code.duplicated() & df.is_column_loc)
    if len(df[duplication_filter]) == 0:
        return None
    duplicated_fields = list(df[duplication_filter].field_or_code.unique())
    raise ValueError(f"Duplicate field names: {duplicated_fields}")


def add_basic_column_data(df):
    try:
        df.columns = ['location', 'size', 'field_or_code', 'code_value']
    except ValueError:
        # some pdfs get parsed with only 3 columns. Known issue with 2009 & earlier, exact num unclear
        df.columns = ['location', 'size', 'field_or_code']
        df['code_value'] = df['field_or_code'].apply(lambda x:
            str(x)[str(x).find('...'):].strip() if str(x).find('...') >= 0 else pd.np.nan)
        df['field_or_code'] = df['field_or_code'].apply(lambda x:
            str(x)[:str(x).find('...')].strip() if str(x).find('...') >= 0 else x)

    # tabula errors can introduce quotes. None exist in pdfs checked by hand.
    for column in df.columns:
        df[column] = df[column].apply(lambda x: str(x).strip(whitespace + '"') if not pd.isnull(x) else pd.np.nan)
    df['is_column_loc'] = (df['location'].apply(is_column_location_row) &
        ~df['size'].apply(pd.isnull))
    df['is_encoding'] = (df['code_value'].apply(is_encoding_row) &
        ~df['field_or_code'].apply(pd.isnull) & df['size'].apply(pd.isnull))
    return df


def update_condition_names(df):
    # some condition field names can end up in the code_values column
    # these are migrated over
    malformed_field_filter = (df.is_column_loc & df.field_or_code.isnull())
    df.loc[malformed_field_filter, 'field_or_code'] = df[malformed_field_filter].code_value
    df['condition_entry'] = (df['field_or_code'].apply(is_condition_entry) & df['is_column_loc'])
    df['condition_type'] = (df['field_or_code'].apply(is_condition_header) & df['is_column_loc'])
    df['condition_type'] = df.apply(
        lambda row: row['field_or_code'].split('-')[0].strip().title()
        if row['condition_type'] else None, axis=1)
    df['condition_type'].fillna(method='ffill', inplace=True)
    df = df[~df['field_or_code'].apply(is_condition_header)].copy()
    df['condition_number'] = df.apply(
        lambda row: ''.join([i for i in row['field_or_code'] if i.isnumeric()])
        if row['condition_entry'] else None, axis=1
                                     )
    df.loc[df['condition_entry'], 'field_or_code'] = df[df['condition_entry']].apply(
            lambda row: f'{row["condition_type"]}_condition_{row["condition_number"]}', axis=1)
    df.loc[df['condition_entry'], 'code_value'] = pd.np.nan
    df.loc[df['condition_entry'], 'is_column_loc'] = False
    return df


def drop_useless_rows(df):
    df = df[df.is_column_loc | df.is_encoding].copy()
    df = df[~(df['field_or_code'].apply(is_continuation) & df['is_column_loc'])].copy()
    df = df[~df['field_or_code'].apply(lambda x: str(x).startswith('Reserved'))].copy()
    return df


def handle_missing_values(df):
    df['location'].fillna(method='ffill', inplace=True)
    df['size'].fillna(method='ffill', inplace=True)
    df.fillna(value='', inplace=True)
    return df


def tidy_data(df):
    df['code_value'] = df['code_value'].apply(lambda x:
        str(x)[str(x).find('...') + len('...'):].strip() if str(x).find('...') >= 0 else pd.np.nan)
    # documentation uses 1 based indexing, we need 0 based
    df['location_start'] = df['location'].apply(lambda x: int(x.split('-')[0]) - 1)
    # first page of results has non-unique code keys for US vs territorial datasets
    # retaining first keeps US keys
    df.drop_duplicates(subset={'location', 'size', 'field_or_code'}, inplace=True)
    df['size'] = df['size'].apply(int)
    df['location_end'] = df.apply(
            lambda row: row['location_start'] + row['size'], axis=1)
    df['field'] = df.apply(lambda row: row['field_or_code']
        if any([row['is_column_loc'], row['condition_entry']]) else pd.np.nan, axis=1)
    df['field'].fillna(method='ffill', inplace=True)
    df['field'] = df['field'].apply(lambda x: slugify(x).replace('-', '_'))
    df.rename(columns={'field_or_code': 'code'}, inplace=True)
    return df


def populate_field_data(df, year):
    df['field_to_keep'] = df['field'].apply(lambda x:
        any([bool(safe_re_search(pattern, x)) for pattern in UNCODED_FIELDS_TO_KEEP]))
    df = df[~df['is_column_loc'] | df['field_to_keep']].copy()
    df = df[~df['field'].isin(FIELDS_TO_DROP)].copy()
    # enforcing title case to remove source of inconsistencies across years
    df['code'] = df['code'].apply(str.title)
    field_codes = defaultdict(dict)
    df.apply(lambda row:
        field_codes[row['field']].update({row['code']: row['code_value']}), axis=1)
    field_codes.update(HARDCODED_CODES[year])
    field_codes = {k: v for k, v in field_codes.items()
        if not any([re.search(pattern, k) for pattern in FIELD_MAP_PATTERNS_TO_DROP])}
    df.drop_duplicates(inplace=True)
    df.sort_values(by=['location_start'], inplace=True)
    df['location'] = df.apply(
        lambda row: (row['location_start'], row['location_end']), axis=1)
    df.drop_duplicates(subset=['location'], inplace=True)
    field_locations = OrderedDict()
    df.apply(lambda row: field_locations.update({row['field']: row['location']}), axis=1)
    field_locations.update(HARDCODED_LOCATIONS[year])
    field_locations = OrderedDict(sorted(field_locations.items(), key=lambda x: x[1][0]))
    return field_locations, dict(field_codes), df


def parse_pdf_data(df, year):
    df = add_basic_column_data(df)
    df = drop_useless_rows(df)
    df = update_condition_names(df)
    check_for_duplicate_rows(df)
    df = handle_missing_values(df)
    df = tidy_data(df)
    return populate_field_data(df, year)


def read_dataset(year, field_locations):
    full_path = os.path.join(INPUT_DATA_DIR, DATA_PATHS[year])
    return pd.read_fwf(full_path, colspecs=list(field_locations.values()),
        names=list(field_locations.keys()), header=None, dtype=object)


def validate_data_types(df, year):
    # since we don't know the complete list of columns to expect, we only
    # validate a small sample that definitely ought exist with consistent formats
    if not all([
        set(df['autopsy'].apply(str.upper).unique()) == {'Y', 'N', 'U'},
        set(df['sex'].unique()) == {'M', 'F'},
        all(df['detail_age_type'].apply(str.isnumeric)),
        df['current_data_year'].unique().tolist() == [str(year)],
        len([x for x in df.columns if 'record' in x]) >= 20,
        len([x for x in df.columns if 'entity' in x]) >= 20,
               ]):
        pdb.set_trace()
        raise ValueError(f"Invalid datatype found in {year} data")


def safe_first_re_match(pattern, text):
    match = safe_re_search(pattern, text)
    if match:
        return match[0]


def extract_cause_title(text):
    return safe_first_re_match(r'(?i)\b\d{1,4}\s*selected\s*causes\b', text)


def is_idc_data_header(text):
    return bool(safe_re_search(r'(?i)^recodetsexagecause', re.sub('\s', '', text)))


def extract_recode(text):
    return safe_first_re_match(r'^\d{3}', text)


def remove_filler_rows(df):
    df['is_header'] = df['text'].apply(is_idc_data_header)
    cause_title_indexes = list(df[df.is_cause_title].index)
    header_indexes = list(df[df.is_header].index)
    filler_indexes = [x for x in zip(cause_title_indexes, header_indexes)]
    for min_idx, max_idx in filler_indexes:
        df = df[(df.index < min_idx) | (df.index > max_idx)].copy()
    return df


def validate_ICD_codes(icd_codes):
    # codes should be form complete sequence within each group
    for cause_title, code_group in icd_codes.items():
        int_icd_codes = sorted([int(x) for x in code_group.keys()])
        expected_code_range = [i for i in range(min(int_icd_codes), max(int_icd_codes) + 1)]
        if int_icd_codes != expected_code_range:
            pdb.set_trace()
            raise ValueError(f'ICD codes for {cause_title} do not form complete sequence')


def read_ICD_codes(df):
    df.columns = ['text']
    df.dropna(inplace=True)
    # some lines end up improperly quote wrapped due to tabula error
    df['text'] = df[~df['text'].isnull()].copy()['text'].apply(
        lambda x: x.strip('"') if not pd.isnull(x) else pd.np.nan)
    df['cause_title'] = df['text'].apply(extract_cause_title)
    df['is_cause_title'] = ~df['cause_title'].isnull()
    df['cause_title'].fillna(method='ffill', inplace=True)
    df = remove_filler_rows(df)
    df['recode'] = df['text'].apply(extract_recode)
    df['recode'].fillna(method='ffill', inplace=True)
    # very first row can still be blank; can be safely dropped
    df = df.dropna(subset=['text', 'recode'], how='any').copy()
    df['text'] = (df[['text', 'recode', 'cause_title']].groupby(['cause_title', 'recode'])
        .transform(lambda x: ' '.join(x)))
    df.drop_duplicates(inplace=True)
    # tabula fails to retain spacing; we cannot recover subtotal/gender/age codes
    df['text'] = df['text'].apply(lambda x: x.lstrip(digits + whitespace).strip())
    icd_codes = {title: dict() for title in df['cause_title'].unique()}
    df.apply(lambda row: icd_codes[row['cause_title']].update({row['recode']: row['text']}), axis=1)
    validate_ICD_codes(icd_codes)
    return icd_codes


def update_field_map(field_map, icd_codes):
    for code_group, icd_values in icd_codes.items():
        group_number = safe_first_re_match('\d{1,4}', code_group)
        matching_pattern = f'(?i){group_number}.+Cause_Recode$'
        field_map_matching_key = [x for x in field_map.keys() if safe_re_search(matching_pattern, x)][0]
        field_map[field_map_matching_key] = icd_values
    return field_map


def export_code_maps(field_map, year):
    with open(os.path.join(EXPORT_DATA_DIR, f'{year}_codes.json'), 'w+') as f_open:
        json.dump(field_map, f_open)


def process_year(year):
    print(f"Unpacking pdf for {year}")
    documentation_data = extract_pdf(year)
    field_locations, field_map, df = parse_pdf_data(documentation_data[0], year)
    print(f"Parsing fixed with file for {year}")
    df = read_dataset(year, field_locations)
    validate_data_types(df, year)
    df.to_csv(os.path.join(EXPORT_DATA_DIR, f'{year}_data.csv'), index=False)
    print(f"Exporting column code mappings")
    icd_codes = read_ICD_codes(documentation_data[1])
    field_map = update_field_map(field_map, icd_codes)
    export_code_maps(field_map, year)
    print(f"Finished {year}")


def process_all_years():
    for year in YEARS_TO_PROCESS:
        process_year(year)


if __name__ == '__main__':
    process_all_years()

The only external dependencies for the script are pandas tabula-py and python-slugify.

Note: If you plan on running this program yourself you should know that numpy tries to allocate 4093 gb of memory in order to parse the data. That's a lot more memory than I had on my machine to say the least! Taken from stackoverflow this is probably due to your system's overcommit handling mode. I urge you to check out the post if you are curious, but you can solve this problem on linux fairly simply with

 echo 1 | sudo tee /proc/sys/vm/overcommit_memory

and if you want setting it back to the default when you are finished running the program by running

 echo 0 | sudo tee /proc/sys/vm/overcommit_memory

running the script on my desktop 2 took about 26 mins to run and left me with a lot more normalized data. For each year we outputted two files.

  1. a csv file with the first row as labels for our column data
$ head cleaned_data/2015_data.csv
resident_status,education_1989_revision,education_2003_revision,education_reporting_flag,month_of_death,sex,detail_age_type,detail_age,age_substitution_flag,age_recode_52,age_recode_27,age_recode_12,infant_age_recode_22,place_of_death_and_decedents_status,marital_status,day_of_week_of_death,current_data_year,injury_at_work,manner_of_death,method_of_disposition,autopsy,activity_code,place_of_injury_for_causes_w00_y34_except_y06_and_y07,icd_code_10th_revision,358_cause_recode,113_cause_recode,130_infant_cause_recode,39_cause_recode,number_of_entity_axis_conditions,entity_condition_1,entity_condition_2,entity_condition_3,entity_condition_4,entity_condition_5,entity_condition_6,entity_condition_7,entity_condition_8,entity_condition_9,entity_condition_10,entity_condition_11,entity_condition_12,entity_condition_13,entity_condition_14,entity_condition_15,entity_condition_16,entity_condition_17,entity_condition_18,entity_condition_19,entity_condition_20,number_of_record_axis_conditions,record_condition_1,record_condition_2,record_condition_3,record_condition_4,record_condition_5,record_condition_6,record_condition_7,record_condition_8,record_condition_9,record_condition_10,record_condition_11,record_condition_12,record_condition_13,record_condition_14,record_condition_15,record_condition_16,record_condition_17,record_condition_18,record_condition_19,record_condition_20,race,bridged_race_flag,race_imputation_flag,race_recode_3,race_recode_5,hispanic_origin,hispanic_origin_race_recode
1,,3,1,01,M,1,084,,42,22,10,,1,M,1,2015,U,7,B,N,,,I500,230,067,,22,02,11I500,61L031,,,,,,,,,,,,,,,,,,,02,I500,L031,,,,,,,,,,,,,,,,,,,01,,,1,1,100,6
1,,6,1,01,M,1,070,,40,20,09,,1,M,2,2015,U,7,C,N,,,C349,093,027,,08,08,11I469,21R042,31C349,41J189,42J90,61J449,62I251,63F179,,,,,,,,,,,,,07,C349,F179,I251,I469,J440,J90,R042,,,,,,,,,,,,,,01,,,1,1,100,6
1,,3,1,01,F,1,091,,44,24,11,,4,W,6,2015,U,7,C,N,,,G309,189,052,,17,01,11G309,,,,,,,,,,,,,,,,,,,,01,G309,,,,,,,,,,,,,,,,,,,,01,,,1,1,100,6
1,,3,1,01,F,1,040,,34,14,06,,4,S,5,2015,N,3,B,Y,9,9,X91,434,129,,41,06,11T71,12X91,21T71,61T149,62S223,63S299,,,,,,,,,,,,,,,05,X91,S223,S299,T149,T71,,,,,,,,,,,,,,,,03,,,2,3,100,8
1,,5,1,01,F,1,089,,43,23,11,,6,W,5,2015,U,7,C,N,,9,I250,214,062,,21,03,11I250,21S720,61X590,,,,,,,,,,,,,,,,,,03,I250,S720,X590,,,,,,,,,,,,,,,,,,01,,,1,1,100,6
1,,6,1,01,M,1,078,,41,21,10,,4,W,6,2015,U,7,C,N,,,I516,233,068,,22,04,11I499,21I516,61E780,62F179,,,,,,,,,,,,,,,,,04,I516,E780,F179,I499,,,,,,,,,,,,,,,,,03,,,2,3,100,8
1,,4,1,01,M,1,092,,44,24,11,,4,W,1,2015,U,7,R,N,,,E274,161,111,,37,01,11E274,,,,,,,,,,,,,,,,,,,,01,E274,,,,,,,,,,,,,,,,,,,,01,,,1,1,100,6
1,,3,1,01,M,1,082,,42,22,10,,4,M,1,2015,U,7,C,N,,,I500,230,067,,22,01,11I500,,,,,,,,,,,,,,,,,,,,01,I500,,,,,,,,,,,,,,,,,,,,01,,,1,1,100,6
1,,3,1,01,F,1,089,,43,23,11,,4,W,1,2015,U,7,B,N,,,I350,225,068,,22,02,11I500,21I350,,,,,,,,,,,,,,,,,,,02,I350,I500,,,,,,,,,,,,,,,,,,,01,,,1,1,100,6
  1. a json file containing the key codes for each table value 3
$ cat cleaned_data/2015_keys.json
{
    "resident_status": {"1": "RESIDENTS", "2": "INTRASTATE NONRESIDENTS", "3": "INTERSTATE NONRESIDENTS", "4": "FOREIGN RESIDENTS"}
...
}

Lets see how many rows of data we have to work with here.

$ wc -l cleaned_data/*.csv | grep total
  27720684 total

a little under 28 million, not bad. And let's get a rough idea of the size of this data in terms of computer memory

$ du -h cleaned_data/*.csv
365M	cleaned_data/2005_data.csv
362M	cleaned_data/2006_data.csv
362M	cleaned_data/2007_data.csv
369M	cleaned_data/2008_data.csv
364M	cleaned_data/2009_data.csv
369M	cleaned_data/2010_data.csv
377M	cleaned_data/2011_data.csv
381M	cleaned_data/2012_data.csv
389M	cleaned_data/2013_data.csv
394M	cleaned_data/2014_data.csv
408M	cleaned_data/2015_data.csv

Since the file size for all of these is megabytes we can get the sum pretty simply

$ du -h cleaned_data/*.csv | awk '{print $1}' | sed s/M// | awk '{total+=$1} END{print total/1024 " gigabytes"}'
4.04297 gigabytes

Choosing A Database.

We have some structured data now but we need to transform it a bit further. However, the database we use to store this in will affect that transformation significantly. For example, if we use a relational database, we'd have to make sure we transform our semi-structured data to match the type definitions of the tables. Document databases or NoSQL are also an option and may be a good use case for us as well since we do not have to design any sort of schema and can just upload all of our data row by row. Document DB's lack the transactional nature of relational databases, but for this project, we won't have a lot of transactional operations on our database. Leaving the data as a file and loading up the data into memory is also an option, especially because the amount of data we have is relatively small, and it will be read only for the most part anyways. This would be the most performant option, However any time we need to update the data, say we want to add additional years, we would have to rewrite our storage system.

I think a happy medium between these methods is Apache CouchDB. The nature of our data is not transactional, and CouchDB comes with a built in HTTP API so that people can start accessing the data quickly and easily right out of the gate. Using CouchDB we can send the data somewhere in a structured form, that is easily accessible for a MVP, and iterate on that solution if we need to make changes. We can also implement a caching layer if we want to see the performance benefits of in memory storage as well. CouchDB also comes with a built in rest api for exposing the data.

However, the end result of what I would like to achieve is a read only graphql endpoint for this data. Graphql is a important part of exposing this data since the tables will contain 76 columns, and many of them are not particularly useful. Graphql makes sense in this situation because it will greatly reduce the size of data transfer.

PostGraphile seems like it would be a perfect solution to expose this data, as long as the data is in a PostgreSQL database.

Given our options, the long term scaleability and availability of our data seems like it would live best in a postgres database exposed over qraphql. Let's create some tables!

Footnotes

The json in the file is on online, and the example given is altered for readability

neofetch


rob@asus
--------
OS: Arch Linux x86_64
Host: M32CD_A_F_K20CD_K31CD
Kernel: 5.1.16-arch1-1-ARCH
Uptime: 12 days, 10 hours, 17 mins
Packages: 1346 (pacman)
Shell: zsh 5.7.1
Resolution: 1440x2560, 1440x2560, 1920x1080
WM: i3
Theme: Adwaita [GTK2/3]
Icons: Adwaita [GTK2/3]
Terminal: alacritty
CPU: Intel i5-6400 (4) @ 3.300GHz
GPU: NVIDIA GeForce GTX 1650
GPU: Intel HD Graphics 530
Memory: 7763MiB / 15926MiB

most recent at the time of writing