In this scenario, I'm a data analyst for the New York Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. My supervisor has asked me to evaluate the quality and reliability of available census data to inform this decision.
# MOE percentage and reliability categories using mutate()
county_reliability <- county_data_clean %>%
mutate(
moe_percentage = round(100 * median_incomeM / median_incomeE, 2),
reliability = case_when(
moe_percentage < 5 ~ "High Confidence",
moe_percentage <= 10 ~ "Moderate Confidence",
moe_percentage > 10 ~ "Low Confidence",
TRUE ~ NA_character_
),
unreliable = moe_percentage > 10 # <- required flag
)
# Summary showing count of counties in each reliability category
reliability_summary <- county_reliability %>%
group_by(reliability) %>%
summarize(
counties = n(),
avg_income = round(mean(median_incomeE, na.rm = TRUE), 0)
) %>%
mutate(percent = round(counties / sum(counties) * 100, 1))
high_uncertainty <- county_reliability %>%
arrange(desc(moe_percentage)) %>%
slice_head(n = 5) %>%
select(county_name, total_popE, median_incomeE, median_incomeM, moe_percentage, reliability)
kable(
high_uncertainty,
col.names = c("County","Total Population","Median Income (E)","Median Income MOE","MOE %","Reliability"),
caption = "Counties with Highest Income Data Uncertainty",
format.args = list(big.mark = ",")
)
This conveys that counties with small populations are poorly served by algorithms that rely on this income data. Smaller populations mean smaller samples in the American Community Survey (ACS), which increases margins of error and makes estimates less certain—the true value could be higher or lower than reported. For example, Hamilton County has only about 5,000 people, which results in a high MOE of 11.39%.
#Tract Data
tract_data <- tract_data %>%
mutate(
moe_white = 100 * (whiteM / whiteE),
moe_black = 100 * (blackM / blackE),
moe_hispanic = 100 * (hispanicM / hispanicE),
high_moe_flag = ifelse(
moe_white > 15 | moe_black > 15 | moe_hispanic > 15,
"High MOE", "Other"
),
high_moe_flag_30 = ifelse(
moe_white > 30 | moe_black > 30 | moe_hispanic > 30,
"High MOE", "Lower MOE"
)
)
#Pattern Analysis
MOE_analysis_30 <- tract_data %>%
group_by(high_moe_flag_30) %>%
summarize(
n_tracts = n(),
avg_pop = mean(total_popE, na.rm = TRUE),
avg_pct_white = mean(pct_white, na.rm = TRUE),
avg_pct_black = mean(pct_black, na.rm = TRUE),
avg_pct_hispanic= mean(pct_hispanic, na.rm = TRUE),
.groups = "drop"
#Create Table
kable(MOE_analysis_30,
digits = 1,
col.names = c("MOE Group (30%)", "Number of Tracts", "Avg Population",
"Avg % White", "Avg % Black", "Avg % Hispanic"),
caption = "Tract Summary 30% MOE Cutoff")
When a cut-off of 30% was used, a small number of tracts fell into the ‘Lower MOE’ category, meaning this subset was more reliable. Interestingly, these Lower MOE tracts tend to be diverse than the higher MOE tracts. This pattern suggests that more diverse areas often produce more reliable results, mainly because they also tend to have larger populations. Diversity itself is not the cause of lower MOEs, but it appears indirectly since diverse tracts are often larger and therefore more accurately captured in the ACS.
Policy Recommendation
During my analysis, I found a systematic pattern indicating that data quality varies significantly between counties and census tracts. When analyzing the county data, I saw that the counties with the highest populations had the most reliable data. For example, New York County had a margin of error of just 1.78%, compared to Seneca County's much higher margin of error of 5.24%. The same pattern was evident at the census tract level, where the tracts with a smaller population or less diversity had less reliable estimates compared to those with a larger, more diverse population. The communities that face the greatest risk of being misclassified by algorithms are typically smaller, rural, and less diverse counties. Income estimates for these areas have higher margins of error, meaning they are less precise and the true income could be much higher or lower than the estimate. This could cause areas such as Seneca County to be incorrectly identified as an area of need, leading to insufficient resource allocation. The root cause of these reliability issues is that the American Community Survey (ACS) collects fewer samples in small or rural communities. Smaller samples lead to higher margins of error, while large urban areas benefit from more precise estimates. This means that the reliability of the survey is not evenly distributed across different community types. Therefore, the Department should implement case-by-case reviews for the communities that show high margins of error. These areas require closer analysis and should not be evaluated under the same assumptions used for larger, more reliable communities.
recommendations_data <- county_reliability %>%
select(county_name, median_incomeE, moe_percentage, reliability) %>%
mutate(
recommendation = case_when(
reliability == "High Confidence" ~ "Safe for algorithmic decisions",
reliability == "Moderate Confidence" ~ "Use with caution - monitor outcomes",
reliability == "Low Confidence" ~ "Requires manual review or additional data",
TRUE ~ "Unclassified"
)
)
kable(
recommendations_data,
digits = 1,
col.names = c("County","Median Income","MOE %","Reliability","Recommendation"),
caption = "Algorithmic Decision Framework by County",
format.args = list(big.mark = ",")
)
Snapshot of results
All counties in New York State, except six, are High Confidence. The counties with the highest confidence are Queens, Erie, Suffolk, Kings, Monroe, Nassau, Westchester, Onondaga, New York, Bronx, and Orange, all of which have a margin of error of less than 2%. These counties are considered suitable for immediate algorithmic implementation as they demonstrate very high confidence. There are only five counties classified as Moderate Confidence, with a margin of error between 5% and 10%: Essex, Yates, Greene, Schuyler, and Seneca. The income estimates in these counties should be monitored closely and reviewed manually to ensure that resources are allocated fairly. Hamilton County (11% MOE) is the only county in the Low Confidence category. This means that its income estimate is highly unreliable due to sampling uncertainty and should not be used for algorithmic decisions without additional checks. Community-level surveys should be conducted and reviewed in addition to the ACS data to guide policy decision making.
Data Sources:
U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on 09/27/2025