5  From School To Juvenile Incarceration

Author

Isley Jean-Pierre

5.1 From School To Juvenile Incarceration

This is a project that I hope to present at the NYC OpenData conference next spring. I must admit that there are some datasets that I intended to use for this final assignment, but because of time and roadblocks, I will focus on the three key datasets for now. My goal is to keep working on this project by finding a meaningful way to include the other datasets. I believe that this project has the potential to become something special. Understanding the relationship between probation supervision levels, juvenile rearrest rates, and school discharge may help policymakers evaluate whether current probation resources are sufficient to reduce recidivism among youth.

5.1.1 Loading Libraries

library(tidyverse)
library(Hmisc)
library(corrplot)
library(lubridate)
library(leaflet)
library(sf)
library(viridis)
library(stringr)
library(tidyr)
library(tigris)
library(htmltools)

5.2 Preparing Rearrest Rate Data

I will use three datasets from NYC OpenData. Let’s load the “rearrest rate” and the the “DOP Juvenile” datasets. This dataset contains the number of juvenile probationers rearrested divided by the number supervised during the reporting period. This dataset contains the number of active juvenile probation supervision cases on the last day of the reporting period:General Supervision, Pathways to Excellence Achievement and Knowledge (PEAK), Every Child Has An Opportunity To Excel And Succeed (ECHOES), Juvenile Justice Initiative (JJI), Advocate Intervene Mentor (AIM), Enhanced Supervision Program (ESP). The “school discharge” dataset provides annual reporting on New York City Department of Education student discharges and transfers, as required by Local Law 42. It includes the number of students discharged or transferred by grade or cohort, disaggregated by demographics (race/ethnicity, gender, age, English language learner status, and special education status) and summarized at multiple geographic levels (citywide, borough, district, and school). Discharge and transfer codes indicate the reason students exited or moved within the system.

5.2.1 Loading Rearrest Data

# Load dataset
rea <- nycOpenData::nyc_dop_juvenile_rearrest_rate(limit = 10000)

# Preview first few rows
rea %>%
  head() %>%
  knitr::kable(
    caption = "Preview of juvenile rearrest rate data from NYC Open Data, showing key variables related to rearrest trends over time."
  )
Table 5.1: Preview of juvenile rearrest rate data from NYC Open Data, showing key variables related to rearrest trends over time.
borough month year rate
Citywide January 2026 4.6
Citywide September 2025 4.5
Citywide December 2025 4.4
Citywide July 2025 5.3
Citywide October 2025 4.4
Citywide August 2025 4.6

5.2.2 Cleaning Rearrest Data

rea_clean <- rea %>%
  filter(year >= 2023 & year <= 2025)

rea_clean <- rea_clean %>%
  mutate(
    month_year = paste(month,year, sep = " "),
    month_year = my(month_year)
  )

rea_clean <- rea_clean %>%
      select(-month, -year)

5.3 Preparing Juvenile Cases Dataset

This dataset contains the number of active juvenile probation supervision cases on the last day of the reporting period: General Supervision, Pathways to Excellence Achievement and Knowledge (PEAK), Every Child Has An Opportunity To Excel And Succeed (ECHOES), venile Justice Initiative (JJI), Advocate Intervene Mentor (AIM), Enhanced Supervision Program (ESP).

# Load dataset
juv <- nycOpenData::nyc_dop_juvenile_cases(limit = 10000)

# Preview first few rows
juv %>%
  head() %>%
  knitr::kable(
    caption = "Sample of the NYC Department of Probation juvenile cases dataset, illustrating the structure and key variables used in the analysis."
  )
Table 5.2: Sample of the NYC Department of Probation juvenile cases dataset, illustrating the structure and key variables used in the analysis.
borough supervision_caseload_type month year supervision_caseload_count
Citywide Enhanced Supervision Program January 2026 279
Citywide Juvenile Justice Initiative January 2026 135
Citywide IMPACT January 2026 0
Citywide Advocate Intervene Mentor January 2026 36
Citywide Every Child Has An Opportunity To Excel And Succeed January 2026 0
Citywide General Supervision January 2026 640

5.3.1 Cleaning Rearrest Data

juv_clean <- juv %>%
  filter(year >= 2023 & year <= 2025)

juv_clean <- juv_clean %>%
  mutate(
    month_year = paste(month,year, sep = " "),
    month_year = my(month_year)
  )

juv_clean <- juv_clean %>%
  select(-month, -year)

juv_clean <- juv_clean %>%
  mutate(supervision_caseload_count = as.numeric(supervision_caseload_count))

5.4 Combining Datasets

To investigate whether probation caseload size relates to rearrest rates, the datasets were aggregated by month and merged using the month_year variable. This allows direct comparison between monthly rearrest rates and the total number of youth supervised.

juv_month <- juv_clean %>%
  group_by(month_year) %>%
  summarise(
    total_cases = sum(supervision_caseload_count, na.rm = TRUE)
  )

combined_data <- rea_clean %>%
  left_join(juv_month, by = "month_year")

combined_data$rate <- as.numeric(combined_data$rate)

combined_data %>%
  head() %>%
  knitr::kable(
    caption = "Preview of the merged dataset combining juvenile rearrest rates with monthly juvenile probation caseloads. This table confirms successful alignment by month-year and prepares the data for subsequent analysis."
  )
Table 5.3: Preview of the merged dataset combining juvenile rearrest rates with monthly juvenile probation caseloads. This table confirms successful alignment by month-year and prepares the data for subsequent analysis.
borough rate month_year total_cases
Citywide 4.5 2025-09-01 1114
Citywide 4.4 2025-12-01 1078
Citywide 5.3 2025-07-01 1092
Citywide 4.4 2025-10-01 1109
Citywide 4.6 2025-08-01 1106
Citywide 4.5 2025-11-01 1095

5.5 Analysis

5.5.1 Correlation Analysis

cor(combined_data$rate, combined_data$total_cases, use = "complete.obs")
[1] 0.4935201

The correlation coefficient suggests the degree to which larger supervision caseloads correspond with higher or lower rearrest rates. While correlation does not imply causation, the relationship provides preliminary insight into how supervision capacity may relate to youth outcomes.

5.5.2 Regression Analysis

A simple linear regression was used to test whether monthly probation caseload levels predict rearrest rates.

model <- lm(rate ~ total_cases, data = combined_data)
summary(model)

Call:
lm(formula = rate ~ total_cases, data = combined_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7650 -0.6115  0.2154  0.4185  1.1668 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.303612   0.477335   4.826 3.29e-05 ***
total_cases 0.001675   0.000522   3.210  0.00302 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6221 on 32 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.2436,    Adjusted R-squared:  0.2199 
F-statistic:  10.3 on 1 and 32 DF,  p-value: 0.003017

5.5.3 Time Series Comparison

combined_scaled <- combined_data %>%
  mutate(rate_scaled = scale(rate),
         cases_scaled = scale(total_cases))

ggplot(combined_scaled, aes(x = month_year)) +
  geom_line(aes(y = rate_scaled, color = "Rearrest Rate"), size = 1) +
  geom_line(aes(y = cases_scaled, color = "Caseload"), size = 1) +
  labs(title = "Rearrest Rates and Juvenile Caseloads Over Time",
       x = "Date", y = "Standardized Value", color = "Variable") +
  theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
This figure shows Rearrest Rates and Juvenile Caseloads Over Time.
Figure 5.1: This figure shows Rearrest Rates and Juvenile Caseloads Over Time.

The standardized trend lines show that rearrest rates and caseloads generally move together, suggesting months with higher supervision counts tend to correspond with higher rearrest rates.

5.5.4 Visualizing Caseload and Rearrest Rates

ggplot(combined_data, aes(x = total_cases, y = rate)) +
  geom_point(color = "darkblue", alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Relationship Between Juvenile Caseloads and Rearrest Rates",
    x = "Total Juvenile Probation Caseload",
    y = "Rearrest Rate"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
This figure shows Relationship Between Juvenile Caseloads and Rearrest Rates.
Figure 5.2: This figure shows Relationship Between Juvenile Caseloads and Rearrest Rates.

This scatterplot shows the relationship between total juvenile probation caseloads and rearrest rates each month. The red trend line suggests that months with higher caseloads tend to have slightly higher rearrest rates, though there is variation and other factors likely play a role. While this doesn’t prove causation, it highlights how supervision levels may be related to youth outcomes and can inform resource planning.

5.6 Investigating Middle School vs High School Discharge

# Load and clean dataset
sdis <- nycOpenData::nyc_school_discharge(limit = 10000) %>%
  select(-code, -discharge_description)

# Preview cleaned dataset
sdis %>%
  head() %>%
  knitr::kable(
    caption = "Preview of the NYC school discharge dataset after removing unnecessary variables. The data include discharge counts by school level, district, and discharge type."
  )
Table 5.4: Preview of the NYC school discharge dataset after removing unnecessary variables. The data include discharge counts by school level, district, and discharge type.
year report_category school_level geographic_unit school_name student_category discharge_category count_of_students total_enrolled_students
2022-2023 School Middle School 32K562 EVERGREEN MS FOR URBAN EXPLORATION Male Drop Out s 180
2022-2023 School Middle School 75K036 PS 36 Female Discharge out of NYC School s 22
2022-2023 School Middle School 75K036 PS 36 Male Discharge out of NYC School s 83
2022-2023 School Middle School 75K140 PS K140 Female Discharge out of NYC School s 23
2022-2023 School Middle School 75K140 PS K140 Male Discharge out of NYC School s 122
2022-2023 School Middle School 75K141 PS K141 Male Discharge out of NYC School s 61

5.6.1 Preparing Discharge Data

sdis_male <- sdis %>%
  filter(
    student_category == "Male",
    school_level %in% c("Middle School", "High School"),
    count_of_students != "s",
    total_enrolled_students != "s"
  ) %>%
  mutate(
    count_of_students = as.numeric(count_of_students),
    total_enrolled_students = as.numeric(total_enrolled_students)
  )

sdis_male <- sdis_male %>%
  mutate(discharge_rate = count_of_students / total_enrolled_students)

5.6.2 Summarizing Discharge Data

sum_r <- sdis_male %>%
  group_by(school_level) %>%
  summarise(
    mean_rate = mean(discharge_rate, na.rm = TRUE),
    sd_rate = sd(discharge_rate, na.rm = TRUE),
    n_schools = n()
  )

sum_r %>%
  knitr::kable(
    caption = "Summary statistics of discharge rates for male students by school level, including the mean discharge rate, standard deviation, and number of schools."
  )
Table 5.5: Summary statistics of discharge rates for male students by school level, including the mean discharge rate, standard deviation, and number of schools.
school_level mean_rate sd_rate n_schools
High School 0.0697668 0.0778978 100
Middle School 0.0573623 0.0607013 66

5.6.3 Discharge Analyzes

5.6.3.1 Independent T-Test

t.test(discharge_rate ~ school_level, data = sdis_male)

    Welch Two Sample t-test

data:  discharge_rate by school_level
t = 1.1492, df = 159.43, p-value = 0.2522
alternative hypothesis: true difference in means between group High School and group Middle School is not equal to 0
95 percent confidence interval:
 -0.008913023  0.033721970
sample estimates:
  mean in group High School mean in group Middle School 
                 0.06976680                  0.05736233 

T-Test results show no significant difference between males in middle school vs high school discharge.

5.6.3.2 Discharge Types by School Level

table_male <- sdis_male %>%
  group_by(school_level, discharge_category) %>%
  summarise(total = sum(count_of_students), .groups = "drop")

c_tab <- xtabs(total ~ school_level + discharge_category, data = table_male)

knitr::kable(
  c_tab,
  caption = "Contingency table showing total counts of discharge types by school level for male students."
)
Table 5.6: Contingency table showing total counts of discharge types by school level for male students.
Discharge out of NYC School Drop Out
High School 957 930
Middle School 1363 0

5.6.3.3 Chi-Square Analysis

chi_res <- chisq.test(c_tab)

chi_table <- tibble::tibble(
  Statistic = chi_res$statistic,
  Degrees_of_Freedom = chi_res$parameter,
  P_Value = chi_res$p.value
)

chi_table %>%
  knitr::kable(
    caption = "Chi-square test of independence examining the relationship between school level and discharge category for male students."
  )
Table 5.7: Chi-square test of independence examining the relationship between school level and discharge category for male students.
Statistic Degrees_of_Freedom P_Value
938.6163 1 0

5.6.4 Visualizing Discharge Patterns

5.6.4.1 Discharge Rates by School Level

ggplot(sdis_male, aes(x = school_level, y = discharge_rate)) +
  geom_boxplot() +
  labs(
    title = "Male Student Discharge Rates by School Level",
    x = "School Level",
    y = "Discharge Rate"
  ) +
  theme_minimal()
This figure shows Male Student Discharge Rates by School Level.
Figure 5.3: This figure shows Male Student Discharge Rates by School Level.

5.6.4.2 Discharge Type Composition by School Level

ggplot(table_male, aes(x = school_level, y = total, fill = discharge_category)) +
  geom_bar(stat = "identity", position = "fill") +
  labs(
    title = "Distribution of Discharge Types Among Male Students",
    y = "Proportion",
    x = "School Level"
  ) +
  theme_minimal()
This figure shows The Distribution of Discharge Types Among Male Students
Figure 5.4: This figure shows The Distribution of Discharge Types Among Male Students.

5.7 Mapping District-Level Differences

5.7.1 Preparing District-Level Data

sdis_male <- sdis %>%
  filter(
    student_category == "Male",
    school_level %in% c("Middle School", "High School"),
    count_of_students != "s",
    total_enrolled_students != "s"
  ) %>%
  mutate(
    count_of_students = as.numeric(count_of_students),
    total_enrolled_students = as.numeric(total_enrolled_students),
    discharge_rate = count_of_students / total_enrolled_students,
    district = str_extract(geographic_unit, "\\d+"),
    district = sprintf("%02d", as.numeric(district))
  ) %>%
  filter(!is.na(district))
district_summary <- sdis_male %>%
  group_by(district, school_level) %>%
  summarise(
    mean_rate = mean(discharge_rate, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = school_level,
    values_from = mean_rate
  ) %>%
  mutate(rate_diff = `High School` - `Middle School`)

5.7.2 Districts with the Largest Gaps

district_summary %>%
  arrange(desc(rate_diff)) %>%
  slice_head(n = 5) %>%
  knitr::kable(
    caption = "Top five districts with the largest positive difference between high school and middle school male discharge rates."
  )
Table 5.8: Top five districts with the largest positive difference between high school and middle school male discharge rates.
district High School Middle School rate_diff
02 0.1417620 0.0776684 0.0640936
03 0.1694915 0.1200000 0.0494915
08 0.0825433 0.0394737 0.0430696
21 0.0538933 0.0297701 0.0241231
10 0.0634714 0.0425624 0.0209091

5.7.3 Building the Spatial Dataset

nyc_districts <- school_districts(
  state = "NY",
  year = 2022,
  class = "sf",
  progress_bar = FALSE
) %>%
  mutate(
    district = substr(GEOID, nchar(GEOID) - 1, nchar(GEOID))
  ) %>%
  filter(district %in% sprintf("%02d", 1:32))

map_data <- nyc_districts %>%
  left_join(district_summary, by = "district")

5.7.4 Static District Map

ggplot(map_data) +
  geom_sf(aes(fill = rate_diff), color = "lightblue", linewidth = 0.2) +
  scale_fill_viridis_c(
    name = "High − Middle\nDischarge Rate",
    na.value = "grey90"
  ) +
  labs(
    title = "Difference in Male Discharge Rates by School Level",
    subtitle = "NYC School Districts",
    caption = "Positive values indicate higher rates in high school"
  ) +
  theme_minimal()
Static choropleth map of New York City school districts shaded by the difference between high school and middle school male discharge rates.
Figure 5.5: District-level difference in male discharge rates between high schools and middle schools across New York City.

5.7.5 Interactive District Map

# Join district summary data to district boundaries
district_map <- nyc_districts %>%
  left_join(district_summary, by = "district") %>%
  sf::st_transform(4326)

# Create color palette
pal <- colorNumeric(
  palette = viridis::viridis(256),
  domain = district_map$rate_diff,
  na.color = "grey90"
)

# Interactive leaflet map
leaflet(map_data) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addPolygons(
    fillColor = ~pal(rate_diff),
    fillOpacity = 0.8,
    color = "white",
    weight = 1,
    label = ~paste(
      "District:", district,
      "<br>High − Middle Rate:",
      round(rate_diff, 3)
    ) %>% lapply(HTML)
  ) %>%
  addLegend(
    pal = pal,
    values = ~rate_diff,
    title = "High − Middle<br>Discharge Rate",
    opacity = 1
  )
Figure 5.6: Interactive map of NYC school districts showing the difference between high school and middle school discharge rates. Darker colors indicate districts where high school discharge rates exceed middle school rates by a larger margin.

5.7.6 Focusing On Five Boroughs

leaflet(map_data) %>%
  addProviderTiles("CartoDB.Positron") %>%
  fitBounds(
    lng1 = -74.30, lat1 = 40.45,   # southwest corner
    lng2 = -73.65, lat2 = 40.95    # northeast corner
  ) %>%
  addPolygons(
    fillColor = ~pal(rate_diff),
    fillOpacity = 0.8,
    color = "white",
    weight = 1,
    label = ~paste(
      "District:", district,
      "<br>High − Middle Rate:",
      round(rate_diff, 3)
    ) %>% lapply(htmltools::HTML)
  ) %>%
  addLegend(
    pal = pal,
    values = ~rate_diff,
    title = "High − Middle<br>Discharge Rate",
    opacity = 1
  )
Figure 5.7: Interactive map focused on the five boroughs of New York City, showing district-level differences between high school and middle school male discharge rates. Darker shading indicates larger positive differences.

5.8 Notes

Overall, combining probation and school discharge data helps us develop a better understand of systemic pressures on youth. Trends in caseloads, rearrests, and school discharges together highlight potential intervention points. Future work could include demographic data or other datasets to explore more intersectional patterns to gain deeper insights on the key players that lead to recidivism in youths.