Predicting Autism Screening Outcomes with Machine Learning

Maiko Hata

Research problem

The big question: How can I design and apply Machine Learning (ML) models without reinforcing existing biases?

Research question

  • Goal: Predict Autism Spectrum Quotient (AQ-10) screener scores from demographic information.

  • Potential benefit: Understanding factors related to scores can improve model clarity and later inform real world, focused outreach.

  • Ethical & equity considerations: As an Autistic researcher and Early Intervention/Early Childhood Special Education (EI/ECSE) specialist, I want to understand how predictive models are created, what their limitations are, and the ethical considerations.

Data

From Kaggle

  • Numerical (age, screener result)
  • Categorical (gender f/m, jaundice yes/no, family history of Autism yes/no, used app before yes/no, results from screener results YES (6+) NO (~5), ethnicity, country of residence)

AQ-10 screener

Allison et al. (2012) evaluated shortened 10-item versions of the Autism Spectrum Quotient (AQ-50) as quick screening tools for Autism. Using data from over 1,000 Autistic individuals and 3,000 controls, the short forms showed high sensitivity (accurately identifying most people with Autism) and high specificity (correctly excluding those without).

AQ-10 screener image

Cleaning up missing data

  • using ff_glimpse(autism)
ff_glimpse(autism[, 1:20])
$Continuous
              label var_type   n missing_n missing_percent mean  sd min
A1_Score   A1_Score    <int> 704         0             0.0  0.7 0.4 0.0
A2_Score   A2_Score    <int> 704         0             0.0  0.5 0.5 0.0
A3_Score   A3_Score    <int> 704         0             0.0  0.5 0.5 0.0
A4_Score   A4_Score    <int> 704         0             0.0  0.5 0.5 0.0
A5_Score   A5_Score    <int> 704         0             0.0  0.5 0.5 0.0
A6_Score   A6_Score    <int> 704         0             0.0  0.3 0.5 0.0
A7_Score   A7_Score    <int> 704         0             0.0  0.4 0.5 0.0
A8_Score   A8_Score    <int> 704         0             0.0  0.6 0.5 0.0
A9_Score   A9_Score    <int> 704         0             0.0  0.3 0.5 0.0
A10_Score A10_Score    <int> 704         0             0.0  0.6 0.5 0.0
result       result    <int> 704         0             0.0  4.9 2.5 0.0
          quartile_25 median quartile_75  max
A1_Score          0.0    1.0         1.0  1.0
A2_Score          0.0    0.0         1.0  1.0
A3_Score          0.0    0.0         1.0  1.0
A4_Score          0.0    0.0         1.0  1.0
A5_Score          0.0    0.0         1.0  1.0
A6_Score          0.0    0.0         1.0  1.0
A7_Score          0.0    0.0         1.0  1.0
A8_Score          0.0    1.0         1.0  1.0
A9_Score          0.0    0.0         1.0  1.0
A10_Score         0.0    1.0         1.0  1.0
result            3.0    4.0         7.0 10.0

$Categorical
                          label var_type   n missing_n missing_percent levels_n
age                         age    <chr> 704         0             0.0       47
gender                   gender    <chr> 704         0             0.0        2
ethnicity             ethnicity    <chr> 704         0             0.0       12
jundice                 jundice    <chr> 704         0             0.0        2
austim                   austim    <chr> 704         0             0.0        2
contry_of_res     contry_of_res    <chr> 704         0             0.0       67
used_app_before used_app_before    <chr> 704         0             0.0        2
age_desc               age_desc    <chr> 704         0             0.0        1
relation               relation    <chr> 704         0             0.0        6
                levels levels_count levels_percent
age                  -            -              -
gender               -            -              -
ethnicity            -            -              -
jundice              -            -              -
austim               -            -              -
contry_of_res        -            -              -
used_app_before      -            -              -
age_desc             -            -              -
relation             -            -              -

Cleaning up missing data

  • Since the dataset used “?” for missing values, I calculated how often “?” appeared across columns to check for missing data patterns.
sum(autism$ethnicity == "?", na.rm = TRUE)
[1] 95
nrow(autism)
[1] 704
mean(autism$ethnicity == "?", na.rm = TRUE) * 100
[1] 13.49432

Descriptive statistics

Autism Screener user results

ggplot(autism, aes(x = Class.ASD, fill = Class.ASD)) +
  geom_bar() +
  scale_fill_manual(values = c("NO" = "#69b3a2", "YES" = "#404080")) +
  labs(title = "Autism Prevalence by Screener Result",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

Descriptive statistics

  • Someone was 383 years old 👵🏼
ggplot(autism, aes(x = age)) +
  geom_histogram(binwidth = 0.5) +
  scale_x_continuous(limits = c(0, max(autism$age, na.rm = TRUE) + 10)) +
  labs(title = "Age Distribution",
       x = "Age",
       y = "Count") +
  theme_minimal()

Descriptive statistics

  • Autism Screener user age with filter(age <= 100)
autism <- autism %>% filter(age <= 100)

ggplot(autism, aes(x = age)) +
  geom_histogram(binwidth = 0.5, fill = "steelblue") +
  labs(title = "Age Distribution",
       x = "Age",
       y = "Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Descriptive statistics

  • Autism Screener user race/ethnicity
ggplot(autism, aes(x = ethnicity)) +
  geom_bar(fill = "steelblue") +
  geom_text(aes(label = after_stat(count)),
            stat = "count",
            vjust = -0.5, size = 3.5) +
  labs(title = "Race/Ethnicity",
       x = "Race/Ethnicity",
       y = "Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1))

Modeling assumptions

Before building the models, it’s important to review the Ordinary Least Squares (OLS) assumptions for linear regression (Ch. 4 & 6, Boehmke, & Greenwell, 2019):

  1. Linearity: There should be a roughly linear relationship between predictors and outcome variables.

  2. Sample size: The number of observations (n) should be larger than that of the predictors (p).

  3. Multicollinearity: The independent variables cannot be highly correlated to each other (p. 269, Vogt & Johnson, 2016).

Modeling assumptions

  • To reduce overfitting, the models used 10-fold cross-validation, providing more reliable measure of generalizability.

Model 1: Ridge Regularization model

plot(ridge)
ridge$bestTune
   alpha lambda
49     0   0.49

Model 1: Ridge Regularization model

Held-out test metrics
Model R2 MAE RMSE
Baseline (mean-only) NA 2.263 2.601
Ridge (10-CV, keep_screen=FALSE) 0.309 1.886 2.233

Model 2: Lasso Regularization model

  alpha lambda
6     1  0.005

Model 2: Lasso Regularization model

Held-out test metrics: Baseline vs Ridge vs Lasso
Model R2 MAE RMSE
Baseline (mean-only) NA 2.263 2.601
Ridge (10-CV, keep_screen=FALSE) 0.309 1.886 2.233
Lasso (10-CV, keep_screen=FALSE) 0.281 1.880 2.241

Model 3: Elastic Net Regression model

plot(elastic)
elastic$bestTune        
   alpha     lambda
26     0 0.09249147

Evaluating model performance

The Elastic Net model chose alpha = 0, meaning it behaved like Ridge. With many correlated demographic features, this gentle ridge-style shrinkage worked better than LASSO’s stronger variable-dropping penalty. Ridge and Elastic Net performed the best (R² = 0.309), and while LASSO was slightly lower, all three outperformed the mean-only baseline.

Held-out test metrics: Baseline vs Ridge vs Lasso vs Elastic Net (sorted by R²)
Model R2 MAE RMSE
Ridge (10-fold CV) 0.309 1.886 2.233
Elastic Net (10-fold CV) 0.309 1.886 2.234
Lasso (10-fold CV) 0.281 1.880 2.241
Baseline (mean-only) -0.004 2.263 2.601

Model fit - Final model

  • Model efficiency & effectiveness
ggplot(comp_long, aes(x = Metric, y = Value, fill = Model)) +
  geom_col(position = "dodge") +
  labs(title = "Model Performance by Regularization",
       x = "Metric", y = "Value") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

Model fit - Final model

  • Winner (Ridge) 🥇

Model fit - Cut off point & other considerations

  • The AQ-10 uses a cut-off score of 6, but I chose not to create a binary outcome. Modeling the continuous score offered more nuance and reduced the risk of oversimplifying an already limited dataset.

  • The dataset is highly skewed. Because I do not know who was able to or motivated to access this online screener, the sample is not representative of a broader population at all.

  • Given these limitations, linear regression was more informative for exploring patterns.

What I learned: Machine Learning

  • As with any method, the quality of data determines the quality of the outcome. In ML, this is more critical because it is easy to run models with limited understanding of the datasets or the variables.

  • As Gould et al. (2023) noted, “To… effectively eliminate health disparities requires recognition of the subjectivity of data and of the power of data to dictate and reinforce narratives, accompanied by intentional reform of data practices” (p. 12).

What I learned: Variables & Findings

  • Participants from the U.S., Canada, and Brazil showed higher positive correlations, while those from the UAE, India, South Asia, and New Zealand showed lower ones. Because reported Autism diagnosis rates are much higher in the U.S. and Canada than in Brazil (World Population Review, 2025), Brazil’s similarity to these countries was unexpected.

  • This likely reflects who chose to participate rather than true population patterns, since the sample is based on self-selection. Without knowing participants’ motivations or access factors, these results are difficult to interpret.

  • Family history of Autism was a meaningful predictor, but findings like these must never be misread as suggesting that certain racial/ethnic or national groups are “more likely” to be Autistic.

Other considerations

  • The AQ was developed using adults in the UK who spoke English, of unknown race, with men overrepresented (Baron-Cohen et al., 2001). This makes the norming very skewed.
  • I struggled with the ethical implications of applying ML. It is unclear if participants consented to secondary use of data. Conducting these models requires not only methodological care but also ethical reflection about data ownership, economical/environmental impact, and the risks of misinterpretation.

Conclusion

“…all model-building efforts are constrained by the existing data” (Kuhn & Johnson, 2016, p. 61), and equally shaped by the assumptions and interpretive choices researchers make throughout the analytic process.

References

Thank you

Please don’t hesitate to reach out via LinkedIn page or Email, I would love to hear from you! I also invite you to my website I created using Quarto and take a look at my projects.

Maiko Hata, University of Oregon