Enhancing ASD detection accuracy: a combined approach of machine learning and deep learning models with natural language processing

Data extraction and pre-processing

The initial step involved manually identifying users for the experiments by searching for specific keywords, including ’Autism’, ’ASD’, ’Asperger’, ’Aspie’, ’Autistic’, and ’ActuallyAutistic’, within their biographies. This selection process was carried out diligently, with each profile being manually reviewed. Consequently, several users were excluded as they did not correspond to individuals claiming to have ASD. The following user categories were discarded from the ASD group:

Profiles belonging to organizations or societies.

Users who identified themselves as ASD advocates rather than patients.

Family members of individuals with ASD, such as those who mentioned being the ’Father of an ASD kid’, ’Mother of an ASD kid’, or part of an ’ASD family’.

Data was obtained from public Twitter users using a Python script programmed to interact with Twitter’s API developer. This facilitated the extraction of user publications, which were then exported to a CSV file. The tweets were collected from January 1st, 2017, to January 31th, 2022, covering a period of approximately five years. The dataset was designed to consist of two groups: a representative number of ASD patients and a group of individuals without ASD.

Subsequently, several pre-processing steps were applied to the data, which are outlined as follows:

Removal of duplicate tweets or those with identical content.

Elimination of retweets posted by the users.

Exclusion of tweets that were not extracted correctly or in their entirety.

Removal of tweets automatically published by users through sharing options from other platforms like YouTube and Facebook.

For the experiment, a total of 252 users were considered, with 221 classified as ASD users and 31 classified as non-ASD users. Prior to the pre-processing procedure, the dataset consisted of 1,014,723 classified tweets. After undergoing the aforementioned steps, the dataset was reduced and cleaned, resulting in 404,627 tweets. From the complete dataset, a subset of 90,000 tweets was selected with an equal distribution of 45,000 from ASD and non-ASD users respectively.

Implementation of machine learning and deep learning models

The dataset was randomly divided into training and testing sets, with 75% allocated for training the models and 25% for testing. The primary objective was to identify the best-performing model and compare their results to determine the most accurate model in this specific context. To achieve optimal results, an investigation of the best hyperparameters, which contribute to improving model performance, was conducted. This process, known as hyperparameter search, was facilitated using the Python library called GridSearchCV.

The hyperparameters for each ML model are outlined below, in Table 2:

Table 2 ML models’ hyperparameters

The RNN model is made up of the following layers:

Embedding layer.

Simple RNN layer with 64 units.

Two fully connected layers with dropout between them

The final output layer has a single neuron due to the fact that is the responsable for classifying the sample.

In Fig. 3 the scheme of the RNN arquitecture is shown.

Fig. 3figure 3

Representation of the RNN model arquitecture

The LSTM model is made up of the following layers:

The input pass through a process of text vectorization.

Embedding layer.

LSTM layer with 64 units.

One fully connected layer.

The final output layer has a single neuron because is in charge of classifying the sample.

The only difference among the LSTM and Bi-LSTM arquitectures is the LSTM and Bi-LSTM layers. In Fig. 4 the schemes of the LSTM and Bi-LSTM arquitectures are shown.

Fig. 4figure 4

Representation of the LSTM and Bi-LSTM model arquitectures

Results

Three ML models, namely decision trees, XGB, and KNN, were trained, alongside other DL models, namely RNN, LSTM, Bi-LSTM, BERT and BERTweet. The results, displayed in Table 3, support the hypothesis that some DL basic models achieves higher accuracy compared to the ML models with hyperparameters.

Table 3 Results of ML and DL models

Figure 5 displays the confusion matrices for eight different classification models utilized in a binary classification task aimed at identifying individuals with Autism Spectrum Disorder (ASD).

Fig. 5figure 5

Confusion matrices of the 8 trained models (Decision Trees, XGB, KNN, RNN, LSTM, Bi-LSTM, BERT & BERTweet)

The BERTweet model stands out as the top-performing model, exhibiting a significant number of true positives and true negatives, indicating its strong ability to accurately identify individuals with and without ASD. As a deep learning model, BERT leverages neural networks to discern intricate patterns within the input data. This highlights the potential of deep learning models in extracting relevant patterns, thus enhancing the precision of classification.

While hyperparameter optimization was performed for the machine learning models, it was found that the BERTweet model outperformed the others. The KNN model achieved the lowest accuracy at 60.8%, followed by the decision tree with 61.2%, LSTM with 69.5%, RNN with 69.9%, and Bi-LSTM and XGB with an accuracy of 70.3% and 71.6% respectively. Notably, the BERT-based models achieved the best accuracies. The accuracy of BERT and BERTweet models were 84.3% and 87.7 respectively. So the model with the best accuracy was BERTweet.

In summary, the analysis of the confusion matrices emphasizes the importance of selecting the appropriate model for detecting ASD and evaluating its performance using metrics such as confusion matrices. The exceptional accuracy and ability of the BERT model to learn complex patterns in the data suggest that deep learning models have the potential to significantly enhance the accuracy of classification tasks involving individuals with and without ASD.

Comments (0)

No login
gif