In this section, we briefly describe the machine learning methods employed to identify the presence of glial cells in synaptic transmission. The models selected for this study—Feedforward Neural Networks (FNNs), Decision Trees, Bagging, Random Forests, and Gradient Boosting—were chosen due to the strong multicollinearity observed among the variables, as these methods do not rely heavily on assumptions of feature independence. We also provide a detailed explanation of the choices and initialization procedures for the parameters and hyperparameters of each classifier.
A.1 ClassifiersA.1.1 Decision treesA Decision Tree (DTree) (Breiman et al. 1984a) is a supervised machine learning algorithm used for both classification and regression tasks, which splits data into subsets based on the value of input features. It constructs a tree-like structure where each internal node represents a decision (based on a feature) and each leaf node represents an outcome (either a class label or a regression value). In the case of classification, the algorithm selects the feature that best separates the data at each node using criteria like Gini Impurity (Gini 1912) or Information Gain (Shannon 1948). The tree grows recursively by splitting the data until a stopping condition is met (Fig. 4).
Pseudocode
1. Input: Training data D, stopping criteria (max depth, min samples, etc.)
2. Procedure:
a. Check stopping criteria. If met, stop and return a leaf node.
b. For each feature, calculate the splitting criterion (e.g., Gini).
c. Select the best feature and its threshold that maximizes the criterion.
d. Split the data D into two subsets:
D_left = data where feature \(\texttt \) threshold
D_right = data where feature \(\texttt \) threshold
e. Create a decision node with the selected feature and threshold.
f. Recursively repeat steps (a)-(e) for D_left and D_right to build tree.
3. Output: Decision tree.
Fig. 4Decision tree structure illustration. This figure shows the structure of a decision tree. The root node at the top represents the initial condition, leading to branches based on alternative outcomes. Each decision node (orange) represents a condition that splits the data based on specific criteria, leading to further branches. The branches represent possible alternatives, labeled here as “YES” and “NO.” The endpoints of each branch are the leaf nodes (blue), which provide the final decisions or outcomes. The decision tree model uses these nodes and branches to classify or predict outcomes based on input conditions
A.1.2 BaggingBagging (Bootstrap Aggregating), first introduced by Breiman (1996) is an ensemble learning technique that improves the performance of models like Decision Trees by reducing variance. It works by generating multiple subsets of the training data through random sampling with replacement (bootstrap sampling). A Decision Tree model is trained on each subset, and the final prediction is made by aggregating the results from all models, typically using majority voting for classification or averaging for regression. By combining the predictions from multiple trees, Bagging increases stability and accuracy, particularly for models prone to overfitting (Fig. 5).
Pseudocode
1. Input: Training data D, number of models T, base learner ( e.g., Decision Tree)
2. Procedure:
a. For each model i from 1 to T:
i. Generate a bootstrap sample D_i by randomly sampling with replacement from D.
ii. Train the base learner (e.g., a Decision Tree) on the bootstrap sample D_i.
b. For classification: Aggregate predictions from each model by majority voting.
For regression: Aggregate predictions from each model by averaging.
3. Output: Final prediction based on aggregated results.
Fig. 5Bagging algorithm illustration. This figure shows the Bagging (Bootstrap Aggregating) process, where multiple decision trees are trained independently on different bootstrap samples of the training data. Bootstrap sampling is used to create unique datasets for each tree, allowing each tree to learn from a slightly different perspective. The orange nodes represent decision splits based on selected features, while the blue nodes are leaf nodes with the final predictions for each tree. Each tree produces an output (Output 1, Output 2, etc.), and these outputs are aggregated to generate the overall prediction
A.1.3 Random forestRandom Forest (RF) (Breiman 2001) is an ensemble learning method used for classification and regression. It builds multiple decision trees during training and aggregates their predictions. Each tree is trained on a random subset of the data using bootstrap sampling and a random subset of features. This randomness helps reduce overfitting and variance, making the model more robust (Fig. 6).
Pseudocode
1. Input: Training data D, number of trees T, number of features F.
2. Procedure:
a. For each tree t in range(T):
i. Sample a subset of data D_t from D (with replacement).
ii. Select a random subset of features F_t from F.
iii. Build a decision tree on D_t using F_t.
b. Aggregate predictions from all T trees (e.g., majority vote for classification).
3. Output: Random Forest model.
Fig. 6Random forest algorithm illustration. This figure shows a Random Forest model, where multiple decision trees are trained in parallel on subsets of the training data. Each tree is built using a unique combination of observation and feature subset sampling. The orange nodes represent splits in the trees based on selected features, while the blue nodes represent leaf nodes, which contain the final outputs for each tree. The predictions from each tree (Output 1, Output 2, etc.) are aggregated to produce the overall model prediction
A.1.4 Gradient boostingGradient Boosting (GBoost) (Friedman 2001) is another ensemble technique where models are built sequentially. Each new model tries to correct the errors of the previous one by optimizing a loss function. Unlike Random Forest, where trees are independent, Gradient Boosting builds trees that are dependent, with each one learning from the residual errors of its predecessors (Fig. 7).
Pseudocode
1. Input: Training data D, number of trees T, learning rate (h).
2. Procedure:
a. Initialize the model f(x) with a simple prediction (e.g., mean).
b. For each tree t in range(T):
i. Calculate residual errors: r_t = y - f(x) (for each instance).
ii. Fit a decision tree tree(x) to the residuals r_t.
iii. Update the model: f(x) = f(x) + h * tree(x).
c. Repeat until T trees are built or stopping criteria are met.
3. Output: Gradient Boosting model.
Fig. 7Gradient boosting algorithm illustration: the figure shows a sequence of decision trees being trained iteratively. Each tree is trained on data with previous predictions, with orange points representing incorrect predictions and blue points representing correct predictions. The output from each tree is combined (aggregated predictions) to form the final prediction. Each subsequent tree aims to correct errors from the previous trees, enhancing model accuracy through boosting
A.1.5 Feedforward neural networksFeedforward Neural Networks (FNNs) (Rosenblatt 1958) are a class of Artificial Neural Networks (McCulloch and Pitts 1943) where connections between the nodes do not form cycles. These networks consist of multiple layers: an input layer, one or more hidden layers, and an output layer. Each node (or neuron) in a layer is connected to every node in the next layer via weights. The network learns by adjusting these weights based on the error between predicted and actual outcomes.
We consider a network with an input vector \((X_1, \ldots , X_p)\), where each \(X_i\) is a predictor, \(i = 1, 2, \ldots , p\). These inputs include information on the presence or absence of astrocytes. The nodes of the first hidden layer are represented by the vector \(\left( X_1^, \ldots , X_M^\right)\), where each variable \(X_m^\) is a function of a linear combination of the input vector elements. Specifically,
$$\begin X_m^ = f\left( \beta _^ + \sum _^ \beta _^ X_i\right) , \quad m = 1, 2, \ldots , M, \end$$
where \(f\) is a non-linear activation function, \(\beta _^\) is the bias term, and \(\beta _^\) are weights connecting input nodes to the first hidden layer.
Once the values of \((X_1^, \ldots , X_M^)\) are computed, propagation continues sequentially through the network. For a multi-layer neural network with \(H\) hidden layers, each layer \(h\) has \(d_h\) nodes. The recursive formula for the hidden layer nodes follows the same pattern, involving a linear combination of the previous layer’s outputs followed by the activation function.
The output layer, which predicts \(K\) classes, uses similar equations. For each class \(k = 1, 2, \ldots , K\), the output node is defined as a linear combination of the final hidden layer outputs. In our case, where \(K=2\), the output is converted into a probability for binary classification.
To estimate the parameters of the FNN model, we employed different optimization algorithms. These techniques adjust the weights and biases to minimize a chosen loss function, which in our case was binary cross-entropy, a standard loss function for binary classification tasks (Ian Goodfellow 2016) (Fig. 8).
Pseudocode
1. Input: Training data D, weights W, biases b, learning rate h and number of epochs n.
2. Procedure:
a. Initialize weights W and biases b randomly.
b. For each epoch (iteration):
i. Forward pass: For each layer l, compute:
z_l = W_l * a_(l-1) + b_l (weighted sum)
a_l = activation(z_l) (non-linear transformation)
ii. Compute the error E (e.g., cross-entropy for classification).
iii. Backward pass: Propagate the error backward to compute gradients:
dE/dW and dE/db using the chain rule.
iv. Update weights and biases:
W = W - h * dE/dW
b = b - h * dE/db
3. Output: Trained neural network.
Fig. 8Feedforward neural network architecture. The figure shows an input layer as black circles, hidden layers as blue circles, and output layer as an orange circle. The circles are the artificial neurons and the arrows represent the connections between the neurons
A.2 Performance measuresTo quantify the model’s performance, we use validation data to compute the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). These values form the confusion matrix, which is structured as follows.
Predicted positive
Predicted negative
Actual positive
TP (True positive)
FN (False negative)
Actual negative
FP (False positive)
TN (True negative)
From the confusion matrix, we can calculate several standard machine learning metrics that serve to characterize the model’s performance. These include:
Sensitivity (Recall):
Sensitivity (SN), also known as Recall, measures the model’s ability to correctly identify positive cases among all truly positive cases. The formula for Sensitivity is given by
$$\begin \text = \frac}}+}}. \end$$
(A1)
Specificity:
Specificity (SP) measures the proportion of negative cases that are correctly identified by the model relative to the total number of truly negative cases. The formula for Specificity is given by
$$\begin \text = \frac}}+}}. \end$$
(A2)
Accuracy
Accuracy (ACC) measures the proportion of correct predictions made by the model relative to the total number of samples. The formula for Accuracy is given by:
$$\begin \text = \frac}+}}}} \end$$
(A3)
Positive Predictive Value (Precision):
Positive Predictive Value (PPV), also known as Precision, measures the proportion of correctly predicted positive cases relative to the total number of cases predicted as positive by the model. The formula for PPV is given by
$$\begin \text = \frac}}. \end$$
(A4)
Negative Predictive Value (NPV):
Negative Predictive Value (NPV) measures the proportion of correctly predicted negative cases relative to the total number of cases predicted as negative by the model. The formula for NPV is given by
$$\begin \text = \frac}}. \end$$
(A5)
F1-Score:
The F1-Score is the harmonic mean of Precision and Sensitivity (Recall). The formula for F1-Score is given by
$$\begin \text = 2 \times \frac \times \text } + \text }. \end$$
(A6)
The F1-Score balances the model’s ability to avoid false positives and false negatives. A higher F1-Score indicates better performance, as it balances sensitivity and precision.
A.3 Parameters and hyperparameters choiceTo evaluate the performance of each hyperparameter configuration, we used 5-fold cross-validation. This involves dividing the data into five folds, training the model on four folds, and validating it on the remaining fold. The process is repeated five times, and the mean accuracy across the folds is used to assess model performance. Cross-validation plays an important role in mitigating overfitting, as it ensures that the model generalizes well to unseen data (Kohavi 1995). Given that our dataset is balanced, accuracy was chosen as the primary metric for model selection.
For hyperparameter tuning, we applied Bayesian Optimization using the Optuna framework (Akiba et al. 2019). Bayesian optimization, first introduced by (Mockus 1978), is particularly efficient for navigating large and complex hyperparameter spaces, as it balances exploration and exploitation to identify optimal configurations. This made it a suitable choice for our study.
Mathematically, Bayesian optimization builds a surrogate model \(p_(y \mid })\), where \(}\) represents a hyperparameter configuration and \(y\) is the corresponding performance score (e.g., accuracy). In Optuna, we employed a Tree-structured Parzen Estimator (TPE) as the surrogate model (Bergstra et al. 2011), which models the likelihood ratio as follows:
$$\begin l(}) = \frac} \mid y > y^*)}} \mid y \le y^*)}, \end$$
where \(y^*\) is a threshold separating good from poor evaluations of the objective function, typically chosen based on a quantile of completed trials. By default, Optuna uses a 50% quantile, meaning trials with performance above the median are considered good.
The TPE models the two distributions \(p(} \mid y > y^*)\) and \(p(} \mid y \le y^*)\) using kernel density estimation (KDE) (Rosenblatt 1956; Parzen 1962) or similar techniques. In Optuna, the TPE implementation uses KDE to model the hyperparameter distributions, constructing Gaussian mixture models (Bishop 2006a) for both good and poor trials. This allows the algorithm to estimate the likelihood ratio \(l(})\) and sample new hyperparameters by maximizing this ratio, effectively focusing the search on promising regions of the hyperparameter space.
As trials progress, \(y^*\) is dynamically updated, allowing the model to refine its search and improve efficiency. By updating the surrogate model based on each trial’s results, Bayesian optimization efficiently navigates the hyperparameter space.
We employed Optuna’s Median Pruner to reduce computational cost by halting underperforming trials early, using the following settings:
$$\begin \text \_\text \_\text = 5, \quad \text \_\text \_\text = 30. \end$$
The optimization was conducted over 100 trials, with each trial representing a different configuration of hyperparameters. The combination of Bayesian optimization and cross-validation helped improve model performance while minimizing overfitting. This framework, incorporating Bayesian optimization, early pruning, and cross-validation, provided an efficient and reliable method for identifying a well-performing model for our task.
A.4 Decision treesIn our optimization of the Decision Tree Classifier, we selected a range of hyperparameters that are important for controlling the complexity and performance of the model. The criterion hyperparameter, set to either gini or entropy, determines the function used to measure the quality of a split (Breiman et al. 1984b; Quinlan 1986). The gini criterion assesses node impurity, while entropy is based on information gain.
The min_samples_split parameter, ranging from 2 to 50, controls the minimum number of samples required to split an internal node, helping to prevent the creation of nodes that could lead to overfitting. Setting this parameter within a broader range allows the model to consider both simpler and more complex trees. The ccp_alpha parameter, which varies from 0 to 0.2, helps control tree pruning through cost complexity pruning, reducing overfitting by pruning branches that contribute little to the model (Breiman 1984). The min_samples_leaf parameter, set between 1 and 50, specifies the minimum number of samples that must be present in a leaf node, ensuring that leaves contain enough samples to generalize effectively. Lastly, we defined the max_depth parameter to range from 2 to 30, limiting the maximum depth of the tree and helping control the model’s complexity. This hyperparameter space was selected to allow for a balanced search for configurations that improve both model performance and generalization across different conditions.
A.5 BaggingIn our optimization of the Bagging Classifier, we focused on the n_estimators parameter, which controls the number of base models in the ensemble. We set its range between 10 and 1000 to explore both smaller and larger ensembles. While a higher number of estimators can improve performance by reducing variance, it also increases computational cost. Our goal was to balance performance and efficiency by optimizing this parameter.
A.6 Random forestIn our optimization of the Random Forest Classifier, we selected a comprehensive set of hyperparameters crucial for controlling the model’s complexity and enhancing its performance. The n_estimators parameter, ranging from 100 to 1000, represents the number of trees in the forest. A larger number of estimators generally leads to improved accuracy and robustness against overfitting, as it allows the model to capture more complex patterns in the data. The max_depth parameter, set between 2 and 30, limits the maximum depth of each tree, helping to prevent overfitting by restricting how deep the trees can grow. The min_samples_split hyperparameter, ranging from 2 to 50, determines the minimum number of samples required to split an internal node, thus avoiding the creation of nodes that may be too specific to the training data. The choice of the criterion, which can be either gini or entropy, influences the function used to measure the quality of splits, allowing the model to explore different methods of creating decision boundaries. Additionally, the ccp_alpha parameter, varying from 0 to 0.1, is critical for controlling cost complexity pruning, enabling the model to prune unimportant branches and further combat overfitting. The min_samples_leaf parameter, set between 1 and 50, ensures that each leaf node contains a sufficient number of samples, enhancing generalization. Finally, the max_features parameter allows us to specify the number of features to consider when looking for the best split, with options including sqrt, log2, and a fraction (0.33), promoting diversity among the trees in the forest. The hyperparameter space was designed to enable comprehensive exploration of configurations, helping to identify settings that balance performance and computational efficiency, which is important for building effective machine learning models (Breiman 2001).
A.7 Gradient boostingIn our optimization of the Gradient Boosting Classifier, we selected key hyperparameters to fine-tune model performance and manage complexity. The n_estimators parameter, ranging from 100 to 1000, indicates the number of boosting stages (or trees) in the ensemble. While more estimators can reduce bias and improve accuracy, they also increase the risk of overfitting, requiring careful tuning. The max_depth parameter, set between 2 and 30, limits the depth of each tree, helping to control overfitting by constraining model complexity. We included the loss hyperparameter, set to either log-loss or exponential, to specify the loss function used in optimization, allowing the model to adapt its learning strategy to the data characteristics.
The min_samples_split hyperparameter, ranging from 2 to 50, determines the minimum number of samples required to split an internal node, which helps regulate complexity and prevent overfitting to minor variations in the dataset. The ccp_alpha parameter, varying from 0 to 0.1, controls cost complexity pruning, eliminating less informative splits to enhance generalization. The min_samples_leaf parameter, set between 1 and 50, ensures that each leaf node contains enough samples, further aiding generalization.
Lastly, the learning_rate hyperparameter, ranging from \(1 \times 10^\) to \(2 \times 10^\), controls the contribution of each tree to the final model, adjusting how the model updates with each new tree. This hyperparameter space was designed to explore configurations that improve the Gradient Boosting Classifier’s performance while balancing accuracy and computational efficiency.
A.8 Feedforward neural networksIn our optimization of the Feedforward Neural Networks (FNNs), we focused on a variety of hyperparameters important for improving the model’s performance and ensuring effective training. The optimizer parameter plays a key role as it determines the algorithm used to update the weights during training. We included options such as adam, rmsprop, and sgd (Robbins and Monro 1951; Hinton 2012; Kingma and Ba 2014) to allow the model to explore different optimization strategies, each with its strengths in handling various datasets. The activation function was chosen from relu (Nair and Hinton 2010) and tanh (LeCun et al. 1998), with the decision made to set a single activation function for all hidden layers for simplicity, enabling a more straightforward architecture while still introducing non-linearity. We set the num_layers parameter to range from 1 to 5, allowing for flexibility in the model’s depth. This range was chosen because increasing the number of layers too much can become computationally costly, while still allowing the exploration of deeper networks to capture more complex patterns in the data. For binary classification tasks, we use the sigmoid function in the output layer, which converts the linear combination into a probability value between \(0\) and \(1\) (Bishop 2006b).
The dropout_rate, ranging from 0.0 to 0.5, helps prevent overfitting by randomly dropping a fraction of neurons during training, encouraging the model to learn more robust features. The learning_rate parameter, set between \(1 \times 10^\) and \(1 \times 10^\), controls the step size during weight updates, which is crucial for converging to the optimal solution effectively.
For weight initialization, we employed the He uniform initializer for the hidden layers using the ReLU activation function, which is particularly suited for maintaining a balanced variance of activations across layers, mitigating the vanishing gradient problem (He et al. 2015). Conversely, we utilized the Glorot uniform initializer for layers with the tanh activation function. This initializer is effective in preserving the variance of both positive and negative activations, thus enhancing the network’s ability to learn effectively without encountering issues associated with saturation in the activation function (Glorot and Bengio 2010).
The defined hyperparameter space allows for extensive exploration and fine-tuning of the FNN architecture, optimizing generalization to unseen data while balancing computational efficiency.
Comments (0)