Session: 15-01-01: ASME International Undergraduate Research and Design Exposition
Paper Number: 144594
144594 - A Novel Active Learning Framework for Data-Driven Design Datasets
Many existing design datasets have often involved randomly or pseudorandomly generating design parameters. While these methods effectively guarantee sampling coverage of the design space, they can lead to overrepresentations of undesired performance values, increased fail rates, and decreased accuracies of surrogate regressors. This work aims to mitigate these issues by proposing a novel constraint-aware Active Learning algorithm. We also aim to encourage further adoption of ‘smart’ sampling techniques that can potentially increase the value of generated design datasets.
Our algorithm can handle unique constraints for both performance values and design parameters. It is also compatible with any multidimensional design and performance spaces.
We separate the algorithm into two main steps: the querying strategy and the design-dropping (teaching) strategy. Below is a preliminary outline of some of the algorithm components.
Error Estimation:
We calculate the residuals of the predictions in the testing data to estimate the error of a performance value regressor.
We then normalize the residuals, and train a K-Nearest-Neighbors model (KNN) to predict the error for any point in the design space. We use a KNN because its output predicted error range is limited to the range of the residuals, from 0 to 1.
Note that this error estimation method could be substituted with Bayesian methods. We do not test these other estimation configurations, so they may yield different performances.
Querying Strategy:
1. If there is no initial training data, uniformly select and return points in the design space for the query batch. If there is existing training data, for all points in the pool, calculate the harmonic mean of the estimated error for all the performance regressors. Also, calculate the distance matrix to labeled (trained) points.
2. Using the testing data (described in the Teaching Strategy), set a proximity weight accordingly—this value is set higher for lower regressor accuracies, and vice versa. A higher proximity weight maximizes exploration, whereas a lower proximity weight maximizes exploitation.
3. Use the following experimentally derived formula to "score" each point based on the predicted error in step 1:
scores = (proximity_weight + (1-proximity_weight)*error)proximity_weight
4. Normalize the scores to a probability distribution.
5. Randomly select a point from the pool with the weighted probability distribution.
6. Create a predicted error interval cenetered at the predicted error of the selected point. The width of this interval is a hyperparameter, and we arbitrarily set it at 0.2.
7. Recalculate the distances to the nearest labeled neighbor for all points in the pool, and choose the point farthest from a labeled point that has an predicted error score within the range.
8. Add the chosen point to the batch, remove the point from the pool, and recompute the distance matrix, treating the chosen point as a labeled one.
9. Repeat steps 3-8 until the batch is full.
10. Return the batch
Teaching Strategy:
1. Uniformly select 20% of the training data for error estimation and testing. This will be used in the estimation of errors.
2. Retrain the invalidity classifiers and performance regressors with the training data (not including the testing data).
3. Use the following experimentally derived formula to get a value for the probability of a point being invalid:
∏ni=1PiCi
where n is the number of performance values, Pi is the predicted validity probability for the ith performance value validity classifier, and Ci is the confidence value as a function of the distance to the nearest labeled point. Note that points with lower confidence are biased towards 1.
4. Drop points that have an invalidity score lower than a certain threshold.
Our performance metric is the harmonic mean of the Mean Absolute Percentage Error (MAPE) values across all performance regressors trained on the queried data. In some preliminary case studies, our algorithm has outperformed uniform sampling. Through more rigorous testing and optimization, we will refine our algorithm to establish concrete performance results.
Presenting Author: Advaith Narayanan Leigh High School
Presenting Author Biography: Advaith Narayanan is a Senior at Leigh High School, San Jose CA. He is also concurrently taking college level courses at a community College. His research interests lie at the interface of digital computation and engineering (i.e. CAD, simulation, etc). He is a USA Physics Olympiad 2024 Semifinalist. He is a recipient of IEEE Technical Excellence Award and a Recognition Award from The Office of Naval Research, US Navy and Marine Corps at Synopsys Silicon Valley Science and Technology Championship. He is a passionate hackathon enthusiast.
Authors:
Advaith Narayanan Leigh High SchoolA Novel Active Learning Framework for Data-Driven Design Datasets
Paper Type
Undergraduate Expo