Widely Used Machine Learning Models Reproduce Dataset Bias: Study
upstart writes:
Widely used machine learning models reproduce dataset bias: Study:
Rice University computer science researchers have found bias in widely used machine learning tools used for immunotherapy research.
[...] HLA is a gene in all humans that encodes proteins working as part of our immune response. Those proteins bind with protein chunks called peptides in our cells and mark our infected cells for the body's immune system, so it can respond and, ideally, eliminate the threat.
Different people have slightly different variants in genes, called alleles. Current immunotherapy research is exploring ways to identify peptides that can more effectively bind with the HLA alleles of the patient.
The end result, eventually, could be custom and highly effective immunotherapies. That is why one of the most critical steps is to accurately predict which peptides will bind with which alleles. The greater the accuracy, the better the potential efficacy of the therapy.
But calculating how effectively a peptide will bind to the HLA allele takes a lot of work, which is why machine learning tools are being used to predict binding. This is where Rice's team found a problem: The data used to train those models appears to geographically favor higher-income communities.
Why is this an issue? Without being able to account for genetic data from lower-income communities, future immunotherapies developed for them may not be as effective.
"Each and every one of us has different HLAs that they express, and those HLAs vary between different populations," Fasoulis said. "Given that machine learning is used to identify potential peptide candidates for immunotherapies, if you basically have biased machine models, then those therapeutics won't work equally for everyone in every population."
Read more of this story at SoylentNews.