| ▲ | dghlsakjg 4 hours ago | |
This was 2021 (so pre-llm), but I used to work for a company that gathered data for training voice commands (Alexa, Toyota, Sonos, were some clients). Basically, we paid people to read digital assistant scripts at scale. Your assumptions about training data do not match the demographics of data I collected. The majority of what our work revolved around was getting diversity into the training data. We specifically recruited kids, older folks, women, people with accented/dialected English and just about every variety of speech that we could get our hands on. The companies we worked with were insanely methodical about ensuring that different people were included. | ||
| ▲ | gmueckl 3 hours ago | parent [-] | |
You are reporting on a deliberately curated effort vs. what I understand is effectively voluntary data donation without incentives. It's not surprising to me that the later dataset ends up biased due to the differences in sourcing. | ||