|
| ▲ | bluGill 7 minutes ago | parent | next [-] |
| I remember when those systems first started collecting data they were worried kids wouldn't be handled - but they didn't know how to handle the privacy issuses with recording kids so discouraged it. Women being missed is not a surprise - but not anticipated. |
|
| ▲ | dghlsakjg 2 hours ago | parent | prev | next [-] |
| This was 2021 (so pre-llm), but I used to work for a company that gathered data for training voice commands (Alexa, Toyota, Sonos, were some clients). Basically, we paid people to read digital assistant scripts at scale. Your assumptions about training data do not match the demographics of data I collected. The majority of what our work revolved around was getting diversity into the training data. We specifically recruited kids, older folks, women, people with accented/dialected English and just about every variety of speech that we could get our hands on. The companies we worked with were insanely methodical about ensuring that different people were included. |
| |
| ▲ | gmueckl 2 hours ago | parent [-] | | You are reporting on a deliberately curated effort vs. what I understand is effectively voluntary data donation without incentives. It's not surprising to me that the later dataset ends up biased due to the differences in sourcing. |
|
|
| ▲ | ethagnawl 3 hours ago | parent | prev [-] |
| Oh, I'm sure you're right. I've had people in my personal life (non-technical; "AI enthusiasts") laugh at me over concerns about training bias but this is likely a real world example of it. |
| |
| ▲ | stavros 3 hours ago | parent [-] | | I think you can train your own wake word with microWakeWord but I've never done it. |
|