For example, the Switchboard corpus (300h, 8khz, transcribed audio) is about 16GB.
That is a common size for LVCSR, and you need something around that area to get good performance (maybe minimum 100h). In academic papers by Google, they usually use their own private training data set, with e.g. 1900h. (E.g.: http://arxiv.org/pdf/1402.1128.pdf)
Some crowdsourced effort to collect transcribed audio under a CC-licence would be great!
Maybe this?
http://www.voxforge.org/home - "VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac)."
(caveat: I have not recorded on this from (any) of my machines - I don't have the right plugin apparently)
Maybe also:
https://librivox.org - has audiobooks read by volunteers, plus the book text.
That is a common size for LVCSR, and you need something around that area to get good performance (maybe minimum 100h). In academic papers by Google, they usually use their own private training data set, with e.g. 1900h. (E.g.: http://arxiv.org/pdf/1402.1128.pdf)
Some crowdsourced effort to collect transcribed audio under a CC-licence would be great!