Keyword recognition database

by mabdollahi — last modified Dec 19, 2009 11:04 AM

Since I am planning to create the database for the keyword recognition experiment we agreed to begin with, (recognition of the utterance 'STOP') i am putting the very outline of the database I'm going to record so that we can discuss it and others can comment on it.

I should also note that as we are intending to use the silicon cochlea as the first stage, we thought it might be best to record all the varieties of data that we may need throughout the whole experiment (including training and test phases) together, and convert them to spikes at once.

The coarse categorization of Data is as follows :

  1. The word 'STOP' uttered in three different environmental conditions :

    • a silent room
    • non-silent room (or cocktail party situation)
    • outside in the city (average level of SNR)
  2. The word 'STOP' in different intonations and emotions. e.g. : questioning, surprised, happy, sad, ...

  3. The word 'STOP' with different time-warps and pronunciation speeds.

  4. The word 'STOP' in different intensity levels (in case of direct access to si-cochlea through the microphone).

  5. The word 'STOP' in several different sentences (with and without stress on the keyword),spanning sentences having the keyword in the beginning, middle and the end of the sentence.

  6. The word 'STOP' with five different speakers (3 male, 2 female)

  7. whispered 'STOP'

  8. Some other utterances close (e.g. 'stab','shop','strap','star','stand','top',...) and distant (e.g. 'blow', 'post', ...) to the word 'STOP'.

Please add comments below.


Comments (1)

Giacomo Indiveri Dec 18, 2009 11:58 AM
Hi Mohammad.
If you plan to collect 50 samples per experiment, all proposed sets of data might be too much.
Here are some constraints that will hopefully reduce the number of recordings you plan to make:

 2. limit the collection to: 50 samples of 'STOP' as a question (assuming the original data is uttered as a statement);

 3. limit the collection to 50 samples of 'STOP' at slower speed (assuming the original data is uttered at standard speed);

 4. drop this all-together, as this can be done artificially when re-playing the audio data to the cochlea chip (and/or to the SW simulations);

 5. limit this to a collection of 50 samples of a single sentence (Hynek: any specific sentence in mind?)

 7. this does not seem necessary to me, but Hynek and/or Shih-Chii might have different opinions

 8. choose 3 other utterances only (and 50 samples each).