Writing Discord Bot With Speech Recognition

Bots provide a lot of versatility to discord. You can automate certain tasks using them. For example, imagine a server where voice channels correspond to games. Our bot will join user's voice channel when user starts playing a game and ask him if he wants to be transferred to the right channel. We will be using node version 8.10.0, Google speech recognition, ffmpeg and discord.js. You can install node 8.10.0 using nvm. Also I recommend using latest Ubuntu OS.

First things first, we want discord API token. There are some tutorials on the Internet on how to get it. Also you will need to add your discord bot user to your discord server. Then create a file named config.json and paste your API token there:

{
  "discordApiToken": "your-token-here"
}

Then create package.json file and add dependencies:

{
  "name": "myawesomebot",
  "version": "1.0.0",
  "description": "Discord bot with speech recognition",
  "main": "index.js",
  "scripts": {
    "start": "node index.js"
  },
  "dependencies": {
    "@google-cloud/speech": "^2.1.1",
    "discord.js": "https://github.com/discordjs/discord.js.git#123713305ad5a6aa1e5205a53713494009740aef",
    "node-opus": "^0.3.1"
  },
  "license": "ISC"
}

Create index.js file:

const Discord = require('discord.js')
const config = require('./config')

const discordClient = new Discord.Client()

discordClient.on('ready', () => {
  console.log(`Logged in as ${discordClient.user.tag}!`)
})

discordClient.login(config.discordApiToken)

Now we want to test if our bot is set up correctly. Run npm install and npm start and look in your console. There should be message starting with Logged in as and the bot should come online in your discord server. Congratulations! You wrote your first discord bot. But now this bot does nothing so let's add some functionality. To do this we will need to learn about async functions which will helps us reduce nesting and make our code prettier:

discordClient.on('presenceUpdate', async (oldPresence, newPresence) => {
  console.log('New Presence:', newPresence)

  const member = newPresence.member
  const presence = newPresence
  const memberVoiceChannel = member.voice.channel

  if (!presence || !presence.activity || !presence.activity.name || !memberVoiceChannel) {
    return
  }

  const connection = await memberVoiceChannel.join()

  connection.on('speaking', (user, speaking) => {
    if (speaking) {
      console.log(`I'm listening to ${user.username}`)
    } else {
      console.log(`I stopped listening to ${user.username}`)
    }
  })
})

Now try joining a voice channel and then starting a game. The bot should join you and log your presence which includes the name of the game. Also it should detect when you are speaking.

So far so good. Time for some real speech recognition. To do this we will need Google speech API credentials (read the first item in "before you begin" section). Save your credentials in google-credentials.json file in your project folder. After that you can either use dotenv or start your app with

GOOGLE_APPLICATION_CREDENTIALS="[PATH]" npm start

where [PATH] is a full path to your google-credentials.json file.

If you use dotenv then create .env file

GOOGLE_APPLICATION_CREDENTIALS="google-credentials.json"

in your project folder after installing the package and then add  

require('dotenv').config()

at the start of your index.js.

Nice. Now let's try recognizing your beautiful voice. To do this we need one more thing. Discord.js function createPCMStream creates a 16-bit signed PCM, stereo 48KHz stream, but Google speech recognition takes mono input, i.e. 1 channel. So we have to convert 2 channel stream to 1 channel stream. We achieve it by creating a transform stream:

const { Transform } = require('stream')

function convertBufferTo1Channel(buffer) {
  const convertedBuffer = Buffer.alloc(buffer.length / 2)

  for (let i = 0; i < convertedBuffer.length / 2; i++) {
    const uint16 = buffer.readUInt16LE(i * 4)
    convertedBuffer.writeUInt16LE(uint16, i * 2)
  }

  return convertedBuffer
}

class ConvertTo1ChannelStream extends Transform {
  constructor(source, options) {
    super(options)
  }

  _transform(data, encoding, next) {
    next(null, convertBufferTo1Channel(data))
  }
}

We are ready to implement voice recognition:

const googleSpeech = require('@google-cloud/speech')

const googleSpeechClient = new googleSpeech.SpeechClient()

discordClient.on('presenceUpdate', async (oldPresence, newPresence) => {
  console.log('New Presence:', newPresence)

  const member = newPresence.member
  const presence = newPresence
  const memberVoiceChannel = member.voice.channel

  if (!presence || !presence.activity || !presence.activity.name || !memberVoiceChannel) {
    return
  }

  const connection = await memberVoiceChannel.join()
  const receiver = connection.receiver

  connection.on('speaking', (user, speaking) => {
    if (!speaking) {
      return
    }

    console.log(`I'm listening to ${user.username}`)

    // this creates a 16-bit signed PCM, stereo 48KHz stream
    const audioStream = receiver.createStream(user, { mode: 'pcm' })
    const requestConfig = {
      encoding: 'LINEAR16',
      sampleRateHertz: 48000,
      languageCode: 'en-US'
    }
    const request = {
      config: requestConfig
    }
    const recognizeStream = googleSpeechClient
      .streamingRecognize(request)
      .on('error', console.error)
      .on('data', response => {
        const transcription = response.results
          .map(result => result.alternatives[0].transcript)
          .join('\n')
          .toLowerCase()
        console.log(`Transcription: ${transcription}`)
      })

    const convertTo1ChannelStream = new ConvertTo1ChannelStream()

    audioStream.pipe(convertTo1ChannelStream).pipe(recognizeStream)

    audioStream.on('end', async () => {
      console.log('audioStream end')
    })
  })
})

Start the bot, join voice channel, start any game and then say something in English. You should see transcription of your words in console. For now we will only be needing words "yes" and "no" to command our bot.

Note here: it may now seem that the bot doesn't hear you and doesn't recognize your words. In this github issue they say that discord had (or has) a bug that doesn't allow for bots to listen to you until you've played any sound. So just proceed with the tutorial, we will implement playing sounds before recognition.

The bot is silent right now, we need it to ask user a question:

async function playFile(connection, filePath) {
  return new Promise((resolve, reject) => {
    const dispatcher = connection.play(filePath)
    dispatcher.setVolume(1)
    dispatcher.on('start', () => {
      console.log('Playing')
    })
    dispatcher.on('end', () => {
      resolve()
    })
    dispatcher.on('error', (error) => {
      console.error(error)
      reject(error)
    })
  })
}
...
  const connection = await memberVoiceChannel.join()
  const receiver = connection.receiver

  await playFile(connection, 'wrongChannelEn.mp3')

  connection.on('speaking', (user, speaking) => {
...

You can download wrongChannelEn.mp3 or record your own voice line. If the audio is not playing for you then it is probably because of the missing ffmpeg binaries. You can download them here for your system. This link is for Ubuntu x64. Decompress them and put ffmpeg file in your Path, e.g. /usr/bin folder.

The last step is to create a mapping between games and channels and to tell our bot to transfer people to the right channel on "yes". To copy the voice channel ID use earlier guide. Since I'm using Ubuntu, I will use "Mines" as the game for this demonstration.

const GamesAndChannels = {
  Mines: '[VoiceChannelID]'
}

discordClient.on('presenceUpdate', async (oldMember, newMember) => {
  const memberVoiceChannel = newMember.voiceChannel
  
  if (!newMember.presence || !newMember.presence.game || !memberVoiceChannel) {
    return
  }

  const channelId = GamesAndChannels[newMember.presence.game]
  
  if (!channelId) {
    return
  }
  
  const connection = await memberVoiceChannel.join()
  const receiver = connection.createReceiver()

  await playFile(connection, 'wrongChannelEn.mp3')

  setTimeout(() => {
    memberVoiceChannel.leave()
  }, 30000)

  connection.on('speaking', (user, speaking) => {
    if (!speaking) {
      return
    }

    console.log(`I'm listening to ${user.username}`)

    // this creates a 16-bit signed PCM, stereo 48KHz stream
    const audioStream = receiver.createPCMStream(user)
    const requestConfig = {
      encoding: 'LINEAR16',
      sampleRateHertz: 48000,
      languageCode: 'en-US'
    }
    const request = {
      config: requestConfig
    }
    const recognizeStream = googleSpeechClient
      .streamingRecognize(request)
      .on('error', console.error)
      .on('data', response => {
        const transcription = response.results
          .map(result => result.alternatives[0].transcript)
          .join('\n')
          .toLowerCase()
        console.log(`Transcription: ${transcription}`)

        if (transcription === 'yes') {
          connection.channel.members.array().forEach(member => {
            if (member.user.id !== discordClient.user.id) {
              console.log(`Moving member ${member.displayName} to channel ${channelId}`)
              member.edit({ channel: channelId }).catch(console.error)
              memberVoiceChannel.leave()
            }
          })
        } else if (transcription === 'no') {
          memberVoiceChannel.leave()
        }
      })

    const convertTo1ChannelStream = new ConvertTo1ChannelStream()

    audioStream.pipe(convertTo1ChannelStream).pipe(recognizeStream)

    audioStream.on('end', async () => {
      console.log('audioStream end')
    })
  })
})

Done. Now join a wrong voice channel (not with the ID you specified earlier), fire up "Mines" and tell the bot "yes". You (and your buddies in the voice channel) should be transferred to the voice channel with the specified ID. If you tell the bot "no", it just leaves the voice channel with sadness on his metal face. Final version of the code is available via this gist.

In this tutorial I demonstrated a simplified version of EzBot, a bot we developed specifically for our discord server. It is open source and is under ISC license (use however you want). You can find it on my github. It requires MongoDB, has useful commands, internationalization (only EN/RU currently), sends new users welcome messages and also has a feature to assign people their roles using emote dashboard. This project was created in collaboration with my friend MrPhoenix.

UPDATE 26.11.2019: Because of some breaking change in Discord API stuff in the article stopped working. There is pull request that solves the problems, but we need to update to latest discord.js (master branch). I will rewrite the article when this pull request is merged. I updated the mentioned gist for now though.

UPDATE 07.01.2020 Pull request was merged. Now you can use main branch of discord.js to make it work. I rewrote the article to comply to new API introduced in the main branch of discord.js

UPDATE 04.03.2022 Discord will shut off API versions 6 and 7, so RIP discord.js v11 & v12. I will update this article or write a new one to show the new ways of discord.js.