ffmpeg tutorial

An ffmpeg and SDL Tutorial

Tutorial 03: Playing Sound

Audio

So now we want to play sound. SDL also gives us methods for outputting sound. The SDL_OpenAudio() function is used to open the audio device itself. It takes as arguments an SDL_AudioSpec struct, which contains all the information about the audio we are going to output.

Before we show how you set this up, let's explain first about how audio is handled by computers. Digital audio consists of a long stream of samples. Each sample represents a value of the audio waveform. Sounds are recorded at a certain sample rate, which simply says how fast to play each sample, and is measured in number of samples per second. Example sample rates are 22,050 and 44,100 samples per second, which are the rates used for radio and CD respectively. In addition, most audio can have more than one channel for stereo or surround, so for example, if the sample is in stereo, the samples will come 2 at a time. When we get data from a movie file, we don't know how many samples we will get, but ffmpeg will not give us partial samples - that also means that it will not split a stereo sample up, either.

SDL's method for playing audio is this: you set up your audio options: the sample rate (called "freq" for frequency in the SDL struct), number of channels, and so forth, and we also set a callback function and userdata. When we begin playing audio, SDL will continually call this callback function and ask it to fill the audio buffer with a certain number of bytes. After we put this information in the SDL_AudioSpec struct, we call SDL_OpenAudio(), which will open the audio device and give us back another AudioSpec struct. These are the specs we will actually be using — we are not guaranteed to get what we asked for!

Setting Up the Audio

Keep that all in your head for the moment, because we don't actually have any information yet about the audio streams yet! Let's go back to the place in our code where we found the video stream and find which stream is the audio stream.

// Find the first video stream
videoStream=-1;
audioStream=-1;
for(i=0; i < pFormatCtx->nb_streams; i++) {
  if(pFormatCtx->streams[i]->codec->codec_type==AVMEDIA_TYPE_VIDEO
     &&
       videoStream < 0) {
    videoStream=i;
  }
  if(pFormatCtx->streams[i]->codec->codec_type==AVMEDIA_TYPE_AUDIO &&
     audioStream < 0) {
    audioStream=i;
  }
}
if(videoStream==-1)
  return -1; // Didn't find a video stream
if(audioStream==-1)
  return -1;

From here we can get all the info we want from the AVCodecContext from the stream, just like we did with the video stream:

AVCodecContext *aCodecCtxOrig;
AVCodecContext *aCodecCtx;

aCodecCtxOrig=pFormatCtx->streams[audioStream]->codec;

If you remember from the previous tutorials, we still need to open the audio codec itself. This is straightforward:

AVCodec         *aCodec;

aCodec = avcodec_find_decoder(aCodecCtx->codec_id);
if(!aCodec) {
  fprintf(stderr, "Unsupported codec!\n");
  return -1;
}
// Copy context
aCodecCtx = avcodec_alloc_context3(aCodec);
if(avcodec_copy_context(aCodecCtx, aCodecCtxOrig) != 0) {
  fprintf(stderr, "Couldn't copy codec context");
  return -1; // Error copying codec context
}
/* set up SDL Audio here */

avcodec_open2(aCodecCtx, aCodec, NULL);

Contained within the codec context is all the information we need to set up our audio:

wanted_spec.freq = aCodecCtx->sample_rate;
wanted_spec.format = AUDIO_S16SYS;
wanted_spec.channels = aCodecCtx->channels;
wanted_spec.silence = 0;
wanted_spec.samples = SDL_AUDIO_BUFFER_SIZE;
wanted_spec.callback = audio_callback;
wanted_spec.userdata = aCodecCtx;

if(SDL_OpenAudio(&wanted_spec, &spec) < 0) {
  fprintf(stderr, "SDL_OpenAudio: %s\n", SDL_GetError());
  return -1;
}

Let's go through these:

freq: The sample rate, as explained earlier.
format: This tells SDL what format we will be giving it. The "S" in "S16SYS" stands for "signed", the 16 says that each sample is 16 bits long, and "SYS" means that the endian-order will depend on the system you are on. This is the format that avcodec_decode_audio2 will give us the audio in.
channels: Number of audio channels.
silence: This is the value that indicated silence. Since the audio is signed, 0 is of course the usual value.
samples: This is the size of the audio buffer that we would like SDL to give us when it asks for more audio. A good value here is between 512 and 8192; ffplay uses 1024.
callback: Here's where we pass the actual callback function. We'll talk more about the callback function later.
userdata: SDL will give our callback a void pointer to any user data that we want our callback function to have. We want to let it know about our codec context; you'll see why.

Finally, we open the audio with SDL_OpenAudio.

Queues

There! Now we're ready to start pulling audio information from the stream. But what do we do with that information? We are going to be continuously getting packets from the movie file, but at the same time SDL is going to call the callback function! The solution is going to be to create some kind of global structure that we can stuff audio packets in so our audio_callback has something to get audio data from! So what we're going to do is to create a queue of packets. ffmpeg even comes with a structure to help us with this: AVPacketList, which is just a linked list for packets. Here's our queue structure:

typedef struct PacketQueue {
  AVPacketList *first_pkt, *last_pkt;
  int nb_packets;
  int size;
  SDL_mutex *mutex;
  SDL_cond *cond;
} PacketQueue;

First, we should point out that nb_packets is not the same as size — size refers to a byte size that we get from packet->size. You'll notice that we have a mutex and a condtion variable in there. This is because SDL is running the audio process as a separate thread. If we don't lock the queue properly, we could really mess up our data. We'll see how in the implementation of the queue. Every programmer should know how to make a queue, but we're including this so you can learn the SDL functions.

First we make a function to initialize the queue:

void packet_queue_init(PacketQueue *q) {
  memset(q, 0, sizeof(PacketQueue));
  q->mutex = SDL_CreateMutex();
  q->cond = SDL_CreateCond();
}

Then we will make a function to put stuff in our queue:

int packet_queue_put(PacketQueue *q, AVPacket *pkt) {

  AVPacketList *pkt1;
  if(av_dup_packet(pkt) < 0) {
    return -1;
  }
  pkt1 = av_malloc(sizeof(AVPacketList));
  if (!pkt1)
    return -1;
  pkt1->pkt = *pkt;
  pkt1->next = NULL;
  
  
  SDL_LockMutex(q->mutex);
  
  if (!q->last_pkt)
    q->first_pkt = pkt1;
  else
    q->last_pkt->next = pkt1;
  q->last_pkt = pkt1;
  q->nb_packets++;
  q->size += pkt1->pkt.size;
  SDL_CondSignal(q->cond);
  
  SDL_UnlockMutex(q->mutex);
  return 0;
}

SDL_LockMutex() locks the mutex in the queue so we can add something to it, and then SDL_CondSignal() sends a signal to our get function (if it is waiting) through our condition variable to tell it that there is data and it can proceed, then unlocks the mutex to let it go.

Here's the corresponding get function. Notice how SDL_CondWait() makes the function block (i.e. pause until we get data) if we tell it to.

int quit = 0;

static int packet_queue_get(PacketQueue *q, AVPacket *pkt, int block) {
  AVPacketList *pkt1;
  int ret;
  
  SDL_LockMutex(q->mutex);
  
  for(;;) {
    
    if(quit) {
      ret = -1;
      break;
    }

    pkt1 = q->first_pkt;
    if (pkt1) {
      q->first_pkt = pkt1->next;
      if (!q->first_pkt)
	q->last_pkt = NULL;
      q->nb_packets--;
      q->size -= pkt1->pkt.size;
      *pkt = pkt1->pkt;
      av_free(pkt1);
      ret = 1;
      break;
    } else if (!block) {
      ret = 0;
      break;
    } else {
      SDL_CondWait(q->cond, q->mutex);
    }
  }
  SDL_UnlockMutex(q->mutex);
  return ret;
}

As you can see, we've wrapped the function in a forever loop so we will be sure to get some data if we want to block. We avoid looping forever by making use of SDL's SDL_CondWait() function. Basically, all CondWait does is wait for a signal from SDL_CondSignal() (or SDL_CondBroadcast()) and then continue. However, it looks as though we've trapped it within our mutex — if we hold the lock, our put function can't put anything in the queue! However, what SDL_CondWait() also does for us is to unlock the mutex we give it and then attempt to lock it again once we get the signal.

In Case of Fire

You'll also notice that we have a global quit variable that we check to make sure that we haven't set the program a quit signal (SDL automatically handles TERM signals and the like). Otherwise, the thread will continue forever and we'll have to kill -9 the program.

  SDL_PollEvent(&event);
  switch(event.type) {
  case SDL_QUIT:
    quit = 1;

We make sure to set the quit flag to 1.

Feeding Packets

The only thing left is to set up our queue:

PacketQueue audioq;
main() {
...
  avcodec_open2(aCodecCtx, aCodec, NULL);

  packet_queue_init(&audioq);
  SDL_PauseAudio(0);

SDL_PauseAudio() finally starts the audio device. It plays silence if it doesn't get data; which it won't right away.

So, we've got our queue set up, now we're ready to start feeding it packets. We go to our packet-reading loop:

while(av_read_frame(pFormatCtx, &packet)>=0) {
  // Is this a packet from the video stream?
  if(packet.stream_index==videoStream) {
    // Decode video frame
    ....
    }
  } else if(packet.stream_index==audioStream) {
    packet_queue_put(&audioq, &packet);
  } else {
    av_free_packet(&packet);
  }

Note that we don't free the packet after we put it in the queue. We'll free it later when we decode it.

Fetching Packets

Now let's finally make our audio_callback function to fetch the packets on the queue. The callback has to be of the form void callback(void *userdata, Uint8 *stream, int len), where userdata of course is the pointer we gave to SDL, stream is the buffer we will be writing audio data to, and len is the size of that buffer. Here's the code:

void audio_callback(void *userdata, Uint8 *stream, int len) {

  AVCodecContext *aCodecCtx = (AVCodecContext *)userdata;
  int len1, audio_size;

  static uint8_t audio_buf[(AVCODEC_MAX_AUDIO_FRAME_SIZE * 3) / 2];
  static unsigned int audio_buf_size = 0;
  static unsigned int audio_buf_index = 0;

  while(len > 0) {
    if(audio_buf_index >= audio_buf_size) {
      /* We have already sent all our data; get more */
      audio_size = audio_decode_frame(aCodecCtx, audio_buf,
                                      sizeof(audio_buf));
      if(audio_size < 0) {
	/* If error, output silence */
	audio_buf_size = 1024;
	memset(audio_buf, 0, audio_buf_size);
      } else {
	audio_buf_size = audio_size;
      }
      audio_buf_index = 0;
    }
    len1 = audio_buf_size - audio_buf_index;
    if(len1 > len)
      len1 = len;
    memcpy(stream, (uint8_t *)audio_buf + audio_buf_index, len1);
    len -= len1;
    stream += len1;
    audio_buf_index += len1;
  }
}

This is basically a simple loop that will pull in data from another function we will write, audio_decode_frame(), store the result in an intermediary buffer, attempt to write len bytes to stream, and get more data if we don't have enough yet, or save it for later if we have some left over. The size of audio_buf is 1.5 times the size of the largest audio frame that ffmpeg will give us, which gives us a nice cushion.

Finally Decoding the Audio

Let's get to the real meat of the decoder, audio_decode_frame:

int audio_decode_frame(AVCodecContext *aCodecCtx, uint8_t *audio_buf,
                       int buf_size) {

  static AVPacket pkt;
  static uint8_t *audio_pkt_data = NULL;
  static int audio_pkt_size = 0;
  static AVFrame frame;

  int len1, data_size = 0;

  for(;;) {
    while(audio_pkt_size > 0) {
      int got_frame = 0;
      len1 = avcodec_decode_audio4(aCodecCtx, &frame, &got_frame, &pkt);
      if(len1 < 0) {
	/* if error, skip frame */
	audio_pkt_size = 0;
	break;
      }
      audio_pkt_data += len1;
      audio_pkt_size -= len1;
      data_size = 0;
      if(got_frame) {
	data_size = av_samples_get_buffer_size(NULL, 
					       aCodecCtx->channels,
					       frame.nb_samples,
					       aCodecCtx->sample_fmt,
					       1);
	assert(data_size <= buf_size);
	memcpy(audio_buf, frame.data[0], data_size);
      }
      if(data_size <= 0) {
	/* No data yet, get more frames */
	continue;
      }
      /* We have data, return it and come back for more later */
      return data_size;
    }
    if(pkt.data)
      av_free_packet(&pkt);

    if(quit) {
      return -1;
    }

    if(packet_queue_get(&audioq, &pkt, 1) < 0) {
      return -1;
    }
    audio_pkt_data = pkt.data;
    audio_pkt_size = pkt.size;
  }
}

This whole process actually starts towards the end of the function, where we call packet_queue_get(). We pick the packet up off the queue, and save its information. Then, once we have a packet to work with, we call avcodec_decode_audio4(), which acts a lot like its sister function, avcodec_decode_video(), except in this case, a packet might have more than one frame, so you may have to call it several times to get all the data out of the packet. Once we have the frame, we simply copy it to our audio buffer, making sure the data_size is smaller than our audio buffer. Also, remember the cast to audio_buf, because SDL gives an 8 bit int buffer, and ffmpeg gives us data in a 16 bit int buffer. You should also notice the difference between len1 and data_size. len1 is how much of the packet we've used, and data_size is the amount of raw data returned.

When we've got some data, we immediately return to see if we still need to get more data from the queue, or if we are done. If we still had more of the packet to process, we save it for later. If we finish up a packet, we finally get to free that packet.

So that's it! We've got audio being carried from the main read loop to the queue, which is then read by the audio_callback function, which hands that data to SDL, which SDL beams to your sound card. Go ahead and compile:

gcc -o tutorial03 tutorial03.c -lavutil -lavformat -lavcodec -lswscale -lz -lm \
`sdl-config --cflags --libs`

Hooray! The video is still going as fast as possible, but the audio is playing in time. Why is this? That's because the audio information has a sample rate — we're pumping out audio information as fast as we can, but the audio simply plays from that stream at its leisure according to the sample rate.

We're almost ready to start syncing video and audio ourselves, but first we need to do a little program reorganization. The method of queueing up audio and playing it using a separate thread worked very well: it made the code more managable and more modular. Before we start syncing the video to the audio, we need to make our code easier to deal with. Next time: Spawning Threads!

>> Spawning Threads

Function Reference
Data Reference

email:	dranger at gmail dot com

This work is licensed under the Creative Commons Attribution-Share Alike 2.5 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/2.5/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

Code examples are based off of FFplay, Copyright (c) 2003 Fabrice Bellard, and a tutorial by Martin Bohme.