當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

WebRTC 中的基本音频处理操作

發布時間：2024/4/11 编程问答 38 豆豆

生活随笔收集整理的這篇文章主要介紹了 WebRTC 中的基本音频处理操作小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

在 RTC，即實時音視頻通信中，要解決的音頻相關的問題，主要包括如下這些：

音頻數據的采集及播放。
音頻數據的處理。主要是對采集錄制的音頻數據的處理，即所謂的 3A 處理，AEC (Acoustic Echo Cancellation) 回聲消除，ANS (Automatic Noise Suppression) 降噪，和 AGC (Automatic Gain Control) 自動增益控制。
音效。如變聲，混響，均衡等。
音頻數據的編碼和解碼。包括音頻數據的編碼和解碼，如 AAC，OPUS，和針對弱網的處理，如 NetEQ。
網絡傳輸。一般用 RTP/RTCP 傳輸編碼后的音頻數據。
整個音頻處理流水線的搭建。

WebRTC 的音頻處理流水線大體如下圖：

除了音效之外，WebRTC 的音頻處理流水線包含其它所有的部分，音頻數據的采集及播放，音頻數據的處理，音頻數據的編碼和解碼，網絡傳輸都有。

在 WebRTC 中，通過 AudioDeviceModule 完成音頻數據的采集和播放。不同的操作系統平臺有著不同的與音頻設備通信的方式，因而不同的平臺上使用各自平臺特有的解決方案實現平臺特有的 AudioDeviceModule。一些平臺上甚至有很多套音頻解決方案，如 Linux 有 pulse 和 ALSA，Android 有 framework 提供的 Java 接口、OpenSLES 和 AAudio，Windows 也有多種方案等。

WebRTC 的音頻流水線只支持處理 10 ms 的數據，有些操作系統平臺提供了支持采集和播放 10 ms 音頻數據的接口，如 Linux，有些平臺則沒有，如 Android、iOS 等。AudioDeviceModule 播放和采集的數據，總會通過 AudioDeviceBuffer 拿進來或者送出去 10 ms 的音頻數據。對于不支持采集和播放 10 ms 音頻數據的平臺，在平臺的 AudioDeviceModule 和 AudioDeviceBuffer 還會插入一個 FineAudioBuffer，用于將平臺的音頻數據格式轉換為 10 ms 的 WebRTC 能處理的音頻幀。

WebRTC 的 AudioDeviceModule 連接稱為 AudioTransport 的模塊。對于音頻數據的采集發送，AudioTransport 完成音頻處理，主要即是 3A 處理。對于音頻播放，這里有一個混音器，用于將接收到的多路音頻做混音。回聲消除主要是將錄制的聲音中播放的聲音的部分消除掉，因而，在從 AudioTransport 中拿音頻數據播放時，也會將這一部分音頻數據送進 APM 中。

AudioTransport 接 AudioSendStream 和 AudioReceiveStream，在 AudioSendStream 和 AudioReceiveStream 中完成音頻的編碼發送和接收解碼，及網絡傳輸。

WebRTC 的音頻基本操作

在 WebRTC 的音頻流水線，無論遠端發送了多少路音頻流，也無論遠端發送的各條音頻流的采樣率和通道數具體是什么，都需要經過重采樣，通道數轉換和混音，最終轉換為系統設備可接受的采樣率和通道數的單路音頻數據。具體來說，各條音頻流需要先重采樣和通道數變換轉換為某個統一的采樣率和通道數，然后做混音；混音之后，再經過重采樣以及通道數變換，轉變為最終設備可接受的音頻數據。（WebRTC 中音頻流水線各個節點統一用 16 位整型值表示采樣點。）如下面這樣：

WebRTC 提供了一些音頻操作的工具類和函數用來完成上述操作。

混音如何混？

WebRTC 提供了 AudioMixer 接口來抽象混音器，這個接口定義 (位于 webrtc/src/api/audio/audio_mixer.h) 如下：

namespace webrtc {// WORK IN PROGRESS // This class is under development and is not yet intended for for use outside // of WebRtc/Libjingle. class AudioMixer : public rtc::RefCountInterface {public:// A callback class that all mixer participants must inherit from/implement.class Source {public:enum class AudioFrameInfo {kNormal, // The samples in audio_frame are valid and should be used.kMuted, // The samples in audio_frame should not be used, but// should be implicitly interpreted as zero. Other// fields in audio_frame may be read and should// contain meaningful values.kError, // The audio_frame will not be used.};// Overwrites |audio_frame|. The data_ field is overwritten with// 10 ms of new audio (either 1 or 2 interleaved channels) at// |sample_rate_hz|. All fields in |audio_frame| must be updated.virtual AudioFrameInfo GetAudioFrameWithInfo(int sample_rate_hz,AudioFrame* audio_frame) = 0;// A way for a mixer implementation to distinguish participants.virtual int Ssrc() const = 0;// A way for this source to say that GetAudioFrameWithInfo called// with this sample rate or higher will not cause quality loss.virtual int PreferredSampleRate() const = 0;virtual ~Source() {}};// Returns true if adding was successful. A source is never added// twice. Addition and removal can happen on different threads.virtual bool AddSource(Source* audio_source) = 0;// Removal is never attempted if a source has not been successfully// added to the mixer.virtual void RemoveSource(Source* audio_source) = 0;// Performs mixing by asking registered audio sources for audio. The// mixed result is placed in the provided AudioFrame. This method// will only be called from a single thread. The channels argument// specifies the number of channels of the mix result. The mixer// should mix at a rate that doesn't cause quality loss of the// sources' audio. The mixing rate is one of the rates listed in// AudioProcessing::NativeRate. All fields in// |audio_frame_for_mixing| must be updated.virtual void Mix(size_t number_of_channels,AudioFrame* audio_frame_for_mixing) = 0;protected:// Since the mixer is reference counted, the destructor may be// called from any thread.~AudioMixer() override {} }; } // namespace webrtc

WebRTC 的 AudioMixer 將 0 個、1 個或多個 Mixer Source 混音為特定通道數的單路音頻幀。輸出的音頻幀的采樣率，由 AudioMixer 的具體實現根據一定的規則確定。

Mixer Source 為 AudioMixer 提供特定采樣率的單聲道或立體聲的音頻幀數據，它有責任將它可以拿到的音頻幀數據重采樣為 AudioMixer 期待的采樣率的音頻數據。它還可以提供它傾向的輸出采樣率的信息，以幫助 AudioMixer 計算合適的輸出采樣率。Mixer Source 通過 Ssrc() 提供一個這一路的 Mixer Source 標識。

WebRTC 提供了一個 AudioMixer 的實現 AudioMixerImpl 類，位于 webrtc/src/modules/audio_mixer/。這個類的定義 (位于 webrtc/src/modules/audio_mixer/audio_mixer_impl.h) 如下：

namespace webrtc {typedef std::vector<AudioFrame*> AudioFrameList;class AudioMixerImpl : public AudioMixer {public:struct SourceStatus {SourceStatus(Source* audio_source, bool is_mixed, float gain): audio_source(audio_source), is_mixed(is_mixed), gain(gain) {}Source* audio_source = nullptr;bool is_mixed = false;float gain = 0.0f;// A frame that will be passed to audio_source->GetAudioFrameWithInfo.AudioFrame audio_frame;};using SourceStatusList = std::vector<std::unique_ptr<SourceStatus>>;// AudioProcessing only accepts 10 ms frames.static const int kFrameDurationInMs = 10;static const int kMaximumAmountOfMixedAudioSources = 3;static rtc::scoped_refptr<AudioMixerImpl> Create();static rtc::scoped_refptr<AudioMixerImpl> Create(std::unique_ptr<OutputRateCalculator> output_rate_calculator,bool use_limiter);~AudioMixerImpl() override;// AudioMixer functionsbool AddSource(Source* audio_source) override;void RemoveSource(Source* audio_source) override;void Mix(size_t number_of_channels,AudioFrame* audio_frame_for_mixing) overrideRTC_LOCKS_EXCLUDED(crit_);// Returns true if the source was mixed last round. Returns// false and logs an error if the source was never added to the// mixer.bool GetAudioSourceMixabilityStatusForTest(Source* audio_source) const;protected:AudioMixerImpl(std::unique_ptr<OutputRateCalculator> output_rate_calculator,bool use_limiter);private:// Set mixing frequency through OutputFrequencyCalculator.void CalculateOutputFrequency();// Get mixing frequency.int OutputFrequency() const;// Compute what audio sources to mix from audio_source_list_. Ramp// in and out. Update mixed status. Mixes up to// kMaximumAmountOfMixedAudioSources audio sources.AudioFrameList GetAudioFromSources() RTC_EXCLUSIVE_LOCKS_REQUIRED(crit_);// The critical section lock guards audio source insertion and// removal, which can be done from any thread. The race checker// checks that mixing is done sequentially.rtc::CriticalSection crit_;rtc::RaceChecker race_checker_;std::unique_ptr<OutputRateCalculator> output_rate_calculator_;// The current sample frequency and sample size when mixing.int output_frequency_ RTC_GUARDED_BY(race_checker_);size_t sample_size_ RTC_GUARDED_BY(race_checker_);// List of all audio sources. Note all lists are disjunctSourceStatusList audio_source_list_ RTC_GUARDED_BY(crit_); // May be mixed.// Component that handles actual adding of audio frames.FrameCombiner frame_combiner_ RTC_GUARDED_BY(race_checker_);RTC_DISALLOW_COPY_AND_ASSIGN(AudioMixerImpl); }; } // namespace webrtc

AudioMixerImpl 類的實現 (位于 webrtc/src/modules/audio_mixer/audio_mixer_impl.cc) 如下：

namespace webrtc { namespace {struct SourceFrame {SourceFrame(AudioMixerImpl::SourceStatus* source_status,AudioFrame* audio_frame,bool muted): source_status(source_status), audio_frame(audio_frame), muted(muted) {RTC_DCHECK(source_status);RTC_DCHECK(audio_frame);if (!muted) {energy = AudioMixerCalculateEnergy(*audio_frame);}}SourceFrame(AudioMixerImpl::SourceStatus* source_status,AudioFrame* audio_frame,bool muted,uint32_t energy): source_status(source_status),audio_frame(audio_frame),muted(muted),energy(energy) {RTC_DCHECK(source_status);RTC_DCHECK(audio_frame);}AudioMixerImpl::SourceStatus* source_status = nullptr;AudioFrame* audio_frame = nullptr;bool muted = true;uint32_t energy = 0; };// ShouldMixBefore(a, b) is used to select mixer sources. bool ShouldMixBefore(const SourceFrame& a, const SourceFrame& b) {if (a.muted != b.muted) {return b.muted;}const auto a_activity = a.audio_frame->vad_activity_;const auto b_activity = b.audio_frame->vad_activity_;if (a_activity != b_activity) {return a_activity == AudioFrame::kVadActive;}return a.energy > b.energy; }void RampAndUpdateGain(const std::vector<SourceFrame>& mixed_sources_and_frames) {for (const auto& source_frame : mixed_sources_and_frames) {float target_gain = source_frame.source_status->is_mixed ? 1.0f : 0.0f;Ramp(source_frame.source_status->gain, target_gain,source_frame.audio_frame);source_frame.source_status->gain = target_gain;} }AudioMixerImpl::SourceStatusList::const_iterator FindSourceInList(AudioMixerImpl::Source const* audio_source,AudioMixerImpl::SourceStatusList const* audio_source_list) {return std::find_if(audio_source_list->begin(), audio_source_list->end(),[audio_source](const std::unique_ptr<AudioMixerImpl::SourceStatus>& p) {return p->audio_source == audio_source;}); } } // namespaceAudioMixerImpl::AudioMixerImpl(std::unique_ptr<OutputRateCalculator> output_rate_calculator,bool use_limiter): output_rate_calculator_(std::move(output_rate_calculator)),output_frequency_(0),sample_size_(0),audio_source_list_(),frame_combiner_(use_limiter) {}AudioMixerImpl::~AudioMixerImpl() {}rtc::scoped_refptr<AudioMixerImpl> AudioMixerImpl::Create() {return Create(std::unique_ptr<DefaultOutputRateCalculator>(new DefaultOutputRateCalculator()),true); }rtc::scoped_refptr<AudioMixerImpl> AudioMixerImpl::Create(std::unique_ptr<OutputRateCalculator> output_rate_calculator,bool use_limiter) {return rtc::scoped_refptr<AudioMixerImpl>(new rtc::RefCountedObject<AudioMixerImpl>(std::move(output_rate_calculator), use_limiter)); }void AudioMixerImpl::Mix(size_t number_of_channels,AudioFrame* audio_frame_for_mixing) {RTC_DCHECK(number_of_channels >= 1);RTC_DCHECK_RUNS_SERIALIZED(&race_checker_);CalculateOutputFrequency();{rtc::CritScope lock(&crit_);const size_t number_of_streams = audio_source_list_.size();frame_combiner_.Combine(GetAudioFromSources(), number_of_channels,OutputFrequency(), number_of_streams,audio_frame_for_mixing);}return; }void AudioMixerImpl::CalculateOutputFrequency() {RTC_DCHECK_RUNS_SERIALIZED(&race_checker_);rtc::CritScope lock(&crit_);std::vector<int> preferred_rates;std::transform(audio_source_list_.begin(), audio_source_list_.end(),std::back_inserter(preferred_rates),[&](std::unique_ptr<SourceStatus>& a) {return a->audio_source->PreferredSampleRate();});output_frequency_ =output_rate_calculator_->CalculateOutputRate(preferred_rates);sample_size_ = (output_frequency_ * kFrameDurationInMs) / 1000; }int AudioMixerImpl::OutputFrequency() const {RTC_DCHECK_RUNS_SERIALIZED(&race_checker_);return output_frequency_; }bool AudioMixerImpl::AddSource(Source* audio_source) {RTC_DCHECK(audio_source);rtc::CritScope lock(&crit_);RTC_DCHECK(FindSourceInList(audio_source, &audio_source_list_) ==audio_source_list_.end())<< "Source already added to mixer";audio_source_list_.emplace_back(new SourceStatus(audio_source, false, 0));return true; }void AudioMixerImpl::RemoveSource(Source* audio_source) {RTC_DCHECK(audio_source);rtc::CritScope lock(&crit_);const auto iter = FindSourceInList(audio_source, &audio_source_list_);RTC_DCHECK(iter != audio_source_list_.end()) << "Source not present in mixer";audio_source_list_.erase(iter); }AudioFrameList AudioMixerImpl::GetAudioFromSources() {RTC_DCHECK_RUNS_SERIALIZED(&race_checker_);AudioFrameList result;std::vector<SourceFrame> audio_source_mixing_data_list;std::vector<SourceFrame> ramp_list;// Get audio from the audio sources and put it in the SourceFrame vector.for (auto& source_and_status : audio_source_list_) {const auto audio_frame_info =source_and_status->audio_source->GetAudioFrameWithInfo(OutputFrequency(), &source_and_status->audio_frame);if (audio_frame_info == Source::AudioFrameInfo::kError) {RTC_LOG_F(LS_WARNING) << "failed to GetAudioFrameWithInfo() from source";continue;}audio_source_mixing_data_list.emplace_back(source_and_status.get(), &source_and_status->audio_frame,audio_frame_info == Source::AudioFrameInfo::kMuted);}// Sort frames by sorting function.std::sort(audio_source_mixing_data_list.begin(),audio_source_mixing_data_list.end(), ShouldMixBefore);int max_audio_frame_counter = kMaximumAmountOfMixedAudioSources;// Go through list in order and put unmuted frames in result list.for (const auto& p : audio_source_mixing_data_list) {// Filter muted.if (p.muted) {p.source_status->is_mixed = false;continue;}// Add frame to result vector for mixing.bool is_mixed = false;if (max_audio_frame_counter > 0) {--max_audio_frame_counter;result.push_back(p.audio_frame);ramp_list.emplace_back(p.source_status, p.audio_frame, false, -1);is_mixed = true;}p.source_status->is_mixed = is_mixed;}RampAndUpdateGain(ramp_list);return result; }bool AudioMixerImpl::GetAudioSourceMixabilityStatusForTest(AudioMixerImpl::Source* audio_source) const {RTC_DCHECK_RUNS_SERIALIZED(&race_checker_);rtc::CritScope lock(&crit_);const auto iter = FindSourceInList(audio_source, &audio_source_list_);if (iter != audio_source_list_.end()) {return (*iter)->is_mixed;}RTC_LOG(LS_ERROR) << "Audio source unknown";return false; } } // namespace webrtc

不難看出，AudioMixerImpl 的 AddSource(Source* audio_source) 和 RemoveSource(Source* audio_source) 都只是普通的容器操作，但它強制不能添加已經添加的 Mixer Source，也不能移除不存在的 Mixer Source。整個類的中心無疑是 Mix(size_t number_of_channels, AudioFrame* audio_frame_for_mixing) 了。

void AudioMixerImpl::Mix(size_t number_of_channels,AudioFrame* audio_frame_for_mixing) {RTC_DCHECK(number_of_channels >= 1);RTC_DCHECK_RUNS_SERIALIZED(&race_checker_);CalculateOutputFrequency();{rtc::CritScope lock(&crit_);const size_t number_of_streams = audio_source_list_.size();frame_combiner_.Combine(GetAudioFromSources(), number_of_channels,OutputFrequency(), number_of_streams,audio_frame_for_mixing);}return; }

AudioMixerImpl::Mix() 混音過程大致如下：

計算輸出音頻幀的采樣率。這也是這個接口不需要指定輸出采樣率的原因，AudioMixer 的實現內部會自己算，通常是根據各個 Mixer Source 的 Preferred 采樣率算。

從所有的 Mixer Source 中獲得一個特定采樣率的音頻幀的列表。AudioMixer 并不是簡單的從所有的 Mixer Source 中各獲得一個音頻幀并構造一個列表就完事，它還會對這些音頻幀做一些簡單變換和取舍。

通過 FrameCombiner 對不同的音頻幀做混音。

計算輸出音頻采樣率

計算輸出音頻采樣率的過程如下：

void AudioMixerImpl::CalculateOutputFrequency() {RTC_DCHECK_RUNS_SERIALIZED(&race_checker_);rtc::CritScope lock(&crit_);std::vector<int> preferred_rates;std::transform(audio_source_list_.begin(), audio_source_list_.end(),std::back_inserter(preferred_rates),[&](std::unique_ptr<SourceStatus>& a) {return a->audio_source->PreferredSampleRate();});output_frequency_ =output_rate_calculator_->CalculateOutputRate(preferred_rates);sample_size_ = (output_frequency_ * kFrameDurationInMs) / 1000; }

AudioMixerImpl 首先獲得各個 Mixer Source 的 Preferred 的采樣率并構造一個列表，然后通過 OutputRateCalculator 接口 (位于 webrtc/modules/audio_mixer/output_rate_calculator.h) 計算輸出采樣率：

class OutputRateCalculator {public:virtual int CalculateOutputRate(const std::vector<int>& preferred_sample_rates) = 0;virtual ~OutputRateCalculator() {} };

WebRTC 提供了一個默認的 OutputRateCalculator 接口實現 DefaultOutputRateCalculator，類定義 (webrtc/src/modules/audio_mixer/default_output_rate_calculator.h) 如下：

namespace webrtc {class DefaultOutputRateCalculator : public OutputRateCalculator {public:static const int kDefaultFrequency = 48000;// Produces the least native rate greater or equal to the preferred// sample rates. A native rate is one in// AudioProcessing::NativeRate. If |preferred_sample_rates| is// empty, returns |kDefaultFrequency|.int CalculateOutputRate(const std::vector<int>& preferred_sample_rates) override;~DefaultOutputRateCalculator() override {} };} // namespace webrtc

這個類的定義很簡單。默認的 AudioMixer 輸出采樣率的計算方法如下：

namespace webrtc {int DefaultOutputRateCalculator::CalculateOutputRate(const std::vector<int>& preferred_sample_rates) {if (preferred_sample_rates.empty()) {return DefaultOutputRateCalculator::kDefaultFrequency;}using NativeRate = AudioProcessing::NativeRate;const int maximal_frequency = *std::max_element(preferred_sample_rates.begin(), preferred_sample_rates.end());RTC_DCHECK_LE(NativeRate::kSampleRate8kHz, maximal_frequency);RTC_DCHECK_GE(NativeRate::kSampleRate48kHz, maximal_frequency);static constexpr NativeRate native_rates[] = {NativeRate::kSampleRate8kHz, NativeRate::kSampleRate16kHz,NativeRate::kSampleRate32kHz, NativeRate::kSampleRate48kHz};const auto* rounded_up_index = std::lower_bound(std::begin(native_rates), std::end(native_rates), maximal_frequency);RTC_DCHECK(rounded_up_index != std::end(native_rates));return *rounded_up_index; } } // namespace webrtc

對于音頻，WebRTC 內部支持一些標準的采樣率，即 8K，16K，32K 和 48K。DefaultOutputRateCalculator 獲得傳入的采樣率列表中最大的那個，并在標準采樣率列表中找到最小的那個大于等于前面獲得的最大采樣率的采樣率。也就是說，如果 AudioMixerImpl 的所有 Mixer Source 的 Preferred 采樣率都大于 48K，計算會失敗。

獲得音頻幀列表

AudioMixerImpl::GetAudioFromSources() 獲得音頻幀列表：

AudioFrameList AudioMixerImpl::GetAudioFromSources() {RTC_DCHECK_RUNS_SERIALIZED(&race_checker_);AudioFrameList result;std::vector<SourceFrame> audio_source_mixing_data_list;std::vector<SourceFrame> ramp_list;// Get audio from the audio sources and put it in the SourceFrame vector.for (auto& source_and_status : audio_source_list_) {const auto audio_frame_info =source_and_status->audio_source->GetAudioFrameWithInfo(OutputFrequency(), &source_and_status->audio_frame);if (audio_frame_info == Source::AudioFrameInfo::kError) {RTC_LOG_F(LS_WARNING) << "failed to GetAudioFrameWithInfo() from source";continue;}audio_source_mixing_data_list.emplace_back(source_and_status.get(), &source_and_status->audio_frame,audio_frame_info == Source::AudioFrameInfo::kMuted);}// Sort frames by sorting function.std::sort(audio_source_mixing_data_list.begin(),audio_source_mixing_data_list.end(), ShouldMixBefore);int max_audio_frame_counter = kMaximumAmountOfMixedAudioSources;// Go through list in order and put unmuted frames in result list.for (const auto& p : audio_source_mixing_data_list) {// Filter muted.if (p.muted) {p.source_status->is_mixed = false;continue;}// Add frame to result vector for mixing.bool is_mixed = false;if (max_audio_frame_counter > 0) {--max_audio_frame_counter;result.push_back(p.audio_frame);ramp_list.emplace_back(p.source_status, p.audio_frame, false, -1);is_mixed = true;}p.source_status->is_mixed = is_mixed;}RampAndUpdateGain(ramp_list);return result; }

AudioMixerImpl::GetAudioFromSources() 從各個 Mixer Source 中獲得音頻幀，并構造 SourceFrame 的列表。注意 SourceFrame 的構造函數會調用 AudioMixerCalculateEnergy() (位于 webrtc/src/modules/audio_mixer/audio_frame_manipulator.cc) 計算音頻幀的 energy。具體的計算方法如下：

uint32_t AudioMixerCalculateEnergy(const AudioFrame& audio_frame) {if (audio_frame.muted()) {return 0;}uint32_t energy = 0;const int16_t* frame_data = audio_frame.data();for (size_t position = 0;position < audio_frame.samples_per_channel_ * audio_frame.num_channels_;position++) {// TODO(aleloi): This can overflow. Convert to floats.energy += frame_data[position] * frame_data[position];}return energy; }

計算所有采樣點數值的平方和。

然后對獲得的音頻幀排序，排序的邏輯如下：

bool ShouldMixBefore(const SourceFrame& a, const SourceFrame& b) {if (a.muted != b.muted) {return b.muted;}const auto a_activity = a.audio_frame->vad_activity_;const auto b_activity = b.audio_frame->vad_activity_;if (a_activity != b_activity) {return a_activity == AudioFrame::kVadActive;}return a.energy > b.energy; }

從排序之后的音頻幀列表中選取最多 3 個信號最強的音頻幀返回。

對選擇的音頻幀信號 Ramp 及更新增益：

void RampAndUpdateGain(const std::vector<SourceFrame>& mixed_sources_and_frames) {for (const auto& source_frame : mixed_sources_and_frames) {float target_gain = source_frame.source_status->is_mixed ? 1.0f : 0.0f;Ramp(source_frame.source_status->gain, target_gain,source_frame.audio_frame);source_frame.source_status->gain = target_gain;} }

Ramp() 的執行過程 (位于 webrtc/src/modules/audio_mixer/audio_frame_manipulator.cc) 如下：

void Ramp(float start_gain, float target_gain, AudioFrame* audio_frame) {RTC_DCHECK(audio_frame);RTC_DCHECK_GE(start_gain, 0.0f);RTC_DCHECK_GE(target_gain, 0.0f);if (start_gain == target_gain || audio_frame->muted()) {return;}size_t samples = audio_frame->samples_per_channel_;RTC_DCHECK_LT(0, samples);float increment = (target_gain - start_gain) / samples;float gain = start_gain;int16_t* frame_data = audio_frame->mutable_data();for (size_t i = 0; i < samples; ++i) {// If the audio is interleaved of several channels, we want to// apply the same gain change to the ith sample of every channel.for (size_t ch = 0; ch < audio_frame->num_channels_; ++ch) {frame_data[audio_frame->num_channels_ * i + ch] *= gain;}gain += increment;} }

之所以要執行這一步，是因為在混音不同音頻幀的特定時刻，同一個音頻流的音頻幀可能會由于它的音頻幀的信號相對強度，被納入混音或被排除混音，這一步的操作可以使特定某一路音頻聽上去變化更平滑。

FrameCombiner

FrameCombiner 是混音的最終執行者：

void FrameCombiner::Combine(const std::vector<AudioFrame*>& mix_list,size_t number_of_channels,int sample_rate,size_t number_of_streams,AudioFrame* audio_frame_for_mixing) {RTC_DCHECK(audio_frame_for_mixing);LogMixingStats(mix_list, sample_rate, number_of_streams);SetAudioFrameFields(mix_list, number_of_channels, sample_rate,number_of_streams, audio_frame_for_mixing);const size_t samples_per_channel = static_cast<size_t>((sample_rate * webrtc::AudioMixerImpl::kFrameDurationInMs) / 1000);for (const auto* frame : mix_list) {RTC_DCHECK_EQ(samples_per_channel, frame->samples_per_channel_);RTC_DCHECK_EQ(sample_rate, frame->sample_rate_hz_);}// The 'num_channels_' field of frames in 'mix_list' could be// different from 'number_of_channels'.for (auto* frame : mix_list) {RemixFrame(number_of_channels, frame);}if (number_of_streams <= 1) {MixFewFramesWithNoLimiter(mix_list, audio_frame_for_mixing);return;}std::array<OneChannelBuffer, kMaximumAmountOfChannels> mixing_buffer =MixToFloatFrame(mix_list, samples_per_channel, number_of_channels);// Put float data in an AudioFrameView.std::array<float*, kMaximumAmountOfChannels> channel_pointers{};for (size_t i = 0; i < number_of_channels; ++i) {channel_pointers[i] = &mixing_buffer[i][0];}AudioFrameView<float> mixing_buffer_view(&channel_pointers[0], number_of_channels, samples_per_channel);if (use_limiter_) {RunLimiter(mixing_buffer_view, &limiter_);}InterleaveToAudioFrame(mixing_buffer_view, audio_frame_for_mixing); }

FrameCombiner 把各個音頻幀的數據的通道數都轉換為目標通道數：

void RemixFrame(size_t target_number_of_channels, AudioFrame* frame) {RTC_DCHECK_GE(target_number_of_channels, 1);RTC_DCHECK_LE(target_number_of_channels, 2);if (frame->num_channels_ == 1 && target_number_of_channels == 2) {AudioFrameOperations::MonoToStereo(frame);} else if (frame->num_channels_ == 2 && target_number_of_channels == 1) {AudioFrameOperations::StereoToMono(frame);} }

執行混音

std::array<OneChannelBuffer, kMaximumAmountOfChannels> MixToFloatFrame(const std::vector<AudioFrame*>& mix_list,size_t samples_per_channel,size_t number_of_channels) {// Convert to FloatS16 and mix.using OneChannelBuffer = std::array<float, kMaximumChannelSize>;std::array<OneChannelBuffer, kMaximumAmountOfChannels> mixing_buffer{};for (size_t i = 0; i < mix_list.size(); ++i) {const AudioFrame* const frame = mix_list[i];for (size_t j = 0; j < number_of_channels; ++j) {for (size_t k = 0; k < samples_per_channel; ++k) {mixing_buffer[j][k] += frame->data()[number_of_channels * k + j];}}}return mixing_buffer; }

可以看到，所謂混音，只是把不同音頻流的音頻幀的樣本點數據相加。

RunLimiter
這一步會通過 AGC，對音頻信號做處理。

void RunLimiter(AudioFrameView<float> mixing_buffer_view,FixedGainController* limiter) {const size_t sample_rate = mixing_buffer_view.samples_per_channel() * 1000 /AudioMixerImpl::kFrameDurationInMs;limiter->SetSampleRate(sample_rate);limiter->Process(mixing_buffer_view); }

數據格式轉換

// Both interleaves and rounds. void InterleaveToAudioFrame(AudioFrameView<const float> mixing_buffer_view,AudioFrame* audio_frame_for_mixing) {const size_t number_of_channels = mixing_buffer_view.num_channels();const size_t samples_per_channel = mixing_buffer_view.samples_per_channel();// Put data in the result frame.for (size_t i = 0; i < number_of_channels; ++i) {for (size_t j = 0; j < samples_per_channel; ++j) {audio_frame_for_mixing->mutable_data()[number_of_channels * j + i] =FloatS16ToS16(mixing_buffer_view.channel(i)[j]);}} }

經過前面的處理，得到浮點型的音頻采樣數據。這一步將浮點型的數據轉換為需要的 16 位整型數據。

至此混音結束。

結論：混音就是把各個音頻流的采樣點數據相加。

通道數轉換如何完成？

WebRTC 提供了一些 Utility 函數用于完成音頻幀單通道、立體聲及四通道之間的相互轉換，位于 webrtc/audio/utility/audio_frame_operations.cc。通過這些函數的實現，我們可以看到音頻幀的通道數轉換具體是什么含義。

單通道轉立體聲：

void AudioFrameOperations::MonoToStereo(const int16_t* src_audio,size_t samples_per_channel,int16_t* dst_audio) {for (size_t i = 0; i < samples_per_channel; i++) {dst_audio[2 * i] = src_audio[i];dst_audio[2 * i + 1] = src_audio[i];} }int AudioFrameOperations::MonoToStereo(AudioFrame* frame) {if (frame->num_channels_ != 1) {return -1;}if ((frame->samples_per_channel_ * 2) >= AudioFrame::kMaxDataSizeSamples) {// Not enough memory to expand from mono to stereo.return -1;}if (!frame->muted()) {// TODO(yujo): this operation can be done in place.int16_t data_copy[AudioFrame::kMaxDataSizeSamples];memcpy(data_copy, frame->data(),sizeof(int16_t) * frame->samples_per_channel_);MonoToStereo(data_copy, frame->samples_per_channel_, frame->mutable_data());}frame->num_channels_ = 2;return 0; }

單通道轉立體聲，也就是把一個通道的數據復制一份，讓兩個聲道播放相同的音頻數據。

立體聲轉單聲道：

void AudioFrameOperations::StereoToMono(const int16_t* src_audio,size_t samples_per_channel,int16_t* dst_audio) {for (size_t i = 0; i < samples_per_channel; i++) {dst_audio[i] =(static_cast<int32_t>(src_audio[2 * i]) + src_audio[2 * i + 1]) >> 1;} }int AudioFrameOperations::StereoToMono(AudioFrame* frame) {if (frame->num_channels_ != 2) {return -1;}RTC_DCHECK_LE(frame->samples_per_channel_ * 2,AudioFrame::kMaxDataSizeSamples);if (!frame->muted()) {StereoToMono(frame->data(), frame->samples_per_channel_,frame->mutable_data());}frame->num_channels_ = 1;return 0; }

立體聲轉單聲道是把兩個聲道的數據相加除以 2，得到一個通道的音頻數據。

四聲道轉立體聲：

void AudioFrameOperations::QuadToStereo(const int16_t* src_audio,size_t samples_per_channel,int16_t* dst_audio) {for (size_t i = 0; i < samples_per_channel; i++) {dst_audio[i * 2] =(static_cast<int32_t>(src_audio[4 * i]) + src_audio[4 * i + 1]) >> 1;dst_audio[i * 2 + 1] =(static_cast<int32_t>(src_audio[4 * i + 2]) + src_audio[4 * i + 3]) >>1;} }int AudioFrameOperations::QuadToStereo(AudioFrame* frame) {if (frame->num_channels_ != 4) {return -1;}RTC_DCHECK_LE(frame->samples_per_channel_ * 4,AudioFrame::kMaxDataSizeSamples);if (!frame->muted()) {QuadToStereo(frame->data(), frame->samples_per_channel_,frame->mutable_data());}frame->num_channels_ = 2;return 0; }

四聲道轉立體聲，是把 1、2 兩個聲道的數據相加除以 2 作為一個聲道的數據，把 3、4 兩個聲道的數據相加除以 2 作為另一個聲道的數據。

四聲道轉單聲道：

void AudioFrameOperations::QuadToMono(const int16_t* src_audio,size_t samples_per_channel,int16_t* dst_audio) {for (size_t i = 0; i < samples_per_channel; i++) {dst_audio[i] =(static_cast<int32_t>(src_audio[4 * i]) + src_audio[4 * i + 1] +src_audio[4 * i + 2] + src_audio[4 * i + 3]) >>2;} }int AudioFrameOperations::QuadToMono(AudioFrame* frame) {if (frame->num_channels_ != 4) {return -1;}RTC_DCHECK_LE(frame->samples_per_channel_ * 4,AudioFrame::kMaxDataSizeSamples);if (!frame->muted()) {QuadToMono(frame->data(), frame->samples_per_channel_,frame->mutable_data());}frame->num_channels_ = 1;return 0; }

四聲道轉單聲道是把四個聲道的數據相加除以四，得到一個聲道的數據。

WebRTC 提供的其它音頻數據操作具體可以參考 WebRTC 的頭文件。

重采樣

重采樣可已將某個采樣率的音頻數據轉換為另一個采樣率的分辨率。WebRTC 中的重采樣主要通過 PushResampler 、 PushSincResampler 和 SincResampler 等幾個組件完成。如 webrtc/src/audio/audio_transport_impl.cc 中的 Resample()：

// Resample audio in |frame| to given sample rate preserving the // channel count and place the result in |destination|. int Resample(const AudioFrame& frame, const int destination_sample_rate,PushResampler<int16_t>* resampler, int16_t* destination) {const int number_of_channels = static_cast<int>(frame.num_channels_);const int target_number_of_samples_per_channel =destination_sample_rate / 100;resampler->InitializeIfNeeded(frame.sample_rate_hz_, destination_sample_rate,number_of_channels);// TODO(yujo): make resampler take an AudioFrame, and add special case// handling of muted frames.return resampler->Resample(frame.data(), frame.samples_per_channel_ * number_of_channels,destination, number_of_channels * target_number_of_samples_per_channel); }

PushResampler 是一個模板類，其接口比較簡單，類的具體定義 (位于 webrtc/src/common_audio/resampler/include/push_resampler.h) 如下：

namespace webrtc {class PushSincResampler;// Wraps PushSincResampler to provide stereo support. // TODO(ajm): add support for an arbitrary number of channels. template <typename T> class PushResampler {public:PushResampler();virtual ~PushResampler();// Must be called whenever the parameters change. Free to be called at any// time as it is a no-op if parameters have not changed since the last call.int InitializeIfNeeded(int src_sample_rate_hz,int dst_sample_rate_hz,size_t num_channels);// Returns the total number of samples provided in destination (e.g. 32 kHz,// 2 channel audio gives 640 samples).int Resample(const T* src, size_t src_length, T* dst, size_t dst_capacity);private:std::unique_ptr<PushSincResampler> sinc_resampler_;std::unique_ptr<PushSincResampler> sinc_resampler_right_;int src_sample_rate_hz_;int dst_sample_rate_hz_;size_t num_channels_;std::unique_ptr<T[]> src_left_;std::unique_ptr<T[]> src_right_;std::unique_ptr<T[]> dst_left_;std::unique_ptr<T[]> dst_right_; };} // namespace webrtc

這個類的實現 (位于 webrtc/src/common_audio/resampler/push_resampler.cc) 如下：

template <typename T> PushResampler<T>::PushResampler(): src_sample_rate_hz_(0), dst_sample_rate_hz_(0), num_channels_(0) {}template <typename T> PushResampler<T>::~PushResampler() {}template <typename T> int PushResampler<T>::InitializeIfNeeded(int src_sample_rate_hz,int dst_sample_rate_hz,size_t num_channels) {CheckValidInitParams(src_sample_rate_hz, dst_sample_rate_hz, num_channels);if (src_sample_rate_hz == src_sample_rate_hz_ &&dst_sample_rate_hz == dst_sample_rate_hz_ &&num_channels == num_channels_) {// No-op if settings haven't changed.return 0;}if (src_sample_rate_hz <= 0 || dst_sample_rate_hz <= 0 || num_channels <= 0 ||num_channels > 2) {return -1;}src_sample_rate_hz_ = src_sample_rate_hz;dst_sample_rate_hz_ = dst_sample_rate_hz;num_channels_ = num_channels;const size_t src_size_10ms_mono =static_cast<size_t>(src_sample_rate_hz / 100);const size_t dst_size_10ms_mono =static_cast<size_t>(dst_sample_rate_hz / 100);sinc_resampler_.reset(new PushSincResampler(src_size_10ms_mono, dst_size_10ms_mono));if (num_channels_ == 2) {src_left_.reset(new T[src_size_10ms_mono]);src_right_.reset(new T[src_size_10ms_mono]);dst_left_.reset(new T[dst_size_10ms_mono]);dst_right_.reset(new T[dst_size_10ms_mono]);sinc_resampler_right_.reset(new PushSincResampler(src_size_10ms_mono, dst_size_10ms_mono));}return 0; }template <typename T> int PushResampler<T>::Resample(const T* src,size_t src_length,T* dst,size_t dst_capacity) {CheckExpectedBufferSizes(src_length, dst_capacity, num_channels_,src_sample_rate_hz_, dst_sample_rate_hz_);if (src_sample_rate_hz_ == dst_sample_rate_hz_) {// The old resampler provides this memcpy facility in the case of matching// sample rates, so reproduce it here for the sinc resampler.memcpy(dst, src, src_length * sizeof(T));return static_cast<int>(src_length);}if (num_channels_ == 2) {const size_t src_length_mono = src_length / num_channels_;const size_t dst_capacity_mono = dst_capacity / num_channels_;T* deinterleaved[] = {src_left_.get(), src_right_.get()};Deinterleave(src, src_length_mono, num_channels_, deinterleaved);size_t dst_length_mono = sinc_resampler_->Resample(src_left_.get(), src_length_mono, dst_left_.get(), dst_capacity_mono);sinc_resampler_right_->Resample(src_right_.get(), src_length_mono,dst_right_.get(), dst_capacity_mono);deinterleaved[0] = dst_left_.get();deinterleaved[1] = dst_right_.get();Interleave(deinterleaved, dst_length_mono, num_channels_, dst);return static_cast<int>(dst_length_mono * num_channels_);} else {return static_cast<int>(sinc_resampler_->Resample(src, src_length, dst, dst_capacity));} }// Explictly generate required instantiations. template class PushResampler<int16_t>; template class PushResampler<float>;

PushResampler<T>::InitializeIfNeeded() 函數根據源和目標采樣率初始化了一些緩沖區和必要的 PushSincResampler。

PushResampler<T>::Resample() 函數中，通過 PushSincResampler 完成重采樣。PushSincResampler 執行單個通道的音頻數據的重采樣。對于立體聲的音頻數據，PushResampler<T>::Resample() 函數會先將音頻幀的數據，拆開成兩個單通道的音頻幀數據，然后分別做重采樣，最后再合起來。

webrtc/src/common_audio/include/audio_util.h 中將立體聲的音頻數據拆開為兩個單通道的數據，和將兩個單通道的音頻數據合并為立體聲音頻幀數據的具體實現如下：

// Deinterleave audio from |interleaved| to the channel buffers pointed to // by |deinterleaved|. There must be sufficient space allocated in the // |deinterleaved| buffers (|num_channel| buffers with |samples_per_channel| // per buffer). template <typename T> void Deinterleave(const T* interleaved,size_t samples_per_channel,size_t num_channels,T* const* deinterleaved) {for (size_t i = 0; i < num_channels; ++i) {T* channel = deinterleaved[i];size_t interleaved_idx = i;for (size_t j = 0; j < samples_per_channel; ++j) {channel[j] = interleaved[interleaved_idx];interleaved_idx += num_channels;}} }// Interleave audio from the channel buffers pointed to by |deinterleaved| to // |interleaved|. There must be sufficient space allocated in |interleaved| // (|samples_per_channel| * |num_channels|). template <typename T> void Interleave(const T* const* deinterleaved,size_t samples_per_channel,size_t num_channels,T* interleaved) {for (size_t i = 0; i < num_channels; ++i) {const T* channel = deinterleaved[i];size_t interleaved_idx = i;for (size_t j = 0; j < samples_per_channel; ++j) {interleaved[interleaved_idx] = channel[j];interleaved_idx += num_channels;}} }

音頻數據的基本操作混音，聲道轉換，和重采樣。

總結

以上是生活随笔為你收集整理的WebRTC 中的基本音频处理操作的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Googletest 实现简要分析
下一篇： WebRTC 的音频处理流水线