Title: | R Interface to 'pytorch''s 'torchaudio' |
---|---|
Description: | Provides access to datasets, models and processing facilities for deep learning in audio. |
Authors: | Sigrid Keydana [aut, cre], Athos Damiani [aut], Daniel Falbel [aut] |
Maintainer: | Sigrid Keydana <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.1 |
Built: | 2024-11-23 03:47:16 UTC |
Source: | https://github.com/cran/torchaudio |
Create a Dataset for CMU_ARCTIC.
cmuarctic_dataset( root, url = "aew", folder_in_archive = "ARCTIC", download = FALSE )
cmuarctic_dataset( root, url = "aew", folder_in_archive = "ARCTIC", download = FALSE )
root |
(str): Path to the directory where the dataset is found or downloaded. |
url |
(str, optional): The URL to download the dataset from or the type of the dataset to dowload.
(default: |
folder_in_archive |
(str, optional): The top-level directory of the dataset. (default: |
download |
(bool, optional): Whether to download the dataset if it is not found at root path. (default: |
a torch::dataset()
Extract Archive
extract_archive(from_path, to_path = NULL, overwrite = FALSE)
extract_archive(from_path, to_path = NULL, overwrite = FALSE)
from_path |
(str): the path of the archive. |
to_path |
(str, optional): the root path of the extraced files (directory of from_path) (Default: |
overwrite |
(bool, optional): overwrite existing files (Default: |
list: List of paths to extracted files even if not overwritten.
if(torch::torch_is_installed()) { url = 'http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz' d <- fs::dir_create(tempdir(), "torchaudio") from_path <- fs::path(d, basename(url)) utils::download.file(url = url, destfile = from_path) torchaudio::extract_archive (from_path, d) }
if(torch::torch_is_installed()) { url = 'http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz' d <- fs::dir_create(tempdir(), "torchaudio") from_path <- fs::path(d, basename(url)) utils::download.file(url = url, destfile = from_path) torchaudio::extract_archive (from_path, d) }
Take value from first if bigger than a multiplicative factor of the second, elementwise.
functional__combine_max(a, b, thresh = 0.99)
functional__combine_max(a, b, thresh = 0.99)
a |
(list(tensor, tensor)) |
b |
(list(tensor, tensor)) |
thresh |
(float) Default: 0.99 |
list(tensor, tensor)
: a list with values tensor and indices tensor.
Compute Normalized Cross-Correlation Function (NCCF).
functional__compute_nccf(waveform, sample_rate, frame_time, freq_low)
functional__compute_nccf(waveform, sample_rate, frame_time, freq_low)
waveform |
(Tensor): Tensor of audio of dimension (..., time) |
sample_rate |
(int): sampling rate of the waveform, e.g. 44100 (Hz) |
frame_time |
(float) |
freq_low |
(float) |
tensor
of nccf“
For each frame, take the highest value of NCCF, apply centered median smoothing, and convert to frequency.
functional__find_max_per_frame(nccf, sample_rate, freq_high)
functional__find_max_per_frame(nccf, sample_rate, freq_high)
nccf |
(tensor): Usually a tensor returned by functional__compute_nccf |
sample_rate |
(int): sampling rate of the waveform, e.g. 44100 (Hz) |
freq_high |
(int): Highest frequency that can be detected (Hz) Note: If the max among all the lags is very close to the first half of lags, then the latter is taken. |
tensor
with indices
A helper function for phaser. Generates a table with given parameters
functional__generate_wave_table( wave_type, data_type, table_size, min, max, phase, device )
functional__generate_wave_table( wave_type, data_type, table_size, min, max, phase, device )
wave_type |
(str): 'SINE' or 'TRIANGULAR' |
data_type |
(str): desired data_type ( |
table_size |
(int): desired table size |
min |
(float): desired min value |
max |
(float): desired max value |
phase |
(float): desired phase |
device |
(torch_device): Torch device on which table must be generated |
tensor
: A 1D tensor with wave table values
Apply median smoothing to the 1D tensor over the given window.
functional__median_smoothing(indices, win_length)
functional__median_smoothing(indices, win_length)
indices |
(Tensor) |
win_length |
(int) |
tensor
Noise shaping is calculated by error: error[n] = dithered[n] - original[n] noise_shaped_waveform[n] = dithered[n] + error[n-1]
functional_add_noise_shaping(dithered_waveform, waveform)
functional_add_noise_shaping(dithered_waveform, waveform)
dithered_waveform |
(Tensor) dithered |
waveform |
(Tensor) original |
tensor
of the noise shaped waveform
Design two-pole all-pass filter. Similar to SoX implementation.
functional_allpass_biquad(waveform, sample_rate, central_freq, Q = 0.707)
functional_allpass_biquad(waveform, sample_rate, central_freq, Q = 0.707)
waveform |
(Tensor): audio waveform of dimension of |
sample_rate |
(int): sampling rate of the waveform, e.g. 44100 (Hz) |
central_freq |
(float): central frequency (in Hz) |
Q |
(float, optional): https://en.wikipedia.org/wiki/Q_factor (Default: |
tensor
: Waveform of dimension of (..., time)
Turn a tensor from the power/amplitude scale to the decibel scale.
functional_amplitude_to_db(x, multiplier, amin, db_multiplier, top_db = NULL)
functional_amplitude_to_db(x, multiplier, amin, db_multiplier, top_db = NULL)
x |
(Tensor): Input tensor before being converted to decibel scale |
multiplier |
(float): Use 10.0 for power and 20.0 for amplitude (Default: |
amin |
(float): Number to clamp |
db_multiplier |
(float): Log10(max(ref_value and amin)) |
top_db |
(float or NULL, optional): Minimum negative cut-off in decibels. A reasonable number
is 80. (Default: |
This output depends on the maximum value in the input tensor, and so may return different values for an audio clip split into snippets vs. a a full clip.
tensor
: Output tensor in decibel scale
Compute the angle of complex tensor input.
functional_angle(complex_tensor)
functional_angle(complex_tensor)
complex_tensor |
(Tensor): Tensor shape of |
tensor
: Angle of a complex tensor. Shape of (..., )
Apply a probability distribution function on a waveform.
functional_apply_probability_distribution(waveform, density_function = "TPDF")
functional_apply_probability_distribution(waveform, density_function = "TPDF")
waveform |
(Tensor): Tensor of audio of dimension (..., time) |
density_function |
(str, optional): The density function of a
continuous random variable (Default: |
Triangular probability density function (TPDF) dither noise has a triangular distribution; values in the center of the range have a higher probability of occurring.
Rectangular probability density function (RPDF) dither noise has a uniform distribution; any value in the specified range has the same probability of occurring.
Gaussian probability density function (GPDF) has a normal distribution. The relationship of probabilities of results follows a bell-shaped, or Gaussian curve, typical of dither generated by analog sources.
tensor
: waveform dithered with TPDF
Design two-pole band filter. Similar to SoX implementation.
functional_band_biquad( waveform, sample_rate, central_freq, Q = 0.707, noise = FALSE )
functional_band_biquad( waveform, sample_rate, central_freq, Q = 0.707, noise = FALSE )
waveform |
(Tensor): audio waveform of dimension of |
sample_rate |
(int): sampling rate of the waveform, e.g. 44100 (Hz) |
central_freq |
(float): central frequency (in Hz) |
Q |
(float, optional): https://en.wikipedia.org/wiki/Q_factor (Default: |
noise |
(bool, optional) : If |
tensor
: Waveform of dimension of (..., time)
Design two-pole band-pass filter. Similar to SoX implementation.
functional_bandpass_biquad( waveform, sample_rate, central_freq, Q = 0.707, const_skirt_gain = FALSE )
functional_bandpass_biquad( waveform, sample_rate, central_freq, Q = 0.707, const_skirt_gain = FALSE )
waveform |
(Tensor): audio waveform of dimension of |
sample_rate |
(int): sampling rate of the waveform, e.g. 44100 (Hz) |
central_freq |
(float): central frequency (in Hz) |
Q |
(float, optional): https://en.wikipedia.org/wiki/Q_factor (Default: |
const_skirt_gain |
(bool, optional) : If |
Tensor: Waveform of dimension of (..., time)
Design two-pole band-reject filter. Similar to SoX implementation.
functional_bandreject_biquad(waveform, sample_rate, central_freq, Q = 0.707)
functional_bandreject_biquad(waveform, sample_rate, central_freq, Q = 0.707)
waveform |
(Tensor): audio waveform of dimension of |
sample_rate |
(int): sampling rate of the waveform, e.g. 44100 (Hz) |
central_freq |
(float): central frequency (in Hz) |
Q |
(float, optional): https://en.wikipedia.org/wiki/Q_factor (Default: |
tensor
: Waveform of dimension of (..., time)
Design a bass tone-control effect. Similar to SoX implementation.
functional_bass_biquad( waveform, sample_rate, gain, central_freq = 100, Q = 0.707 )
functional_bass_biquad( waveform, sample_rate, gain, central_freq = 100, Q = 0.707 )
waveform |
(Tensor): audio waveform of dimension of |
sample_rate |
(int): sampling rate of the waveform, e.g. 44100 (Hz) |
gain |
(float): desired gain at the boost (or attenuation) in dB. |
central_freq |
(float, optional): central frequency (in Hz). (Default: |
Q |
(float, optional): https://en.wikipedia.org/wiki/Q_factor (Default: |
tensor
: Waveform of dimension of (..., time)
Perform a biquad filter of input tensor. Initial conditions set to 0. https://en.wikipedia.org/wiki/Digital_biquad_filter
functional_biquad(waveform, b0, b1, b2, a0, a1, a2)
functional_biquad(waveform, b0, b1, b2, a0, a1, a2)
waveform |
(Tensor): audio waveform of dimension of |
b0 |
(float): numerator coefficient of current input, x[n] |
b1 |
(float): numerator coefficient of input one time step ago x[n-1] |
b2 |
(float): numerator coefficient of input two time steps ago x[n-2] |
a0 |
(float): denominator coefficient of current output y[n], typically 1 |
a1 |
(float): denominator coefficient of current output y[n-1] |
a2 |
(float): denominator coefficient of current output y[n-2] |
tensor
: Waveform with dimension of (..., time)
Compute the norm of complex tensor input.
functional_complex_norm(complex_tensor, power = 1)
functional_complex_norm(complex_tensor, power = 1)
complex_tensor |
(tensor): Tensor shape of |
power |
(numeric): Power of the norm. (Default: |
tensor
: Power of the normed input tensor. Shape of (..., )
Compute delta coefficients of a tensor, usually a spectrogram.
functional_compute_deltas(specgram, win_length = 5, mode = "replicate")
functional_compute_deltas(specgram, win_length = 5, mode = "replicate")
specgram |
(Tensor): Tensor of audio of dimension (..., freq, time) |
win_length |
(int, optional): The window length used for computing delta (Default: |
mode |
(str, optional): Mode parameter passed to padding (Default: |
math:
where d_t
is the deltas at time t
, c_t
is the spectrogram coeffcients at time t
,
N
is (win_length-1) %/% 2
.
tensor
: Tensor of deltas of dimension (..., freq, time)
if(torch::torch_is_installed()) { library(torch) library(torchaudio) specgram = torch_randn(1, 40, 1000) delta = functional_compute_deltas(specgram) delta2 = functional_compute_deltas(delta) }
if(torch::torch_is_installed()) { library(torch) library(torchaudio) specgram = torch_randn(1, 40, 1000) delta = functional_compute_deltas(specgram) delta2 = functional_compute_deltas(delta) }
Apply contrast effect. Similar to SoX implementation. Comparable with compression, this effect modifies an audio signal to make it sound louder
functional_contrast(waveform, enhancement_amount = 75)
functional_contrast(waveform, enhancement_amount = 75)
waveform |
(Tensor): audio waveform of dimension of |
enhancement_amount |
(float): controls the amount of the enhancement Allowed range of values for enhancement_amount : 0-100 Note that enhancement_amount = 0 still gives a significant contrast enhancement |
tensor
: Waveform of dimension of (..., time)
Create a DCT transformation matrix with shape (n_mels
, n_mfcc
),
normalized depending on norm.
https://en.wikipedia.org/wiki/Discrete_cosine_transform
functional_create_dct(n_mfcc, n_mels, norm = NULL)
functional_create_dct(n_mfcc, n_mels, norm = NULL)
n_mfcc |
(int): Number of mfc coefficients to retain |
n_mels |
(int): Number of mel filterbanks |
norm |
(chr or NULL): Norm to use (either 'ortho' or NULL) |
tensor
: The transformation matrix, to be right-multiplied to
row-wise data of size (n_mels
, n_mfcc
).
Create a frequency bin conversion matrix.
functional_create_fb_matrix( n_freqs, f_min, f_max, n_mels, sample_rate, norm = NULL )
functional_create_fb_matrix( n_freqs, f_min, f_max, n_mels, sample_rate, norm = NULL )
n_freqs |
(int): Number of frequencies to highlight/apply |
f_min |
(float): Minimum frequency (Hz) |
f_max |
(float or NULL): Maximum frequency (Hz). If NULL defaults to sample_rate %/% 2 |
n_mels |
(int): Number of mel filterbanks |
sample_rate |
(int): Sample rate of the audio waveform |
norm |
(chr) (Optional): If 'slaney', divide the triangular
mel weights by the width of the mel band (area normalization). (Default: |
tensor
: Triangular filter banks (fb matrix) of size (n_freqs
, n_mels
)
meaning number of frequencies to highlight/apply to x the number of filterbanks.
Each column is a filterbank so that assuming there is a matrix A of
size (..., n_freqs
), the applied result would be
A * functional_create_fb_matrix(A.size(-1), ...)
.
Turn a tensor from the decibel scale to the power/amplitude scale.
functional_db_to_amplitude(x, ref, power)
functional_db_to_amplitude(x, ref, power)
x |
(Tensor): Input tensor before being converted to power/amplitude scale. |
ref |
(float): Reference which the output will be scaled by. (Default: |
power |
(float): If power equals 1, will compute DB to power. If 0.5, will compute
DB to amplitude. (Default: |
tensor
: Output tensor in power/amplitude scale.
Apply a DC shift to the audio. Similar to SoX implementation. This can be useful to remove a DC offset (caused perhaps by a hardware problem in the recording chain) from the audio
functional_dcshift(waveform, shift, limiter_gain = NULL)
functional_dcshift(waveform, shift, limiter_gain = NULL)
waveform |
(Tensor): audio waveform of dimension of |
shift |
(float): indicates the amount to shift the audio Allowed range of values for shift : -2.0 to +2.0 |
limiter_gain |
(float): It is used only on peaks to prevent clipping It should have a value much less than 1 (e.g. 0.05 or 0.02) |
tensor
: Waveform of dimension of (..., time)
Apply ISO 908 CD de-emphasis (shelving) IIR filter. Similar to SoX implementation.
functional_deemph_biquad(waveform, sample_rate)
functional_deemph_biquad(waveform, sample_rate)
waveform |
(Tensor): audio waveform of dimension of |
sample_rate |
(int): sampling rate of the waveform, Allowed sample rate |
Tensor: Waveform of dimension of (..., time)
It is implemented using normalized cross-correlation function and median smoothing.
functional_detect_pitch_frequency( waveform, sample_rate, frame_time = 10^(-2), win_length = 30, freq_low = 85, freq_high = 3400 )
functional_detect_pitch_frequency( waveform, sample_rate, frame_time = 10^(-2), win_length = 30, freq_low = 85, freq_high = 3400 )
waveform |
(Tensor): Tensor of audio of dimension (..., freq, time) |
sample_rate |
(int): The sample rate of the waveform (Hz) |
frame_time |
(float, optional): Duration of a frame (Default: |
win_length |
(int, optional): The window length for median smoothing (in number of frames) (Default: |
freq_low |
(int, optional): Lowest frequency that can be detected (Hz) (Default: |
freq_high |
(int, optional): Highest frequency that can be detected (Hz) (Default: |
Tensor: Tensor of freq of dimension (..., frame)
Dither increases the perceived dynamic range of audio stored at a particular bit-depth by eliminating nonlinear truncation distortion (i.e. adding minimally perceived noise to mask distortion caused by quantization).
functional_dither(waveform, density_function = "TPDF", noise_shaping = FALSE)
functional_dither(waveform, density_function = "TPDF", noise_shaping = FALSE)
waveform |
(Tensor): Tensor of audio of dimension (..., time) |
density_function |
(str, optional): The density function of a continuous random variable (Default: |
noise_shaping |
(bool, optional): a filtering process that shapes the spectral
energy of quantisation error (Default: |
tensor
: waveform dithered
Design biquad peaking equalizer filter and perform filtering. Similar to SoX implementation.
functional_equalizer_biquad( waveform, sample_rate, center_freq, gain, Q = 0.707 )
functional_equalizer_biquad( waveform, sample_rate, center_freq, gain, Q = 0.707 )
waveform |
(Tensor): audio waveform of dimension of |
sample_rate |
(int): sampling rate of the waveform, e.g. 44100 (Hz) |
center_freq |
(float): filter's central frequency |
gain |
(float): desired gain at the boost (or attenuation) in dB |
Q |
(float, optional): https://en.wikipedia.org/wiki/Q_factor (Default: |
Tensor: Waveform of dimension of (..., time)
Apply a flanger effect to the audio. Similar to SoX implementation.
functional_flanger( waveform, sample_rate, delay = 0, depth = 2, regen = 0, width = 71, speed = 0.5, phase = 25, modulation = "sinusoidal", interpolation = "linear" )
functional_flanger( waveform, sample_rate, delay = 0, depth = 2, regen = 0, width = 71, speed = 0.5, phase = 25, modulation = "sinusoidal", interpolation = "linear" )
waveform |
(Tensor): audio waveform of dimension of |
sample_rate |
(int): sampling rate of the waveform, e.g. 44100 (Hz) |
delay |
(float): desired delay in milliseconds(ms). Allowed range of values are 0 to 30 |
depth |
(float): desired delay depth in milliseconds(ms). Allowed range of values are 0 to 10 |
regen |
(float): desired regen(feeback gain) in dB. Allowed range of values are -95 to 95 |
width |
(float): desired width(delay gain) in dB. Allowed range of values are 0 to 100 |
speed |
(float): modulation speed in Hz. Allowed range of values are 0.1 to 10 |
phase |
(float): percentage phase-shift for multi-channel. Allowed range of values are 0 to 100 |
modulation |
(str): Use either "sinusoidal" or "triangular" modulation. (Default: |
interpolation |
(str): Use either "linear" or "quadratic" for delay-line interpolation. (Default: |
tensor
: Waveform of dimension of (..., channel, time)
Scott Lehman, Effects Explained, https://web.archive.org/web/20051125072557/http://www.harmony-central.com/Effects/effects-explained.html
Apply amplification or attenuation to the whole waveform.
functional_gain(waveform, gain_db = 1)
functional_gain(waveform, gain_db = 1)
waveform |
(Tensor): Tensor of audio of dimension (..., time). |
gain_db |
(float, optional) Gain adjustment in decibels (dB) (Default: |
tensor
: the whole waveform amplified by gain_db.
Compute waveform from a linear scale magnitude spectrogram using the Griffin-Lim transformation.
Implementation ported from librosa
.
functional_griffinlim( specgram, window, n_fft, hop_length, win_length, power, normalized, n_iter, momentum, length, rand_init )
functional_griffinlim( specgram, window, n_fft, hop_length, win_length, power, normalized, n_iter, momentum, length, rand_init )
specgram |
(Tensor): A magnitude-only STFT spectrogram of dimension (..., freq, frames)
where freq is |
window |
(Tensor): Window tensor that is applied/multiplied to each frame/window |
n_fft |
(int): Size of FFT, creates |
hop_length |
(int): Length of hop between STFT windows. |
win_length |
(int): Window size. |
power |
(float): Exponent for the magnitude spectrogram, (must be > 0) e.g., 1 for energy, 2 for power, etc. |
normalized |
(bool): Whether to normalize by magnitude after stft. |
n_iter |
(int): Number of iteration for phase recovery process. |
momentum |
(float): The momentum parameter for fast Griffin-Lim. Setting this to 0 recovers the original Griffin-Lim method. Values near 1 can lead to faster convergence, but above 1 may not converge. |
length |
(int or NULL): Array length of the expected output. |
rand_init |
(bool): Initializes phase randomly if TRUE, to zero otherwise. |
tensor
: waveform of (..., time), where time equals the length
parameter if given.
Design biquad highpass filter and perform filtering. Similar to SoX implementation.
functional_highpass_biquad(waveform, sample_rate, cutoff_freq, Q = 0.707)
functional_highpass_biquad(waveform, sample_rate, cutoff_freq, Q = 0.707)
waveform |
(Tensor): audio waveform of dimension of |
sample_rate |
(int): sampling rate of the waveform, e.g. 44100 (Hz) |
cutoff_freq |
(float): filter cutoff frequency |
Q |
(float, optional): https://en.wikipedia.org/wiki/Q_factor (Default: |
tensor
: Waveform dimension of (..., time)
Perform an IIR filter by evaluating difference equation.
functional_lfilter(waveform, a_coeffs, b_coeffs, clamp = TRUE)
functional_lfilter(waveform, a_coeffs, b_coeffs, clamp = TRUE)
waveform |
(Tensor): audio waveform of dimension of |
a_coeffs |
(Tensor): denominator coefficients of difference equation of dimension of |
b_coeffs |
(Tensor): numerator coefficients of difference equation of dimension of |
clamp |
(bool, optional): If |
tensor
: Waveform with dimension of (..., time)
.
Design biquad lowpass filter and perform filtering. Similar to SoX implementation.
functional_lowpass_biquad(waveform, sample_rate, cutoff_freq, Q = 0.707)
functional_lowpass_biquad(waveform, sample_rate, cutoff_freq, Q = 0.707)
waveform |
(torch.Tensor): audio waveform of dimension of |
sample_rate |
(int): sampling rate of the waveform, e.g. 44100 (Hz) |
cutoff_freq |
(float): filter cutoff frequency |
Q |
(float, optional): https://en.wikipedia.org/wiki/Q_factor (Default: |
tensor
: Waveform of dimension of (..., time)
Separate a complex-valued spectrogram with shape (.., 2)
into its magnitude and phase.
functional_magphase(complex_tensor, power = 1)
functional_magphase(complex_tensor, power = 1)
complex_tensor |
(Tensor): Tensor shape of |
power |
(float): Power of the norm. (Default: |
list(tensor
, tensor
): The magnitude and phase of the complex tensor
Apply a mask along axis
. Mask will be applied from indices [v_0, v_0 + v)
, where
v
is sampled from uniform (0, mask_param)
, and v_0
from uniform(0, max_v - v)
.
All examples will have the same mask interval.
functional_mask_along_axis(specgram, mask_param, mask_value, axis)
functional_mask_along_axis(specgram, mask_param, mask_value, axis)
specgram |
(Tensor): Real spectrogram (channel, freq, time) |
mask_param |
(int): Number of columns to be masked will be uniformly sampled from |
mask_value |
(float): Value to assign to the masked columns |
axis |
(int): Axis to apply masking on (2 -> frequency, 3 -> time) |
Tensor: Masked spectrogram of dimensions (channel, freq, time)
Apply a mask along axis
. Mask will be applied from indices [v_0, v_0 + v)
, where
v
is sampled from uniform (0, mask_param)
, and v_0
from uniform(0, max_v - v)
.
functional_mask_along_axis_iid(specgrams, mask_param, mask_value, axis)
functional_mask_along_axis_iid(specgrams, mask_param, mask_value, axis)
specgrams |
(Tensor): Real spectrograms (batch, channel, freq, time) |
mask_param |
(int): Number of columns to be masked will be uniformly sampled from |
mask_value |
(float): Value to assign to the masked columns |
axis |
(int): Axis to apply masking on (3 -> frequency, 4 -> time) |
tensor
: Masked spectrograms of dimensions (batch, channel, freq, time)
Turn a normal STFT into a mel frequency STFT, using a conversion matrix. This uses triangular filter banks.
functional_mel_scale( specgram, n_mels = 128, sample_rate = 16000, f_min = 0, f_max = NULL, n_stft = NULL )
functional_mel_scale( specgram, n_mels = 128, sample_rate = 16000, f_min = 0, f_max = NULL, n_stft = NULL )
specgram |
(Tensor): A spectrogram STFT of dimension (..., freq, time). |
n_mels |
(int, optional): Number of mel filterbanks. (Default: |
sample_rate |
(int, optional): Sample rate of audio signal. (Default: |
f_min |
(float, optional): Minimum frequency. (Default: |
f_max |
(float or NULL, optional): Maximum frequency. (Default: |
n_stft |
(int, optional): Number of bins in STFT. Calculated from first input
if NULL is given. See |
tensor
: Mel frequency spectrogram of size (..., n_mels
, time).
Decode mu-law encoded signal. For more info see the Wikipedia Entry
functional_mu_law_decoding(x_mu, quantization_channels)
functional_mu_law_decoding(x_mu, quantization_channels)
x_mu |
(Tensor): Input tensor |
quantization_channels |
(int): Number of channels |
This expects an input with values between 0 and quantization_channels - 1 and returns a signal scaled between -1 and 1.
tensor
: Input after mu-law decoding
Encode signal based on mu-law companding. For more info see the Wikipedia Entry
functional_mu_law_encoding(x, quantization_channels)
functional_mu_law_encoding(x, quantization_channels)
x |
(Tensor): Input tensor |
quantization_channels |
(int): Number of channels |
This algorithm assumes the signal has been scaled to between -1 and 1 and returns a signal encoded with values from 0 to quantization_channels - 1.
tensor
: Input after mu-law encoding
Apply a overdrive effect to the audio. Similar to SoX implementation. This effect applies a non linear distortion to the audio signal.
functional_overdrive(waveform, gain = 20, colour = 20)
functional_overdrive(waveform, gain = 20, colour = 20)
waveform |
(Tensor): audio waveform of dimension of |
gain |
(float): desired gain at the boost (or attenuation) in dB Allowed range of values are 0 to 100 |
colour |
(float): controls the amount of even harmonic content in the over-driven output. Allowed range of values are 0 to 100 |
Tensor: Waveform of dimension of (..., time)
Given a STFT tensor, speed up in time without modifying pitch by a factor of rate
.
functional_phase_vocoder(complex_specgrams, rate, phase_advance)
functional_phase_vocoder(complex_specgrams, rate, phase_advance)
complex_specgrams |
(Tensor): Dimension of |
rate |
(float): Speed-up factor |
phase_advance |
(Tensor): Expected phase advance in each bin. Dimension of (freq, 1) |
tensor
: Complex Specgrams Stretch with dimension of (..., freq, ceiling(time/rate), complex=2)
if(torch::torch_is_installed()) { library(torch) library(torchaudio) freq = 1025 hop_length = 512 # (channel, freq, time, complex=2) complex_specgrams = torch_randn(2, freq, 300, 2) rate = 1.3 # Speed up by 30% phase_advance = torch_linspace(0, pi * hop_length, freq)[.., NULL] x = functional_phase_vocoder(complex_specgrams, rate, phase_advance) x$shape # with 231 == ceil (300 / 1.3) # torch.Size ([2, 1025, 231, 2]) }
if(torch::torch_is_installed()) { library(torch) library(torchaudio) freq = 1025 hop_length = 512 # (channel, freq, time, complex=2) complex_specgrams = torch_randn(2, freq, 300, 2) rate = 1.3 # Speed up by 30% phase_advance = torch_linspace(0, pi * hop_length, freq)[.., NULL] x = functional_phase_vocoder(complex_specgrams, rate, phase_advance) x$shape # with 231 == ceil (300 / 1.3) # torch.Size ([2, 1025, 231, 2]) }
Apply a phasing effect to the audio. Similar to SoX implementation.
functional_phaser( waveform, sample_rate, gain_in = 0.4, gain_out = 0.74, delay_ms = 3, decay = 0.4, mod_speed = 0.5, sinusoidal = TRUE )
functional_phaser( waveform, sample_rate, gain_in = 0.4, gain_out = 0.74, delay_ms = 3, decay = 0.4, mod_speed = 0.5, sinusoidal = TRUE )
waveform |
(Tensor): audio waveform of dimension of |
sample_rate |
(int): sampling rate of the waveform, e.g. 44100 (Hz) |
gain_in |
(float): desired input gain at the boost (or attenuation) in dB. Allowed range of values are 0 to 1 |
gain_out |
(float): desired output gain at the boost (or attenuation) in dB. Allowed range of values are 0 to 1e9 |
delay_ms |
(float): desired delay in milli seconds. Allowed range of values are 0 to 5.0 |
decay |
(float): desired decay relative to gain-in. Allowed range of values are 0 to 0.99 |
mod_speed |
(float): modulation speed in Hz. Allowed range of values are 0.1 to 2 |
sinusoidal |
(bool): If |
tensor
: Waveform of dimension of (..., time)
Apply RIAA vinyl playback equalisation. Similar to SoX implementation.
functional_riaa_biquad(waveform, sample_rate)
functional_riaa_biquad(waveform, sample_rate)
waveform |
(Tensor): audio waveform of dimension of |
sample_rate |
(int): sampling rate of the waveform, e.g. 44100 (Hz).
Allowed sample rates in Hz : |
tensor
: Waveform of dimension of (..., time)
Apply sliding-window cepstral mean (and optionally variance) normalization per utterance.
functional_sliding_window_cmn( waveform, cmn_window = 600, min_cmn_window = 100, center = FALSE, norm_vars = FALSE )
functional_sliding_window_cmn( waveform, cmn_window = 600, min_cmn_window = 100, center = FALSE, norm_vars = FALSE )
waveform |
(Tensor): Tensor of audio of dimension (..., freq, time) |
cmn_window |
(int, optional): Window in frames for running average CMN computation (int, default = 600) |
min_cmn_window |
(int, optional): Minimum CMN window used at start of decoding (adds latency only at start).
Only applicable if center == |
center |
(bool, optional): If |
norm_vars |
(bool, optional): If |
tensor
: Tensor of freq of dimension (..., frame)
Create a spectrogram or a batch of spectrograms from a raw audio signal. The spectrogram can be either magnitude-only or complex.
functional_spectrogram( waveform, pad, window, n_fft, hop_length, win_length, power, normalized )
functional_spectrogram( waveform, pad, window, n_fft, hop_length, win_length, power, normalized )
waveform |
(tensor): Tensor of audio of dimension (..., time) |
pad |
(integer): Two sided padding of signal |
window |
(tensor or function): Window tensor that is applied/multiplied to each frame/window or a function that generates the window tensor. |
n_fft |
(integer): Size of FFT |
hop_length |
(integer): Length of hop between STFT windows |
win_length |
(integer): Window size |
power |
(numeric): Exponent for the magnitude spectrogram, (must be > 0) e.g., 1 for energy, 2 for power, etc. If NULL, then the complex spectrum is returned instead. |
normalized |
(logical): Whether to normalize by magnitude after stft |
tensor
: Dimension (..., freq, time), freq is n_fft %/% 2 + 1 and n_fft is the
number of Fourier bins, and time is the number of window hops (n_frame).
Design a treble tone-control effect. Similar to SoX implementation.
functional_treble_biquad( waveform, sample_rate, gain, central_freq = 3000, Q = 0.707 )
functional_treble_biquad( waveform, sample_rate, gain, central_freq = 3000, Q = 0.707 )
waveform |
(Tensor): audio waveform of dimension of |
sample_rate |
(int): sampling rate of the waveform, e.g. 44100 (Hz) |
gain |
(float): desired gain at the boost (or attenuation) in dB. |
central_freq |
(float, optional): central frequency (in Hz). (Default: |
Q |
(float, optional): https://en.wikipedia.org/wiki/Q_factor (Default: |
tensor
: Waveform of dimension of (..., time)
Voice Activity Detector. Similar to SoX implementation. Attempts to trim silence and quiet background sounds from the ends of recordings of speech. The algorithm currently uses a simple cepstral power measurement to detect voice, so may be fooled by other things, especially music.
functional_vad( waveform, sample_rate, trigger_level = 7, trigger_time = 0.25, search_time = 1, allowed_gap = 0.25, pre_trigger_time = 0, boot_time = 0.35, noise_up_time = 0.1, noise_down_time = 0.01, noise_reduction_amount = 1.35, measure_freq = 20, measure_duration = NULL, measure_smooth_time = 0.4, hp_filter_freq = 50, lp_filter_freq = 6000, hp_lifter_freq = 150, lp_lifter_freq = 2000 )
functional_vad( waveform, sample_rate, trigger_level = 7, trigger_time = 0.25, search_time = 1, allowed_gap = 0.25, pre_trigger_time = 0, boot_time = 0.35, noise_up_time = 0.1, noise_down_time = 0.01, noise_reduction_amount = 1.35, measure_freq = 20, measure_duration = NULL, measure_smooth_time = 0.4, hp_filter_freq = 50, lp_filter_freq = 6000, hp_lifter_freq = 150, lp_lifter_freq = 2000 )
waveform |
(Tensor): Tensor of audio of dimension |
sample_rate |
(int): Sample rate of audio signal. |
trigger_level |
(float, optional): The measurement level used to trigger activity detection. This may need to be cahnged depending on the noise level, signal level, and other characteristics of the input audio. (Default: 7.0) |
trigger_time |
(float, optional): The time constant (in seconds) used to help ignore short bursts of sound. (Default: 0.25) |
search_time |
(float, optional): The amount of audio (in seconds) to search for quieter/shorter bursts of audio to include prior to the detected trigger point. (Default: 1.0) |
allowed_gap |
(float, optional): The allowed gap (in seconds) between quiteter/shorter bursts of audio to include prior to the detected trigger point. (Default: 0.25) |
pre_trigger_time |
(float, optional): The amount of audio (in seconds) to preserve before the trigger point and any found quieter/shorter bursts. (Default: 0.0) |
boot_time |
(float, optional) The algorithm (internally) uses adaptive noise estimation/reduction in order to detect the start of the wanted audio. This option sets the time for the initial noise estimate. (Default: 0.35) |
noise_up_time |
(float, optional) Time constant used by the adaptive noise estimator for when the noise level is increasing. (Default: 0.1) |
noise_down_time |
(float, optional) Time constant used by the adaptive noise estimator for when the noise level is decreasing. (Default: 0.01) |
noise_reduction_amount |
(float, optional) Amount of noise reduction to use in the detection algorithm (e.g. 0, 0.5, ...). (Default: 1.35) |
measure_freq |
(float, optional) Frequency of the algorithm’s processing/measurements. (Default: 20.0) |
measure_duration |
(float, optional) Measurement duration. (Default: Twice the measurement period; i.e. with overlap.) |
measure_smooth_time |
(float, optional) Time constant used to smooth spectral measurements. (Default: 0.4) |
hp_filter_freq |
(float, optional) "Brick-wall" frequency of high-pass filter applied at the input to the detector algorithm. (Default: 50.0) |
lp_filter_freq |
(float, optional) "Brick-wall" frequency of low-pass filter applied at the input to the detector algorithm. (Default: 6000.0) |
hp_lifter_freq |
(float, optional) "Brick-wall" frequency of high-pass lifter used in the detector algorithm. (Default: 150.0) |
lp_lifter_freq |
(float, optional) "Brick-wall" frequency of low-pass lifter used in the detector algorithm. (Default: 2000.0) |
The effect can trim only from the front of the audio, so in order to trim from the back, the reverse effect must also be used.
Tensor: Tensor of audio of dimension (..., time).
Based on LinearResample::SetIndexesAndWeights where it retrieves the weights for
resampling as well as the indices in which they are valid. LinearResample (LR) means
that the output signal is at linearly spaced intervals (i.e the output signal has a
frequency of new_freq
).
kaldi__get_lr_indices_and_weights( orig_freq, new_freq, output_samples_in_unit, window_width, lowpass_cutoff, lowpass_filter_width, device, dtype )
kaldi__get_lr_indices_and_weights( orig_freq, new_freq, output_samples_in_unit, window_width, lowpass_cutoff, lowpass_filter_width, device, dtype )
orig_freq |
(float): The original frequency of the signal |
new_freq |
(float): The desired frequency |
output_samples_in_unit |
(int): The number of output samples in the smallest repeating unit: num_samp_out = new_freq / Gcd (orig_freq, new_freq) |
window_width |
(float): The width of the window which is nonzero |
lowpass_cutoff |
(float): The filter cutoff in Hz. The filter cutoff needs to be less than samp_rate_in_hz/2 and less than samp_rate_out_hz/2. |
lowpass_filter_width |
(int): Controls the sharpness of the filter, more == sharper but less efficient. We suggest around 4 to 10 for normal use. |
device |
(torch_device): Torch device on which output must be generated. |
dtype |
(torch::torch_\<dtype\>): Torch dtype such as torch::torch_float |
It uses sinc/bandlimited interpolation to upsample/downsample the signal.
The reason why the same filter is not used for multiple convolutions is because the sinc function could sampled at different points in time. For example, suppose a signal is sampled at the timestamps (seconds) 0 16 32 and we want it to be sampled at the timestamps (seconds) 0 5 10 15 20 25 30 35 at the timestamp of 16, the delta timestamps are 16 11 6 1 4 9 14 19 at the timestamp of 32, the delta timestamps are 32 27 22 17 12 8 2 3
As we can see from deltas, the sinc function is sampled at different points of time assuming the center of the sinc function is at 0, 16, and 32 (the deltas [..., 6, 1, 4, ....] for 16 vs [...., 2, 3, ....] for 32)
Example, one case is when the orig_freq
and new_freq
are multiples of each other then
there needs to be one filter.
A windowed filter function (i.e. Hanning * sinc) because the ideal case of sinc function has infinite support (non-zero for all values) so instead it is truncated and multiplied by a window function which gives it less-than-perfect rolloff [1].
[1] Chapter 16: Windowed-Sinc Filters, https://www.dspguide.com/ch16/1.htm
Tensor, Tensor): A tuple of min_input_index
(which is the minimum indices
where the window is valid, size (output_samples_in_unit
)) and weights
(which is the weights
which correspond with min_input_index, size (output_samples_in_unit
, max_weight_width
)).
Based on LinearResample::GetNumOutputSamples. LinearResample (LR) means that
the output signal is at linearly spaced intervals (i.e the output signal has a
frequency of new_freq
). It uses sinc/bandlimited interpolation to upsample/downsample
the signal.
kaldi__get_num_lr_output_samples(input_num_samp, samp_rate_in, samp_rate_out)
kaldi__get_num_lr_output_samples(input_num_samp, samp_rate_in, samp_rate_out)
input_num_samp |
(int): The number of samples in the input |
samp_rate_in |
(float): The original frequency of the signal |
samp_rate_out |
(float): The desired frequency |
int: The number of output samples
Resamples the waveform at the new frequency.
kaldi_resample_waveform( waveform, orig_freq, new_freq, lowpass_filter_width = 6 )
kaldi_resample_waveform( waveform, orig_freq, new_freq, lowpass_filter_width = 6 )
waveform |
(Tensor): The input signal of size (c, n) |
orig_freq |
(float): The original frequency of the signal |
new_freq |
(float): The desired frequency |
lowpass_filter_width |
(int, optional): Controls the sharpness of the filter, more == sharper
but less efficient. We suggest around 4 to 10 for normal use. (Default: |
This matches Kaldi's OfflineFeatureTpl ResampleWaveform
which uses a LinearResample (resample a signal at linearly spaced intervals to upsample/downsample
a signal). LinearResample (LR) means that the output signal is at linearly spaced intervals (i.e
the output signal has a frequency of new_freq
). It uses sinc/bandlimited interpolation to
upsample/downsample the signal.
Tensor: The waveform at the new frequency
https://ccrma.stanford.edu/~jos/resample/Theory_Ideal_Bandlimited_Interpolation.html
https://github.com/kaldi-asr/kaldi/blob/master/src/feat/resample.h#L56
Converts frequencies from the linear scale to mel scale.
linear_to_mel_frequency( frequency_in_hertz, mel_break_frequency_hertz = 2595, mel_high_frequency_q = 700 )
linear_to_mel_frequency( frequency_in_hertz, mel_break_frequency_hertz = 2595, mel_high_frequency_q = 700 )
frequency_in_hertz |
(numeric) tensor of frequencies in hertz to be converted to mel scale. |
mel_break_frequency_hertz |
(numeric) scalar. (Default to 2595.0) |
mel_high_frequency_q |
(numeric) scalar. (Default to 700.0) |
tensor
List available audio backends
list_audio_backends()
list_audio_backends()
character vector with the list of available backends.
Converts frequencies from the mel scale to linear scale.
mel_to_linear_frequency( frequency_in_mel, mel_break_frequency_hertz = 2595, mel_high_frequency_q = 700 )
mel_to_linear_frequency( frequency_in_mel, mel_break_frequency_hertz = 2595, mel_high_frequency_q = 700 )
frequency_in_mel |
(numeric) tensor of frequencies in mel to be converted to linear scale. |
mel_break_frequency_hertz |
(numeric) scalar. (Default to 2595.0) |
mel_high_frequency_q |
(numeric) scalar. (Default to 700.0) |
tensor
MelResNet layer uses a stack of ResBlocks on spectrogram. Pass the input through the MelResNet layer.
model_melresnet( n_res_block = 10, n_freq = 128, n_hidden = 128, n_output = 128, kernel_size = 5 )
model_melresnet( n_res_block = 10, n_freq = 128, n_hidden = 128, n_output = 128, kernel_size = 5 )
n_res_block |
the number of ResBlock in stack. (Default: |
n_freq |
the number of bins in a spectrogram. (Default: |
the number of hidden dimensions of resblock. (Default: |
|
n_output |
the number of output dimensions of melresnet. (Default: |
kernel_size |
the number of kernel size in the first Conv1d layer. (Default: |
forward param: specgram (Tensor): the input sequence to the MelResNet layer (n_batch, n_freq, n_time).
Tensor shape: (n_batch, n_output, n_time - kernel_size + 1)
if(torch::torch_is_installed()) { melresnet = model_melresnet() input = torch::torch_rand(10, 128, 512) # a random spectrogram output = melresnet(input) # shape: (10, 128, 508) }
if(torch::torch_is_installed()) { melresnet = model_melresnet() input = torch::torch_rand(10, 128, 512) # a random spectrogram output = melresnet(input) # shape: (10, 128, 508) }
ResNet block based on "Deep Residual Learning for Image Recognition". Pass the input through the ResBlock layer. The paper link is https://arxiv.org/pdf/1512.03385.pdf.
model_resblock(n_freq = 128)
model_resblock(n_freq = 128)
n_freq |
the number of bins in a spectrogram. (Default: |
forward param: specgram (Tensor): the input sequence to the ResBlock layer (n_batch, n_freq, n_time).
Tensor shape: (n_batch, n_freq, n_time)
if(torch::torch_is_installed()) { resblock = model_resblock() input = torch::torch_rand(10, 128, 512) # a random spectrogram output = resblock(input) # shape: (10, 128, 512) }
if(torch::torch_is_installed()) { resblock = model_resblock() input = torch::torch_rand(10, 128, 512) # a random spectrogram output = resblock(input) # shape: (10, 128, 512) }
Upscale the frequency and time dimensions of a spectrogram. Pass the input through the Stretch2d layer.
model_stretch2d(time_scale, freq_scale)
model_stretch2d(time_scale, freq_scale)
time_scale |
the scale factor in time dimension |
freq_scale |
the scale factor in frequency dimension |
forward param: specgram (Tensor): the input sequence to the Stretch2d layer (..., n_freq, n_time).
Tensor shape: (..., n_freq * freq_scale, n_time * time_scale)
if(torch::torch_is_installed()) { stretch2d = model_stretch2d(time_scale=10, freq_scale=5) input = torch::torch_rand(10, 100, 512) # a random spectrogram output = stretch2d(input) # shape: (10, 500, 5120) }
if(torch::torch_is_installed()) { stretch2d = model_stretch2d(time_scale=10, freq_scale=5) input = torch::torch_rand(10, 100, 512) # a random spectrogram output = stretch2d(input) # shape: (10, 500, 5120) }
Upscale the dimensions of a spectrogram. Pass the input through the UpsampleNetwork layer.
model_upsample_network( upsample_scales, n_res_block = 10, n_freq = 128, n_hidden = 128, n_output = 128, kernel_size = 5 )
model_upsample_network( upsample_scales, n_res_block = 10, n_freq = 128, n_hidden = 128, n_output = 128, kernel_size = 5 )
upsample_scales |
the list of upsample scales. |
n_res_block |
the number of ResBlock in stack. (Default: |
n_freq |
the number of bins in a spectrogram. (Default: |
the number of hidden dimensions of resblock. (Default: |
|
n_output |
the number of output dimensions of melresnet. (Default: |
kernel_size |
the number of kernel size in the first Conv1d layer. (Default: |
forward param: specgram (Tensor): the input sequence to the UpsampleNetwork layer (n_batch, n_freq, n_time)
Tensor shape: (n_batch, n_freq, (n_time - kernel_size + 1) * total_scale), (n_batch, n_output, (n_time - kernel_size + 1) * total_scale) where total_scale is the product of all elements in upsample_scales.
if(torch::torch_is_installed()) { upsamplenetwork = model_upsample_network(upsample_scales=c(4, 4, 16)) input = torch::torch_rand (10, 128, 10) # a random spectrogram output = upsamplenetwork (input) # shape: (10, 1536, 128), (10, 1536, 128) }
if(torch::torch_is_installed()) { upsamplenetwork = model_upsample_network(upsample_scales=c(4, 4, 16)) input = torch::torch_rand (10, 128, 10) # a random spectrogram output = upsamplenetwork (input) # shape: (10, 1536, 128), (10, 1536, 128) }
WaveRNN model based on the implementation from fatchord. The original implementation was introduced in "Efficient Neural Audio Synthesis". #' Pass the input through the WaveRNN model.
model_wavernn( upsample_scales, n_classes, hop_length, n_res_block = 10, n_rnn = 512, n_fc = 512, kernel_size = 5, n_freq = 128, n_hidden = 128, n_output = 128 )
model_wavernn( upsample_scales, n_classes, hop_length, n_res_block = 10, n_rnn = 512, n_fc = 512, kernel_size = 5, n_freq = 128, n_hidden = 128, n_output = 128 )
upsample_scales |
the list of upsample scales. |
n_classes |
the number of output classes. |
hop_length |
the number of samples between the starts of consecutive frames. |
n_res_block |
the number of ResBlock in stack. (Default: |
n_rnn |
the dimension of RNN layer. (Default: |
n_fc |
the dimension of fully connected layer. (Default: |
kernel_size |
the number of kernel size in the first Conv1d layer. (Default: |
n_freq |
the number of bins in a spectrogram. (Default: |
the number of hidden dimensions of resblock. (Default: |
|
n_output |
the number of output dimensions of melresnet. (Default: |
forward param:
waveform the input waveform to the WaveRNN layer (n_batch, 1, (n_time - kernel_size + 1) * hop_length)
specgram the input spectrogram to the WaveRNN layer (n_batch, 1, n_freq, n_time)
The input channels of waveform and spectrogram have to be 1. The product of
upsample_scales
must equal hop_length
.
Tensor shape: (n_batch, 1, (n_time - kernel_size + 1) * hop_length, n_classes)
if(torch::torch_is_installed()) { wavernn <- model_wavernn(upsample_scales=c(2,2,3), n_classes=5, hop_length=12) waveform <- torch::torch_rand(3,1,(10 - 5 + 1)*12) spectrogram <- torch::torch_rand(3,1,128,10) # waveform shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length) output <- wavernn(waveform, spectrogram) }
if(torch::torch_is_installed()) { wavernn <- model_wavernn(upsample_scales=c(2,2,3), n_classes=5, hop_length=12) waveform <- torch::torch_rand(3,1,(10 - 5 + 1)*12) spectrogram <- torch::torch_rand(3,1,128,10) # waveform shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length) output <- wavernn(waveform, spectrogram) }
Speech Commands Dataset
speechcommand_dataset( root, url = "speech_commands_v0.02", folder_in_archive = "SpeechCommands", download = FALSE, normalization = NULL )
speechcommand_dataset( root, url = "speech_commands_v0.02", folder_in_archive = "SpeechCommands", download = FALSE, normalization = NULL )
root |
(str): Path to the directory where the dataset is found or downloaded. |
url |
(str, optional): The URL to download the dataset from,
or the type of the dataset to dowload.
Allowed type values are |
folder_in_archive |
(str, optional): The top-level directory of the dataset. (default: |
download |
(bool, optional): Whether to download the dataset if it is not found at root path. (default: |
normalization |
(NULL, bool, int or function): Optional normalization. If boolean TRUE, then output is divided by 2^31. Assuming the input is signed 32-bit audio, this normalizes to [-1, 1]. If numeric, then output is divided by that number. If function, then the output is passed as a paramete to the given function, then the output is divided by the result. (Default: NULL) |
a torch::dataset()
Retrieve audio metadata.
torchaudio_info(filepath)
torchaudio_info(filepath)
filepath |
(str) path to the audio file. |
AudioMetaData: an R6 class with fields sample_rate, channels, samples.
path <- system.file("waves_yesno/1_1_0_1_1_0_1_1.wav", package = "torchaudio") torchaudio_info(path)
path <- system.file("waves_yesno/1_1_0_1_1_0_1_1.wav", package = "torchaudio") torchaudio_info(path)
Loads an audio file from disk using the default loader (getOption("torchaudio.loader")).
torchaudio_load( filepath, offset = 0L, duration = -1L, unit = c("samples", "time") )
torchaudio_load( filepath, offset = 0L, duration = -1L, unit = c("samples", "time") )
filepath |
(str): Path to audio file |
offset |
(int): Number of frames (or seconds) from the start of the file to begin data loading. (Default: |
duration |
(int): Number of frames (or seconds) to load. |
unit |
(str): "sample" or "time". If "sample" duration and offset will be interpreted as frames, and as seconds otherwise. |
Apply masking to a spectrogram.
transform__axismasking(mask_param, axis, iid_masks)
transform__axismasking(mask_param, axis, iid_masks)
mask_param |
(int): Maximum possible length of the mask. |
axis |
(int): What dimension the mask is applied on. |
iid_masks |
(bool): Applies iid masks to each of the examples in the batch dimension. This option is applicable only when the input tensor is 4D. |
forward param: specgram (Tensor): Tensor of dimension (..., freq, time).
mask_value (float): Value to assign to the masked columns.
Tensor: Masked spectrogram of dimensions (..., freq, time).
Turn a tensor from the power/amplitude scale to the decibel scale.
transform_amplitude_to_db(stype = "power", top_db = NULL)
transform_amplitude_to_db(stype = "power", top_db = NULL)
stype |
(str, optional): scale of input tensor ('power' or 'magnitude'). The
power being the elementwise square of the magnitude. (Default: |
top_db |
(float or NULL, optional): Minimum negative cut-off in decibels. A reasonable number
is 80. (Default: |
This output depends on the maximum value in the input tensor, and so may return different values for an audio clip split into snippets vs. a a full clip.
forward param: x (Tensor): Input tensor before being converted to decibel scale
tensor
: Output tensor in decibel scale
Compute the norm of complex tensor input.
transform_complex_norm(power = 1)
transform_complex_norm(power = 1)
power |
(float, optional): Power of the norm. (Default: to |
forward param:
complex_tensor (Tensor): Tensor shape of (..., complex=2)
.
Tensor: norm of the input tensor, shape of (..., )
.
Compute delta coefficients of a tensor, usually a spectrogram.
transform_compute_deltas(win_length = 5, mode = "replicate")
transform_compute_deltas(win_length = 5, mode = "replicate")
win_length |
(int): The window length used for computing delta. (Default: |
mode |
(str): Mode parameter passed to padding. (Default: |
forward param: specgram (Tensor): Tensor of audio of dimension (..., freq, time).
See functional_compute_deltas for more details.
Tensor: Tensor of deltas of dimension (..., freq, time).
Add a fade in and/or fade out to an waveform.
transform_fade(fade_in_len = 0, fade_out_len = 0, fade_shape = "linear")
transform_fade(fade_in_len = 0, fade_out_len = 0, fade_shape = "linear")
fade_in_len |
(int, optional): Length of fade-in (time frames). (Default: |
fade_out_len |
(int, optional): Length of fade-out (time frames). (Default: |
fade_shape |
(str, optional): Shape of fade. Must be one of: "quarter_sine",
"half_sine", "linear", "logarithmic", "exponential". (Default: |
forward param: waveform (Tensor): Tensor of audio of dimension (..., time).
Tensor: Tensor of audio of dimension (..., time).
Apply masking to a spectrogram in the frequency domain.
transform_frequencymasking(freq_mask_param, iid_masks)
transform_frequencymasking(freq_mask_param, iid_masks)
freq_mask_param |
(int): maximum possible length of the mask. Indices uniformly sampled from [0, freq_mask_param). |
iid_masks |
(bool, optional): whether to apply different masks to each
example/channel in the batch. (Default: |
not implemented yet.
Solve for a normal STFT from a mel frequency STFT, using a conversion matrix. This uses triangular filter banks.
transform_inverse_mel_scale( n_stft, n_mels = 128, sample_rate = 16000, f_min = 0, f_max = NULL, max_iter = 1e+05, tolerance_loss = 1e-05, tolerance_change = 1e-08, ... )
transform_inverse_mel_scale( n_stft, n_mels = 128, sample_rate = 16000, f_min = 0, f_max = NULL, max_iter = 1e+05, tolerance_loss = 1e-05, tolerance_change = 1e-08, ... )
n_stft |
(int): Number of bins in STFT. See |
n_mels |
(int, optional): Number of mel filterbanks. (Default: |
sample_rate |
(int, optional): Sample rate of audio signal. (Default: |
f_min |
(float, optional): Minimum frequency. (Default: |
f_max |
(float or NULL, optional): Maximum frequency. (Default: |
max_iter |
(int, optional): Maximum number of optimization iterations. (Default: |
tolerance_loss |
(float, optional): Value of loss to stop optimization at. (Default: |
tolerance_change |
(float, optional): Difference in losses to stop optimization at. (Default: |
... |
(optional): Arguments passed to the SGD optimizer. Argument lr will default to 0.1 if not specied.(Default: |
forward param:
melspec (Tensor): A Mel frequency spectrogram of dimension (..., n_mels
, time)
It minimizes the euclidian norm between the input mel-spectrogram and the product between the estimated spectrogram and the filter banks using SGD.
Tensor: Linear scale spectrogram of size (..., freq, time)
Turn a normal STFT into a mel frequency STFT, using a conversion matrix. This uses triangular filter banks.
transform_mel_scale( n_mels = 128, sample_rate = 16000, f_min = 0, f_max = NULL, n_stft = NULL )
transform_mel_scale( n_mels = 128, sample_rate = 16000, f_min = 0, f_max = NULL, n_stft = NULL )
n_mels |
(int, optional): Number of mel filterbanks. (Default: |
sample_rate |
(int, optional): Sample rate of audio signal. (Default: |
f_min |
(float, optional): Minimum frequency. (Default: |
f_max |
(float or NULL, optional): Maximum frequency. (Default: |
n_stft |
(int, optional): Number of bins in STFT. Calculated from first input
if NULL is given. See |
forward param: specgram (Tensor): Tensor of audio of dimension (..., freq, time).
tensor
: Mel frequency spectrogram of size (..., n_mels
, time).
Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram and MelScale.
transform_mel_spectrogram( sample_rate = 16000, n_fft = 400, win_length = NULL, hop_length = NULL, f_min = 0, f_max = NULL, pad = 0, n_mels = 128, window_fn = torch::torch_hann_window, power = 2, normalized = FALSE, ... )
transform_mel_spectrogram( sample_rate = 16000, n_fft = 400, win_length = NULL, hop_length = NULL, f_min = 0, f_max = NULL, pad = 0, n_mels = 128, window_fn = torch::torch_hann_window, power = 2, normalized = FALSE, ... )
sample_rate |
(int, optional): Sample rate of audio signal. (Default: |
n_fft |
(int, optional): Size of FFT, creates |
win_length |
(int or NULL, optional): Window size. (Default: |
hop_length |
(int or NULL, optional): Length of hop between STFT windows. (Default: |
f_min |
(float, optional): Minimum frequency. (Default: |
f_max |
(float or NULL, optional): Maximum frequency. (Default: |
pad |
(int, optional): Two sided padding of signal. (Default: |
n_mels |
(int, optional): Number of mel filterbanks. (Default: |
window_fn |
(function, optional): A function to create a window tensor
that is applied/multiplied to each frame/window. (Default: |
power |
(float, optional): Power of the norm. (Default: to |
normalized |
(logical): Whether to normalize by magnitude after stft (Default: |
... |
(optional): Arguments for window function. |
forward param: waveform (Tensor): Tensor of audio of dimension (..., time).
tensor
: Mel frequency spectrogram of size (..., n_mels
, time).
https://timsainb.github.io/spectrograms-mfccs-and-inversion-in-python.html
https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
#' Example ## Not run: if(torch::torch_is_installed()) { mp3_path <- system.file("sample_audio_1.mp3", package = "torchaudio") sample_mp3 <- transform_to_tensor(tuneR_loader(mp3_path)) # (channel, n_mels, time) mel_specgram <- transform_mel_spectrogram(sample_rate = sample_mp3[[2]])(sample_mp3[[1]]) } ## End(Not run)
#' Example ## Not run: if(torch::torch_is_installed()) { mp3_path <- system.file("sample_audio_1.mp3", package = "torchaudio") sample_mp3 <- transform_to_tensor(tuneR_loader(mp3_path)) # (channel, n_mels, time) mel_specgram <- transform_mel_spectrogram(sample_rate = sample_mp3[[2]])(sample_mp3[[1]]) } ## End(Not run)
Create the Mel-frequency cepstrum coefficients from an audio signal.
transform_mfcc( sample_rate = 16000, n_mfcc = 40, dct_type = 2, norm = "ortho", log_mels = FALSE, ... )
transform_mfcc( sample_rate = 16000, n_mfcc = 40, dct_type = 2, norm = "ortho", log_mels = FALSE, ... )
sample_rate |
(int, optional): Sample rate of audio signal. (Default: |
n_mfcc |
(int, optional): Number of mfc coefficients to retain. (Default: |
dct_type |
(int, optional): type of DCT (discrete cosine transform) to use. (Default: |
norm |
(str, optional): norm to use. (Default: |
log_mels |
(bool, optional): whether to use log-mel spectrograms instead of db-scaled. (Default: |
... |
(optional): arguments for transform_mel_spectrogram. |
forward param: waveform (tensor): Tensor of audio of dimension (..., time)
By default, this calculates the MFCC on the DB-scaled Mel spectrogram. This output depends on the maximum value in the input spectrogram, and so may return different values for an audio clip split into snippets vs. a a full clip.
tensor
: specgram_mel_db of size (..., n_mfcc
, time).
Decode mu-law encoded signal. For more info see the Wikipedia Entry
transform_mu_law_decoding(quantization_channels = 256)
transform_mu_law_decoding(quantization_channels = 256)
quantization_channels |
(int, optional): Number of channels. (Default: |
This expects an input with values between 0 and quantization_channels - 1 and returns a signal scaled between -1 and 1.
forward param: x_mu (Tensor): A mu-law encoded signal which needs to be decoded.
Tensor: The signal decoded.
Encode signal based on mu-law companding. For more info see the Wikipedia Entry
transform_mu_law_encoding(quantization_channels = 256)
transform_mu_law_encoding(quantization_channels = 256)
quantization_channels |
(int, optional): Number of channels. (Default: |
forward param: x (Tensor): A signal to be encoded.
This algorithm assumes the signal has been scaled to between -1 and 1 and returns a signal encoded with values from 0 to quantization_channels - 1.
x_mu (Tensor): An encoded signal.
Resample a signal from one frequency to another. A resampling method can be given.
transform_resample( orig_freq = 16000, new_freq = 16000, resampling_method = "sinc_interpolation" )
transform_resample( orig_freq = 16000, new_freq = 16000, resampling_method = "sinc_interpolation" )
orig_freq |
(float, optional): The original frequency of the signal. (Default: |
new_freq |
(float, optional): The desired frequency. (Default: |
resampling_method |
(str, optional): The resampling method. (Default: |
forward param: waveform (Tensor): Tensor of audio of dimension (..., time).
Tensor: Output signal of dimension (..., time).
Apply sliding-window cepstral mean (and optionally variance) normalization per utterance.
transform_sliding_window_cmn( cmn_window = 600, min_cmn_window = 100, center = FALSE, norm_vars = FALSE )
transform_sliding_window_cmn( cmn_window = 600, min_cmn_window = 100, center = FALSE, norm_vars = FALSE )
cmn_window |
(int, optional): Window in frames for running average CMN computation (int, default = 600) |
min_cmn_window |
(int, optional): Minimum CMN window used at start of decoding (adds latency only at start).
Only applicable if center == |
center |
(bool, optional): If |
norm_vars |
(bool, optional): If |
forward param: waveform (Tensor): Tensor of audio of dimension (..., time).
Tensor: Tensor of audio of dimension (..., time).
Create a spectrogram or a batch of spectrograms from a raw audio signal. The spectrogram can be either magnitude-only or complex.
transform_spectrogram( n_fft = 400, win_length = NULL, hop_length = NULL, pad = 0L, window_fn = torch::torch_hann_window, power = 2, normalized = FALSE, ... )
transform_spectrogram( n_fft = 400, win_length = NULL, hop_length = NULL, pad = 0L, window_fn = torch::torch_hann_window, power = 2, normalized = FALSE, ... )
n_fft |
(integer): Size of FFT |
win_length |
(integer): Window size |
hop_length |
(integer): Length of hop between STFT windows |
pad |
(integer): Two sided padding of signal |
window_fn |
(tensor or function): Window tensor that is applied/multiplied to each frame/window or a function that generates the window tensor. |
power |
(numeric): Exponent for the magnitude spectrogram, (must be > 0) e.g., 1 for energy, 2 for power, etc. If NULL, then the complex spectrum is returned instead. |
normalized |
(logical): Whether to normalize by magnitude after stft |
... |
(optional) Arguments for window function. |
forward param: waveform (tensor): Tensor of audio of dimension (..., time)
tensor: Dimension (..., freq, time), freq is n_fft %/% 2 + 1 and n_fft is the number of Fourier bins, and time is the number of window hops (n_frame).
Stretch stft in time without modifying pitch for a given rate.
transform_time_stretch(hop_length = NULL, n_freq = 201, fixed_rate = NULL)
transform_time_stretch(hop_length = NULL, n_freq = 201, fixed_rate = NULL)
hop_length |
(int or NULL, optional): Length of hop between STFT windows. (Default: |
n_freq |
(int, optional): number of filter banks from stft. (Default: |
fixed_rate |
(float or NULL, optional): rate to speed up or slow down by.
If NULL is provided, rate must be passed to the forward method. (Default: |
forward param: complex_specgrams (Tensor): complex spectrogram (..., freq, time, complex=2).
overriding_rate (float or NULL, optional): speed up to apply to this batch.
If no rate is passed, use self$fixed_rate
. (Default: NULL
)
Tensor: Stretched complex spectrogram of dimension (..., freq, ceil(time/rate), complex=2).
Apply masking to a spectrogram in the time domain.
transform_timemasking(time_mask_param, iid_masks)
transform_timemasking(time_mask_param, iid_masks)
time_mask_param |
(int): maximum possible length of the mask. Indices uniformly sampled from [0, time_mask_param). |
iid_masks |
(bool, optional): whether to apply different masks to each
example/channel in the batch. (Default: |
not implemented yet.
Converts a numeric vector, as delivered by the backend, into a torch_tensor
of shape (channels x samples).
If provided by the backend, attributes "channels" and "sample_rate" will be used.
transform_to_tensor( audio, out = NULL, normalization = TRUE, channels_first = TRUE )
transform_to_tensor( audio, out = NULL, normalization = TRUE, channels_first = TRUE )
audio |
(numeric): A numeric vector, as delivered by the backend. |
out |
(Tensor): An optional output tensor to use instead of creating one. (Default: |
normalization |
(bool, float or function): Optional normalization.
If boolean |
channels_first |
(bool): Set channels first or length first in result. (Default: |
list(Tensor, int), containing - the audio content, encoded as `[C x L]` or `[L x C]` where L is the number of audio frames and C is the number of channels - the sample rate of the audio (as listed in the metadata of the file)
Voice Activity Detector. Similar to SoX implementation.
transform_vad( sample_rate, trigger_level = 7, trigger_time = 0.25, search_time = 1, allowed_gap = 0.25, pre_trigger_time = 0, boot_time = 0.35, noise_up_time = 0.1, noise_down_time = 0.01, noise_reduction_amount = 1.35, measure_freq = 20, measure_duration = NULL, measure_smooth_time = 0.4, hp_filter_freq = 50, lp_filter_freq = 6000, hp_lifter_freq = 150, lp_lifter_freq = 2000 )
transform_vad( sample_rate, trigger_level = 7, trigger_time = 0.25, search_time = 1, allowed_gap = 0.25, pre_trigger_time = 0, boot_time = 0.35, noise_up_time = 0.1, noise_down_time = 0.01, noise_reduction_amount = 1.35, measure_freq = 20, measure_duration = NULL, measure_smooth_time = 0.4, hp_filter_freq = 50, lp_filter_freq = 6000, hp_lifter_freq = 150, lp_lifter_freq = 2000 )
sample_rate |
(int): Sample rate of audio signal. |
trigger_level |
(float, optional): The measurement level used to trigger activity detection. This may need to be cahnged depending on the noise level, signal level, and other characteristics of the input audio. (Default: 7.0) |
trigger_time |
(float, optional): The time constant (in seconds) used to help ignore short bursts of sound. (Default: 0.25) |
search_time |
(float, optional): The amount of audio (in seconds) to search for quieter/shorter bursts of audio to include prior the detected trigger point. (Default: 1.0) |
allowed_gap |
(float, optional): The allowed gap (in seconds) between quiteter/shorter bursts of audio to include prior to the detected trigger point. (Default: 0.25) |
pre_trigger_time |
(float, optional): The amount of audio (in seconds) to preserve before the trigger point and any found quieter/shorter bursts. (Default: 0.0) |
boot_time |
(float, optional) The algorithm (internally) uses adaptive noise estimation/reduction in order to detect the start of the wanted audio. This option sets the time for the initial noise estimate. (Default: 0.35) |
noise_up_time |
(float, optional) Time constant used by the adaptive noise estimator for when the noise level is increasing. (Default: 0.1) |
noise_down_time |
(float, optional) Time constant used by the adaptive noise estimator for when the noise level is decreasing. (Default: 0.01) |
noise_reduction_amount |
(float, optional) Amount of noise reduction to use in the detection algorithm (e.g. 0, 0.5, ...). (Default: 1.35) |
measure_freq |
(float, optional) Frequency of the algorithm’s processing/measurements. (Default: 20.0) |
measure_duration |
(float, optional) Measurement duration. (Default: Twice the measurement period; i.e. with overlap.) |
measure_smooth_time |
(float, optional) Time constant used to smooth spectral measurements. (Default: 0.4) |
hp_filter_freq |
(float, optional) "Brick-wall" frequency of high-pass filter applied at the input to the detector algorithm. (Default: 50.0) |
lp_filter_freq |
(float, optional) "Brick-wall" frequency of low-pass filter applied at the input to the detector algorithm. (Default: 6000.0) |
hp_lifter_freq |
(float, optional) "Brick-wall" frequency of high-pass lifter used in the detector algorithm. (Default: 150.0) |
lp_lifter_freq |
(float, optional) "Brick-wall" frequency of low-pass lifter used in the detector algorithm. (Default: 2000.0) |
Attempts to trim silence and quiet background sounds from the ends of recordings of speech. The algorithm currently uses a simple cepstral power measurement to detect voice, so may be fooled by other things, especially music.
The effect can trim only from the front of the audio, so in order to trim from the back, the reverse effect must also be used.
forward param:
waveform (Tensor): Tensor of audio of dimension (..., time)
torch::nn_module()
Add a volume to an waveform.
transform_vol(gain, gain_type = "amplitude")
transform_vol(gain, gain_type = "amplitude")
gain |
(float): Interpreted according to the given gain_type:
If |
gain_type |
(str, optional): Type of gain. One of: |
forward param: waveform (Tensor): Tensor of audio of dimension (..., time).
Tensor: Tensor of audio of dimension (..., time).
Create a Dataset for YesNo
yesno_dataset( root, url = "http://www.openslr.org/resources/1/waves_yesno.tar.gz", folder_in_archive = "waves_yesno", download = FALSE, transform = NULL, target_transform = NULL )
yesno_dataset( root, url = "http://www.openslr.org/resources/1/waves_yesno.tar.gz", folder_in_archive = "waves_yesno", download = FALSE, transform = NULL, target_transform = NULL )
root |
(str): Path to the directory where the dataset is found or downloaded. |
url |
(str, optional): The URL to download the dataset from.
(default: |
folder_in_archive |
(str, optional): The top-level directory of the dataset. (default: |
download |
(bool, optional): Whether to download the dataset if it is not found at root path. (default: |
transform |
(callable, optional): Optional transform applied on waveform. (default: |
target_transform |
(callable, optional): Optional transform applied on utterance. (default: |
tuple: (waveform, sample_rate, labels)