Skip to main content

Table 3 Needs of traditional transcription, opportunities via whisper and additional opportunities via VinkWould it be possible for the table to not be in alternating white and blue rows? The formatting is confusing

From: From voice to ink (Vink): development and assessment of an automated, free-of-charge transcription tool

Transcription concerns and needs

Characteristics of Whisper

Characteristics of Vink and additional needs

Resources, infrastructure, and costs

Transcription services are expensive.

Whisper is offered by OpenAI free of cost.

Vink is a free of cost transcription tool using Whisper’s open-source algorithm.

Transcription software often requires high computing power to operate.

Whisper offers multiple model sizes that require 1–10 GB of RAM, thus can run on average computers, depending on the model size.

Vink conserves this feature from Whisper, allowing selection of model size per user and computer characteristics.

Safety and privacy

Uploading data for transcription or outsourcing transcripts to a third party raises confidentiality and data protection issues.

Whisper runs locally, thus eliminates the need to share or upload data.

Vink is designed to operate locally without uploading data.

Quality of transcription

Transcription software is often unavailable in non-Western or less dominant languages.

The same speech models for all languages technically make Whisper usable for everyone, yet differences in performance persist. Audio files with mixed languages can be transcribed.

Accuracy of transcription varies across languages (Table 1).

Conventional transcription software often requires training on a user’s voice or on exemplary audio data.

The Whisper algorithm has already been trained on big data and is ready for use.

The ‘ready to use’ feature limits the possibilities to adapt the algorithm to individual requirements.

Conventional transcription software often struggles with accents, mixed use of languages and background noise.

Whisper provides improved robustness to accents, background noise and technical language.

The improved speech recognition comes at the expense of expressions (e.g., laughter) that are excluded from the final transcript.

Identifying speakers (e.g., interviewer, respondent, multiple participants) is an essential but sometimes challenging feature of transcription.

Whisper does not offer speaker recognition.

Vink currently does not include speaker recognition. Depending on the transcription approach, the user may need to add them manually.

Other open-source transcription software (Silero, Vosk) only output raw lower-case text. Punctuation models can be applied later in the process, but these are not available for all languages.

Whisper generates transcripts with already integrated punctuation and upper cases regardless of the language.

 

Ease of use

Transcription software should be accessible to researchers without knowledge of software programming.

Whispers requires a programming language (e.g., Python, R), an interpreter and installation of specific packages within the programming software, to operate.

Vink is a downloadable standalone application which includes the necessary packages and tokenizers, reducing the installation requirements and steps.

Whisper does not have a user interface, which limits its use to people with knowledge of programming (e.g., Python).

Our transcription tool includes an intuitive user interface.