DEV Community

Revathi Joshi for AWS Community Builders

Posted on

How AWS Service - Amazon Transcribe acts on PII

In my previous article, I have shown you how to use Amazon Transcribe (automatic speech recognition service), to create a text transcript of a pre-recorded speech file in English.

In this article, I am going to show you how to use Amazon Transcribe to add privacy to your transcriptions by not exposing personal and sensitive information (PII) after uploading your transcript to a S3 bucket via AWS Management Console. Each PII word is taken as an entity-type and masks the content with the PII entity-type in the transcript output, such as Social Security Number 123-45-6789 will be masked as [SSN].

Amazon Transcribe

  • It is an automatic speech recognition service.

  • You can use it to transcribe media files stored in as Amazon S3 bucket (batch transcription) and in real time (stream transcription).

  • The following types of PII recognized for batch transcriptions

    • SSN
    • CREDIT_DEBIT_NUMBER
    • CREDIT_DEBIT_EXPIRY
    • CREDIT_DEBIT_CVV
    • BANK_ACCOUNT_NUMBER
    • BANK_ROUTING
    • PIN
    • NAME
    • EMAIL
    • PHONE (10 digits)
    • ADDRESS
  • Batch transcription is available with US English (en-US).

  • You get word-for-word portion of the transcription output.

  • A perfect use case would be an organization where you may or may not want to expose certain transcription data to various team members.

  • In such situations, personally identifiable information (PII) may need to be removed to protect privacy and comply with local laws and regulations.

  • Using Amazon Transcribe, it is easy to get accurate and redacted sensitive text which otherwise would not have been possible due to manual errors and time consuming process.

Let’s get started!

Please visit my GitHub Repository for S3 articles on various topics being updated on constant basis.

Objectives:

1. Create a S3 bucket

2. Upload an audio PII file into S3 bucket

3. Create a transcription job

4. Review transcription results

Pre-requisites:

  • AWS user account with admin access, not a root account.

  • Create an IAM role, with AmazonS3FullAccess.

Resources Used:

Amazon Transcribe

IAM Access Policy

S3 Bucket

Steps for implementation to this project:

1. Create a S3 bucket

On Amazon S3 console / Create bucket / Under General configuration /

Bucket name: - pii-bucket12

AWS Region: - US East (N. Virginia) us-east-1

  • Take all defaults and Create bucket

Image description

2. Upload an audio PII file into S3 bucket

  • Amazon Transcribe supports MP3, MP4, WAV, FLAC, AMR, OGG, and WebM formats.

  • Click on your bucket’s name to navigate to the bucket / On the Buckets Home page / Select Upload / Add files / Upload the PII-file.mp3 file

Upload

Image description

  • Select PII-file.mp3 file / Under Properties / For Object overview / Copy the S3 URL / Save it for future use

s3://pii-bucket12/PII-file.mp3

Image description

3. Create a transcription job

  • From the top menu bar, select Services then begin typing Transcribe in the search bar and select Amazon Transcribe to open the service console.

  • On the Amazon Transcribe Console / Transcription jobs page, click Create job / Under Specify job details / Job settings /

Name: - PII-transcribe-job

Language: - English,US (en-US)

  • Input data / Input file location on S3: s3://pii-bucket12/PII-file.mp3

Output data location type: take the default - Service-managed S3 bucket.

Next

  • On the Configure page / Under Content removal / Check PII redaction / Take the default Select ALL

Image description

Create job

  • Wait for the status of your job to change from In progress to Complete

Image description

4. Review transcription results

  • Click on PII-transcribe-job / Under Transcription preview / Text

  • You can see that all personally identifiable information (PII) in the transcript is masked with the PII entity-type.

Image description

Cleanup

  • Delete the audio file - PII-file.mp3

  • Delete the S3 bucket - pii-bucket12

  • Delete the Transcription job - PII-transcribe-job

What we have done so far

Using Amazon Transcribe (automatic speech recognition service), we have successfully redacted certain personal and sensitive identifiable information (PII).

Top comments (0)