Scrapfly for Scrapfly

Posted on Nov 25, 2024 • Originally published at scrapfly.io on Nov 25, 2024

How to Use cURL to Download Files?

#curl #tools

Curl, short for "Client URL," is a versatile command-line tool used for transferring data with URLs. It's widely favored by developers and system administrators for its ability to interact with a multitude of protocols such as HTTP, HTTPS, FTP, and more.

Using curl to download files simplifies the process by enabling direct command-line interaction with web resources. Curl is not only efficient and lightweight — operating without the need for a graphical interface — but also cross-platform, working seamlessly on Linux, macOS, and Windows systems.

In this article, we'll explore how to use curl to download a file from the web, covering various use cases and demonstrating the tool's versatility.

Why Use Curl to Download Files?

Curl stands out as an exceptional file downloading tool, offering a robust set of features that make it indispensable for developers. Here's what makes curl particularly powerful for downloading files:

Multi-Protocol Support

Handles various protocols like HTTP, HTTPS, FTP, and SFTP.
Eliminates the need for multiple tools when working with different protocols.

Resume Interrupted Downloads

Use the -C - option to continue downloads from where they left off.
Saves time and bandwidth by avoiding the need to restart downloads.

Bandwidth Management

Limit download speeds using --limit-rate to manage bandwidth usage.
Prevents downloads from consuming all available network resources.

Proxy Support

Easily configure proxies using options like -x or --proxy.
Supports various proxy types, including HTTP, HTTPS, SOCKS4, and SOCKS5.

Authentication Handling

Supports a range of authentication methods, including Basic, Digest, NTLM, and OAuth.
Access protected resources seamlessly.

Secure Transfers

Supports SSL/TLS protocols for secure file transfers.
Verify SSL certificates and use secure authentication methods.

Cross-Platform Compatibility

Available on Linux, macOS, Windows, and more.
Consistent functionality across different operating systems.

Automation and Scripting

Easily integrates into scripts for automated tasks.
Ideal for scheduled downloads using cron jobs or Windows Task Scheduler.

Curl's robust feature set makes it an excellent choice for downloading files, whether you're handling simple tasks or complex download operations. Its flexibility and efficiency empower users to manage downloads effectively in various environments.

You can learn more about curl and its options in our article about using curl for web-scraping

Now let's explore the basic usage of curl for downloading files and then dive deeper into more complex and unconventional scenarios.

Curl Basic File Download Options

By default, when curl is run on a file URL without any extra options, the file content is displayed is the terminal.

curl https://web-scraping.dev/assets/pdf/tos.pdf

However, you can use curl to save to file with its original name using the -O (uppercase "O" for Output) option:

curl -O https://web-scraping.dev/assets/pdf/tos.pdf

This command saves the file as tos.pdf, retaining the original filename.

Custom File Name on Download

To save the downloaded file with a custom name, use the -o (lowercase "o") option followed by the desired filename:

curl -o [filename] [URL]

Example:

curl -o web-scraping-tos.pdf https://web-scraping.dev/assets/pdf/tos.pdf

This command downloads tos.pdf and saves it as web-scraping-tos.pdf on your local machine.

Show Progress Bar / Download Silently

Curl show a progress meter by default. However, you can suppress the progress meter and show a simple progress bar instead.

Show Progress Bar

Replace the default progress meter with a simple progress bar using --progress-bar:

curl -O --progress-bar https://web-scraping.dev/assets/pdf/tos.pdf

Download Silently

To suppress all output, including progress and error messages, use the -s or --silent option:

curl -O -s https://web-scraping.dev/assets/pdf/tos.pdf

Silent Mode with Error Messages

If you want to hide the progress meter but still see error messages, combine -s with -S:

curl -O -s -S https://web-scraping.dev/assets/pdf/tos.pdf

Retry for Unstable Connections

For unreliable network connections, you can configure curl to retry downloads automatically:

Set Number of Retries

Use the --retry option followed by the number of retry attempts:

curl -O --retry [number] [URL]

Example:

curl -O --retry 5 https://web-scraping.dev/assets/pdf/tos.pdf

This command retries the download up to 5 times upon failure.

Specify Retry Delay

To add a delay between retries, use --retry-delay:

curl -O --retry 5 --retry-delay [seconds] [URL]

Example:

curl -O --retry 5 --retry-delay 10 https://web-scraping.dev/assets/pdf/tos.pdf

This adds a 10-second pause between each retry attempt.

Retry on All Errors

By default, curl retries on transient errors. To make it retry on all errors, use --retry-all-errors:

curl -O --retry 5 --retry-all-errors [URL]

Example:

curl -O --retry 5 --retry-all-errors https://web-scraping.dev/assets/pdf/tos.pdf

Handling Large File Downloads

Downloading large files can pose challenges such as network congestion or impacting other users on the same network. Curl offers options to manage these issues effectively.

To prevent a large download from consuming all your available bandwidth, you can limit the download speed using the --limit-rate option:

curl -O --limit-rate [speed] [URL]

Example:

curl -O --limit-rate 500k https://web-scraping.dev/assets/pdf/tos.pdf

This command limits the download speed to 500 kilobytes per second. You can specify the speed using suffixes:

k or K for kilobytes (e.g., 500k)
m or M for megabytes (e.g., 2M)

Benefits:

Bandwidth Management : Ensures other network activities aren't slowed down.
Network Stability : Reduces the risk of connection drops due to high bandwidth usage.

Insecure Downloading

In some cases, you might need to use cURL to download a file from a server with an invalid or self-signed SSL certificate. Curl verifies SSL certificates by default, which can block these downloads.

Disable SSL Certificate Verification

Warning: Disabling SSL verification can expose you to security risks like man-in-the-middle attacks. Use this option only when you're certain about the server's trustworthiness.

To bypass SSL certificate checks, use the -k or --insecure option:

curl -O -k https://web-scraping.dev/assets/pdf/tos.pdf

This command tells curl to ignore SSL certificate validation and proceed with the download.

Verifying File Integrity

Ensuring that a downloaded file hasn't been tampered with is crucial, especially for important or large files. You can verify file integrity using checksum tools like sha256sum.

Using sha256sum to Verify Downloads

Steps:

Download the File and Its Checksum

curl -O https://example.com/file.zip
curl -O https://example.com/file.zip.sha256

Verify the Checksum

sha256sum -c file.zip.sha256

The -c option tells sha256sum to check the file against the provided checksum.

Manual Verification:

If the checksum isn't provided in a file:

Get the Expected Checksum

Obtain the checksum value from the website or provider.

Calculate the Downloaded File's Checksum

sha256sum file.zip

This command outputs a checksum that you can compare with the expected value.

Example Output:

e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 file.zip

Benefits:

Security : Confirms the file hasn't been altered maliciously.
Data Integrity : Ensures the file isn't corrupted due to network issues.

Handling Authentication

When downloading files from protected resources, authentication is often required. Curl supports various authentication methods to access these resources.

Authorization Header

To include an authorization token or API key in your request, use the -H option to add a custom header:

curl -O -H "Authorization: Bearer your_token_here" https://api.example.com/securefile.zip

This example uses bearer token authentication, but you can use any other authentication method supported by curl.

Cookie Session

If authentication relies on session cookies, you can manage cookies using curl:

When logging in, save the session cookies to a file using the -c option:

curl -c cookies.txt -d "username=user&password=pass" https://example.com/login

The -d option sends POST data for login credentials.
Cookies received during login are saved to cookies.txt.
Use Saved Cookies

Use the saved cookies for subsequent requests with the -b option:

curl -O -b cookies.txt https://example.com/securefile.zip

Benefits:

Session Management : Maintains login sessions across multiple requests.
Automated Workflows : Scripts can handle login and file download processes seamlessly.

Utilizing these options enhances the reliability of your file downloads, ensuring efficiency, security, and smoother operations even with unstable internet connections.

Curl Command Builder

To simplify the process of creating cURL commands for file downloads, we've created a curl command builder tool. This interactive form allows you to select various options and generate the corresponding curl command instantly:

Check it out here

Automating Curl Downloads with Crontab

Automating file downloads ensures you always have the latest data without manual effort. By integrating curl with crontab, you can schedule downloads to run at specified times, enhancing efficiency and productivity.

What Is Crontab?

Crontab is a time-based job scheduler in Unix-like operating systems. It allows users to schedule scripts or commands to run automatically at predefined times or intervals.

Steps to Automate Downloads Using Crontab

1. Create a Download Script (Optional)

Write the Script

Create a shell script (e.g., download.sh) that contains your curl command:

#!/bin/bash
# Navigate to the desired directory
cd /path/to/download/directory

# Download the file using curl
curl -O https://example.com/file.zip

Make the Script Executable

chmod +x /path/to/download.sh

2. Edit the Crontab File

Open Crontab Editor

crontab -e

Add a New Cron Job

Insert a line following the cron syntax:

* * * * * /path/to/command

Example: Schedule the Script to Run Daily at 2 AM

0 2 * * * /path/to/download.sh

Fields Explained:

Minute: 0
Hour: 2 (2 AM)
Day of Month: * (Every day)
Month: * (Every month)
Day of Week: * (Every day of the week)

3. Save and Exit

After adding your cron job, save the file. The cron service will automatically pick up the new schedule.

Automating curl downloads with crontab streamlines your workflow, ensuring timely and consistent data retrieval. Whether you're updating datasets, synchronizing files, or performing regular backups, this combination offers a robust solution for scheduled tasks.

Bypassing Download Blocks

When attempting to use curl to download files, you might encounter situations where the download is blocked or fails. This can be due to various reasons such as network restrictions, server configurations, or security measures that prevent automated requests.

The most common reason for download blocks is that the server is blocking automated requests. To bypass this, you can add a custom browser user-agent string to your request headers to mimic a real browser request.

curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36" https://example.com/file.zip

This example uses the -A option to set a custom user-agent string. You can replace the string with any other user-agent string that mimics a real browser request.

Changing the user-agent string is the most basic method to bypass download blocks. However, some servers are sophisticated enough to still block requests with custom user-agent strings. In these cases, you may need to use a more advanced tools like curl-impersonate.

Curl-impersonate is a modified version of cURL that simulates the TLS fingerprint of major web browsers, like Chrome, Firefox, Edge and Safari, by mimicing their TLS and HTTP2 configuration. It also overrides the default cURL headers, such as the User-Agent, with regular header values. This makes the cURL Impersonate requests look like those sent from the browsers, preventing the firewalls from detecting the usage of HTTP clients.

You can learn more about curl-impersonate in our dedicated guide on using curl-impersonate for web-scraping

Power Up File Downloads with Scrapfly

Downloading files programmatically can quickly become a cumbersome task. Especially when the files are protected against automation and bots using sophisticated bot protection systems that cannot be bypassed with tools like curl-impersonate.

Scrapfly has millions of proxies and connection fingerprints that can be used to bypass protection against automated traffic and significantly simplify your file download process.

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass - scrape web pages without blocking!
Rotating residential proxies - prevent IP address and geographic blocks.
JavaScript rendering - scrape dynamic web pages through cloud browsers.
Full browser automation - control browsers to scroll, input and click on objects.
Format conversion - scrape as HTML, JSON, Text, or Markdown.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

For example, here is how to use Scrapfly's web scraping API to download a file, we will use Scrapfly's Pyhton SDK to call the API:

from scrapfly import ScrapflyClient, ScrapeConfig
import base64

scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")

FILE_URL = "https://web-scraping.dev/assets/pdf/tos.pdf"

response = scrapfly.scrape(
    ScrapeConfig(
        url=FILE_URL,
        asp=True,
    )
)

## decode base64 file data
file_data = base64.b64decode(response.result.content)

with open("tos.pdf", "wb") as f:
    f.write(file_data)

Scrapfly's API automatically detects that the requested URL is a file and return the binary content of the file encoded with base64. Which is why we decoded the content returned by the API before we saved it to a file called tos.pdf.

FAQ

Wrapping up, here are some common questions concering downlaoding files with curl:

Can I resume an interrupted download with `curl`?

Yes, you can resume an interrupted download by using the appropriate option in curl that allows you to continue from where the download stopped, which is especially useful for large files or unstable connections.

Is wget a better alternative to curl for downloading files?

wget is another command-line tool specifically designed for downloading files. While curl is versatile and supports various protocols and features, wget is often preferred for its simplicity in handling recursive downloads and its ability to download entire websites. You can learn more about the differenced between curl and wget in our dedicated curl vs wget article

How do I download multiple files at once using `curl`?

You can download multiple files simultaneously by specifying multiple URLs in a single command or by using scripting methods to loop through a list of URLs, allowing for efficient batch downloads.

Summary

Curl is a versatile tool when it comes to downloading files, offering:

Multi-Protocol Support : Works with HTTP, HTTPS, FTP, and more.
Resume Capability : Restarts interrupted downloads with ease.
Proxy and Bandwidth Management : Supports proxies and limits download speed.
Authentication Support : Handles cookies, tokens, and secured resources.
Automation : Integrates with scripts and scheduling tools like crontab.

For advanced needs, tools like curl-impersonate or services like Scrapfly can bypass sophisticated bot protections, offering:

Enhanced Bypass Capabilities : Overcomes anti-bot systems.
API Flexibility : Simplifies complex file downloads with robust solutions.

Curl’s feature-set make it essential for managing simple to complex downloads efficiently.

DEV Community

How to Use cURL to Download Files?

Why Use Curl to Download Files?

Curl Basic File Download Options

Custom File Name on Download

Show Progress Bar / Download Silently

Retry for Unstable Connections

Handling Large File Downloads

Insecure Downloading

Verifying File Integrity

Handling Authentication

Curl Command Builder

Automating Curl Downloads with Crontab

What Is Crontab?

Steps to Automate Downloads Using Crontab

Bypassing Download Blocks

Power Up File Downloads with Scrapfly

FAQ

Can I resume an interrupted download with `curl`?

Is wget a better alternative to curl for downloading files?

How do I download multiple files at once using `curl`?

Summary

Top comments (0)

Read next

New AI Memory Breakthrough: Infinite Context Length Without Performance Loss

Advanced HLS Tips and Tricks: A Journey through High-Level Synthesis

One Million Jobs 2.0: Embracing the Future of Work

Channels and Synchronization in Go

Why Use Curl to Download Files?

Curl Basic File Download Options

Custom File Name on Download

Show Progress Bar / Download Silently

Retry for Unstable Connections

Handling Large File Downloads

Insecure Downloading

Verifying File Integrity

Handling Authentication

Curl Command Builder

Automating Curl Downloads with Crontab

What Is Crontab?

Steps to Automate Downloads Using Crontab

Bypassing Download Blocks

Power Up File Downloads with Scrapfly

FAQ

Can I resume an interrupted download with curl?

Is wget a better alternative to curl for downloading files?

How do I download multiple files at once using curl?

Summary

Read next

New AI Memory Breakthrough: Infinite Context Length Without Performance Loss

Advanced HLS Tips and Tricks: A Journey through High-Level Synthesis

One Million Jobs 2.0: Embracing the Future of Work

Channels and Synchronization in Go

Can I resume an interrupted download with `curl`?

How do I download multiple files at once using `curl`?