Instagram is a goldmine of data—profiles, posts, videos, follower counts, and more. But scraping this data isn’t as straightforward as just hitting a few buttons. Thanks to robust anti-bot measures and login requirements, accessing Instagram's treasure trove requires a bit of finesse.
But don't worry. I’m here to guide you through scraping Instagram user data using Python. With the right tools and methods, you can easily collect data from public profiles—no need for complex login processes or jumping through hoops.
Let's dive in.
Step 1: Configuring Your Environment
Before we dive into the code, let’s make sure you're equipped with the right libraries. You’ll need:
- requests: To make HTTP requests and interact with Instagram's backend API.
- python-box: This simplifies data handling, turning JSON responses into objects that you can interact with using dot notation. Install them like this:
pip install requests python-box
With these libraries installed, we can start making requests to Instagram's backend to pull user profile data.
Step 2: Initiating the API Request
Instagram’s front-end is locked down tight, but the backend offers a sweet spot for accessing public data. To fetch profile data, you’ll use an endpoint like this:
https://i.instagram.com/api/v1/users/web_profile_info/?username={username}
Here’s the secret sauce: Instagram analyzes the request headers to filter out bots. Mimicking a real browser request is crucial. Specifically, you’ll use:
x-ig-app-id: Instagram's app ID to make the request seem legitimate.
User-Agent: The browser you're pretending to use.
Here’s the code for sending the request:
import requests
headers = {
"x-ig-app-id": "936619743392459",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Accept": "*/*",
}
username = 'testtest'
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers)
response_json = response.json()
Step 3: Handling Proxies to Avoid Detection
Instagram doesn’t like it when you hit their API repeatedly from the same IP address. To avoid rate-limiting or getting blocked, you’ll need to use proxies. Proxies allow you to send requests from different IP addresses, making your activity appear more like that of multiple users.
Here’s how you can use proxies:
proxies = {
'http': 'http://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
'https': 'https://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
}
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers, proxies=proxies)
Step 4: Parsing JSON Efficiently with Box
Instagram’s API responses come in JSON format, which can be pretty nested and hard to navigate. The solution? python-box.
Instead of accessing data with dictionary-style keys (e.g., response_json['data']['user']['full_name']
), Box lets you use dot notation. It's like interacting with objects in JavaScript.
Here’s how you do it:
from box import Box
response_json = Box(response.json())
user_data = {
'full name': response_json.data.user.full_name,
'followers': response_json.data.user.edge_followed_by.count,
'profile pic url': response_json.data.user.profile_pic_url_hd,
}
Now, parsing Instagram’s deep JSON structure is a breeze.
Step 5: Extracting Video and Post Data
But why stop at basic profile info? Instagram’s API also lets you grab media data—posts and videos—along with important metrics like likes, comments, and views.
Here’s how to extract video data:
profile_video_data = []
for element in response_json.data.user.edge_felix_video_timeline.edges:
video_data = {
'id': element.node.id,
'video url': element.node.video_url,
'view count': element.node.video_view_count,
'comment count': element.node.edge_media_to_comment.count,
'like count': element.node.edge_liked_by.count,
}
profile_video_data.append(video_data)
The same method works for posts—whether photos or videos:
profile_timeline_media_data = []
for element in response_json.data.user.edge_owner_to_timeline_media.edges:
media_data = {
'media url': element.node.display_url,
'like count': element.node.edge_liked_by.count,
'comment count': element.node.edge_media_to_comment.count,
}
profile_timeline_media_data.append(media_data)
Step 6: Storing the Data
After scraping all this data, you’ll want to save it for analysis or further use. Thankfully, Python's built-in json module makes it simple.
Here’s how to save everything:
import json
# Save profile data
with open(f'{username}_profile_data.json', 'w') as file:
json.dump(user_data, file, indent=4)
# Save video data
with open(f'{username}_video_data.json', 'w') as file:
json.dump(profile_video_data, file, indent=4)
# Save media data
with open(f'{username}_timeline_media_data.json', 'w') as file:
json.dump(profile_timeline_media_data, file, indent=4)
Putting It All Together
Now, here’s the full script to scrape Instagram profile, video, and post data, handle headers and proxies, and save everything in JSON files:
import requests
from box import Box
import json
# Set headers and proxies (optional)
headers = {
"x-ig-app-id": "936619743392459",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
}
proxies = {
'http': 'http://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
'https': 'https://<proxy_username>:<proxy_password>@<proxy_ip>:<proxy_port>',
}
username = 'testtest'
# Send the request
response = requests.get(f'https://i.instagram.com/api/v1/users/web_profile_info/?username={username}', headers=headers, proxies=proxies)
response_json = Box(response.json())
# Extract profile data
user_data = {
'full name': response_json.data.user.full_name,
'followers': response_json.data.user.edge_followed_by.count,
}
# Extract video data
profile_video_data = []
for element in response_json.data.user.edge_felix_video_timeline.edges:
video_data = {
'video url': element.node.video_url,
'view count': element.node.video_view_count,
}
profile_video_data.append(video_data)
# Extract timeline media data
profile_timeline_media_data = []
for element in response_json.data.user.edge_owner_to_timeline_media.edges:
media_data = {
'media url': element.node.display_url,
'like count': element.node.edge_liked_by.count,
}
profile_timeline_media_data.append(media_data)
# Save data
with open(f'{username}_profile_data.json', 'w') as file:
json.dump(user_data, file, indent=4)
with open(f'{username}_video_data.json', 'w') as file:
json.dump(profile_video_data, file, indent=4)
with open(f'{username}_timeline_media_data.json', 'w') as file:
json.dump(profile_timeline_media_data, file, indent=4)
Final Thoughts
Scraping Instagram data with Python is more than just copying data—it's about smartly handling headers, proxies, and parsing complex JSON responses. By following these steps, you can efficiently scrape Instagram data and pull the information you need. Just remember, always adhere to Instagram’s terms of service and avoid scraping at too large a scale to stay on the right side of the platform.
Top comments (0)