Ethical Data, Explained
Intricacies of web scraping in 2023 with Pierluigi VInciguerra, founder of The Web Scraping Club
In this episode of Ethical Data, Explained, Henry Ng is joined by Pierluigi Vinciguerra, founder of The Web Scraping Club as well as Co-founder and CTO of Re Analytics - Databoutique.com. Pier is a web scraping professional with more than 15 years of experience in data sourcing. We discussed web scraping past, present, and future - how technology evolves and what to expect in the coming years, what trends are emerging and driving the market, and where is the future of web scraping for business - in-house or outsourced teams. We also talked about things like what determines the success of a web scraping project or how to choose a proxy provider for your project.
This episode is a great opportunity to learn more about the man behind the Web Scraping Club project and get his perspective on the industry and its future.
This episode is a great opportunity to learn more about the man behind the Web Scraping Club project and get his perspective on the industry and its future.
Quotes
1. “If we are talking about the success of a small web scraping project, the most important thing is the quality of the output. If you're selling this project you need to create trust between you as a provider and a user and you need to put all the effort you can to provide quality data. To do so you need to set up a process of data quality with the most common techniques like human count regression, trends forecasting, etc. For large-scale projects, this applies as well but you also need to think about your scraping architecture. If you're building something that you're going to scale you need to standardize your processes.”
2. “Web scraping is becoming harder and more expensive. 10 years ago there was no need to have any proxy unless you needed to by-pass a geo-fence of a website. Now you need much more tools - proxies, headless browsers... "
3. “Many in the industry try to sell their APIs for automatic extraction from websites. This is a trend I've seen started four or five years ago and I think it's a good trend for for the data sourcing industry because it resolves quite a number of issues."
4. "There is more attention to the sourcing of the IP from many proxy providers, the Narrative of the proxy provider about the proxy industries moved to the ethical sourcing of the IP. It's good for this industry because web scraping has always been seen as shady. But it's totally legit if you do it in a proper way."
3 questions we ask all guests:
1. Who in the world of Tech/Data Pier would take out for lunch?
Ben Rogojan, the Seattle Data Guy
2. What piece of software Pier couldn't imagine life without?
Scrapy - an open-source and collaborative framework for extracting data websites.
Scrapy - an open-source and collaborative framework for extracting data websites.
3. What real-life problem did Pier solve using data?
Wrote a scraper to help him buy a TV, which eventually saved him 300-400 EUR.
Episode Resources
If you enjoyed this episode then please either:
Subscribe, rate, and review the "Ethical Data, Explained" podcast on Apple Podcasts.
Follow the "Ethical Data, Explained" podcast on Spotify.
Follow the "Ethical Data, Explained" podcast on Google Podcasts.
Watch full episodes of the "Ethical Data, Explained" podcast on YouTube.
To know more about SOAX visit the website.
Ethical Data, Explained is handcrafted by our friends over at: fame.so