Retail Website Scraping — Racquel Fygenson

Revice Denim
Web Scraping & Crawling Project

Python: scrapy, beautifulSoup, urllib

This project was my first foray into web scraping. Using the Python package scra.py, and taking advantage of a javascript oversight in the Shopify framework, I create spyders that crawl through
Revice Denim’s online retail site, and then scrape retail items’ IDs, names, sizes, colors, and quantities available on each page.

See this project on GitHub

Inspired by the question “Can I predict when clothing prices will drop due to lack of demand?”, I set out to find a relatively small retail site with an easy, uniform javascript tagging scheme that could serve as my HTML source for a quick proof-of-concept.

My original strategy for this endeavor was to simply scrape an item’s unique ID, descriptive tags (e.g. “pants”, “flares”, “70s”, “high-waist”, “light-wash”), and boolean availability (in stock/sold out).
This was a brute force tactic that would require day-by-day snapshots of these items, so that I could determine the number of days each item was available for purchase before selling out (what I call its “lifespan”) and eventually predict other item’s lifespans. Not a very elegant plan, but a plan nonetheless.

Luckily, I began this project by rooting through the source HTML of Revice Denim, a retail site built on Shopify. As I did so, I noticed that Shopify publishes the quantity of each item in stock into its declarative HTML tag (maybe left over from the infrastructure built out for the retail company user platform), so I can easily scrape information on the retail company’s current stock.

The result of this discovery and my Python script is an excel sheet with the full record of Revice Denim’s inventory.

Love Graphs