Goals
Im applying my graffiti interests toward my programming ability. Today I put together a brief python script to scrape a reputable graffiti writers forum. The scraper was designed to build a database of user names and images uploaded. My goal is two fold: learn about computer vision through analyzing graffiti pictures, develop insight into people’s usage of the internet for distributing graffiti images. Beyond analysis, I would love to create a graffiti recomendation engine for aspiring graffiti artists. This engine would look at a person’s ‘style’ and give suggestions of other artists they may benefit from studying.
Scraping process
I focused my first group of images scraped on New York city. I found the New York city forum thread in 12ozprophet.com. From here, I catered the scraper to tag posts and the images contained in the post. I created a unique ID for the images and the posts. Now that this has been completed, I may need to create a unique ID for the user names. As of now, this is not necessary.
The scraper pulled from over 440,000 posts within the New York graffiti thread. From those posts, over 275,000 images were gathered. In a brief overview, I can tell that not all images are graffiti. Similarly, not all images are of paintings. There are tags, throw ups, pieces, and sketches. The images vary in quality and size. The dataset pulled from forum posts between 2009 to 2014. I’m surprised to see how much camera quality and digital photography has changed in those few years.
Considering analysis
In the past few weeks, I have spent time understanding the application of statistical analysis on large data sets. Because I was working with small sets, I was able to use Microsoft Excel’s functions. Beyond 100 data points, excel becomes noticeably sluggish in processing simple PivotTable tasks. I will use R and python libraries to process future analysis.
I am reading various books I purchased in the past from Oreilly. The books topics are mainly around R, data analysis in Python, OpenCV, and machine learning. None of these topics are failiar to my background, or me so I have also picked up a basic book on linear algebra. I would like to finish the preliminary results from the first dataset by next week and move ahead with scraping the rest of 12ozprophet.
Code
The code used to scrape is pasted below. The dataset will be made available in the near future. Note, while scraping 12ozprophet, the images are hosted on a large variety of websites. Different users primarily uploaded the images, so the image analysis should not create a large load on the forums. I do not believe I need to download the images for analysis, but if I do, I will likely use some a virtual machine cloud instance.
Also, I am unfamiliar with the Python-way to write programs. I am primarily working with JavaScript and PHP. I have dove into Ruby on Rails, but am still dependent on the Gems and generators. As a result, much of my Python code is written with a JavaScript mentality.
#!/usr/bin/env python
import scraperwiki
import requests
import lxml.html
# Constants
numberofposts = 7844
post_iteration = 0
user_iteration = 0
image_iteration = 0
# One-time use
global savedusername, date_published, post_image
savedusername = 'null'
dateremovetitle = """N E W Y O R K C I T Y - """
dateremovere = """Re:"""
ignoredimages = ['images/12oz/statusicon/post_old.gif','images/12oz/buttons/collapse_thead.gif','images/12oz/statusicon/post_new.gif','images/12oz/reputation/reputation_pos.gif','images/12oz/reputation/reputation_highpos.gif','images/icons/icon1.gif','images/12oz/buttons/quote.gif','clear.gif', 'images/12oz/attach/jpg.gif','images/12oz/reputation/reputation_neg.gif','images/12oz/reputation/reputation_highneg.gif','images/12oz/statusicon/post_new.gif']
for i in range(1, numberofposts):
html = requests.get("http://www.12ozprophet.com/forum/showthread.php?t=128783&page="+ str(i)).content
dom = lxml.html.fromstring(html)
print 'Page: ' + str(i)
for posts in dom.cssselect('#posts'):
for table in posts.cssselect('table'):
try:
username = table.cssselect('a.bigusername')[0].text_content()
if username != savedusername:
if username != 'null':
savedusername = username
except IndexError:
username = 'null'
try:
post_iteration = post_iteration + 1 #my unique post id
postdate = table.cssselect('td.alt1 div.smallfont')[0].text_content()
postdate = postdate.replace(dateremovetitle, '')
postdate = postdate.replace(dateremovere, '')
postdate = postdate.strip()
date_published = postdate
# print '---'
# print savedusername +' '+ postdate + ', ID: ' + str(iteration)
except IndexError:
postdate = 'null'
for img in table.cssselect('img'):
imagesrc = img.get('src')
imagematch = 'false'
for image in ignoredimages:
if image == imagesrc:
imagematch = 'true'
if imagematch != 'true':
image_iteration = image_iteration + 1
post_image = imagesrc
post = {
'image_id': image_iteration,
'image_url': post_image,
'post_id': post_iteration,
'user': savedusername,
'date_published': date_published,
}
print post
scraperwiki.sql.save(['image_id'], post)