IMPORTANT NOTE

Due to the migration of the code from wordpress the indentation of the Python code has been lost. Please, check out my 4chan repository at https://bitbucket.org/inedit00/ to see the newest source code.


Are you bored? Me too. And what do you normally do when you are bored? Just visit 4chan.org/b, to the random section, and start pressing F5 waiting for something naked interesting.

But this is fun the first 20 minutes. Then you get bored to press the refresh button too. And then you create a python script for automatically download the photos posted in 4chan, and store it in a folder. And this is it:

import os  
import re  
import time  
import urllib

DST_PATH = './imgs'  
os.system("mkdir -p %s" % DST_PATH)

def get_main_page():  
# Downloads the HTML code of the main page  
w = urllib.urlopen('http://boards.4chan.org/b/')  
data = w.read().split('\\n')  
w.close()  
return data

def extract_src_tags(lines):  
return reduce(lambda a,b: a+b,  
map(lambda line: re.findall('href="//([a-z:/.A-Z0-9]*)"', line),
lines),  
[])

def filter_invalid_urls(urls):  
return set([ url for url in urls if url.endswith('jpg') or
url.endswith('png') ])

def download_img(url):  
def file_exist(filename):  
# Just check if the file already exist. This prevent from downloading  
# the same file many times  
return os.path.exists('%s/%s' % (DST_PATH, filename))

try:  
# Inside try-catch because sometimes the regexp does not have a group
result  
filename = re.search('/([0-9]*\\..*)', url).group(1)     # Extract
the filename of the photo  
except:  
return False

if file_exist(filename): return False

url = "http://" + url  
img = urllib.urlopen(url).read()                        # Download the
img  
with open('%s/%s' % (DST_PATH, filename), 'wb') as f:   # Write the
image in the disk  
f.write(img)

return True

while True:

lines = get_main_page()  
for url in filter_invalid_urls(extract_src_tags(lines)):  
if download_img(url):  
print "Downloaded image %s" % url  
time.sleep(1)

time.sleep(20)

Save it as 4chan.py and run it as:

$ python 4chan.py

The images are stored in a folder called "imgs".