Due to the migration of the code from wordpress the indentation of the Python code has been lost. Please, check out my 4chan repository at https://bitbucket.org/inedit00/ to see the newest source code.
Are you bored? Me too. And what do you normally do when you are bored? Just visit 4chan.org/b, to the random section, and start pressing F5 waiting for something naked interesting.
But this is fun the first 20 minutes. Then you get bored to press the refresh button too. And then you create a python script for automatically download the photos posted in 4chan, and store it in a folder. And this is it:
import os import re import time import urllib DST_PATH = './imgs' os.system("mkdir -p %s" % DST_PATH) def get_main_page(): # Downloads the HTML code of the main page w = urllib.urlopen('http://boards.4chan.org/b/') data = w.read().split('\\n') w.close() return data def extract_src_tags(lines): return reduce(lambda a,b: a+b, map(lambda line: re.findall('href="//([a-z:/.A-Z0-9]*)"', line), lines), ) def filter_invalid_urls(urls): return set([ url for url in urls if url.endswith('jpg') or url.endswith('png') ]) def download_img(url): def file_exist(filename): # Just check if the file already exist. This prevent from downloading # the same file many times return os.path.exists('%s/%s' % (DST_PATH, filename)) try: # Inside try-catch because sometimes the regexp does not have a group result filename = re.search('/([0-9]*\\..*)', url).group(1) # Extract the filename of the photo except: return False if file_exist(filename): return False url = "http://" + url img = urllib.urlopen(url).read() # Download the img with open('%s/%s' % (DST_PATH, filename), 'wb') as f: # Write the image in the disk f.write(img) return True while True: lines = get_main_page() for url in filter_invalid_urls(extract_src_tags(lines)): if download_img(url): print "Downloaded image %s" % url time.sleep(1) time.sleep(20)
Save it as 4chan.py and run it as:
$ python 4chan.py
The images are stored in a folder called "imgs".