|
Post by ch00beh on Apr 26, 2016 11:59:18 GMT -5
So I was bored last night and angry at the internet (photobucket specifically) so I decided to try and download our little wobsite in case things blow up again. I found a wob crawling framework called Scrapy which has been annoying me to no end due to it having one p, but then I ended up with a little script that at the moment can be pointed at a topic and it will spit out a super simplified version to local disk. Here is the sample output of DisI've discovered quite a few things about proboards which leads me to believe if it ever goes down, wayback won't have the pages properly indexed. So that's fun. Anyway, the script is still a work in progress. I've found the secret incantation to actually find pages, but now I'm running into good ol' threading issues as order is not guaranteed in page reads, and the last page just goes ahead and overwrites everything prior. But there are some provisions for this in Scrapy, so I think I can work around that, and soon I'll be able to just point it at an entire board and have it dump everything. I'm considering spinning up an AWS instance or something so I can have a robot automatically back everything up every day or something but that's like a whole thing for maybe a week or two later. If you are interested in looking at the shitty code or even contributing, I've backed up this backing up script on the githubs: github.com/jmaliksi/aesimplifier(current incantations for crawling pages can be found in the branch jm_crawls)
|
|
|
Post by ch00beh on Apr 27, 2016 0:39:14 GMT -5
|
|
|
Post by ch00beh on Apr 27, 2016 0:56:43 GMT -5
things left to do when i am next bored: - Make the top level crawler actually crawl instead of explicitly listing out every page of every board - Fuzzier topic name matching - Generate links and index page for backed up things automatically instead of doing gnarly regexes - Cron job and auto upload - Command line mode - Cold storage in S3? - Proper HTML/CSS gen - Filter out spoiler tags or make them functional - Self serve wob form? - Two phase distribution - Topic organization
|
|
|
Post by ch00beh on Apr 28, 2016 3:47:13 GMT -5
ok so now i have templates and an index and shit. And with templates means fonts. All the above links are dead, go here instead: www.josephmaliksi.com/stuff/I also set all spoiler tags to hidden, so now there might be some missing context but oh well.
|
|
|
Post by ch00beh on Apr 28, 2016 10:48:53 GMT -5
jk about spoiler tags i just made them work
|
|
|
Post by Beelzebibble on Apr 28, 2016 11:03:15 GMT -5
Man this is awesome work. I didn't think images and stuff would be incorporated.
|
|
|
Post by ch00beh on Apr 28, 2016 11:04:27 GMT -5
it's literally copy and pasting the stuff that proboards generates. it would be harder for me to remove them than to keep them there.
|
|
|
Post by Beelzebibble on Apr 28, 2016 11:07:44 GMT -5
Then whatever i don't think it's impressive at all fine
|
|
|
Post by ch00beh on Apr 28, 2016 11:14:29 GMT -5
just be impressed that i got any css to work
|
|
|
Post by ch00beh on Apr 28, 2016 13:01:52 GMT -5
so in case anyone wants to run this thing, I can provide instructions for OSX. If you are using linux, you should already know this. If you are using windows, you're on your own. For OSX, open up Terminal.app and make sure you have python installed by typing `python --version`. That should spit out something that isn't an error. I think it comes with OSX by default, but yeah, just make sure. Next, install pip and git. After that, type the following: - mkdir -p ~/workspace
- cd ~/workspace
- git clone github.com/jmaliksi/aesimplifier.git
- cd aesimplifier
- pip install virtualenv
- virtualenv env
- source env/bin/activate
- pip install -r requirements.txt
- mkdir aesimplifier/dist
A lot of things happened, but don't worry, I probably did not hack you. After that, you'll want to open up the file aesimplifier/aesimplifier/spiders/exy.py with textedit or something. You'll see two giant lists: "start_urls" and "self.topics". Delete everything in the square brackets. In "start_urls" add the URL to the board your topic of interest resides in (make sure you surround it with quotes), and in "self.topics", add the exact name of the topic you want to back up. Now run - cd aesimplifier
- scrapy crawl exy
I messed up with some folder structure, but I didn't feel like fixing them so that's why there's all these redundancies. But anyway, the crawler is now running, spitting out the post content of your topic. Give it a few since it's throttled to one page every two seconds to avoid being autobanned. Once it's finished, open up the dist folder and you should have some shiny new html files in there. Grats. I'll probably work on making topic definition slightly easier next because it is annoying me as well.
|
|
|
Post by ch00beh on Apr 30, 2016 13:17:01 GMT -5
code has been updated to support workflows and to get rid of the dumb directory nesting. this is mostly for me so I can iterate on webpage generation faster, but if you are trying to run it, just run `fab generate` at the top level directory instead of running scrapy in the subdir
|
|
|
Post by ch00beh on Apr 30, 2016 15:44:09 GMT -5
just 4 u pohats, i made the the braggadocio font work.
this means that the ish pages are the proud owners of 5 different fonts
|
|
|
Post by ch00beh on May 1, 2016 15:17:02 GMT -5
so i'm just about done building out the base feature set and probably won't do anything with AWS for a while.
if anyone has any features they would like, or to add topics to backup to the list, just post here.
|
|
|
Post by Tout-Perd on May 1, 2016 16:10:00 GMT -5
Please enable jiggle physics for Kevin's posts. kthnxbye
|
|
|
Post by Loogs on May 5, 2016 0:16:52 GMT -5
|
|
|
Post by ch00beh on May 14, 2016 14:27:15 GMT -5
kingsmen and luncheon added to the site
|
|
|
Post by ch00beh on Jun 1, 2016 23:19:24 GMT -5
|
|
|
Post by ch00beh on Oct 1, 2016 7:34:25 GMT -5
Added another Luncheon, Sailing, Sextant, Give/Get, and some solo fics to the list. I should probably implement folders one day, but that sounds hard.
|
|
|
Post by ch00beh on Jul 21, 2017 9:28:53 GMT -5
so i realize now that pohatu was feigning amazement at the images being present in the tags because he may have thought the script downloaded them and rehosted them on my site. it did in fact do no such thing. it just copy/pasted the link, which may end up being from photobucket, and stuck it in standard HTML img tags. sorros, friends.
|
|