Crawling in my Archipelago

ch00beh
Global Moderator

wat

Posts: 9,651

House: Oloysian

Crawling in my Archipelago Apr 26, 2016 11:59:18 GMT -5 Tout-Perd and Beelzebibble like this

Quote

Post by ch00beh on Apr 26, 2016 11:59:18 GMT -5

So I was bored last night and angry at the internet (photobucket specifically) so I decided to try and download our little wobsite in case things blow up again. I found a wob crawling framework called Scrapy which has been annoying me to no end due to it having one p, but then I ended up with a little script that at the moment can be pointed at a topic and it will spit out a super simplified version to local disk. Here is the sample output of Dis

I've discovered quite a few things about proboards which leads me to believe if it ever goes down, wayback won't have the pages properly indexed. So that's fun.

Anyway, the script is still a work in progress. I've found the secret incantation to actually find pages, but now I'm running into good ol' threading issues as order is not guaranteed in page reads, and the last page just goes ahead and overwrites everything prior. But there are some provisions for this in Scrapy, so I think I can work around that, and soon I'll be able to just point it at an entire board and have it dump everything.

I'm considering spinning up an AWS instance or something so I can have a robot automatically back everything up every day or something but that's like a whole thing for maybe a week or two later.

If you are interested in looking at the shitty code or even contributing, I've backed up this backing up script on the githubs: github.com/jmaliksi/aesimplifier

(current incantations for crawling pages can be found in the branch jm_crawls)

Last Edit: Apr 26, 2016 12:01:22 GMT -5 by ch00beh

ヽ( ﾟヮﾟ)ﾉ

ch00beh
Global Moderator

wat

Posts: 9,651

House: Oloysian

Crawling in my Archipelago Apr 27, 2016 0:39:14 GMT -5 Tout-Perd likes this

Quote

Post by ch00beh on Apr 27, 2016 0:39:14 GMT -5

I think I got it to work?

(Dis)Orientation
Chords in an Ethereal Harp
End Game
Gasoline
Head Games
Ishkabibble Scene Eight
Ishkabibble Scene Five
Ishkabibble Scene Four
Ishkabibble Scene Nine
Ishkabibble Scene Nineteen
Ishkabibble Scene One
Ishkabibble Scene Seven
Ishkabibble Scene Six
Ishkabibble Scene Three
Ishkabibble Scene Twelve
Ishkabibble Scene Twenty-one
Ishkabibble Scene Two
Ishkabibble Scene Zero
It\'s All Fun and Games
Luxury Train Ride
OOC: World Building
Obscured Truth; Court is in Session
Summoner Style
The Case of the Burgled Boullogne
The Grand Reconstruction!
The Last Best Hope
Triannual
Whisper in My Ear

Last Edit: Apr 27, 2016 11:09:46 GMT -5 by ch00beh

ヽ( ﾟヮﾟ)ﾉ

ch00beh
Global Moderator

wat

Posts: 9,651

House: Oloysian

Crawling in my Archipelago Apr 27, 2016 0:56:43 GMT -5

Quote

Post by ch00beh on Apr 27, 2016 0:56:43 GMT -5

things left to do when i am next bored:
- ~~Make the top level crawler actually crawl instead of explicitly listing out every page of every board~~
- Fuzzier topic name matching
- ~~Generate links and index page for backed up things automatically instead of doing gnarly regexes~~
- Cron job and auto upload
- ~~Command line mode~~
- Cold storage in S3?
- ~~Proper HTML/CSS gen~~
- ~~Filter out spoiler tags or make them functional~~
- Self serve wob form?
- ~~Two phase distribution~~
- Topic organization

Last Edit: May 1, 2016 15:29:06 GMT -5 by ch00beh

ヽ( ﾟヮﾟ)ﾉ

ch00beh
Global Moderator

wat

Posts: 9,651

House: Oloysian

Crawling in my Archipelago Apr 28, 2016 3:47:13 GMT -5

Quote

Post by ch00beh on Apr 28, 2016 3:47:13 GMT -5

ok so now i have templates and an index and shit. And with templates means fonts. All the above links are dead, go here instead: www.josephmaliksi.com/stuff/

I also set all spoiler tags to hidden, so now there might be some missing context but oh well.

Last Edit: Apr 28, 2016 3:48:25 GMT -5 by ch00beh

ヽ( ﾟヮﾟ)ﾉ

ch00beh Global Moderator wat Posts: 9,651 House: Oloysian	Crawling in my Archipelago Apr 28, 2016 10:48:53 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by ch00beh on Apr 28, 2016 10:48:53 GMT -5 jk about spoiler tags i just made them work
	ヽ( ﾟヮﾟ)ﾉ

Beelzebibble Fadministrator Snob-in-Residence Posts: 9,655 House: Oloysian	Crawling in my Archipelago Apr 28, 2016 11:03:15 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Beelzebibble on Apr 28, 2016 11:03:15 GMT -5 Man this is awesome work. I didn't think images and stuff would be incorporated.

ch00beh Global Moderator wat Posts: 9,651 House: Oloysian	Crawling in my Archipelago Apr 28, 2016 11:04:27 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by ch00beh on Apr 28, 2016 11:04:27 GMT -5 it's literally copy and pasting the stuff that proboards generates. it would be harder for me to remove them than to keep them there.
	ヽ( ﾟヮﾟ)ﾉ

Beelzebibble Fadministrator Snob-in-Residence Posts: 9,655 House: Oloysian	Crawling in my Archipelago Apr 28, 2016 11:07:44 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Beelzebibble on Apr 28, 2016 11:07:44 GMT -5 Then whatever i don't think it's impressive at all fine

ch00beh Global Moderator wat Posts: 9,651 House: Oloysian	Crawling in my Archipelago Apr 28, 2016 11:14:29 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by ch00beh on Apr 28, 2016 11:14:29 GMT -5 just be impressed that i got any css to work
	ヽ( ﾟヮﾟ)ﾉ

ch00beh
Global Moderator

wat

Posts: 9,651

House: Oloysian

Crawling in my Archipelago Apr 28, 2016 13:01:52 GMT -5

Quote

Post by ch00beh on Apr 28, 2016 13:01:52 GMT -5

so in case anyone wants to run this thing, I can provide instructions for OSX. If you are using linux, you should already know this. If you are using windows, you're on your own.

For OSX, open up Terminal.app and make sure you have python installed by typing `python --version`. That should spit out something that isn't an error. I think it comes with OSX by default, but yeah, just make sure.

Next, install pip and git. After that, type the following:

mkdir -p ~/workspace
cd ~/workspace
git clone github.com/jmaliksi/aesimplifier.git
cd aesimplifier
pip install virtualenv
virtualenv env
source env/bin/activate
pip install -r requirements.txt
mkdir aesimplifier/dist

A lot of things happened, but don't worry, I probably did not hack you. After that, you'll want to open up the file aesimplifier/aesimplifier/spiders/exy.py with textedit or something. You'll see two giant lists: "start_urls" and "self.topics". Delete everything in the square brackets. In "start_urls" add the URL to the board your topic of interest resides in (make sure you surround it with quotes), and in "self.topics", add the exact name of the topic you want to back up. Now run

cd aesimplifier
scrapy crawl exy

I messed up with some folder structure, but I didn't feel like fixing them so that's why there's all these redundancies. But anyway, the crawler is now running, spitting out the post content of your topic. Give it a few since it's throttled to one page every two seconds to avoid being autobanned. Once it's finished, open up the dist folder and you should have some shiny new html files in there. Grats.

I'll probably work on making topic definition slightly easier next because it is annoying me as well.

Last Edit: Apr 29, 2016 11:33:43 GMT -5 by ch00beh

ヽ( ﾟヮﾟ)ﾉ

ch00beh
Global Moderator

wat

Posts: 9,651

House: Oloysian

Crawling in my Archipelago Apr 30, 2016 13:17:01 GMT -5 Tout-Perd likes this

Quote

Post by ch00beh on Apr 30, 2016 13:17:01 GMT -5

code has been updated to support workflows and to get rid of the dumb directory nesting. this is mostly for me so I can iterate on webpage generation faster, but if you are trying to run it, just run `fab generate` at the top level directory instead of running scrapy in the subdir

ヽ( ﾟヮﾟ)ﾉ

ch00beh Global Moderator wat Posts: 9,651 House: Oloysian	Crawling in my Archipelago Apr 30, 2016 15:44:09 GMT -5 Tout-Perd and Beelzebibble like this Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by ch00beh on Apr 30, 2016 15:44:09 GMT -5 just 4 u pohats, i made the the braggadocio font work. this means that the ish pages are the proud owners of 5 different fonts
	Last Edit: Apr 30, 2016 16:02:00 GMT -5 by ch00beh ヽ( ﾟヮﾟ)ﾉ

ch00beh Global Moderator wat Posts: 9,651 House: Oloysian	Crawling in my Archipelago May 1, 2016 15:17:02 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by ch00beh on May 1, 2016 15:17:02 GMT -5 so i'm just about done building out the base feature set and probably won't do anything with AWS for a while. if anyone has any features they would like, or to add topics to backup to the list, just post here.
	ヽ( ﾟヮﾟ)ﾉ

Tout-Perd Law of the Heavens Reluctant Admin Posts: 14,181 House: Alryst	Crawling in my Archipelago May 1, 2016 16:10:00 GMT -5 via mobile Beelzebibble likes this Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Tout-Perd on May 1, 2016 16:10:00 GMT -5 Please enable jiggle physics for Kevin's posts. kthnxbye
	Take two steps north into the unsettled future, south into the unquiet past, east into the present day, or west into the great unknown.-Terramorphic Expanse Honestly, a trainwreck isn't a wrong statement to use. It's a trainwreck, but the train was carrying a load of fireworks and a shipment of clowns.-Jayngfet

Loogs Patron Saint of Poor Decisions Malfunction Junction WE'RE PLAYING THE FEUD Posts: 6,814	Crawling in my Archipelago May 5, 2016 0:16:52 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Loogs on May 5, 2016 0:16:52 GMT -5 RadiancePartIAnAugustLuncheon.html (46.8 KB) I went ahead and saved An August Luncheon here

ch00beh Global Moderator wat Posts: 9,651 House: Oloysian	Crawling in my Archipelago May 14, 2016 14:27:15 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by ch00beh on May 14, 2016 14:27:15 GMT -5 kingsmen and luncheon added to the site
	ヽ( ﾟヮﾟ)ﾉ

ch00beh Global Moderator wat Posts: 9,651 House: Oloysian	Crawling in my Archipelago Jun 1, 2016 23:19:24 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by ch00beh on Jun 1, 2016 23:19:24 GMT -5 i ran the script again and added some topics/fics i also added the MIT license in case anyone was thinking of ILLICITLY STEALING MY CODE and MAKING BILLIONS without attributing it to me
	ヽ( ﾟヮﾟ)ﾉ

ch00beh Global Moderator wat Posts: 9,651 House: Oloysian	Crawling in my Archipelago Oct 1, 2016 7:34:25 GMT -5 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by ch00beh on Oct 1, 2016 7:34:25 GMT -5 Added another Luncheon, Sailing, Sextant, Give/Get, and some solo fics to the list. I should probably implement folders one day, but that sounds hard.
	ヽ( ﾟヮﾟ)ﾉ

ch00beh
Global Moderator

wat

Posts: 9,651

House: Oloysian

Crawling in my Archipelago Jul 21, 2017 9:28:53 GMT -5

Quote

Post by ch00beh on Jul 21, 2017 9:28:53 GMT -5

so i realize now that pohatu was feigning amazement at the images being present in the tags because he may have thought the script downloaded them and rehosted them on my site. it did in fact do no such thing. it just copy/pasted the link, which may end up being from photobucket, and stuck it in standard HTML img tags. sorros, friends.

ヽ( ﾟヮﾟ)ﾉ