Main program is pdfCrawl.py, written in python 2.7.
Libraries used:
- sys
- BeautifulSoup
- urllib2
- urlparse (part of urllib in python 3.x)
- httplib (used for connection refusal errors)
To install dependencies use the following command:
$ pip install -r requirements.txt
Assignment #1 Due: 11:59pm Sept 26
- correctly POST data to a form.
- Show that the HTML response that is returned is "correct". That is, the server should take the arguments you POSTed and build a response accordingly.
- Save the HTML response to a file and then view that file in a browser and take a screen shot.
- takes as a command line argument a web page
- extracts all the links from the page
- lists all the links that result in PDF files, and prints out the bytes for each of the links. (note: be sure to follow all the redirects until the link terminates with a "200 OK".)
- show that the program works on 3 different URIs, one of which needs to be: http://www.cs.odu.edu/~mln/teaching/cs532-s17/test/pdfs.html
http://www9.org/w9cdrom/160/160.html
Now consider the following graph:
A --> B
B --> C
C --> D
C --> A
C --> G
E --> F
G --> C
G --> H
I --> H
I --> K
L --> D
M --> A
M --> N
N --> D
O --> A
P --> G
For the above graph, give the values for:
IN:
SCC:
OUT:
Tendrils:
Tubes:
Disconnected: