… that you have started working.
It’s when you realize it’s a wednesday night 9 o clock and you din’t look at the XKCD post yet.
… that you have started working.
It’s when you realize it’s a wednesday night 9 o clock and you din’t look at the XKCD post yet.
This post is the first part of a series of n posts of various toy projects in Python. Python is a very high language and this makes all your work pretty much easy, so that even Tom, Dick and Balaji can write a search engine.
The source code of indexer is as below. Have a bigger picture. The explanation for each line of source follows.
#! /usr/sfw/bin/python
import urllib
import re
import pickle
worddict={}
page = "http://www.xkcd.com" #startpage
stack = []
stack.append(page)
visit = {} # keeps track of pages that we visited, to avoid loops
stopvar = 5000
while(stopvar >= 0):
stopvar-=1
print stopvar
cpage = stack.pop(0)
try:
f = urllib.urlopen(cpage)
html = f.read()
except:
print cpage,"Unexpected Error"
continue
sp = "a href=\""
htmlvisibletext = re.sub('<.*?>','',html)
htmlvisibletext = re.sub('[^a-zA-z ]','',htmlvisibletext)
words=htmlvisibletext.split()
for word in words:
word=word.lower()
if(word not in worddict):
dict={cpage:1}
worddict[word]=dict
else:
dict=worddict[word];
for i in range(len(html)):
if(sp == html[i:i+len(sp)]):
url = ""
i+=len(sp)
while(html[i] != "\""):
url+=html[i]
i+=1
if(url[0:4] == "http"):
if(visit.has_key(url) == False):
visit[url]=1
stack.append(url)
for key in worddict:
items = [(v, k) for k, v in worddict[key].items()]
items.sort()
items.reverse()
worddict[key] = [(k, v) for v, k in items]
dumpfile = open('dictdump','wb')
pickle.dump(worddict,dumpfile)
Here We go..
#! /usr/sfw/bin/python
This is the first line similar to most other scripting languages, where you specify the interpreter which will interpret the following script.
import urllib
import re
import pickle
Python provides several modules which we can use. We use here three of those modules
1. urllib – high level interface for fetching data across world wide web
2. re – python’s regular expression module. If you are familiar with Perl’s regex module then this is not much different.
3. pickle – module which provides methods to serialize and store python objects which help in data persistence
worddict={}
A dictionary data structure in python is used to store pairs of key and an arbitrary object corresponding to the key. worddict is a dictionary we will used to store words and the dictionaries corresponding to the words. Each dictionary corresponding to the word will contain urls in which the word was present and the frequency.
page = "http://www.xkcd.com" #startpage
This is the page where we start our graph traversal.
stack = []
The stack we are using is the python’s list data structure. We initialise to an empty list. This stack will maintain the URLs that we come across in each page.
stack.append(page)
The first page is added to the empty stack.
visit = {} # keeps track of pages that we visited, to avoid loops
stopvar = 5000
This is a limit on the number of pages we want to index. Lets start with a small number 5000. Of course 5000 is a damn small number in this world wide web.
stopvar-=1 #decrement the value
print stopvar
cpage = stack.pop(0)
stack.pop() can be used to pop values. stack.pop() pops the last element appended. Stack.pop(0) pops(??) the 1st element in the list. We are taking the first element in the list. So this is Basically a DFS.
try:
f = urllib.urlopen(cpage)
html = f.read()
The try-except construct is used to handle exceptions. Any exceptions in try block are thrown to the except block and are handled there.
except: print cpage,"Unexpected Error"
continue
sp = "a href=\""
“a href = “” is the pattern we will search in the html page to capture urls.
htmlvisibletext = re.sub('<.*?>','',html)htmlvisibletext = re.sub('[^a-zA-z ]','',htmlvisibletext)words=htmlvisibletext.split()for word in words: word=word.lower()We convert all words to lower case to provide for case-insensitivity.
if(word not in worddict):dict={cpage:1}worddict[word]=dictIf the word is not already present in the worddict dictionary, we create a dictionary dict with contents current url and frequency 1. The word and the dictionary dict are then inserted into the worddict.else:dict=worddict[word];In case the word is already there we obtain the dictionary by looking up for the word.if(cpage not in dict):dict[cpage]=1else:dict[cpage]=dict[cpage]+1

A simple program to computing Pi using the Monte Carlo method:
#include “stdlib.h”
#include “stdio.h”
#include “time.h”
int isinside(int , int);
int main()
{
srand(time(NULL));
double pi;
int x,y;
int i=0;
int inside=0;
int n=10000;
while(i<n){
x=rand()%11-5;
y=rand()%11-5;
if(isinside(x,y)==1)
inside++;
i++;
}
printf(“%d\n”,inside);
printf(“pi = %lf\n”,inside*4.0/n);
}
int isinside(int x,int y)
{
if(x<0)
x*=-1;
if(y<0)
y*=-1;
if((x==5 && y==0) || (x==4 && y<4) || (x==3 && y<5) || (x==2 && y<5) || (x==1 && y<5) || x==0 )
return 1;
else
return 0;
}
bramani@chassis01:/home/bramani/toyprojs> cc pi.c
bramani@chassis01:/home/bramani/toyprojs> a.out
6662
pi = 2.664800 [oops]
bramani@chassis01:/home/bramani/toyprojs> a.out
6812
pi = 2.724800 [aiyo]
These pi values are for r=5.
As r increases pi tends to 3.14.
There was Schrodinger. There was no Schrodinger.
Was dabbling with python this lazy saturday afternoon.
Surprise. I could write and run useful programs in no time at all.
I am Impressed. This is very true.
PYTHON ROCKS !
Hi all,
Wish you all a very happy new year 2009.
Let all you dreams come true this year.
The movie is like selling beer in a woodwards gripewater bottle.
The grownups don’t like the cover.
The children don’t like the taste.
None buys.
Some quick reference for Regular Expressions:
/ / delimiters of the pattern
+ one or more of the preceding character
* zero or more of the preceding character
? zero or one occurence of the preceding character
[] match one of a group of alternatives
\ escape a special character
^ match at the beginning of the string
$ match at the end of the string
/^ $/ match entire string
\b match at word boundary
\B not at word boundary – opposite to \b
[^ ] excluding a set of alternatives
\d any digit
\D anything other than digit
\w any word character [_0-9a-zA-Z]
\W anything other than word character
\s whitespace
\S anything other than whitespace
. anything except newline
{n} n number of occurences of the preceding character
| OR operation
\1 \2 … saved patterns
PATTERN MATCHING OPTIONS:
/g match all possible patterns
/i ignore case
/m treat string as multiple lines
/o only evaluate once
/s treat string as single line
/x ignore white space in pattern
PATTERN SUBSTITUTION AND OPTIONS:
s/ substitution operator
/g change all occurences of the pattern
/i ignore case in pattern
/e evaluate replacement string as expression
/m treat string to be matched as multiple lines
/o evaluate only once
/s treat string to be matched as single line
/x ignore white space in pattern
TRANSLATION AND OPTIONS:
tr/ translation
/c complement
/d delete all specified characters
/s squeeze-replace multiple identical output characters with a single character
EXTENDED PATTERN MATCHING:
(?: ) do not save the pattern in memory
EMBEDDED OPTIONS:
/(?i) = /i
/(?m) = /m
/(?s) = /s
/(?x) = /x
LOOK AHEAD:
(?= ) positive look ahead
(?! ) negative look ahead
(?# ) comments within pattern
~There’s more than one way to do it~TMTOWTDI