Archive for the ‘Uncategorized’ Category

When do you know…

February 25, 2009

… that you have started working.

It’s when you realize it’s a wednesday night 9 o clock and you din’t look at the XKCD post yet.

Advertisements

Python toys – I

February 23, 2009

This post is the first part of a series of n posts of various toy projects in Python. Python is a very high language and this makes all your work pretty much easy, so that even Tom, Dick and Balaji can write a search engine.

The source code of indexer is as below.  Have a bigger picture.  The explanation for each line of source follows.

#! /usr/sfw/bin/python
import urllib
import re
import pickle
worddict={}
page = "http://www.xkcd.com" #startpage
stack = []
stack.append(page)
visit = {} # keeps track of pages that we visited, to avoid loops
stopvar = 5000

while(stopvar >= 0):
  stopvar-=1
  print stopvar
  cpage = stack.pop(0)
  try:
    f = urllib.urlopen(cpage)
    html = f.read()
  except:
    print cpage,"Unexpected Error"
    continue

  sp = "a href=\""
  htmlvisibletext = re.sub('<.*?>','',html)
  htmlvisibletext = re.sub('[^a-zA-z ]','',htmlvisibletext)
  words=htmlvisibletext.split()
  for word in words:
    word=word.lower()
    if(word not in worddict):
      dict={cpage:1}
      worddict[word]=dict
    else:
      dict=worddict[word];
  for i in range(len(html)):

 

    if(sp == html[i:i+len(sp)]):
      url = ""
      i+=len(sp)
      while(html[i] != "\""):
        url+=html[i]
        i+=1
      if(url[0:4] == "http"):
        if(visit.has_key(url) == False):
          visit[url]=1
          stack.append(url)
for key in worddict:
  items = [(v, k) for k, v in worddict[key].items()]
  items.sort()
  items.reverse()
  worddict[key] = [(k, v) for v, k in items]
dumpfile = open('dictdump','wb')
pickle.dump(worddict,dumpfile)

 

Here We go..

#! /usr/sfw/bin/python

This is the first line similar to most other scripting languages, where you specify the interpreter which will interpret the following script.

import urllib
import re
import pickle

Python provides several modules which we can use. We use here three of those modules

1. urllib – high level interface for fetching data across world wide web

2. re – python’s regular expression module. If you are familiar with Perl’s regex module then this is not much different.

3. pickle – module which provides methods to serialize and store python objects which help in data persistence

 worddict={}

A dictionary data structure in python is used to store pairs of key and an arbitrary object corresponding to the key. worddict is a dictionary we will used to store words and the dictionaries corresponding to the words. Each dictionary corresponding to the word will contain urls in which the word was present and the frequency.

page = "http://www.xkcd.com" #startpage

This is the page where we start our graph traversal.

stack = []

The stack we are using is the python’s list data structure. We initialise to an empty list. This stack will maintain the URLs that we come across in each page.

stack.append(page)

The first page is added to the empty stack.

visit = {} # keeps track of pages that we visited, to avoid loops
stopvar = 5000

This is a limit on the number of pages we want to index. Lets start with a small number 5000. Of course 5000 is a damn small number in this world wide web.

stopvar-=1 #decrement the value
print stopvar
cpage = stack.pop(0)

stack.pop() can be used to pop values. stack.pop() pops the last element appended. Stack.pop(0) pops(??) the 1st element in the list. We are taking the first element in the list. So this is Basically a DFS.

 try:
    f = urllib.urlopen(cpage)
    html = f.read()

The try-except construct is used to handle exceptions. Any exceptions in try block are thrown to the except block and are handled there.

except:
print cpage,"Unexpected Error"
continue
The exception is handled with the just the continue command.
sp = "a href=\""

“a href = “”  is the pattern we will search in the html page to capture urls. 

htmlvisibletext = re.sub('<.*?>','',html)
htmlvisibletext = re.sub('[^a-zA-z ]','',htmlvisibletext)
words=htmlvisibletext.split()
for word in words:
word=word.lower()

We convert all words to lower case to provide for case-insensitivity.

 if(word not in worddict):
      dict={cpage:1}
      worddict[word]=dict
If the word is not already present in the worddict dictionary, we create a dictionary dict with contents current url and frequency 1. The word and the dictionary dict are then inserted into the worddict.
else:
  
   dict=worddict[word];
In case the word is already there we obtain the dictionary by looking up for the word.
      
     if(cpage not in dict):
        dict[cpage]=1
      else:
        dict[cpage]=dict[cpage]+1
Now we check if the url is present in the dictionary for that particular word. Suppose the word occurs for the first time in a url, cpage will not be present in dict. so url and corresponding frequency 1 are added to the dictionary.
If the word was already present in the url and is repeated, then the frequency is incremented by 1.
 for i in range(len(html)):
    
    if(sp == html[i:i+len(sp)]):
      url = “”
      i+=len(sp)
the html is a string which contains the whole html file corresponding to the url. Now we scan it to find <a href=. If we come across <a href = “, we store the consequent string in url until we come across “.
      while(html[i] != “\””):
        url+=html[i]
        i+=1
      if(url[0:4] == “http”):
        if(visit.has_key(url) == False):
          visit[url]=1
          stack.append(url)
if the url starts with http, we check if the visit dictionary, which holds all the urls that we ve visited. If we have not visited, we add the url to our dictionary and append it to our stack.
for key in worddict:
  
  items = [(v, k) for k, v in worddict[key].items()]
  
  items.sort()
  
  items.reverse() # so that highest value is first
  
  worddict[key] = [(k, v) for v, k in items]
The above is the general mechanism used to sort a dictionary by key. We sort the dictionary corresponding to each word so that the url with the highest frequency comes to the top.
  dumpfile = open(‘dictdump’,’wb’)
I open a file named ‘dumpfile’ in the current directory for writing. We will dump our index into this file.
  pickle.dump(worddict,dumpfile)
Here we use the dump function provided by the pickle module to dump the object into the file.
Now that we have dumped it, Should we do wat we started of promising to. Lets now write the script to search the word from the dump. 
#! /usr/sfw/bin/python
import pickle
dumpfile=open(‘dictdump’,’rb’)
worddict=pickle.load(dumpfile)
print ‘Press CTRL + C to kill’
while 1:
  word=raw_input(‘Enter word to search:’)
  word=word.lower()
  if(word not in worddict):
    print ‘Word not found’
  else:
    print worddict[word]
Now we ll look into it closely.
import pickle
We again import the pickle module to load the dumped dictionary
dumpfile=open(‘dictdump’,’rb’)
worddict=pickle.load(dumpfile)
Open the dumpfile in readmode and load the dictionary.
word=word.lower()
Convert the query to lower case. 
if(word not in worddict):
    print ‘Word not found’
  else:
    print worddict[word]
Print the dictionary corresponding to the word.

 

Pi

January 21, 2009

circle1

 

A simple program to computing Pi using the Monte Carlo method:

 

#include “stdlib.h”

#include “stdio.h”

#include “time.h”


int isinside(int , int);


int main()

{

        srand(time(NULL));

        double pi;

        int x,y;

        int i=0;

        int inside=0;

        int n=10000;


        while(i<n){

                x=rand()%11-5;

                y=rand()%11-5;

                if(isinside(x,y)==1)

                        inside++;

                i++;

        }

        printf(“%d\n”,inside);

        printf(“pi = %lf\n”,inside*4.0/n);

}


int isinside(int x,int y)

{

        if(x<0)

                x*=-1;

        if(y<0)

                y*=-1;

        if((x==5 && y==0) || (x==4 && y<4) || (x==3 && y<5) || (x==2 && y<5) || (x==1 && y<5) || x==0 )

                return 1;

        else

                return 0;

}


bramani@chassis01:/home/bramani/toyprojs> cc pi.c

bramani@chassis01:/home/bramani/toyprojs> a.out

6662

pi = 2.664800 [oops]

bramani@chassis01:/home/bramani/toyprojs> a.out

6812

pi = 2.724800 [aiyo]

 

These pi values are for r=5.

As r increases pi tends to 3.14. 

 

 

I loved this experiment and thought of sharing.
Inspired by Fooled by randomness ~ Nassim Nicholas Taleb.

I am. No, I am Not.

January 7, 2009

There was Schrodinger. There was no Schrodinger.

Pythoned

January 3, 2009

Was dabbling with python this lazy saturday afternoon.

Surprise.  I could write and run useful programs in no time at all.

I am Impressed.  This is very true.

 

PYTHON ROCKS !

A to Z

January 2, 2009

A to Z.

Happy New Year 2009

December 31, 2008

Hi all,

 

Wish you all a very happy new year 2009.

Let all you dreams come true this year.

Jumbo

December 30, 2008

The movie is like selling beer in a woodwards gripewater bottle.

The grownups don’t like the cover.

The children don’t like the taste.

None buys.

RegEx Lookup

December 14, 2008

Some quick reference for Regular Expressions:

 

/ / delimiters of the pattern

+ one or more of the preceding character

* zero or more of the preceding character

? zero or one occurence of the preceding character

[] match one of a group of alternatives

\ escape a special character

^ match at the beginning of the string

$ match at the end of the string

/^ $/ match entire string

\b match at word boundary

\B not at word boundary – opposite to \b

[^ ] excluding a set of alternatives

\d any digit

\D anything other than digit

\w any word character [_0-9a-zA-Z]

\W anything other than word character

\s whitespace

\S anything other than whitespace

. anything except newline

{n} n number of occurences of the preceding character

| OR operation

\1 \2 … saved patterns

 

PATTERN MATCHING OPTIONS:

 

/g match all possible patterns

/i ignore case

/m treat string as multiple lines

/o only evaluate once

/s treat string as single line

/x ignore white space in pattern

 

PATTERN SUBSTITUTION AND OPTIONS:

 

s/ substitution operator

 

/g change all occurences of the pattern

/i ignore case in pattern

/e evaluate replacement string as expression

/m treat string to be matched as multiple lines

/o evaluate only once

/s treat string to be matched as single line

/x ignore white space in pattern

 

TRANSLATION AND OPTIONS:

 

tr/ translation

 

/c complement

/d delete all specified characters

/s squeeze-replace multiple identical output characters with a single character

 

EXTENDED PATTERN MATCHING:

 

(?: ) do not save the pattern in memory

 

EMBEDDED OPTIONS:

 

/(?i) = /i

/(?m) = /m

/(?s) = /s

/(?x) = /x

 

LOOK AHEAD:

 

(?= ) positive look ahead

(?! ) negative look ahead

(?# ) comments within pattern

 

~There’s more than one way to do it~TMTOWTDI

. Connecting the Dots .

December 9, 2008

This may or may not be true.

Even if its a fiction, its a good one.

~Do No Evil