Where i am, where i will go?: Research with wikipedia database, offline

Today im proud of people contributing with wikipedia, i see things growing, and that made me think, is much better make some analysis using wikipedia contents in offline mode, i mean, go to the wikimedia foundation find the proper place and download the dumps and start making the research.

Well, i'm not an expert, i'm a common person, a user of wikipedia content that have some hints.

Think about millions of "computers" around the world working together, it already exist, people just need work together, how simple its.This make me thing in the importance of the philosophy of the Open Source nature.

I didn't find much tools, and i didn't find reasons to not exist such tools, the wikipedia get their support from donations, so if people contribute without going literaly into their web pages with a web browser, there wont be any problem because they are not based on advertisement.

So, where are those "nice" autometed tools? hehe

Here is a link that point you to a page that shows some tools to edit the contents.

As you see, all tools(in the moment that i saw) were all browser based tools.

I guess that our first research now is tools not web broswer based tools, to edit the wikipedia contents(cross-refs in the whole database, regex, etc)

Ya, i doesn't exist i guess(2 min typing in search engines ;), so our first thing is, or my first thing to say is, go learn some programming language and do the stuff yourself.

Ohh, sorry, i can't help right now, i'm not so good with programming ;)

I think that you can be much better than me, if you dig into that world.

Wikipedia html dump(html is more easy at first time)
Wikipedia xml dump(xml for someone more skilled)

More tools to help converting or extracting information from the wikipedia dumps.

I did something really ugly but works, i did with bash and with other programming language ;)

1.Before anything extract the contents...

$ man "what i want see"

keywords: cpio, tar, bzip2, gzip, unzip, lzma, 7z, ...

2.Filtered the data, the wikipedia dumps come with some things that we don't need, the users discussions ... -.-

For example if you extract an html dump(was what i did, i'm too noob to something, xml'ed ;)

$ find wikipedia-lang -type f -name '*Disscussion*~*' -exec rm -f '{}' ';'

Now be patient, this take some time hehe

3.Ohh, is possible we put all files in one directory?

$ find wikipedia-lang -type f|sed -r 's,^.+/,,g'|sort|uniq -d -c

Type this, you found something in your screen?

If yes, sorry if you put all those files on the same directory there will be a conflict with those file names, they are not unique, the numbers preceding the file names are the numbers that one filename is repeated.

Here, in my dump, it didn't happen, so i could copy everything to one directory.

4.I copy everything to one directory

mkdir mydirectory && find wikipedia-lang -type f -exec mv '{}' mydirectory ';'

Now be patient more one time ;]

5.Make a list of these file and store in one file

$ ls -1 mydirectory > articles.list

You have to specify -1 to not include the directory name, remember, less things to parse, more the speed grow.

6.Make a script in any language, even the bash script language can do that job.

Here i say, just

$ man bash

Literaly type what is inside the double quotes.

7.You have passed the essencial, now you can apply more filters and each time get what you really want, with that you enhance the search process a lot.

8. You keep with your path ...

Where i am, where i will go?

Blog Archive

Monday, June 1, 2009

Research with wikipedia database, offline

No comments:

Post a Comment

Links

Recomendacoes - Textos/Videos/Audio/...

Direitos autorais

You can have some fun with it...

Alguns direitos reservados