Urllib2 - Le Manuel manquant

Introduction

urllib2 est un module Python pour récupérer des URLs. Il offre une interface très simple, avec la fonction urlopen. Ce module est capable de récupérer des URLs en utilisant différents protocoles. Il fournit aussi une interface un peu plus complexe pour gérer des situations standards - comme une authentification, des cookies, des proxies, etc... Cela est fourni par des objets appelés handlers et openers.

Pour des situations simples, urlopen est très facile à utiliser. Mais dès que vous rencontrez des erreurs, ou des cas non-triviaux, vous aurez besoin de comprendre le protocole HTTP . La meilleure documentation concernant HTTP est le RFC 2616. C'est un document technique et il n'est pas prévu pour être facile à lire . Le but de ce tutoriel est de documenter urllib2, avec suffisamment de détails concernant HTTP pour vous aider. Son but n'est pas de remplacer la urllib2 docs [1], mais de la compléter.

Mechanize

mechanize

Stateful programmatic web browsing in Python, after Andy Lester’s Perl module WWW::Mechanize.

mechanize.Browser and mechanize.UserAgentBase implement the interface of urllib2.OpenerDirector, so:
- any URL can be opened, not just http:
- mechanize.UserAgentBase offers easy dynamic configuration of user-agent features like protocol, cookie, redirection and robots.txt handling, without having to make a new OpenerDirector each time, e.g. by calling build_opener().
Easy HTML form filling.
Convenient link parsing and following.
Browser history (.back() and .reload() methods).
The Referer HTTP header is added properly (optional).
Automatic observance of robots.txt.
Automatic handling of HTTP-Equiv and Refresh.

Spidering hacks - Google Livres

Web Images Vidéos Maps Actualités Livres Gmail plus ▼

Traduction Blogs Mises à jour

YouTube Agenda Photos Documents Reader Sites Groupes

et encore plus »

Ma bibliothèque | Connexion

Spidering hacks

Par Kevin Hemenway,Tara Calishain

21 Avis Rédiger un commentaire

À propos de ce livre

Couverture

Books?id=plRkgFidC&pg=PP&img=&zoom=&hl=fr&sig=ACfUUhsjNULAgqIEbXPJShYT w&w=

Il s'agit d'un aperçu. Le nombre total de pages affichées est limité. En savoir plus

Books?id=plRkgFidC&pg=PR&img=&zoom=&hl=fr&sig=ACfUUfOyQpnwxNcpJtDHuNusYNkUKQ&w=

Data scraping - Wikipedia, the free encyclopedia

Data scraping

From Wikipedia, the free encyclopedia

Screen scraping

[ previous | newer ] /home/writings/diary/archive/2005/04/21/screen_scraping

Screen scraping

That is, I give it a value and it returns a list of the hits. Each hit is a 3-tuple of the weight (as a float), the simple name for normal ASCII and the name for HTML.

I split it over two lines to make it shorter for the screen.

The data I want starts further on down I just need to make my function create the right URL query string. That's a simple string substitution: I can run this and see the raw HTML printed to the screen.

Not a very elegant parser but it works. Here's the output

With some experimentation and testing here's the BeautifulSoup version of the parser

and using the function above I get an empty list. That's what I wanted.

followed by some additional information. It looks like when there's only one compound then the server shows more data and in different format. The relevant HTML for the parsing is I can write a parser for this case, I just need to know when to use which one. After looking at the HTML for a bit, if the field has an in it then it's the detailed information for a single compound. Otherwise it's a list of results or an error message saying there were no results in that range. Not the most satisfying of solutions but that's typical when screen scraping.

GOCR

open-source character recognition

Testbed for Information Extraction from Deep Web

1. Introduction
Search results generated by searchable databases are served dynamically and far larger than the static documents on the Web. These results pages has been referred to as the Deep Web. We propose a testbed for information extraction from search results. We chose 100 databases randomly from 114540 pages with forms. Therefore, these databases have a good variety. We selected 51 databases which include URLs in results page and manually identify target information to be extracted. We also suggest evaluation measures for comparing extraction methods and methods for extending the target data.

2. Download 3.
4. Evaluation measure
Mail: daisen(at)matu.cc.kyushu-u.ac.jp

Hakozaki 6-10-1, Higasi-ku, Fukuoka 812-8581, Japan
Tel: +81-92-642-2296
Fax: +81-92-642-2294

Data Extraction, Web Screen Scraping Tool, Mozenda Scraper

Get data and images
from any web page.

data extraction HomeGreybardot

screen scraping HomeGreybardot

web harvesting HomeGreybardot

web crawling

No matter what you call it, Mozenda's simple point-and-click system gives you the freedom to gather data from the web like never before. Click here to see how.

ClearForest Solutions

CLEARFOREST SOLUTIONS

Our OneCalais solutions use Natural Language Processing (NLP), text analytics and data mining technologies to derive meaning from unstructured information, including news articles, blog posts, research reports and more.

Web Information Retrieval / Natural Language Processing Group (WING) - Projects

Friday, 25 June 2010 14:07

If you're looking for a brief introduction of research in the WING group, check out these presentation slides, prepared by Min in April 2004. There are also a more recent set of slides from the presentation on 13 May 2005.

If you are a student looking for a research project for your graduate studies (GP/MSc), Honors Year (HYP) or undergraduate research opportunity program (UROP) or considering a summer internship, we advise you to look at the slides above, read the following notes and visit our group's open project listings.

Web Query Analysis

Project Duration: Continuing.

Web queries are often dense and short, but they often have distinct purposes. In our work, we examine how to automatically classify web queries using only the simple, lightweight data of query logs and search results. In comparison, most existing automatic methods integrate rich data sources, such as user sessions and click-through data. We believe there is more untapped potential for analyzing and typing queries based on deeper analysis of these simple sources.

Project Staff

Min-Yen Kan , Project Lead
Viet Bang Nguyen, Macro and Microscopic Query Analysis for Web Queries (Spring 2006)
Hoang Minh Trinh, Implementing Query Classification (Fall 2007)

LyricAlly: Lyric Alignment

Project Duration: Continuing.
Joint work with Wang Ye and Haizhou Li.

Lyric Alignment Popular music is often characterized by sung lyrics and regular, repetitive structure. We examine how to capitalize on these characteristics along with constraints from music knowledge to find a suitable alignment of the text lyrics with the acoustic musical signal. Our previous work showed a proof of concept of aligning lyrics to popular music using a hierarchical, musically-informed approach, without the use of noisy results from speech recognition. Later results tried to improve alignment to the per-syllable level using an adapted speech recognition model, initially trained on newswire broadcasts.

However, these approaches are slow and require offline computation and cannot be run in real-time. In recent work, we have been examining whether we can do away with intense computation by using self-similarity to align the lyrics and music directly without explicit multimodal processing.

Project Deliverables (excluding publications)

Minh Thang Luong's RepLyal lyric alignment demo / home page

Project Staff

Min-Yen Kan , Project Co-Lead
Denny Iskandar, alumnus
Minh Thang Luong, Using Self-Similarity in Lyric Alignment for Popular Music (Spring 2007)

Automatic Text Summarization

Project Duration: Continuing.
Joint work with Wee Sun Lee and Hwee Tou Ng.

We examine graph based methods to text summarization, with respect to the graph construction and representation of (multidocument) texts and graphical decomposition methods leading to summaries. Unlike previous approaches to graph-based summarization, we devise a graph based approach that creates the graph with a simple model of how an author produces text and a reader consumes it. We are currently applying this work to blog summarization.

Project Staff

Min-Yen Kan , Project Lead
Ziheng Lin, Automatic Text Summarization using a Lead Classifier (Spring 2005 and Spring 2006)
Xuan Wang, Blog Summarization (Fall 2007)

Advanced OPACs

Project Duration: 4 years, ending July 2007. Continuing work without formal funding. Joint work with Danny C. C. Poo

Overview + Details interface

When library patrons utilize an online public access catalog (OPAC), they tend to type very few query words. Although it has been observed that patrons often have specific information needs, current search engine usability encourages users to underspecify their queries. With an OPAC's fast response times and the difficulty of using more advanced query operators, users are pushed towards a probe-and-browse mode of information seeking. Additionally, patrons have adapted (or forced to adapt) to OPACs and give keywords as their queries, rather than more precise queries. As a consequence, the search results often only approximate the patron's information need, missing crucial resources that may have been phrased differently or offering search results that may be phrased exactly as wanted but which only address the patron's information need tangentially.

One solution is to teach the library patron to use advanced query syntax and to formulate more precise queries. However, this solution is both labor-intensive for library staff and time-intensive for patrons. Furthermore, different OPACs support different levels of advanced capabilities and often represent these operators with different syntax. An alternative solution that we propose is the use of advanced query analysis and query expansion. Rather than change the behavior of the patron, a system can analyze their keyword queries to infer more precise queries that uses advanced operators when appropriate. To make these inferences, the system will not only rely on logic but also will dynamically access both a) historical query logs and b) the library catalog to assess the feasibility of its suggestions.

The proposed research will target three different types of query rewriting/expansion: 1) correction of misspellings, 2) inferring the relationships between a query's noun and noun phrases, and 3) inferring intended advanced query syntax. The realization of these techniques will allow patrons to continue using OPACs by issuing simple keyword searchers while benefiting from more precise querying and alternative search suggestions that would originate from the implemented system.

In our current work we have examined how to re-engineer and design the User Interface to better supoort the actual information seeking methods used by library patrons. Jesse has re-engineered the work and has incorporated our own NUS OPAC as well as Colorado State University's OPAC results into his tabbed, overview+details framework. If you have a library catalog with MARC21 results that can be exported we can wrap our UI around your database. Contact us for more information.

Project Deliverables (excluding publications)

Jesse Prabawa's OPAC interface for [ NUS ] [ Colorado State University ]
Long Qiu's report on the Namekeeper (author spelling correction) system.
List of misspelled titles in the LINC OPAC catalog system (over 1.2K misspellings), reported to Libraries
Prototype spelling correction and morphology system (linc2.cgi, and its past incarnation mirror.cgi)
Notes/Slides from past meetings with NUS Library staff:
[ 16 June 2004 ] [ 11 May 2005] [ 26 Aug 2005]

Project Staff

Min-Yen Kan, Project lead
Jin Zhao, Query Analysis and typing (Spring 2005)
Malcolm Lee, OPAC UI (Spring 2005)
Tranh Son Ngo, alumnus, Systems programmer
Kalpana Kumar, Ranked spelling correction and Advanced Query Analysis (Spring 2004 and Spring 2006)
Jesse Gozali Prabawa, An AJAX interface for the LINC system (Spring 2006)
Siru Tan, alumnus, Morphology (Spring 2004)
Meichan Ng, alumnus, Phrase structure (Spring 2004)
Roopak Selvanathan, alumnus, programmer
Long Qiu, alumnus, author spelling correction

Scenario Template Generation

Project Duration: Continuing.

A Scenario Template is a data structure that reflects the salient aspects shared by a set of events, which are similar enough to be considered as belonging to the same scenario. The salient aspects are typically the scenario's characteristic actions, the entities involved in these actions and the related attributes. Such a scenario template, once populated with respect to a particular event, serves as a concise overview of the event. It also provides valuable information for applications such as information extraction (IE), text summarization, etc.

Manually defining scenario template is expensive. In this project, we aim to automatize the template generation process. Sentences from different event reports are broken down into predicate-argument tuples which are clustered semantically. Then salient aspects are generalized from big clusters, respectively. For this purpose, features we investigate include word similarity, context similarity, etc. The resulting scenario template is not only a structured collection of salient aspects as a manual template is, but also a information source that other NLP systems can refer to for how these salient aspects are realized in news reports.

Stay tuned for a corpus release of newswire articles that Long has compiled for use in the Scenario Template tasks.

Project Staff

Min-Yen Kan, Project lead
Long Qiu (Spring 2003)

Web-based disambiguation of digital library metadata

Project Duration: Continuing.
Joint work with A/P Dongwon Lee at Pennsylvania State University.

As digital libraries grow in size, the quality of the digital library metadata records become an issue. Data entry mistakes, string representation differences, ambiguous names, missing field data, repeated entries and other factors contribute to errors and inconsistencies in the metadata records. Noisy metadata records make searching difficult, and possibly result in certain information not being found at all, causing an under-count or over-count and distorting aggregate statistics, and decrease the utility of digital libraries in general.

In this project, we concentrate on using the Web to aid the disambiguation of the metadata records. This is because sometimes the metadata records itself contains insufficient information, or the required knowledge is very difficult to mine. However, the richness of the Web, which represents the collective knowledge of the human population, often provides the answer instantly when suitable queries are presented to a search engine. As search engine calls and web page downloads are processes that are expensive on time, any Web-based disambiguation algorithm must be able to scale up to large number of records.

Project Staff

Min-Yen Kan, Project lead
Yee Fan Tan (Fall 2005)

Phrase Based Statistical Machine Translation

Project Duration: Continuing.
Joint work with Haizhou Li.

Due to the nature of the problem, machine translation provides an interesting playground for the implementation of statistical approach. The problems in machine translation are rendered from the ambiguity in several level starting from the surface until the semantic level, where in isolation itself poses a great challenge. In this project, our pursuit is to advance the performance of reordering model. Reordering model attempts to restructure the lexically translated sentence to the correct target language's ordering. In particular, we examine reordering centered around function words. This is motivated by observation that phrases around function words are often incorrectly reordered. By modelling reordering patterns around function words, we hope to capture prominent reordering patterns in both the source and target languages.

Project Staff

Min-Yen Kan, project lead
Hendra Setiawan

Focused crawling

Picture of lyric crawling levels

Project Duration: Continuing.

Web crawling algorithms have now been devised for topic specific resources, or focused crawling. We examine the specialized crawling of structurally-similar resources that is used as input to other projects. We examine how to devise trainable crawling algorithms such that they "sip" the minimal amount of bandwidth and web pages from a site by considering using context graphs, negative information, web page layout, and URL structure as evidence.

To motivate the crawling algorithm design, we concentrate on the collection of four real-world problems: topical page collection, the collection of song lyrics, scientific publications and geographical map images.

Project Deliverables (excluding publications)

Maptlas: A collection of map images culled from the web.

Project Staff

Min-Yen Kan, Project lead
Abhishek Arora: Map Spidering and Browsing User Interface (Summer 2005)
Hoang Oanh Nguyen Thi, alumnus, Publication Spiderer (Spring 2004)
Litan Wang, alumnus, music lyrics spider (Fall 2004)
Vasesht Rao, alumnus, Map tiling and spidering (Fall 2004)
Fei Wang, Non-photograph image categorization (Fall 2004 and Spring 2005)
Xuan Wang , Augmenting Focused Crawling using Search Engine Queries (Spring 2006)

Domain-Specific Research Digital Libraries

Project Duration: Continuing.

SlideSeer Interface In this project, we construct software that will impose a generic, shareable, publishable and searchable framework for organizing scientific publications, similar to Cora and CiteSeer. Our work attempts to enable researchers to share annotations, search by fields and add new fields and organization as appropriate, as well as publish annotations. In our projects we are examining DLs for mathematics as well as for coordinating multimedia.

For mathematics, despite the enormous success of common search engines for general search, when it comes to domain-specific search, their performance is often compromised due to the lack of knowledge of (and hence support to) the entities and the users in the domain. In our project, we choose to tackle this problem in the domain of mathematics. Our ultimate goal is to build a system which is able to 1) automatically index and categorize math materials from multiple online resources, and 2) understand the intents and needs of the users and present the results in an accurate and organized manner.

Project Deliverables (excluding publications)

dAnth: digital anthologies mailing list - a clearinghouse for researchers dealing with text conversion, citation processing and other scaling issues in digital libraries.
ForeCite: Web 2.0 based citation manager a la Citeulike, Citeseer, Rexa.
SlashDoc: A Ruby on Rails new media knowledge base for research groups.
SlideSeer: A digital library of aligned slides and papers.
ParsCit: citation parsing using maximum entropy and global repairs.

Project Staff

Min-Yen Kan, Project lead
Dang Dinh Trung: TiddlWiki for scholarly digital libraries (Spring 2007)
Ezekiel Eugene Ephraim: Presentation summarization, alignment and generation (Spring 2005)
Guo Min Liew, Visual Slide Analysis (Spring 2007)
Jesse Gozali Prabawa, ForeCite integration lead (summer project, Summer 2007)
Hoang Oanh Nguyen Thi, alumnus, Publication Spiderer (Spring 2004)
Yong Kiat Ng, Maximum Entropy Citation Parsing with Repairs (Spring 2004)
Emma Thuy Dung Nguyen, Automatic keyword generation for academic publications (Spring 2006)
Thien An Vo, Support for annotation of scientific papers (Spring 2004)
Tinh Ky Vu, Public-domain research corpora gatherer (Fall 2004) and Academic Research Repository (Spring 2005)
Yue Wang, Presentation summarization, alignment and generation (Spring 2006)
Jin Zhao, Math IR and SlashDoc integrator (Fall 2007)

Lightweight NLP

Project Duration: Continuing.
Joint work with Dr. Samarjit Chakraborty.

PDA For embedded systems with constrained power and CPU resources, how should NLP and other machine learning tasks be done. We investigate how different combinations of features and learners can affect machine learned NLP tasks on embedded devices with respect to time, power and accuracy.

Project Staff

Min-Yen Kan, Project lead
Ziheng Lin, Summer Project (Summer 2007)

PARCELS: Web page division and classification

Picture of parcels annotator

Project Duration: December 2003 - July 2005. Completed.

Web documents that look similar often use different HTML tags to achieve their layout effect. These tags often make it difficult for a machine to find text or images of interest.

Parcels is a backend system [Java] designed to distinguish different components of a web site and parse it into a logical structure. This logical structure is independent of the design/style of any website. The system is implemented using a co-training framework between two independent views: a lexical module and a stylistic module.

Each component in the structure will be given a tag revelant to the domain they are classified under.

Project Deliverables (excluding publications)

PARCELS toolkit, hosted on sourceforge.net.
Similar document similarity (integrated within the PARCELS toolkit.

Project Staff

Min-Yen Kan, Project lead
Chee How Lee, alumnus, programmer
Aik Miang Lau, alumnus, Advancing PARCELS (Fall 2004)
Sandra Lai, alumnus, PARCELS Web logical structure parser (Fall 2003)

Metadata-based webpage summarization

Project Duration: June 2003 - December 2004.

Search engines currently report the top n documents that seem most relevant to a user's query. We investigate how to change the structure this ranked list into a more meaningful natural language summary. Rather than just focus on the content of the actual webpages, we examine how metadata can be used to create useful summaries for researchers.

Reorganized Roman keypads

mechanize

scrubber / scrubyt_examples

Spidering hacks

Data scraping

Web scraping

Screen scraping

Testbed for Information Extraction from Deep Web

Get data and images from any web page.

CLEARFOREST SOLUTIONS

Web Query Analysis

Project Staff

LyricAlly: Lyric Alignment

Project Deliverables (excluding publications)

Project Staff

Automatic Text Summarization

Project Staff

Project Deliverables (excluding publications)

Project Staff

Scenario Template Generation

Project Staff

Web-based disambiguation of digital library metadata

Project Staff

Phrase Based Statistical Machine Translation

Project Staff

Focused crawling

Project Deliverables (excluding publications)

Project Staff

Domain-Specific Research Digital Libraries

Project Deliverables (excluding publications)

Project Staff

Lightweight NLP

Project Staff

Project Deliverables (excluding publications)

Project Staff

Metadata-based webpage summarization

Get data and images
from any web page.