Language identification of written text in the domain of Latin script based languages is a well-studied research area. However, new challenges arise when it is applied to non-Latin-script based languages, particularly for Asian languages' web pages. Web page classification creates new research challenges because of the noisy nature of the pages. It’s no doubt that English has been the predominant language for the World Wide Web since its inception and so it’s usage is confined to a specific community of people have a good grasp of the English language. The serviceability factors of the Internet have proven to be beneficial to a highly educated society, because of the linguistic barrier. The solution to this problem is to provide web pages in regional languages. Our aim is to provide web pages in pairs, of Devanagari and English web pages if it exists. In order to provide parallel Web Pages in native language Hindi or Marathi on the fly we require classification of web pages in Devanagari and English. We had experiment on 500 web pages in English and Devanagari web pages and could label it correctly.
Keywords: Classification of Devanagari Web pages, UTF-8 Encoding .
1.Introduction
With the explosion of multi-lingual data on the Internet, the need and demand for an effective automated language identifier for web pages is further increased. Web search in Indian languages is constantly gaining importance. With the fast growth of Indian language content on the web, many
The first versions of WWW ((what most people call “The Web”))) provide means for people around the world to exchange information between, to work together, to communicate, and to share documentation more efficiently. Tim Berners-Lee wrote the first browser (called WWW browser) and Web server in March 1991, allowing hypertext documents to be stored, fetched, and viewed. The Web can be seen as a tremendous document store where these documents (web pages) can be fetched by typing their address into a web browser. To do that, two im- portant techniques have been developed. First, a language called Hypertext Markup Languag (HTML) tells the computers how to display documents which contain texts, photos, sounds, visuals (video), and animation, interactive
• Users can restrict their searches for content in 35 non-English languages, including Chinese, Greek, Icelandic, Hebrew, Hungarian and Estonian.
English today is the most commonly used global language for commerce and it is the main language of the international diplomacy. And maybe the most important for them, it’s the most common language on the internet. Also, the language that integrand
Due to the rapid advancement of the information technology, the World Wide Web (WWW) has now become a multifunctional tool. People can get lots of things done through the Internet, chatting with friends through MSN, shopping on Amarzon.com, settling the credit card bill, making new friends through the Facebook, reading newspaper on appledaily.com, etc. Besides, when we want to search for information, we can simply “Google” it, and we get what we want. It is no doubt that the Internet has greatly sped up the flow of information.
(King-Lup Liu, 2001) Given countless motors on the Internet, it is troublesome for a man to figure out which web search tools could serve his/her data needs. A typical arrangement is to build a metasearch motor on top of the web indexes. After accepting a client question, the metasearch motor sends it to those fundamental web indexes which are liable to give back the craved archives for the inquiry. The determination calculation utilized by a metasearch motor to figure out if a web index ought to be sent the inquiry ordinarily settles on the choice in light of the web search tool agent, which contains trademark data about the database of a web search tool. Be that as it may, a hidden web index may not will to give the required data to the metasearch motor. This paper demonstrates that the required data can be evaluated from an uncooperative web crawler with great exactness. Two bits of data which license precise web crawler determination are the quantity of reports filed by the web index and the greatest weight of every term. In this paper, we display systems for the estimation of these two bits of data.
The main focus of this research will be how bilingualism affects the school success of deaf students. The research will focus on three different signed languages, one natural signed language (ASL) and two constructed signed languages (PSE and SEE). It is expected that the second language of each student is English. The questions I will address are as follows:
This growth in users of Arabic web pages requires some statistical overview of Arabic websites that exist in the live web, how far do they exist, how much of it is archived and how well are they archived. To answer those
Web mining is the application of data mining technique which is an unstructured or semi-structured data and it automatically discovers and extracts potentially useful and previously unknown information or knowledge from the web. The significant web mining applications are website design, web search, search engines, information retrieval, network management, E-commerce, business and artificial intelligence, web market places and web communities. Online business breaks the barrier of time and space as compared to the physical office business. Big companies around the world are realizing that e-commerce is not just buying and selling over Internet, rather it improves the efficiency to compete with other giants in the market. This application includes the temporal issues for the users. []
Hypermedia systems and hypertext have given a universal access to a big number of documents over the Internet, and the World Wide Web was the most successful and popular one to link the biggest amount of hypertext documents from all over the world, despite the existence of other sophisticated hypermedia systems in time. These systems specialized with richer navigation experience, and were competing with the W3. This report will be focusing on the aspects that enhanced and empowered the success and achievements of the World
Computer (PC) Internet users know that it is possible to find different kinds of texts in it:
The Internet org. might not work efficiently in India, neither in culture nor technological method. For cultural reason, religion would be the first cause which makes extremely negative effect for expanding internet service in India. Since the Hinduism is a kind of close religion, such as it does not encourage people to travel away from India, the ethic might not encourage people to use new technology such as internet. In addition, education is still a problem in India. The literacy in India is 74.04 percent, which is lower than the world average level according to the research of census (2011). However, there are still some positive aspects for the organisation. Although the language aspect might cause serious problems for the expansion due to a number of native languages, English, which is a global and the most common internet language, could be used by a number of Indian because India was the colony of the UK. Moreover, based on the colony history, the company might be accepted by local people easily, due to the assimilatory education (Science Encyclopedia 2014). Consequently, although there are advantages for the organisation to expand into Indian market, the disadvantage might make bigger effect because the religion could influence people’s mind and daily life when we concerned with the culture aspect.
In this paper we surveyed the state of the art in multilingual text retrieval accessing parallel web pages. Multilingual search engines typically consist of a crawler which traverses the web, retrieves the required web page in the desire languages. It provides front end user interface, which can be used for selecting language for query submission. The way the query is fired leads into two types of search Cross-language information retrieval and Multi-language information retrieval. In cross-language retrieval the user query is machine translated into multiple language queries automatically as per user selection and then fired. In multi-language retrieval the user has to provide queries in multiple languages to fetch the web documents in different languages. Also in some search engine the web page is machine translated and forwarded to the user. NLP is still in the expansion phase and has to make advances in it, research is going on all over world. Experiment was done with top rated multilingual search engine to access parallel pages but could not find it. When billions of parallel web documents are present on the web in different languages why not explore that? A alternative to that can be searching through parallel pair finder.
Abstract—Character Recognition by machines is an innovative way by which the dependence on manpower is reduced. Character recognition provides a reliable alternative of converting manual text into digitized format. Now-a-days, as technology becomes integral part of human life, many applications have enabled the incorporation of English OCR for real time inputs. The advantages that the English alphabet has is its simplicity offered by less number of letters i.e. 26 and easier classification due to the concept of lowercase and uppercase. If we consider Devnagari script in this scenario, we will come across myriad hurdles because this script lacks the simplicity of English. The concept of fused letters, modifiers, shirorekha and spitting similarities in some letters make recognition difficult. Also, character recognition for handwritten text is far more complex than that for machine printed characters. This is because of the versatility and different writing techniques adopted by people. The direction of strokes, pressure applied on writing equipments, quality of writing equipment and the mentality of the writer itself highly affects the written text. These problems when combined with the intricate details of Devnagari script, the complications in constructing a HCR of this script are increased. The proposed system focuses on these two issues by adopting Hough transform for detecting features from lines and curves. Further, for classification, SVM is used. These two methods
FOUNDATION COURSES PART – I – LANGUAGES - FIRST SEMESTER Tamil - I CLA1A Telugu - I CLB1A Kannanda - I CLC1A Malayalam - I CLD1A Hindi - I CLE1A 15-11-2012 FN Urdu - I CLF1A Sanskrit - I CLG1A Arabic - I CLH1A Arabic – I (Candidate admitted from 2012-13) CLH1E French - I CLK1A French - I (Candidate admitted
Opinions are very important in the life of human beings. Whenever a decision has been taken, opinions of others are always considered. As the impact of the web is increasing day by day, web documents can be seen as a new source of opinion for human beings. Large amount of information is available on the web, so it is necessary to develop methods that automatically analyze and classify this information. This domain is called Sentiment Analysis and Opinion Mining. Opinion Mining or Sentiment Analysis is the mining of attitudes, opinions, and emotions automatically from text, speech, and database sources through Natural Language Processing (NLP).But, from the last few years, there is an enormous increase in web content in Hindi language. Research in opinion mining mostly carried out in English language but it is very important to perform the opinion mining in Hindi language also as large amount of information in Hindi is also available on the web. This survey paper gives an overview of the work that has been performed in the area of Hindi language.