

Most of these questions require writing Python code and computing results, and the rest of them have textual answers. Please see detailed submission instructions below. Shoebox and Toolbox Lexicons¶Ī Toolbox file, previously known as Shoebox file, is one of the most popular tools used by linguists.Problem set 1, Intro to NLP, 2017 ¶ This is due on September 22nd at 11PM. ( u 'explorer', ) ( u 'explorers', ) ( u 'explores', ) ( u 'exploring', ) ( u 'explosion', ) ( u 'explosions', ) ( u 'explosive', ) ( u 'explosively', ) 1.6. list comprehension with multiple _for_s.How frequently do letters appear in different languages? Use swadesh corpus. New concepts introduced in this exercise: Narrow down your search scope to files in Latin1 encoding. Write function find_language that takes a word and returns a list of language that this word may be in. U'i can make me say not ye that speak evil against you and i fear this glorious thing is against them since by many a good thousand a year two thousand pounds' join ( sentence ) # word1, word2 = choice(mapper.keys()) # use for random results word1, word2 = 'i', 'can' generate ( word1, word2, 30 ) append ( new_word ) word1, word2 = word2, new_word return " ". append ( c ) def generate ( word1, word2, N ): sentence = for i in xrange ( N ): new_word = choice ( mapper ) sentence. trigrams ( words ) mapper = defaultdict ( list ) for a, b, c in trigrams : mapper.

Execute the following code:įrom collections import defaultdict from random import choice words = ( word. One of the corpuses is named gutenberg.Ĭorpuses may need to be downloaded. All corpuses live inside rpus subpackage. There are some builtin corpuses distributed along with nltk library. Sometimes the categories may overlap, that is one text may belong to more than one genre or topic. It’s a collection of files inside a directory.Ĭorporas can be categorized by genre or topic. Accessing Text Corpora and Lexical Resources ¶Ī text corpus is a structured collections of texts. Word Segmentation (especially in Chinese)¶

Searching, Replacing, Splitting, Joining¶

Accessing Text Corpora and Lexical Resources ¶
