CISC7021代写、c

University of Macau

CISC7021 - Applied Natural Language Processing

Assignment 1, 2023/2024

(Due date: 26 September 2023)

Introduction

In this assignment, we will prepare  -gram language models and evaluate the test set s

perplexity. We will learn how to create a language model using the language model

toolkit SRILM 1 (Stolcke, 2002). The toolkit can be downloaded at:

http://www.speech.sri.com/projects/srilm/download.html. Basic instructions on using

the SRILM toolkit can be found on the website also.

Train and Test Data

The 尊龙凯时官网training and testing data for this assignment come from the News Commentary,

which is created to be used for training the English language model. The training data

consists of 300 thousand lines of text. While the testing set consists of around 90

thousand lines of text. The data corpora are from the official website of Shared Task:

Machine Translation of News.

2 Both the training and testing data can be downloaded

from UMMoodle.

Tasks

1. Build word-based language models, 1-gram, 2-gram, and 3-gram, for English text

given the training data, and measure the perplexity on the training and testing set.

2. Build character-based language models, 1-gram to 6-gram, using the training data

and measuring the perplexity of the training and test set.

3. Collect more monolingual data from the First Conference on Machine Translation

(WMT16) and add them to the training data. Build language models and measure

the perplexity.

Environment Setup

We require all the related (development) tools for course assignments and projects are

Linux/Unix programs. You need to have a Linux platform for conducting experiments

and system implementation. Using a virtual machine (i.e. WM Virtual Box -

https://www.virtualbox.org/) to host a Linux system (i.e. Ubuntu -

http://www.ubuntu.com/) will be a good choice. We strongly recommend this. Besides,

you will use different toolkits for various (pre)processing tasks in the coursework. For

example, you need a g++ compiler for compiling the SRILM toolkit in this assignment.

1 http://www.speech.sri.com/projects/srilm/download.html

2 http://www.statmt.org/wmt16/translation-task.html

In any way, there are documents for using the toolkit. If you are new to processing text

on the Linux platform, there is a very good introduction given by Church (1994)3 of

using Unix commands for basic text processing.

Report

You need to submit a report of your work (2~3 pages). It should clearly present what is

going on in your experiments, how you achieve them, and solve problems you

encountered. You should include tables (or graphs) of the data (e.g. corpora statistics),

evaluated perplexities, etc. of your models. I am particularly interested to see the

conclusions you draw about the models you made and the data you collected, as well

as the analysis of the obtained results. The report should follow the two-column format

of the ACL proceeding.

4,5

References

1. Kenneth Ward Church. 1994. UnixTM for Poets. Notes of a course from the

European Summer School on Language and Speech Communication, Corpus Based

Methods.

2. Andreas Stolcke. 2002. SRILM - an extensible language modeling toolkit. In 7th

International Conference on Spoken Language Processing, ICSLP2002 -

INTERSPEECH 2002, Denver, Colorado, USA, September 16-20, 2002.

3 http://www.cs.upc.edu/~padro/Unixforpoets.pdf

请加QQ:99515681 或邮箱:[email protected]   WX:codehelp


Tags:CISC 7021代 程序设计 辅导