- Account
- Join for Free
- Sign In
- Help & Info
- Privacy Notice
- DMCA
- Contact Us
- Terms Of Use
...Description...... more. less.
cubicfeetofnaturalgasand28million barrelsofoilbutalsoitspipeline,gas- gatheringandcontract-drillingoperations IntroductiontoCorpusLinguistics 3p.6 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] DataforLexicographers Howmanysensesdoestheword line have? 14(accordingtoWebster 9sNewEncyclopedic Dictionary,1994): 1.acomparativelystrongslendercord 2.acord,wire,ortapeusedinmeasuringandleveling 3.pipingforconveyinga uid 4.arowofwords,letters,numbersorsymbolsthatare written,printed,ordisplayed 5.somethingthatisdistinct,elongated,andnarrow 6.astateofagreement(bringideasintoline) 7.acourseofconduct,action,orthought(apoliticalline) 8.limit,restraint(overstepthelineofgoodtaste)... IntroductiontoCorpusLinguistics 3p.7 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] DataforLexicographers Howmanysensesdoestheword line have?<br><br> 14(accordingtoWebster 9sNewEncyclopedic Dictionary,1994): 1.acomparativelystrongslendercord 2.acord,wire,ortapeusedinmeasuringandleveling 3.pipingforconveyinga uid 4.arowofwords,letters,numbersorsymbolsthatare written,printed,ordisplayed 5.somethingthatisdistinct,elongated,andnarrow 6.astateofagreement(bringideasintoline) 7.acourseofconduct,action,orthought(apoliticalline) 8.limit,restraint(overstepthelineofgoodtaste)... IntroductiontoCorpusLinguistics 3p.7 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] InstructionforLanguageLearning HowdoyousayinEnglish: thinkabout or thinkon ? Ifindoubt,askgoogle: 36.300.000hitsfor thinkabout 738.000hitsfor thinkon IntroductiontoCorpusLinguistics 3p.8 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] InstructionforLanguageLearning HowdoyousayinEnglish: thinkabout or thinkon ?<br><br> Ifindoubt,askgoogle: 36.300.000hitsfor thinkabout 738.000hitsfor thinkon IntroductiontoCorpusLinguistics 3p.8 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] TypesofCorpora mono-lingualversusmulti-lingualcorpora special-purpose,domain-speci ccorporaversus general-purpose,large-scalecorpora spokenlanguagecorporaversuscollectionsof writtentext ad-hoccorpuscollectionsversusbalanced, representativecorpora rawtextversusmarked-updocuments unannotatedversusannotatedcorpora WWWasacorpus IntroductiontoCorpusLinguistics 3p.9 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] BadStartforCorpusLinguistics NoamChomsky(1957) SyntacticStructures : p.15: c...itisobviousthatthesetofgrammatical sentencescannotbeidenti edwithanyparticularcorpus ofutterances... ...agrammarmirrorsthebehaviorofthespeaker,who, onthebasisofa niteandaccidentalexperiencewith language,canproduceorunderstandaninde nite numberofnewsentences. d IntroductiontoCorpusLinguistics 3p.10 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] BadStartforCorpusLinguistics(2) NoamChomsky(1957) SyntacticStructures : p.16/17: c...ones 9sabilitytoproduceandrecognize grammaticalutterancesisnotbasedonnotionsof statisticalapproximationsorthelike. ...Ifwerankthesequencesofagivenlengthinorderof statisticalapproximationtoEnglish,wewill ndboth grammaticalandungrammaticalsequencesscattered throughoutthelist;thereappearstobenoparticular relationbetweentheorderofapproximationsand grammaticalness. d IntroductiontoCorpusLinguistics 3p.11 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] EnglishCorpora BrownCorpus :1millionwordsofwritten AmericanEnglishtextsfromvariousgenres, datingfrom1961 Lancaster-Oslo-Bergen(LOB)Corpus :1 millionwordsofwrittenBritishEnglishtexts, datingfrom1961.Genresareparalleltothe BrownCorpus.<br><br> BritishNationalCorpus :100mio.wordsof writtenandspokenlanguage,balancedcorpus ofcurrentBritishEnglish InternationCorpusofEnglish(ICE) :national orregionalvarietiesofEnglish;onemillionword collectionsofcontemporaryspokenandwritten English(GreatBritain,USA,Australia,South Africa,Canada,HongKong,India,etc.) IntroductiontoCorpusLinguistics 3p.12 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] Non-EnglishCorpora IPIPANPolishCorpus :300mio.words CzechNationalCorpus :100mio.words HungarianNationalCorpus :80mio.words CroatianNationalCorpus :30mio.words HellenicNationalCorpus :20mio.words METUTurkishCorpus :10mio.words ... IntroductiontoCorpusLinguistics 3p.13 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] ParallelCorpora MULTEXT-East :forBulgarian,Croatian,Czech, English,Estonian,Hungarian,Lithuanian, Resian,Romanian,Russian,Slovene,and Serbian.Formostlanguages:Orwell 9s1984. HansardCorpus :fromtheof cialrecords (Hansards)ofthe36thCanadianParliament [1997-2000],3mio.words Europarl :extractedfromtheproceedingsofthe EuropeanParliament;includesversionsin11 Europeanlanguages:Romanic(French,Italian, Spanish,Portuguese),Germanic(English, Dutch,German,Danish,Swedish),Greekand Finnish.Ca.20mio.words.<br><br> IntroductiontoCorpusLinguistics 3p.14 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] TheNeedforCorpusMark-Up Annotationguidelinesareneededinorderto facilitatetheaccessibilityandreusabilityofcorpus resources. Minimalinformation: authorshipofthesourcedocument authorshipoftheannotateddocument languageofthedocument charactersetandcharacterencodingusedin thecorpus IntroductiontoCorpusLinguistics 3p.15 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] Leech 9sSevenMaximsofAnnotation 1.Itshouldbepossibletoremovetheannotation fromanannotatedcorpusinordertorevertto therawcorpus. 2.Itshouldbepossibletoextracttheannotations bythemselvesfromthetext.ThisistheKipside ofmaxim1.Takingpoints1.and2.together,the annotatedcorpusshouldallowthemaximum Kexibilityformanipulationbytheuser.<br><br> 3.Theannotationschemeshouldbebasedon guidelineswhichareavailabletotheenduser. 4.Itshouldbemadeclearhowandbywhomthe annotationwascarriedout. IntroductiontoCorpusLinguistics 3p.16 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] Leech 9sSevenMaxims(2) 5.Theendusershouldbemadeawarethatthe corpusannotationisnotinfallible,butsimplya potentiallyusefultool.<br><br> 6.Annotationschemesshouldbebasedasfaras possibleonwidelyagreedandtheory-neutral principles. 7.Noannotationschemehastheapriorirightto beconsideredasastandard.Standardsemerge throughpracticalconsensus. IntroductiontoCorpusLinguistics 3p.17 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] TextEncodingInitiative(TEI) projectsponsoredbytheAssociationfor ComputationalLinguistics,theAssociationfor LiteraryandLinguisticComputing,andthe AssociationforComputersintheHumanities encodingguidelines link: http://www.tei-c.org de nehowdocumentsshouldbemarked-up withthemark-uplanguageSGML(ormore recentlyXML) IntroductiontoCorpusLinguistics 3p.18 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] XML XML:ExtensibleMarkupLanguage similartoHTML hasno xed csemantics d:userde neswhattags mean recognizedasinternationalISOstandard formallyveri ableviadocumenttypede nitions (DTD) toolsavailableforediting,displaying,querying IntroductiontoCorpusLinguistics 3p.19 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] XML 3Example <CATALOG> <CD> <TITLE>EmpireBurlesque</TITLE> <ARTIST>BobDylan</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>Columbia</COMPANY> <PRICE>10.90</PRICE> <YEAR>1985</YEAR> </CD> <CD> <TITLE>GreatestHits</TITLE> <ARTIST>DollyParton</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>RCA</COMPANY> <PRICE>9.90</PRICE> <YEAR>1982</YEAR> </CD> </CATALOG> IntroductiontoCorpusLinguistics 3p.20 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] TEIGuildelines EachtextthatisconformwiththeTEIguidelines consistsoftwoparts 3aheaderandthetextitself.<br><br> Theheadercontainsinformationsuchas: author,title,anddate theeditionorpublisherusedincreatingthe machine-readabletext informationabouttheencodingpractices adopted IntroductiontoCorpusLinguistics 3p.21 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] XMLAnnotatedText <text> <body> <divtype=BODY> <divtype="Q"> <head>Subject:ThestaffingintheCommissionoftheEuropeanCommunities </head> <p>CantheCommissionsay:</p> <p>1.howmanytemporaryofficialsareworkingattheCommission?</p> <p>2.whotheyareandwhatcriteriawereusedinselectingthem?</p> </div> <divtype="R"> <head>Answergivenby<nametype=PERSON><abbrrend=TAIL-SUPER>Mr</ABBR> CardosoeCunha</name>onbehalfoftheCommission<date>(22September 1992)</date></head> <p>1and2.TheCommissionwillsendtablesshowingthenumberoftemporary staffworkingfortheCommissiondirectlytotheHonourableMemberandto Parliament 9sSecretariat.</p> </div></div></body></text> IntroductiontoCorpusLinguistics 3p.22 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] LevelsofLinguisticAnnotation morphologicalannotation(e.g.inKection, derivation,compounding) morpho-syntacticannotation:part-of-speech (POS)tagging syntacticannotation(e.g.namedentities, phrasalchunking,fullsyntacticanalysis) semanticannotation(e.g.word-sense disambiguation,anaphoraandcoreference resolution,informationstructure) discourseannotation(e.g.dialogturns,speech acts) IntroductiontoCorpusLinguistics 3p.23 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] WhyDoWeNeedAnnotation? fortrainingNLPtools for ndingexamples whatisthepluralformof fish ? whichnounscanoccurasbarenouns, withoutadeterminer?<br><br> aretheresubjectlesssentencesinGerman? 3Yes,e.g. Miristkalt.<br><br> (Tomeiscold.) isitpossibleinEnglishtohavesomething betweenanounanditsmodifyingrelative clause? IntroductiontoCorpusLinguistics 3p.24 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] StepbyStepAnnotation tokenization lemmatization/morphologicalanalysis part-of-speechtagging named-entityrecognition partialparsing fullsyntacticparsing semanticanddiscourseprocessing IntroductiontoCorpusLinguistics 3p.25 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] PreprocessingtheText:Tokenization Tokenizationreferstotheannotationstepof dividingtheinputtextintounitscalled tokens . Eachtokensconsistsofeither: amorpho-syntacticword apunctuationmarkoraspecialcharacter(e.g.<br><br> &,@,%) anumber IntroductiontoCorpusLinguistics 3p.26 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] Tokenization 3Example beforetokenization: Miltonwrote"ParadiseLost."Thenhis wifediesandhewrote"Paradise Regained." aftertokenization: Miltonwrote"ParadiseLost."Then hiswifediesandhewrote"Paradise Regained." IntroductiontoCorpusLinguistics 3p.27 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] Tokenization 3Example beforetokenization: Miltonwrote"ParadiseLost."Thenhis wifediesandhewrote"Paradise Regained." aftertokenization: Miltonwrote"ParadiseLost."Then hiswifediesandhewrote"Paradise Regained." IntroductiontoCorpusLinguistics 3p.27 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] WhyisTokenizationNon-Trivial? disambiguationofpunctuation e.g.periodcanoccurinsidecardinalnumbers, afterordinals,afterabbreviations,atendof sentences recognitionofcomplexwords compounds,e.g. banktransferfee, US-company mergers,e.g.<br><br> don 9t,England 9s ,French: t 9aime multiwords,e.g.complexprepositions providedthat,inspiteof IntroductiontoCorpusLinguistics 3p.28 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] Lemmatization referstotheprocessofrelatingindividualword formstotheircitationform(lemma)bymeansof morphologicalanalysis e.g. stopped stop providesameanstodistinguishbetweenthe totalnumberofwordtokensanddistinct lemmatathatoccurinacorpus e.g.helpsto ndalloccurrencesof buy isindispensableforhighlyinKectionallanguages whichhavealargenumberofdistinctword formsforagivenlemma IntroductiontoCorpusLinguistics 3p.29 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] Lemmatization 3GermanExample wiewie+Adv+Wh+#lex+COWIE wiewie+Conj+Coord+#lex+COWIE wiewie+Conj+Subord+#lex+COWIE siesie+Pron+Pers+3P+Pl+Fem+Nom+#lex+PERSPRO siesie+Pron+Pers+3P+Sg+Fem+Nom+#lex+PERSPRO offenbaroffenbaren+Verb+Imp+2P+Sg+#lex+VVFIN offenbaroffenbar+Adj+Pos+Pred+#lex+ADJD gedachtgedenken+Verb+PPast+#lex+VVPP gedachtdachen+Verb+PPast+#lex+VVPP gedachtdenken+Verb+PPast+#lex+VVPP hathaben+Verb+Indc+3P+Sg+Pres+#lex+VAFIN IntroductiontoCorpusLinguistics 3p.30 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] ToolsforLemmatization XEROXMorphologicalAnalyzer : comprehensivemorphologicalanalyzersfor manylanguagesincludingEnglish,French, Dutch,German,Hungarian,Italian,Portuguese, Czech,Danish,Finnish,Norwegian,Polish, Russian,Turkish. link: http://www.xrce.xerox.com/competencies/ content-analysis/toolhome.en.html Lingsoft :morphologicalanylyzersforEnglish, Danish,German,Swedish,andFinnish link: http://www.lingsoft.fi/demos.html IntroductiontoCorpusLinguistics 3p.31 E BERHARD K ARLS U NIVERSITÄT T ÜBINGEN Seminarf¨ urSprachwissenschaft [ C L SfS ] MorphologicalAnalysis Xerox: halfhalf+Adj halfhalf+Adv halfhalf+Noun+Sg Lingsoft: "<half>" "half"<Quant>DETPRESG/PL@QN> "half"<NonMod><Quant>PRONSG/PL "half"NNOMSG "half"ADV IntroductiontoCorpusLinguistics 3p.32<br><br>