Outline - TUM

Outline - TUM

Exploring the Effect of Data Augmentation on the Quality of Text Classification Shabnam Sadegharmaki, May 2019 Chair of Software Engineering for Business Information Systems (sebis) Faculty of Informatics Technische Universitt Mnchen wwwmatthes.in.tum.de Outline Problem Statement Text Classification Expensive training data Domain specific Use cases News classification project at Allianz Legal Norm classification at Sebis chair Hateful speech classification of tweets Solutions Semi Supervised Learning Self-training Graph based SSL Text Data Augmentation

Horizontal Vertical Hypernyms Conclusion References 190513 Shabnam Sadegharmaki guided research final presentation sebis 2 Text Classification Training Classification : =

Classifier ML 190513 Shabnam Sadegharmaki guided research final presentation Classifier sebis 3 Curse of Supervised Methods Labeled Data: The More, The Better

However: Expensive and Scarce On the other hand, Vast amount of unlabeled data 190513 Shabnam Sadegharmaki guided research final presentation Labeled Unlabeled sebis 4 Counter attacks

Semi-supervised learning Self-training Label propagation Thesaurus-based data augmentation Multi-instance learning Transfer Learning

clipart-library The focus is limited to text classification 190513 Shabnam Sadegharmaki guided research final presentation sebis 5 But The solutions do not work in all domains Therefore We chose three diverse datasets 1. to investigate their effects on different domains 2. to estimate their generalization power 190513 Shabnam Sadegharmaki guided research final presentation sebis

6 Outline Problem Statement Text Classification Expensive training data Domain specific Use cases News classification project at Allianz Legal Norm classification at Sebis chair Hateful speech classification of tweets Solutions Semi Supervised Learning Self-training Graph based SSL Text Data Augmentation Horizontal Vertical Hypernyms Conclusion References

190513 Shabnam Sadegharmaki guided research final presentation sebis 7 Dataset 1: Legal Norms(LN) 601 Sentences from the German BGB with regard to the tenancy law Manually labeled to 9 semantics: Properties: Legal domain Technical Grammatically correct No typo

Formal speech Small domain 190513 Shabnam Sadegharmaki guided research final presentation Semantic type Occurrences Rel occurr. (%) Duty 117 19 Indemnity 8 1

Permission 148 25 Prohibition 18 3 Objection 98 16 Continuation 21

3 Consequence 117 19 Definition 18 3 Reference 56 9 sebis

8 Dataset 2: GermEval18, hate speech tweets A publicly available dataset of 5009 tweets in German Binary classification of whether a tweet contains offensive content Labels Occurrences Rel occurr. (%) Properties: Offensive 1688 34%

Other 3321 66% Social network domain Informal speech Error and typo prone Short texts Emoji Short content 190513 Shabnam Sadegharmaki guided research final presentation sebis

9 Dataset 3: Economic News Allianz This is a private dataset provided by one of Allianz entities Contains 2278 News regarding Germany economy and industry Binary classification of whether a news is critical for Allianz business Properties: Economy Domain Formal speech Long text Grammatically correct No typo 190513 Shabnam Sadegharmaki guided research final presentation

Semantic type Occurrences Rel occurr. (%) Critical 282 12 Non-critical 1996 88 sebis 10

Outline Problem Statement Text Classification Expensive training data Domain specific Use cases News classification project at Allianz Legal Norm classification at Sebis chair Hateful speech classification of tweets Solutions Semi Supervised Learning Self-training Graph based SSL Text Data Augmentation Horizontal Vertical Hypernyms Conclusion References 190513 Shabnam Sadegharmaki guided research final presentation

sebis 11 SSL: Self-training 190513 Shabnam Sadegharmaki guided research final presentation sebis 12 Self-training: Results 190513 Shabnam Sadegharmaki guided research final presentation sebis 13 Outline

Problem Statement Text Classification Expensive training data Domain specific Use cases News classification project at Allianz Legal Norm classification at Sebis chair Hateful speech classification of tweets Solutions Semi Supervised Learning Self-training Graph based SSL Text Data Augmentation Horizontal Vertical Hypernyms Conclusion References 190513 Shabnam Sadegharmaki guided research final presentation sebis

14 SSL: Label Propagation Graph: Nodes are both labeled and unlabeled Edges reflect the similarity of examples. Classification: Label Propagation Parameter to be tuned: Gamma 190513 Shabnam Sadegharmaki guided research final presentation ai.googleblog Alpha sebis 15

Label Propagation: Results 190513 Shabnam Sadegharmaki guided research final presentation sebis 16 Outline Problem Statement Text Classification Expensive training data Domain specific Use cases News classification project at Allianz Legal Norm classification at Sebis chair Hateful speech classification of tweets Solutions Semi Supervised Learning Self-training Graph based SSL

Text Data Augmentation Horizontal Vertical Hypernyms Conclusion References 190513 Shabnam Sadegharmaki guided research final presentation sebis 17 Data Augmentation medium.com/nanonets 190513 Shabnam Sadegharmaki guided research final presentation sebis 18

Data augmentation in Text Trivial Solution is paraphrasing Original: Her life spanned years of incredible change for women as they gained more rights than ever before. Paraphrase: She lived through the exciting era of women's liberation. However: Even harder than labeling for human expert Meaning is very subjective Solution: Thesaurus-based augmentation 190513 Shabnam Sadegharmaki guided research final presentation sebis 19 Thesaurus-based data augmentation Thesaurus of Synonyms, Hypernyms, Hyponyms

GermaNet German WordNet Other languages Mietvertrag Synonyms: Pacht, Pachtvertrag, Mietvereinbarung Hypernyms: Vertrag Hyponyms: Charter 190513 Shabnam Sadegharmaki guided research final presentation sebis 20

Idea behind it The weather is awful + - The weather is decent ? The weather is nice 190513 Shabnam Sadegharmaki guided research final presentation sebis 21 Idea behind it The weather is awful

+ - The weather is decent ? The weather is nice 190513 Shabnam Sadegharmaki guided research final presentation sebis 22 Data Augmentation This is not only about increasing training size But also expanding the semantics Alternations: Horizontal augmentation by synonyms Vertical augmentation by synonyms

Augmentation by hypernyms 190513 Shabnam Sadegharmaki guided research final presentation sebis 23 Outline Problem Statement Text Classification Expensive training data Domain specific Use cases News classification project at Allianz Legal Norm classification at Sebis chair Hateful speech classification of tweets Solutions Semi Supervised Learning Self-training Graph based SSL Text Data Augmentation

Horizontal Vertical Hypernyms Conclusion References 190513 Shabnam Sadegharmaki guided research final presentation sebis 24 Horizontal augmentation by synonyms + The weather is awful bad The weather is decent nice ? The weather is nice decent 190513 Shabnam Sadegharmaki guided research final presentation But

+ The - weather is awful dreadful painful terrible nasty awed The ? weather is decent clean fitting satisfactory acceptable he weather is nice decent skillful dainty overnice prissy squeamish courte With Horizontal Augmentation 190513 Shabnam Sadegharmaki guided research final presentation Original Horizontal DA: Notes + The - weather is awful dreadful painful terrible nasty awed The ? weather is decent clean fitting satisfactory acceptable he weather is nice decent skillful dainty overnice prissy squeamish courte

Possible solutions: Only include most frequent synsets Choose synsets randomly Consequences: In this case training size did not change. (no mess with tf-idf and model assumptions) Increase of dimensionality Possibility of enforcing noise Can capture similar contexts 190513 Shabnam Sadegharmaki guided research final presentation Outline Problem Statement Text Classification Expensive training data Domain specific Use cases News classification project at Allianz

Legal Norm classification at Sebis chair Hateful speech classification of tweets Solutions Semi Supervised Learning Self-training Graph based SSL Text Data Augmentation Horizontal Vertical Hypernyms Conclusion References 190513 Shabnam Sadegharmaki guided research final presentation sebis 28 Vertical DA by Synonyms +The weather is nice +The weather is decent Bingo!!!

+The weather is skillful +The weather is dainty +The weather is gracious -The weather is awful -The weather is dreadful -The weather is terrible Original ?The weather is decent Vertical DA Binary vectorizer But In total 4*2*2*2 = 32 variation exist for only this sentence. Computationally impossible! Solution: Choose random r of all possible combination. Consequence:

There is no guarantee anymore to find a match. Notes: Both dimensionality and data size are increased Augmented set should not appear in the test set (customized cross validation required) 190513 Shabnam Sadegharmaki guided research final presentation sebis 30 More important consequences TF-IDF Model Assumptions 190513 Shabnam Sadegharmaki guided research final presentation sebis 31 Dangerous consequences

TF-IDF Model Assumptions 190513 Shabnam Sadegharmaki guided research final presentation sebis 32 Dangerous consequences TF-IDF Model Assumptions 190513 Shabnam Sadegharmaki guided research final presentation sebis 33 Outline Problem Statement Text Classification

Expensive training data Domain specific Use cases News classification project at Allianz Legal Norm classification at Sebis chair Hateful speech classification of tweets Solutions Semi Supervised Learning Self-training Graph based SSL Text Data Augmentation Horizontal Vertical Hypernyms Conclusion References 190513 Shabnam Sadegharmaki guided research final presentation sebis 34

DA with Hypernyms Dimensionality decreases Less chance of enforcing noise No mess with model assumptions But: Hypernyms set are limited Only nouns are supported And(next slide) 190513 Shabnam Sadegharmaki guided research final presentation sebis 35 But Semantics are more complex when it comes to informal speech(e.g. sarcasm, jokes..) Hateful speech:

190513 Shabnam Sadegharmaki guided research final presentation sebis 36 And finally, the results 190513 Shabnam Sadegharmaki guided research final presentation sebis 37 Outline Problem Statement Text Classification Expensive training data Domain specific Use cases

News classification project at Allianz Legal Norm classification at SEBIS chair Hateful speech classification of tweets,GermEval18 Solutions Semi Supervised Learning Self-training Graph based SSL Text Data Augmentation Horizontal Vertical Hypernyms Conclusion References 190513 Shabnam Sadegharmaki guided research final presentation sebis 38 Conclusion Label propagation

The vanilla LP could not reach the best performances in all three data sets. We hypothesized, it is very sensitive to the parameters and noise compared to classical linear models. Self-training with consideration of a threshold can increase the performance and enables the model to take advantage of a vast number of unlabeled data. Without a threshold has undoubtedly a negative impact. Text data augmentation: Horizontal data augmentation enforces more irrelevant data which causes a negative impact on classification. Vertical data augmentation affects the vectorizer technique and model assumptions. In formal contexts, vertical DA and hypernyms are shown to be effective. However in informal contexts, thesaurus-based methods do not perform well due to complexity of semantics. 190513 Shabnam Sadegharmaki guided research final presentation sebis 39 References

[1] Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saied Safaei, Elizabeth D Trippe, Juan B Gutierrez, and Krys Kochut. 2017. A brief survey of text mining: Classification,clustering and extraction techniques.arXiv preprint arXiv:1707.02919(2017). [2] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario GuajardoCespedes, Steve Yuan, Chris Tar, et al.2018. Universalsentence encoder.arXiv preprint arXiv:1803.11175(2018). [3] O. Chapelle, B. Scholkopf, and A. Zien, Eds. 2009. Semi-Supervised Learning (Chapelle, O. et al., Eds.; 2006) [Book reviews].IEEE Transactions on Neural Networks20, 3 (March2009), 542542. https://doi.org/10.1109/TNN.2009.2015974 [4] Veronika Cheplygina, Marleen de Bruijne, and Josien PW Pluim. 2019. Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical imageanalysis.Medical Image Analysis(2019). [5] Stephen Clark, James R Curran, and Miles Osborne. 2003. Bootstrapping POS taggers using unlabelled data. InProceedings of the seventh conference on Natural languagelearning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics, 4955. [6] Christiane Fellbaum. 2010. WordNet. InTheory and applications of ontology: computer applications. Springer, 231243. [7] Birgit Hamp and Helmut Feldweg. 1997. Germanet-a lexical-semantic net for german.Automatic information extraction and building of lexical semantic resources for NLPapplications(1997). [8] Verena Henrich and Erhard Hinrichs. 2010. GernEdiT-the GermaNet editing tool.Proceedings of the ACL 2010 System Demonstrations(2010), 1924. [9] Matthew Honnibal and Ines Montani. 2017. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing.Toappear(2017).[10] Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification.arXiv preprint arXiv:1801.06146(2018). [10] Rie Johnson and Tong Zhang. 2016. Supervised and semi-supervised text categorization using LSTM for region embeddings.arXiv preprint arXiv:1602.02373(2016). [11] Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, Laszlo

Lukacs, Marina Ganea, Peter Young, et al.2016. Smartreply: Automated response suggestion for email. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 955964. [12] David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. InProceedings of the main conference on human language technology conferenceof the North American Chapter of the Association of Computational Linguistics. Association for Computational Linguistics, 152159. [13] Takeru Miyato, Andrew M Dai, and Ian Goodfellow. 2016. Adversarial training methods for semi-supervised text classification.arXiv preprint arXiv:1605.07725(2016). 190513 Shabnam Sadegharmaki guided research final presentation sebis 40 References [14] Miha Pavlinek and Vili Podgorelec. 2017. Text classification method based on self-training and LDA topic models.Expert Systems with Applications80 (2017), 8393. [15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python.Journal of Machine Learning Research12 (2011), 28252830. [16] Luis Perez and Jason Wang. 2017. The effectiveness of data augmentation in image classification using deep learning.arXiv preprint arXiv:1712.04621(2017).

[17] Shrutika S Sawant and Manoharan Prabukumar. 2018. A review on graph-based semi-supervised learning methods for hyperspectral image classification.The Egyptian Journalof Remote Sensing and Space Science(2018). [18] Xiao Sun and Jiajin He. 2018. A novel approach to generate a large scale of supervised data for short text sentiment analysis.Multimedia Tools and Applications(2018), 121. [19] Bernhard Waltl, Georg Bonczek, Elena Scepankova, and Florian Matthes. 2019. Semantic types of legal norms in German laws: classification and analysis using local linearexplanations.Artificial Intelligence and Law27, 1 (2019), 4371. [20] Bernhard Waltl, Johannes Muhr, Ingo Glaser, Georg Bonczek, Elena Scepankova, and Florian Matthes. 2017. Classifying Legal Norms with Active Machine Learning.. InJURIX.1120. [21] Michael Wiegand, Melanie Siegel, and Josef Ruppenhofer. 2018. Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language. In14th Conference onNatural Language Processing KONVENS 2018. [22] Yibing Yang and M Omair Shafiq. 2018. Large scale and parallel sentiment analysis based on Label Propagation in Twitter Data. In2018 17th IEEE International Conference OnTrust, Security And Privacy In Computing And Communications/ 12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE). IEEE, 17911798. [23] Xiang Zhang and Yann LeCun. 2015. Text understanding from scratch.arXiv preprint arXiv:1502.01710(2015). [24] Dengyong Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and Bernhard Schlkopf. 2004. Learning with local and global consistency. InAdvances in neural informationprocessing systems. 321328. [25] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. 2003. Semi-supervised learning using gaussian fields and harmonic functions. InProceedings of the 20th Internationalconference on Machine learning (ICML-03). 912919. [26] Xiaojin Zhu and Andrew B Goldberg. 2009. Introduction to semi-supervised learning.Synthesis lectures on artificial intelligence and machine learning3, 1 (2009), 1130. [27] Xiaojin Jerry Zhu. 2005.Semi-supervised learning literature survey. Technical Report. University of Wisconsin-Madison Department of Computer Sciences.

190513 Shabnam Sadegharmaki guided research final presentation sebis 41 Thank You Question? 190513 Shabnam Sadegharmaki guided research final presentation sebis 42

Recently Viewed Presentations

  • Building and Sustaining High Performing Advancement Committees

    Building and Sustaining High Performing Advancement Committees

    GTA 7.0.4.0. Scouts are not allowed to begin work on discontinued merit badges. However if a Scout has already started work and is actively pursuing completion of the discontinued merit badge, then the badge may be completed and count toward...
  • American Literary Periods Colonial 1650-1750 ~ The Colonial

    American Literary Periods Colonial 1650-1750 ~ The Colonial

    Colonial 1650-1750 ~ The Colonial movement was. mostly instructional. ~ It was to spread the word of. God, and help the corrupted (which was everyone). ~ This style of writing is seen . in plain writing, in the form ....
  • Chapter 8

    Chapter 8

    Step 2 Identify the customer or customer segment. Step 3 Map the process from the customer's point of view. Step 4 Map contact employee actions, onstage and back-stage. Step 5 Link customer and contact person activities to needed support functions....
  • The Basic Tenets of Marxism - Bergen

    The Basic Tenets of Marxism - Bergen

    The Basic Tenets of Marxism "The philosophers have only interpreted the world in various ways; the point, however, is to change it." I. History and Class Struggle (Historical Materialism) Human history is the history of class struggles among the classes...
  • Ancient Astronomy: Foundations of Physics

    Ancient Astronomy: Foundations of Physics

    Content/Structure of Class. Six lectures - cover major areas of physics and historical importance from ancient times to 20th century. Lectures will mostly be in chronological order but will also mostly focus on one particular area of physics.
  • James Madison - Winston-Salem/Forsyth County Schools / Front Page

    James Madison - Winston-Salem/Forsyth County Schools / Front Page

    James Madison (1809-1817) Who is he? Friend of TJ. Virginian. Democratic Republican. Father of Constitution. 2nd National Bank (1816) Also president for 2 terms
  • Antichità E Medioevo. Storia Del Padre

    Antichità E Medioevo. Storia Del Padre

    La paura di ammonire i bambini esprime la diaspora tra coerenza educativa e compiacenza genitoriale ed è sintomatica della moderna . società senza padri. PATERNITÀ E CONTESTO SOCIALE. ... Secondo lo psicologo californiano Stephen Poulter, il successo personale e professionale...
  • Civics EOC Review - Winston-Salem/Forsyth County Schools

    Civics EOC Review - Winston-Salem/Forsyth County Schools

    Objective 8.06: Competition, Price, Supply. Why is competition important in our economy? It helps to keep prices down. How do consumers benefit from competition in markets? Competition helps to bring good products and good prices for those products Define ....