Natural Language Processing: Crash Course Computer Science #36


Hi, I’m Carrie Anne, and welcome to Crash
Course Computer Science! Last episode we talked about computer vision
– giving computers the ability to see and understand visual information. Today we’re going to talk about how to give
computers the ability to understand language. You might argue they’ve always had this
capability. Back in Episodes 9 and 12, we talked about
machine language instructions, as well as higher-level programming languages. While these certainly meet the definition
of a language, they also tend to have small vocabularies and follow highly structured
conventions. Code will only compile and run if it’s 100
percent free of spelling and syntactic errors. Of course, this is quite different from human
languages – what are called natural languages – containing large, diverse vocabularies,
words with several different meanings, speakers with different accents, and all sorts of interesting
word play. People also make linguistic faux pas when
writing and speaking, like slurring words together, leaving out key details so things
are ambiguous, and mispronouncing things. But, for the most part, humans can roll right
through these challenges. The skillful use of language is a major part
of what makes us human. And for this reason, the desire for computers
to understand and speak our language has been around since they were first conceived. This led to the creation of Natural Language
Processing, or NLP, an interdisciplinary field combining computer science and linguistics. INTRO There’s an essentially infinite number of
ways to arrange words in a sentence. We can’t give computers a dictionary of
all possible sentences to help them understand what humans are blabbing on about. So an early and fundamental NLP problem was
deconstructing sentences into bite-sized pieces, which could be more easily processed. In school, you learned about nine fundamental
types of English words: nouns, pronouns, articles, verbs, adjectives, adverbs, prepositions,
conjunctions, and interjections. These are called parts of speech. There are all sorts of subcategories too,
like singular vs. plural nouns and superlative vs. comparative adverbs, but we’re not going
to get into that. Knowing a word’s type is definitely useful,
but unfortunately, there are a lot words that have multiple meanings – like “rose”
and “leaves”, which can be used as nouns or verbs. A digital dictionary alone isn’t enough
to resolve this ambiguity, so computers also need to know some grammar. For this, phrase structure rules were developed,
which encapsulate the grammar of a language. For example, in English there’s a rule that
says a sentence can be comprised of a noun phrase followed by a verb phrase. Noun phrases can be an article, like “the”,
followed by a noun or they can be an adjective followed by a noun. And you can make rules like this for an entire
language. Then, using these rules, it’s fairly easy
to construct what’s called a parse tree, which not only tags every word with a likely
part of speech, but also reveals how the sentence is constructed. We now know, for example, that the noun focus
of this sentence is “the mongols”, and we know it’s about them doing the action
of “rising” from something, in this case, “leaves”. These smaller chunks of data allow computers
to more easily access, process and respond to information. Equivalent processes are happening every time
you do a voice search, like: “where’s the nearest pizza”. The computer can recognize that this is a
“where” question, knows you want the noun “pizza”, and the dimension you care about
is “nearest”. The same process applies to “what is the
biggest giraffe?” or “who sang thriller?” By treating language almost like lego, computers
can be quite adept at natural language tasks. They can answer questions and also process
commands, like “set an alarm for 2:20” or “play T-Swizzle on spotify”. But, as you’ve probably experienced, they
fail when you start getting too fancy, and they can no longer parse the sentence correctly,
or capture your intent. Hey Siri… methinks the mongols doth roam
too much, what think ye on this most gentle mid-summer’s day? Siri: I’m not sure I got that. I should also note that phrase structure rules,
and similar methods that codify language, can be used by computers to generate natural
language text. This works particularly well when data is
stored in a web of semantic information, where entities are linked to one another in meaningful
relationships, providing all the ingredients you need to craft informational sentences. Siri: Thriller was released in 1983 and sung
by Michael Jackson Google’s version of this is called Knowledge
Graph. At the end of 2016, it contained roughly seventy
billion facts about, and relationships between, different entities. These two processes, parsing and generating
text, are fundamental components of natural language chatbots – computer programs that
chat with you. Early chatbots were primarily rule-based,
where experts would encode hundreds of rules mapping what a user might say, to how a program
should reply. Obviously this was unwieldy to maintain and
limited the possible sophistication. A famous early example was ELIZA, created
in the mid-1960s at MIT. This was a chatbot that took on the role of
a therapist, and used basic syntactic rules to identify content in written exchanges,
which it would turn around and ask the user about. Sometimes, it felt very much like human-human
communication, but other times it would make simple and even comical mistakes. Chatbots, and more advanced dialog systems,
have come a long way in the last fifty years, and can be quite convincing today! Modern approaches are based on machine learning,
where gigabytes of real human-to-human chats are used to train chatbots. Today, the technology is finding use in customer
service applications, where there’s already heaps of example conversations to learn from. People have also been getting chatbots to
talk with one another, and in a Facebook experiment, chatbots even started to evolve their own
language. This experiment got a bunch of scary-sounding
press, but it was just the computers crafting a simplified protocol to negotiate with one
another. It wasn’t evil, it’s was efficient. But what about if something is spoken – how
does a computer get words from the sound? That’s the domain of speech recognition,
which has been the focus of research for many decades. Bell Labs debuted the first speech recognition
system in 1952, nicknamed Audrey – the automatic digit recognizer. It could recognize all ten numerical digits,
if you said them slowly enough. 5… 9… 7? The project didn’t go anywhere because it
was much faster to enter telephone numbers with a finger. Ten years later, at the 1962 World’s Fair,
IBM demonstrated a shoebox-sized machine capable of recognizing sixteen words. To boost research in the area, DARPA kicked
off an ambitious five-year funding initiative in 1971, which led to the development of Harpy
at Carnegie Mellon University. Harpy was the first system to recognize over
a thousand words. But, on computers of the era, transcription
was often ten or more times slower than the rate of natural speech. Fortunately, thanks to huge advances in computing
performance in the 1980s and 90s, continuous, real-time speech recognition became practical. There was simultaneous innovation in the algorithms
for processing natural language, moving from hand-crafted rules, to machine learning techniques
that could learn automatically from existing datasets of human language. Today, the speech recognition systems with
the best accuracy are using deep neural networks, which we touched on in Episode 34. To get a sense of how these techniques work,
let’s look at some speech, specifically, the acoustic signal. Let’s start by looking at vowel sounds,
like aaaaa…and Eeeeeee. These are the waveforms of those two sounds,
as captured by a computer’s microphone. As we discussed in Episode 21 – on Files
and File Formats – this signal is the magnitude of displacement, of a diaphragm inside of
a microphone, as sound waves cause it to oscillate. In this view of sound data, the horizontal
axis is time, and the vertical axis is the magnitude of displacement, or amplitude. Although we can see there are differences
between the waveforms, it’s not super obvious what you would point at to say, “ah ha!
this is definitely an eeee sound”. To really make this pop out, we need to view
the data in a totally different way: a spectrogram. In this view of the data, we still have time
along the horizontal axis, but now instead of amplitude on the vertical axis, we plot
the magnitude of the different frequencies that make up each sound. The brighter the color, the louder that frequency
component. This conversion from waveform to frequencies
is done with a very cool algorithm called a Fast Fourier Transform. If you’ve ever stared at a stereo system’s
EQ visualizer, it’s pretty much the same thing. A spectrogram is plotting that information
over time. You might have noticed that the signals have
a sort of ribbed pattern to them – that’s all the resonances of my vocal tract. To make different sounds, I squeeze my vocal
chords, mouth and tongue into different shapes, which amplifies or dampens different resonances. We can see this in the signal, with areas
that are brighter, and areas that are darker. If we work our way up from the bottom, labeling
where we see peaks in the spectrum – what are called formants – we can see the two
sounds have quite different arrangements. And this is true for all vowel sounds. It’s exactly this type of information that
lets computers recognize spoken vowels, and indeed, whole words. Let’s see a more complicated example, like
when I say: “she.. was.. happy” We can see our “eee” sound here, and “aaa”
sound here. We can also see a bunch of other distinctive
sounds, like the “shh” sound in “she”, the “wah” and “sss” in “was”,
and so on. These sound pieces, that make up words, are
called phonemes. Speech recognition software knows what all
these phonemes look like. In English, there are roughly forty-four,
so it mostly boils down to fancy pattern matching. Then you have to separate words from one another,
figure out when sentences begin and end… and ultimately, you end up with speech converted
into text, allowing for techniques like we discussed at the beginning of the episode. Because people say words in slightly different
ways, due to things like accents and mispronunciations, transcription accuracy is greatly improved
when combined with a language model, which contains statistics about sequences of words. For example “she was” is most likely to
be followed by an adjective, like “happy”. It’s uncommon for “she was” to be followed
immediately by a noun. So if the speech recognizer was unsure between,
“happy” and “harpy”, it’d pick “happy”, since the language model would report that
as a more likely choice. Finally, we need to talk about Speech Synthesis,
that is, giving computers the ability to output speech. This is very much like speech recognition,
but in reverse. We can take a sentence of text, and break
it down into its phonetic components, and then play those sounds back to back, out of
a computer speaker. You can hear this chaining of phonemes very
clearly with older speech synthesis technologies, like this 1937, hand-operated machine from
Bell Labs. Say, “she saw me” with no expression. She saw me. Now say it in answer to these questions. Who saw you? She saw me. Who did she see? She saw me. Did she see you or hear you? She saw me. By the 1980s, this had improved a lot, but
that discontinuous and awkward blending of phonemes still created that signature, robotic
sound. Thriller was released in 1983 and sung by Michael Jackson. Today, synthesized computer voices, like Siri,
Cortana and Alexa, have gotten much better, but they’re still not quite human. But we’re soo soo close, and it’s likely
to be a solved problem pretty soon. Especially because we’re now seeing an explosion
of voice user interfaces on our phones, in our cars and homes, and maybe soon, plugged
right into our ears. This ubiquity is creating a positive feedback
loop, where people are using voice interaction more often, which in turn, is giving companies
like Google, Amazon and Microsoft more data to train their systems on… Which is enabling better accuracy, which is
leading to people using voice more, which is enabling even better accuracy… and the
loop continues! Many predict that speech technologies will
become as common a form of interaction as screens, keyboards, trackpads and other physical
input-output devices that we use today. That’s particularly good news for robots,
who don’t want to have to walk around with keyboards in order to communicate with humans. But, we’ll talk more about them next week. See you then.

100 thoughts on “Natural Language Processing: Crash Course Computer Science #36

  1. Little correction: Phonems are from the realm of Phonology, not Phonetics. The corresponding unit from Phonetics is a Phon. The distinction is important, because Phons are a physics based unit while Phonems are a mind based unit. One Phonem can contain many multiple Phons, which are then referred to as Allophones.

  2. HTML is allowed to contain errors and will usually run just fine.

    edit:
    fyi: html is interpretted, not compiled. The video is still accurate.

  3. This series is so fantastic. I get super excited to hear real-world examples and helps me envision the possibilities in the future.

  4. NLP is "interdisciplinary", except companies exclusively hire programmers who think like computers, and leave out the linguists who can bring in the human aspect and funnel Language into organized data to be programmed, and solve challenges others didn't even think of. what people want computers to do, and how to reach that goal, is basically what linguists study. but everyone thinks they're a linguist with knowledge of basic grammar and/or brief high school study of a foreign language shrug-. keep holding yourself back, science; because only STEM is qualified, apparently. none of that mushy human knowledge, or humanities – literally human studies — to make computers/machines/robots better equipped to interact with :humans:. Linguistics is already an incredibly broad term, and is interdisciplinary in itself (Historical Linguistics deals with changes in Language, and specific languages, over time; as well as pronunciation changes, marking what changes were made, and providing a pattern. Syntax is basically mathematics & logic. Psycholinguistics uses neurobiology. and those are only a few of the branches within the study of Linguistics. Linguistics, in the broad, all-encompassing sense is generally deemed a "humanities" study, except maybe by MIT).

  5. And yet I've still never seen a web-based version of Eliza or any other Turing test contender that works using HTML speech recognition and generation. Anyone know of any?

  6. thank you for the episode! i knew a lot of it, but this video managed to structure everything so neatly that i feel like i understand the topic so much better now :>

  7. In the video where u talk about black holes u said that time would speed up for someone falling in to a black holes but would that mean u could see the black hole that u are falling in to died like black hole supernova and if i was at the a black hole and i was one meter away from the evethorisen could i see the future and get back to Earth i would be dead but could i??

  8. Do a video on Nepal ( talk about its history: Prithvi Narayan Shah, Manjushree), Nature: Mt Everest, The highest situated lake Tilicho, geographic regions, etc, Religion: 80% hindu 10% buddhism ( Buddha was born in Lumbini, Nepal) , 2015 Earthquake and Nepal Blockade and also the royal massacre

  9. thanks for flagging up the knowledge graph. Is there a transcript of this? The chat is VERY fast… almost like one continuous sentence!

  10. One thing to keep in mind with regard to natural language is that natural language is often highly imprecise, and very dependant upon the level of eloquence of the transmitter and the knowledge and understanding of the receiver, and the importance of shared points of reference for both of them.

    This is why for instance when it comes to the sciences there is such a heavy emphasis on exact word use. When writing a scientific article or instruction manual, you want to be as precise as possible with your use of words, terms and grammer and to minimize the chance of missunderanding to be as small as possible.

    That is why, even if computers could understand natural language, the degrees of complexity of the tasks which they could pull off would inevitably be affected by the accuracy of the instructions(macros and preprogrammed sequences and programs are of course exempted).

  11. …and, finally, speaking of accuracy, linguists tell us that linguistics, the scientific study of language-in-general, is not an exact science, and yet, linguistics is what we use to convey the understanding of exact science—(is that a paradox)—and furthermore will correct the usage of natural language only when we stop monkeying-around for example, NOT-ELSE.

  12. There is even Vocaloid, a speech synthesizer software by Yamaha aimed to singing. There is a whole culture aroind it because the companies that made the voice banks have created anime-style characters for them.

  13. Try turning on automatic captions. For clearly spoken stuff like what is on this channel, they are super accurate now.

  14. It's pretty much solved if you were Japanese. Honestly surprised you used Siri or Cortana as an example at all given how backwater they sound compared to Vocaloid.

  15. import pyttsx
    engine = pyttsx.init()
    engine.say('Binary solo
    Zero zero zero zero zero zero one
    Zero zero zero zero zero zero one one
    Zero zero zero zero zero zero one one one
    Zero zero zero zero zero one one one one
    Oh, oh,
    Oh, one
    Come on sucker,
    Lick my battery .')
    engine.runAndWait()

  16. I'm 55. It has been "pretty soon" that voice recognition would work properly for most of my life. It is now in common use – and it still does not work. Perhaps if I live to a 110, but I will take no bets on that.

  17. WHAT THINK YE ON THIS GENTLE MIDSUMMER'S DAY? >intense concentration< maybe my favorite part in the entire series so far

  18. LOL. Majority of this is hype. Chatbots haven't come that far since ELIZA from a conversational perspective. Sure there are advances in recommendation engines and other algorithms, however, these services can be used in a traditional app, as for the intelligent conversations with an AI, the bots today are on average the same as Eliza if not worst, and the better ones are only slightly better. In addition the Facebook story of bot creating their own language has been debunked.

  19. Oh, I just watched 3Blue1Brown's video on the Fourrier Transform a couple of days ago!
    If you don't know him yet: Definitely check him out, he does the best videos visualizing and explaining all sorts of mathematical … things 🙂

  20. apart from the vocabularies concept can you give short description on Entity extraction, aspect of entity extraction , like how can these interrelate to Natural language processing.

  21. LMAOOOOO I had my phone with me and siri was activated on my phone while this video was playing hahahaha I was like WTF hahaha

  22. pro tip, turn the 12 minute video into 15. Give the viewer consistent and natural pauses between sentences and concepts, and let them absorb the information. If people want to speed it up they can, but at least the pauses are natural.

  23. Omg, your videos are good, but you speak really fast. Could you improve that?

  24. Absolutely fantastic video. A great overview of the topic making it easy for me to learn more about the parts that interest me. I will be watching more of these this weekend

  25. A video on
    1) firmware, drivers, microcontroller, daq and PLD
    2) database
    Would be very useful and complete…

  26. The best part of this video is when she spoke in Shakespearean to Siri. Who here can also fluently speak this language? I know I can.

  27. What bothers me about those "speech recognizers" is. Most of the time we humans don't say things like "I <pause> am <pause> going <pause> to <pause> …". But talk more like this "Am goingto" (yes, I exaggerated a bit). So, we combine two words, because you talk faster. Or we even skip words. But the computer is programmed to recognize each word individually. Which isn't the way humans naturally speak.

  28. Do you, as AI makers, identify as humans because I can see it benefiting humanity trickling down from the AI-creator elites getting max power but once you've fully quantized a human, reproduced one, and surpassed one.. why would you care about a biological one, or is 'human' obsolete and to survive we MUST augment to AI then full upload to the AI-net?

  29. Probably the best channel/playlist that I have encountered on youtube till date apart from 3blue1brown.
    Carrie Ann, Carry on!

  30. If you still have a Windows XP computer, or if you have access to Microsoft Sam (the voice of Narrator in Windows XP), replace the sample text with "soy" or "soi", and you'll hear a strange sound.

  31. I wonder if there are some academic papers that introducing these concepts? I hope wish I can find good citation for my dissertation, thank you!

Leave a Reply

Your email address will not be published. Required fields are marked *