Talk:Natural language processing

Computing C‑class High‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
C	This article has been rated as C-class on Wikipedia's content assessment scale.
High	This article has been rated as High-importance on the project's importance scale.

Computer science C‑class Top‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

C

This article has been rated as C-class on Wikipedia's content assessment scale.

Top

This article has been rated as Top-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Linguistics: Applied Linguistics C‑class Mid‑importance

	Linguistics portal This article is within the scope of WikiProject Linguistics, a collaborative effort to improve the coverage of linguistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.LinguisticsWikipedia:WikiProject LinguisticsTemplate:WikiProject LinguisticsLinguistics articles
C	This article has been rated as C-class on Wikipedia's content assessment scale.
Mid	This article has been rated as Mid-importance on the project's importance scale.
	This article is supported by Applied Linguistics Task Force.

(Computational Linguistics) Merge

PRO I think that this article should probably be merged with Computational linguistics, but I'm fairly new to the Wikipedia, so I'm not sure.

Lambda 22:55, 22 Feb 2004 (UTC)

CON While they're related, they're not really the same thing. Computational linguistics tries to use computer techniques to better understand linguistics as a discipline, while NLP tries to build ways for a computer to understand language. Obviously many things overlap, but they have much different focus: NLP doesn't explicitly care if it's making new contributions to linguistics, and computational linguistics doesn't explicitly care if it's making it easier for computers to understand natural languages. --Delirium 22:58, Feb 22, 2004 (UTC)

Unclear My take on this (I'm a grad student studying NLP/CL) is that CL and NLP are the endpoints on a continuum, and so a lot of work in the middle is hard to classify as one or the other. They don't have separate conferences - the Association for Computational Linguistics (annual) and Computational Linguistics (biannual) are the main conferences for both NLP and CL research. 24.59.194.44 13:26, 23 June 2006 (UTC)[reply]

PRO I agree -- we should merge. Whether you call it NLP or CL is mostly a question of what aspect you stress. In addition, my impression is that the NLP tendency is currently stronger than the CL tendency in the field. Articles in the Computational Linguistics journal, and at the Coling and ACL conferences, are judged on whether they are useful rather than on whether they give any insight on how humans process language.Kallerdis (talk) 19:35, 29 February 2008 (UTC)[reply]

CON There's a fine distinction between NLP and Computational Linguistics that has to do primarily with the distinction between computing and linguistics. Historically, NLP is associated with computing and CL with linguistics. I would be opposed to the merge for that reason. Investigations into the nature of language are misplaced in applied computing and practical aspects of parsing for say commercial applications are misplaced in Linguistics. 74.78.162.229 (talk) 21:30, 10 July 2008 (UTC)[reply]

PRO/Rebuttal Both NLP and CL have the same objectives, and this "fine distinction" is irrelevant when both CL and NLP involve computing and linguistics (who cares about the mixture proportions?). ----Dustin

CON I am agree. -- AKA MBG (talk) 09:58, 11 July 2008 (UTC)[reply]

PRO CL and NLP should be be merged. There are other fields: (I call) "Natural Language Understanding" or "Machine Reading" that have more ambitious goals: get a computer to "understand" some natural language. NLP and CL have made more progress, but are application driven --the technology behind them is often just perl scrips making statistics from NL corpora. In any case, certainly NLP should merge with NLU or CL, but definitely not both. ----Dustin

PRO My understanding has always been that CL is the term used by people with linguistics backgrounds, while NLP is more often used by computer scientists. At worst, I'd call CL a core subfield of NLP. —/Mendaliv/^2¢/_{Δでるた's}/ 23:31, 17 October 2008 (UTC)[reply]

CON CL is not a subfield of NLP, unless you're an NLP researcher ;) I've studied at both NLP-oriented departments and more CL-oriented ones. There's a lot of overlap, sure, but the approaches differ a lot (do we build corpora to study eg. the limits of case alignment in natural language, or to get testbeds for parsers?). Also, there is a difference in methods used, eg. in both fields there are those who swear to statistical methods, but NLP (eg. for parsing, MT) puts a lot of credit in Bayesian methods, while CL (eg. corpus linguists) uses more standard hypothesis tests. Kiwibird (talk) 11:57, 16 November 2008 (UTC)[reply]

PRO I think that this article should probably be merged with Computational linguistics, but I'm fairly new to the Wikipedia, so I'm not sure.

CON -- see my suggestions under #Clean-up/Major edit. --Thüringer ☼ (talk) 08:47, 15 January 2009 (UTC)[reply]

PRO I have worked in CL/NLP for two decades, and as far as I am aware, there is no clear distinction in practice between CL and NLP, both have the same conferences, the same publications, the same research communities. In my opinion, it would be better to have one merged article, with mention of the different subfields within CL/NLP. Gor (talk) 06:45, 27 March 2009 (UTC)[reply]

PRO I work as a researcher in CL/NLP/Text Analytics/AI/Machine Learning/etc. I think CL and NLP should be merged, in the grand scheme of things, there is not much difference (if any). Either way, as I said under the CL article: It seems to me that the state of things is that the boundary between NLP and CL is unclear. I think the goal of any related Wikipedia articles should be to represent the state of things as accurately as possible, NOT to solve the clarity problem. Thus, both articles should clearly :) state that various opinions about these fields. Indquimal (talk) 23:15, 20 June 2009 (UTC)[reply]

Content from The Natural Language Processing

I would like to mention my company, Creative Virtual, because we have over 10 years experience working with virtual assistant natural language web applications, and link to the automated online assistant page. — Preceding unsigned comment added by 75.99.227.213 (talk) 20:46, 9 November 2011 (UTC)[reply]

I append the content from that page, in case anyone wants to merge it in here.

Charles Matthews 09:35, 6 May 2004 (UTC)[reply]

The Natural Language Processing

Natural Language Processing (NLP) is inside the topic of the Artificial Intelligence and linguistics. It treats the problems inherent in the processing and manipulation of natural language.

Some examples of the major tasks in Natural Language Processing are:

Text to speech
Speech Recognitions
Natural language generation
Translation made by Machine
Question answering
Information retrieval
Information extraction
Text-proofing

Some problematic things in NLP are:

Word boundary detection

In the known spoken language, there are no gaps between words; where to situate the word boundary many times depends on what choice makes the most sense grammatically and given the context.

Word sense disambiguation

Any word that we can think of has many different meanings. That is why, we have to select the meaning which makes the most sense in our context.

– Sign

Syntactic ambiguity

The grammar for natural languages is ambiguous. Selecting the most appropriate grammatical element requires semantic and contextual information.

Speech acts and plans

Sometimes what we write doesn't mean literaly what is written; for instance a good answer to "Can you give the pencil?" is to give the pencil; in most contexts "Yes" is not the best thing to answer; when you want to say literaly "No" it is better to say "I'm afraid that I can't see it".

Question edited into the article by User:129.27.236.115:

The Morphix-NLP link is not valid anymore. Does anybody know where to get Morphix-NLP?

Cadr

It is now. Yaron 22:40, May 17, 2004 (UTC)

Remove external link

Removed a spam link (several times) to a website called ivrdictionary. This is a thinly veiled attempt to put advertising on Wikipedia. Links were added by several anonymous users within a tight IP range. Website purports to list ivr terminology, but in reality it prominently displays an advertisement to Angel dot com, which is a commercial company that sells IVR related products. The same links were added to other articles that are related to IVR technology. Calltech 16:59, 17 November 2006 (UTC)[reply]

Incorporate stemming?

I suggest adding a link to stemming in the see also or subtasks or challenges. I am not sure who is responsible for editing this article though, and I don't want to edit it myself without asking. Is stemming too detailed, or a subtask of another subtask only like IR? Not sure. I thought it was a pretty popular problem. Josh Froelich 19:46, 13 December 2006 (UTC)[reply]

"I am not sure who is responsible for editing this article though" You are, feel free to edit any wikipedia page. Yes it feels very wrong the first few time, but your fine to do so. Someone will fix it if your wrong anyhow. Scott A Herbert (talk) 13:56, 24 February 2011 (UTC)[reply]

External links

I think everyone would agree the external links section is a complete mess and full of spam, vanity links, and other links that don't add anything to the article. I count 47 external links. I'm sure there is someone out there who supports each one, but I think we all can agree that 47 is too many and there is certainly some redundancy.

I know it can be hard to part with large chunks of an article, but I propose the following: we assume that we are going to delete all of them and anyone who wants a link kept should nominate it here on the talk page. We can then discuss whether it actually adds something unique. Please keep in mind WP:EL, also.

--Selket 22:50, 1 February 2007 (UTC)[reply]

The Implementations links seem alright. However the R & D groups links are way too many. Unfortunatly, each group would want there own link up there. Also, there were a few links to blogs. Am I right in believing that those links should be deleted?

Ummonk 22:06, 4 February 2007 (UTC)[reply]

I think the Implementations links should be removed per WP:EL, WP:SPAM, and WP:NOT#LINK --Ronz 03:06, 25 October 2007 (UTC)[reply]

I cleaned the section up quickly because it had become quite the linkfarm once again. --Ronz (talk) 15:11, 18 September 2011 (UTC)[reply]

Maximum entropy methods

My vague understanding is that maximum entropy methods represent the state of the art in NLP these days; yet this article seems to fail to mention them. Could an expert clarify/elucidate? linas 13:17, 13 June 2007 (UTC)[reply]

If an article is lacking a notable subject, it's usually the case that nobody got around to adding it. Please be bold and add a review of maxent NLP stuff to the article as you see fit, remembering to cite your sources. –jonsafari 20:47, 14 June 2007 (UTC)[reply]

In most subareas of current NLP, machine learning is at the core of most implementations. It's true that Maxent (or logistic regression, as it's also known) and its generalizations (e.g. Conditional random fields) usually perform well for these tasks, but they are not the only method. I'd say that margin-based methods such as Support Vector Machines are at least as popular. Anyway, it's more important to expand the section about machine learning/statistical modeling rather than just adding a section about Maxent. Kallerdis (talk) 19:43, 29 February 2008 (UTC)[reply]

Human Language Technology

Does anyone feel it necessary to distinguish between NLP and HLT? If so, please visit that article—it desperately needs work. On the other hand, perhaps it should simply redirect here to the NLP article. —johndburger 02:47, 22 June 2007 (UTC)[reply]

Papers

The following were added to the External links section. Perhaps one or more might be used as a reference someday?

Goutam Kumar Saha, English to Bangla Translator: The BANGANUBAD, International Journal -CPOL, Vol.18(4), pp.281-290, December 2005, WSPC, USA.
Goutam Kumar Saha, Parsing Bengali Text - an Intelligent Approach, ACM Ubiquity, Vol. 7 Issue 13, April, 2006. ACM Press, USA.
Goutam Kumar Saha, The EB-ANUBAD Translator: A Hybrid Scheme, International Journal ZUS, Vol. 6A(10), ZUS Press, 2005.
Goutam Kumar Saha, A Novel 3-Tier XML Schematic Approach for Web Page Translation, ACM Ubiquity, Vol. 6(43), ACM Press, 2005, USA.

--Ronz 17:36, 14 November 2007 (UTC)[reply]

Add confusion about accenting words?

I was going to add this in, but I thought it might not be a good Idea. If you guys can incorporate it well and fit it in, please do: (I was going to put it after the 'I never said she stole my money' part.) Accenting words can be very helpful in giving meaning to a sentence that contains negatives, because the speaker is saying that a specific fact is not true, and usually something else without one expressed specific is. Sometimes accenting words in a sentence can still lead to confusion, like in "Go over there" because "over" is being used to describe the relative position of the destination, but when taken by itself, "over" means ontop of something. The accent in this case implies a literal meaning of the word...

24.250.97.223 (talk) 04:56, 14 December 2007 (UTC)[reply]

(NLU) Merge

PRO As stated on my talk page. Not much there but don't see anything here either so maybe better to do a little something here. Perhaps a § (NLU, Semantics, Discourse, Top Level Protocols, etc.) to which the NLU article can redirect. 74.78.162.229 (talk) 21:38, 10 July 2008 (UTC)[reply]

Rating and Importance

Set these to values that seemed reasonable to me and manually created the Comments page. 74.78.162.229 (talk) 22:01, 10 July 2008 (UTC)[reply]

Clean-up/Major edit

As noted in the article header, this article needs major rewriting, restructuring and clean-up. Would anyone like to team up with me to get it done? I'm a wiki-novice but know a fair amount about NLP (and have plenty of references that I can consult). Sunfishy (talk) 17:39, 5 November 2008 (UTC)sunfishy[reply]

Yes, I can see the necessity, and I am willing to help. Let's perhaps start with a non-controversial, easy restructuring: The section Major tasks in NLP is quite a random list of NLP-related articles at the moment. I think it would be wise to differentiate between (1) NLP tasks in the sense of NLP modules a comprehensive NLP system can have (speech recognition, morphological analysis, NLU, word sense disambiguation, semantic role labeling, semantic interpretation, perhaps NLG), and (2) NLP applications such as those currently listed under this heading.

More generally, the relationship to Computational linguistics should finally be clarified. It is not a rare thing that industry uses terms different from academia, and in this case, I can see that it makes sense to have two articles. They should simply be linked to each other in a reasonable way, and then the big warning signs will no longer be necessary. This article could describe the applied side while Computational linguistics could focus on the theoretical underpinnings (which is already the case, by and large). --Thüringer ☼ (talk) 08:43, 15 January 2009 (UTC)[reply]

Subproblems

A significant subproblem not mentioned (directly) is that the great majority of people use words and grammar incorrectly. For example, one of the most frequently seen errors in written text is using "loose" for "lose", as in "Did anyone loose this book?". A typical grammatical error is a golf analyst talking about something being "between he and the hole" instead of "between the hole and him". In fact, if you listen to sportscasters on TV, hardly five minutes will go by without some kind of gross grammatical error or misuse of words. Tens of millions of people are often subjected to this for hours at a time, week after week, possibly having a negative effect on the way they speak.

Ironically, even the article is guilty of speech misuse under the "Subproblems: Speech acts and plans" heading where it says: "Can you pass the salt?" is requesting a physical action to be performed. Actually, the verb "can" means "able to" and as such, DOES request a yes or no answer rather than requesting a physical action. The correct, unambiguous wording is: "Please pass the salt." or at the very least: "Would you pass the salt, please." The question mark is intentionally not used because we are not really asking a question. Also notice that adding "please", like your mother surely told you, instantly clarifies that a physical action is being requested.

Speech is only half of communication; the other half is the cooperation of the listener in trying to understand what the speaker means regardless of errors in speech. So any computerized natural language processor must be programmed not only with proper grammar and word meanings, but also with the ability to recognize and correct for IMPROPER speech. Any NLP program which requires perfect word usage, spelling, and grammar is not going to work very well. 71.154.253.96 (talk) 14:02, 8 October 2009 (UTC)[reply]

Forgotten Merge? Better forgotten

I do not see a discussion of the July 2008 merge suggestion. Natural language understanding is a field unto itself, and I am going to rewrite that article 99.99999% and put a "main link" so there is really no need for a merge. This article is not in good shape either, but it is a much larger field and will need much more attention. It does have several good points in it, but overall a new computer science student would be well advised not to read it until it has been cleaned up. Unless there are objections I will remove the merge flag later. Cheers. History2007 (talk) 21:12, 18 February 2010 (UTC)[reply]

Section 'Concrete problems'

The second bullet point in the section 'Concrete problems' is copied verbatim from its source, http://www.kurzweilai.net/articles/art0311.html?printable=1. Is there permission? —Preceding unsigned comment added by Jann.poppinga (talk • contribs) 14:17, 3 May 2010 (UTC)[reply]

Good observation; thank you for pointing this out. I am working on this section and the problem should naturally drop out as the restructure progresses. TehMorp (talk) 14:55, 23 June 2010 (UTC)[reply]

Sections 'Concrete Problems' and 'Major tasks'

When I began, concrete problems was essentially a list of largely unelucidated examples; It seems better to work the examples in with some level of explanation (or work some level of explanation in with the examples). I began to do that, and now I'm wondering whether ultimately it wouldn't be better to end up combining this section with the Major tasks section. What that would entail would be including examples along with appropriate tasks to illustrate why that particularly task isn't yet solved, or what's difficult about the task. There's one fairly rich example, the "time flies like an arrow" example, subparts of which could be used under several different problems, so perhaps this example would be set up at the beginning of the list and then different aspects of it referred to appropriately.

Alternately, it could be interesting to use the examples before the task list as sort of a teaser, a "this is what we have to deal with", followed by a sort of "because of that, these are tasks that must be handled" type thematic progression.

Opinions? TehMorp (talk) 15:04, 23 June 2010 (UTC)[reply]

I think that the 'Concrete Problems' section should be dropped. The "problems" all boil down to the same issue: not being able to determine the intended meanings of words outside of their context.

The letter "A" can have many different meanings: the first letter of the English alphabet, a musical note, a grade, etc., just as the phrase "pretty little girls' school" (or any of the other phrases given) can have any of the meanings shown in the section. In each case, the meaning should be determinable by the surrounding context. It is ridiculous to say that understanding such phrases is a problem any more than is understanding which meaning of "A" is intended when no context is given for either.

Determining the intended meanings of words based on their context is not a "problem" so much as it is the essential goal of NLP. This is not to say that there cannot be ambiguities resulting from poorly worded text, but when when an NLP program detects abiguities which cannot be resolved given the surrounding context, the simple solution is to request clarification from the source of the text. 75.46.215.114 (talk) 12:10, 11 August 2010 (UTC)[reply]

I did drop this section. It was repetitive and didn't seem especially useful. The section on tasks gives a fair amount of explanation of what the issues are for the individual tasks. For more examples, refer to the articles on specific tasks. Benwing (talk) 22:18, 3 October 2010 (UTC)[reply]

Parsing weird

"And ALL fruit flies in the same manner - like bananas do;"

I don't think any program would parse "Time flies like an arrow" this way, given that neither "fruit" nor "bananas" appears in the source sentence. I suspect this was copied incorrectly, but the original link is now dead.

Should it read "And ALL time flies in the same manner - like an arrow does"? That's a pretty big change for a typo. —Preceding unsigned comment added by 216.163.72.2 (talk) 00:45, 1 October 2010 (UTC)[reply]

Citations

I don't know if the "Resources" section makes it redundant, but the text doesn't have too many citations. The "NLP using machine learning" section, which is a fairly long piece of text, hasn't got any citations at all. Isn't this needed? 90.233.154.111 (talk) 15:38, 11 November 2010 (UTC)[reply]