One of the top complaints data scientists have is the amount of time it takes to clean and label text data to prepare it for machine learning. In fact, it is the complaint. If you’re in the data cleaning business at all, you’ve seen the statistics – preparing and cleaning data can eat up almost 80 percent of a data scientists’ time, according to a recent CrowdFlower survey.[1]
This means less data is being used. One estimate published by PWC maintains that businesses use only 0.5 percent of data that’s available to them.[2]
Consider, also, the issues caused by data that’s labeled incorrectly. Poor data quality can proliferate and lead to a greater error rate, higher storage fees and require additional costs for cleaning.
And all the while, the demand for data-driven decision-making increases.
What makes for good data?
Data scientists work with a wide range of text data including social media posts, product reviews, call center voice-to-text data, academic libraries, product descriptions…it’s an endless stream of text data that can produce insight and value if analyzed properly.
Normalizing this data presents the first real hurdle for data scientists. Just getting the data into a format where it can be looked at for labeling is a cumbersome task.
Once the data is normalized, there are a few approaches and options for labeling it. Depending on the size of the dataset, it could be labeled “by hand” or by matching data to a taxonomy. If data scientists are working with a specific set of data in a specific subject area, there may be a taxonomy designed for that system. Mapping to an auto parts taxonomy is a fantastic way to organize data about auto parts – but a horrible way to map customer reviews about an auto parts store.
Label Text Data with a General Taxonomy
More than ten years ago, our company launched a meta search engine called Info.com. Serving up relevant results – and ads – required a deep and thorough understanding of search terms. So, we set out to map the most-searched-for words on the internet. The result was a huge taxonomy (it took more than 1 million hours of labor to build.) And once that was complete, we realized that our nifty tool had value to a lot of other people, so we launched eContext, an API that can take text data from any source and map it – in real time – to a taxonomy that is curated by humans. A general taxonomy, eContext has 500,000 nodes on topics that range from children’s toys to arthritis treatments.
eContext also sets itself apart as being a very deep taxonomy. The IABC provides an industry-standard taxonomic structure for retail, which contains 3 tiers of structure. The eContext taxonomy, which incidentally covers thousands and thousands of retail topics, offers up to 25 tiers.
For data scientists, this level of depth and such a wide range of topics in a general taxonomy means, simply, better and more accurate text labeling. And the fact that the API can take raw text data from anywhere and map it in real time opens a new door for data scientists – they can take back a big chunk of the time they used to spend normalizing and focus on refining labels and doing the work they love – analyzing data.
Give us a Try
We’re as excited as everyone else about the potential for machine learning, artificial intelligence, and neural networks – we want everyone to have clean data, so we can get on with the business of putting that data to work.
Try us out. You can see a mini-demonstration at http://www.econtext.ai/try. Simply type in a URL, a Twitter handle, or paste a page of text to see how we classify it. We think you’ll be impressed enough to give us a call.
We’re very happy to talk with you about your specific needs and walk you through a demo of eContext.
Additionally, if you’re interested in learning more about how a general taxonomy supports better machine learning initiatives, read our whitepaper, Contextual Machine Learning – It’s Classified by Seth Grimes. The paper outlines five ways that machine learning accuracy can be improved by deep text classification.
______
[1] CrowdFlower Data Report, 2017, p1, https://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport.pdf
[2] PWC, Data and Analysis in Fiancial Research, Financial Services Research, https://www.pwc.com/us/en/industries/financial-services/research-institute/top-issues/data-analytics.html