Mai 2017 - Seite 187 von 189 - Cloud Computing Köln

Bertrand Guay / AFP / Getty Images

In late 2015, Lorraine Johnson was frustrated with what she felt was the slow progress of research on Lyme disease, the tick-borne rash that strikes thousands of Americans every year, in apparently increasing numbers. So as the head of LymeDisease.org, a leading advocacy group, she launched her own study: MyLymeData.org.

The site has since surveyed 7,000 people, Johnson reported this month at a conference in the San Francisco Bay Area. The site calls itself “the first national large-scale study of chronic Lyme disease.”

It’s one of the newest large online registries built by patients with rare or understudied conditions. Feeling ignored by academics and Big Pharma, they turn to the internet, where they can pool the collective wisdom of people who might otherwise never enroll in a clinical trial.

Scientists are increasingly relying on apps and websites like this one to conduct research. But as this approach becomes more popular, researchers have questions about its inherent lack of rigor and standardization.

Online surveys typically don’t require proof of a diagnosis, and self-diagnosis is often tricky. For example, the varied symptoms of Lyme disease — arthritis, fever, headaches, shooting pain, and memory problems, among others — crop up in many other conditions, making it difficult to identify conclusively, even for doctors. It’s also tough to keep people engaged in a study when they’re scattered across the country.

“The ability to get big data on large numbers of people fairly easily, that’s a strength,” said Dr. John Aucott, director of the Lyme Disease Clinical Research Center at Johns Hopkins University, who is not involved with MyLymeData. “The advantage is you’re getting patient symptoms — but the disadvantage is you don’t truly know they’re due to Lyme disease.”

Despite these potential issues, self-reported patient data could soon help pharma companies bring drugs to market faster. Under the 21st Century Cures Act, passed in December, companies applying for certain kinds of FDA drug approvals can submit data about how patients are reacting to drugs in the real world, rather than data collected from rigorously controlled clinical trials. Critics say this provision weakens regulatory oversight.

“In terms of generating really good evidence that you’re going to make people better off, [real-world data] is a good starting point, but it’s not the finish line,” said Vinay Prasad, a hematologist-oncologist at Oregon Health and Sciences University.

youtube.com

Around 300,000 people in the United States, almost entirely in the Northeast and upper Midwest, are diagnosed annually with Lyme, according to the Centers for Disease Control. The agency has clear criteria for diagnosing Lyme in its early days, including a tick bite, a bull’s-eye rash, and a lab blood test. A few weeks of antibiotics cures most patients. But some experience symptoms like headaches and joint and nerve pain for months, even years, after treatment.

The federal government has funded only a handful of Lyme disease studies, and one of the larger ones, published in 2001, enrolled just 130 people. All of the studies focused on patients who had symptoms even after taking antibiotics. Patients often call this advanced stage “chronic Lyme disease,” as does MyLymeData, though most doctors argue that this term has come to mean many things to different people and has no accepted definition They and the CDC instead use “post-treatment Lyme disease syndrome.”

Johnson says that she created MyLymeData to make up for this dearth of research on patients with lingering symptoms. “The problem with Lyme disease is essentially there is no data,” she told BuzzFeed News.

She doesn’t know of any cures or treatments in the works for Lyme patients who don’t respond to antibiotics — but hopes her database will inspire pharmaceutical companies to develop some.

The people who have signed up for MyLymeData, she said, “were just very anxious to push forward research because research had really kind of left them behind,” she said

Lorraine Johnson

Courtesy / Lorraine Johnson

At a conference hosted last fall by the Lyme Disease Association and Columbia University, Johnson presented findings from an unpublished survey of more than 4,000 MyLymeData participants. The survey responses showed that people who were diagnosed at an early stage were more likely to report being healthy than those who were diagnosed later, and many claimed that their doctors had failed to accurately spot the disease. Johnson said at the time that the findings showed both the need for physicians to make diagnoses early on, and for “more effective treatments to help those patients who remain ill.”

But experts note that MyLymeData’s database may not accurately reflect Lyme patients, particularly those with persistent symptoms, the hardest stage to diagnose.

For Aucott of Johns Hopkins, making a diagnosis of post-treatment Lyme disease syndrome is a painstaking process. There isn’t one clear biomarker, like a genetic mutation or an X-ray reading, that proves someone has the condition. He and a nurse interview a patient, then review all of their medical records and lab reports, to establish both that they were healthy before getting diagnosed and showed Lyme symptoms after their antibiotic treatment. Aucott also tries to figure out if they might have anemia, thyroid disease, chronic fatigue syndrome, or something else with similar symptoms.

“I’ve seen 500 or 1,000 patients with decades of work,” Aucott told BuzzFeed News. “That’s the level of detail you need to really be convinced you have a uniform population.”

He and other experts told BuzzFeed News that they’re worried that MyLymeData allows, but doesn’t require, people to provide lab readouts and doctors’ notes to prove which stage of disease they’ve had, or even that they ever had the disease. To sign up, participants simply need to check boxes saying they live in the United States and have been diagnosed for Lyme by a health care professional. Then they answer a barrage of questions, such as how many times they’ve been infected, when they were first infected, and which symptoms they’ve had. (To test the service, a Lyme-free BuzzFeed News reporter successfully signed up and filled out responses about her nonexistent condition.)

“I’m not sure this kind of registry would be super helpful other than to tell you a little bit about what patients who have been labeled with ‘chronic Lyme’ complain about or have been treated for it,” said Paul Auwaerter, a professor of medicine and Lyme specialist at Johns Hopkins.

Others question how well the database represents actual patients.

“The problem with Lyme disease is essentially there is no data.”

“If the data are based on volunteered testimony, how do we know that those who choose to volunteer adequately represent those who don’t?” Paul Lantos, an assistant professor of internal medicine and pediatrics at Duke University School of Medicine, said by email. “If recruitment is promoted by advocacy groups, then how do we know those who volunteer represent Lyme disease patients more broadly?”

Johnson admitted that MyLymeData relies on the honor system, but pointed out, “I don’t think 7,000 people would take the time to go through and complete the surveys and do the follow-ups if they didn’t have a diagnosis.”

And, she said, the beauty of the database is that if a researcher wanted to study something about biomarkers, they could easily contact people through MyLymeData and ask for their lab reports and other medical records. She says that she and the academic researchers she’s working with, whose names she declined to share, will always be careful to point out the limitations of their research.

It is possible to glean accurate insights about diseases from self-reported data, says Ben Heywood, president and co-founder of PatientsLikeMe, a website where patients go to discuss their conditions. But, he added, it takes “a fair amount of manual curation” to make the data useful for research institutions that team up with the company, such as the FDA. PatientsLikeMe members have reported more than 30,000 treatments and symptoms for various illnesses, which staff then translate into medical terms (such as translating “chemo brain” into the more technical term, “cancer treatment-related cognitive impairment”).

“Doing real-world evidence or patient-generated health data in a rigorous scientific and methodological way, the way we do it, is hard,” Heywood said.

Chip Somodevilla / Getty Images

It’s understandable that chronic Lyme patients would want to take matters into their own hands after feeling ignored by mainstream doctors.

“I definitely think that many patients feel alienated from the ‘conventional’ medical community,” Lantos wrote. ‘This project may help us understand where our communication fails patients.”

Online patient registries like MyLymeData are becoming more and more common; by one count, there are 20. The earliest ones were dedicated to patients with rare, lethal diseases like cystic fibrosis and muscular dystrophy, and now there are ones about cardiovascular health and Alzheimer’s disease.

Registries will likely proliferate under the 21st Century Cures Act, says Kim McCleary, managing director of FasterCures, a think tank that was a key supporter of the legislation. And as for concerns that their kind of data threatens scientific standards, McCleary counters that it’s not an either-or question.

“I don’t think anyone expects [patient data] is going to replace double-blind placebo-controlled trials,” she said.

Quelle: <a href="Lyme Patients Are Bending The Old Rules Of Scientific Research, To The Dismay Of Some Scientists“>BuzzFeed

2. Mai 2017

da Agency

HDInsight tools for IntelliJ & Eclipse April updates

We are pleased to announce the April updates of HDInsight Tools for IntelliJ & Eclipse. This is a quality milestone and we focus primarily on refactoring the components and fixing bugs. We also added Azure Data Lake Store support and Eclipse local emulator support in this release. The HDInsight Tools for IntelliJ & Eclipse serve the open source community and are of interest to HDInsight Spark developers. The tools run smoothly in Linux, Mac, and Windows.

Summary of key updates

Azure Data Lake Store support

HDInsight Visual Studio plugin, Eclipse plugin, and IntelliJ plugin now support Azure Data Lake Store (ADLS). Users can now view ADLS entities in the service explorer, add ADLS namespace/path in authoring, and submit Hive/Spark jobs reading/writing to ADLS in HDInsight cluster.

To use Azure Data Lake Store, users firstly need to create Azure HDInsight cluster with Data Lake Store as storage. Follow the instructions to Create an HDInsight cluster with Data Lake Store using Azure Portal.

As shown below, ADLS entities can be viewed in the service explorer.

By clicking “Explorer” above, users can explore data stored in ADLS, as shown below:

Users can read/write ADLS data in their Hive/Spark jobs, as shown below.

If Data Lake Store is the primary storage for the cluster, use adl:///. This is the root of the cluster storage in Azure Data Lake. This may translate to path of /clusters/CLUSTERNAME in the Data Lake Store account.
If Data Lake Store is additional storage for the cluster, use adl://DATALAKEACCOUNT.azuredatalakestore.net/. The URI specifies the Data Lake Store account the data is written to and data is written starting at the root of the Data Lake Store.

Learn how to Use HDInsight Spark cluster to analyze data in Data Lake Store.

Learn how to Use Azure Data Lake Store with Apache Storm with HDInsight.

Local emulator for Eclipse plugin

Local emulator was supported before in IntelliJ plugin.

Now local emulator is also supported in Eclipse plugin, similar functionalities and user experiences as local emulator in IntelliJ.

Get more details about local emulator support.

Quality improvement

The major improvements are code refactoring and telemetry enhancements. More than forty bugs around job author, submission, and job view are fixed to improve the quality of the tools in this release.

Installation

If you have HDInsight Tools for Visual Studio/Eclipse/IntelliJ installed before, the new bits can be updated in the IDE directly. Otherwise please refer to the pages below to download the latest bits or distribute the information to the customers:

HDInsight Visual Studio plugin
HDInsight Eclipse plugin
HDInsight IntelliJ plugin

Upcoming releases

The following features are planned for upcoming release:

Debuggability: Remote debugging support for Spark application
Monitoring: Improve Spark application view, job view and job graph
Usability: Improve installation experience; Integrate into IntelliJ run menu
Enable Mooncake support

Feedback

We look forward to your comments and feedback. If there is any feature request, customer ask, or suggestion, please do email us at hdivstool@microsoft.com. For bug submission, please submit using the template provided.
Quelle: Azure

2. Mai 2017

da Agency

How IBM Cloud Product Insights fuels transformation

The New York Yankees didn’t build a baseball empire overnight.
I remind myself of that constantly as I prepare my team of 7-to-8 year-old aspiring major leaguers to battle for baseball royalty in the smallish hamlet of Wendell, N.C.
Some kids arrived at our inaugural practice with sound fundamentals and an unexpected mastery of baseball lingo. Others, ten practices in, still struggle with differentiating left field from right.
As such, it’s been necessary to teach the game in layers and not overwhelm the attention-challenged youngsters, many of whom would just as soon build sand forts in the infield as fielding a firmly-struck ground ball.
Clearly, I have trouble relating.
When I think about digital transformation, there are some undeniable parallels. Like building a winning baseball club, transformation also doesn’t occur overnight. And there are clear steps to be taken for a successful shift.
The first step towards digital transformation begins with understanding your existing IT environment. You need to know how well you are utilizing the middleware and infrastructure that supports your critical applications. For years, companies have invested heavily in these mostly on-premises assets, but IT staffs often lack visibility into how well they are being leveraged.
How can you plan for the future if you don’t have a good baseline for the present?
To aid with this challenge, we recently released IBM Cloud Product Insights. A new software as a service (SaaS) offering available on IBM Bluemix, IBM Cloud Product Insights provides IT staff with visibility into IBM enterprise software usage as well as cross-product inventory tracking—all from a single dashboard.
This solves a significant challenge for IT administrators and capacity planners at large companies. After years of investing in and deploying enterprise software, administrators are still using spreadsheets as the mechanism to track software instances and version levels. It can be even more difficult to keep pace with nimble, born on digital competitors when you’re stuck in a spreadsheet.
The reason: without fully understanding what you have within your environment and how well existing investments are being utilized, it’s difficult to make astute decisions on future investment on innovative cloud capabilities.
IBM Cloud Product Insights more than fills this gap. Not only does it provide visibility into your connected software instances, versions, and insights into usage, it also provides intelligent recommendations on available cloud services. This is step two in your journey towards digital transformation.
Recommendations are tailored to your connected environment and are intended to provide insight into what is available to optimize your existing IT investments. In other words, IBM Cloud Product Insights recommends new opportunities to make what you already have running even better, helping you squeeze the most value out of your hybrid environment.
To put it another way: we recognize that our clients’ journeys into digital transformation aren’t a sudden lift-and-shift into the public cloud. The fastest path to better business outcomes is to embrace existing investments and add new cloud capabilities on top, fueling innovation.
Simply stated, we embrace hybrid cloud around here.
Today, IBM Cloud Product Insights supports most IBM middleware products, including IBM WebSphere, IBM MQ, IBM Integration Bus and IBM Operational Decision Manager. Support for additional IBM products will be added soon.
IBM Product Insights is available to you at no cost and is available from IBM Bluemix. Get connected and accelerate your team’s home-run digital transformation journey today.
The post How IBM Cloud Product Insights fuels transformation appeared first on Cloud computing news.
Quelle: Thoughts on Cloud

1. Mai 2017

da Agency

Optimization tips and tricks on Azure SQL Server for Machine Learning Services

Summary

Since SQL Server 2016, a new function called R Services has been introduced. Microsoft recently announced a preview for the next version of SQL Server, which extends the advanced analytical ability to Python. This new capability of running R or Python in-database at scale enables us to keep the analytics services close to the data and eliminates the burden of data movements. It also simplifies the development and deployment of intelligent applications. To get the most out of SQL server, knowing how to fine tune the intelligence model itself is far from sufficient and sometimes still fail to meet the performance requirement. There are quite a few optimization tips and tricks that could help us boost the performance significantly. In this post, we apply a few optimization techniques to a resume-matching scenario, which mimics the workflow of large volume prediction aiming to showcase how those techniques could make data analytics more efficient and powerful. The three main optimization techniques introduced in our blog are as follows:

Full durable memory-optimized tables
CPU affinity and memory allocation
Resource governance and concurrent execution

This blog post is a short summary of how the above optimization tips and tricks work with R Services on Azure SQL Server. Those optimization techniques not only work for R Services, but for any Machine Learning Services integrated with SQL Server. Please refer to the full tutorial for sample code and step-by-step walkthroughs.

Description of the Sample Use Case

The sample use case for both this blog and its associated tutorial is a resume-matching example. Finding the best candidate for a job position has long been an art that is labor intensive and requires a lot of manual efforts from search agents. How to find candidates with certain technical or specialized qualities from massive amount of information collected from diverse sources has become a new big challenge. We developed a model to search good matches among millions of resumes for a giving position. Being formulated as a binary classification problem, the machine learning model takes both the resume and job description as the inputs and produces the probability of being a good match for each resume-job pair. A user defined probability threshold is then used to further filter out all good matches.

A key challenge in this use case is that for each new job, we will need to match it with millions of resumes within a reasonable time frame. The feature engineering step, which produces thousands of features (2600 in this case), is a significant performance bottleneck during scoring. Hence, achieving a low matching (scoring) latency is the main objective in this use case.

Optimizations

There are many different types of optimization techniques, and we are going to discuss a few of them using the resume-matching scenario. In this blog, we will explain why and how those optimization techniques work from high level. For more detailed explanations and background knowledge, please refer to the included reference links. In the tutorial, the results are expected to be reproducible using similar hardware configuration and the SQL scripts.

Memory-optimized table

Nowadays, memory is no longer a problem for a modern machine in terms of size and speed. People can get ‘value of RAM’ with the advancement of hardware. In the meantime, data has been produced far more quickly than ever before and some tasks need to process those data with low latency. Memory-optimized tables can leverage the advancement of hardware to tackle this problem. Memory-optimized tables mainly reside in memory so that data is read from and written to memory [1]. However, for durability purposes a second copy of the table is maintained on disk and data is only read from disk during database recovery. The performance could be optimized with high scalability and low latency using memory especially when we need to read from and write to tables very frequently [2]. You can find a detailed introduction of memory-optimized tables on this blog [1]. You can also watch this video [3] to learn more about the performance benefits of using In-Memory OLTP.

In the resume-matching scenario, we will need to read all the resume features from the database and match all of them with a new job opening. By using memory-optimized tables, resume features are stored in main memory and disk IO could be significantly reduced. In addition, since we need to write all the predictions back to the database concurrently from different batches, extra performance gain could be achieved by using memory-optimized table. With the support of memory-optimized table on SQL Server, we achieved low latency on reading from/writing to tables and a seamless experience during development. Full durable memory-optimized tables were created along with creating the database. The rest of the development is exactly the same as before without knowing where the data is stored.

CPU affinity and memory allocation

With SQL Server 2014 SP2 and later version, soft-NUMA is automatically enabled at the database-instance level when starting the SQL Server service [4, 5, 6]. If the database engine server detects more than 8 physical cores per NUMA node or socket, it will automatically create soft-NUMA nodes that ideally contain 8 cores. But it can go down to 5 or up to 9 logical cores per node. You can find the log information when SQL Server detects more than 8 physical cores in each socket.

Figure 1: SQL log of auto Soft-NUMA, 4 soft NUMA nodes were created

As shown in Figure 1, our test consisted of 20 physical cores among which 4 soft-NUMA nodes were created automatically such that each node contained 5 cores. Soft-NUMA enables the ability to partition service threads per node and that generally increases scalability and performance by reducing IO and lazy writer bottlenecks. We then further created 4 SQL resource pools and 4 external resource pools [7] to specify the CPU affinity of using the same set of CPUs in each node. By doing this, both SQL Server and the R processes can eliminate foreign memory access since the processes will be within the same NUMA node. Hence, memory access latency could be reduced. Subsequently, those resource pools are then assigned to different workload groups to enhance hardware resource consumption.

Soft-NUMA and CPU affinity cannot divide physical memory in each physical NUMA node. All the soft NUMA nodes in the same physical NUMA node receive memory from the same OS memory block and there is no memory-to-processor affinity. However, we should pay attention to the memory allocation between SQL Server and the R processes. By default, only 20% of memory is allocated to R services and that is not enough for most of the data analytical tasks. Please see How To: Create a Resource Pool for R [7] for more information. We need to fine tune memory allocation between those two and of course the best configuration varies case by case. In the resume-matching use case, we increased the external memory resource allocation to 70% which was the best configuration.

Resource governance and concurrent scoring

To scale up the scoring problem, a good practice is to adopt the map-reduce approach in which we split millions of resumes into multiple batches, and then execute multiple scoring concurrently. The parallel processing framework is illustrated in Figure 2.

Figure 2: Illustration of parallel processing in multiple batches

Those batches will be processed on different CPU sets, and the results will be collected and written back to the database. Resource governance in SQL Server is designed to implement this idea. We can create resource governance for R services on SQL Server [8] by routing those scoring batches into different workload groups (Figure. 3). More information about resource governor could be found on this blog [9].

Figure 3: Resource governor (from: https://docs.microsoft.com/en-us/sql/relational-databases/resource-governor/resource-governor)

Resource governor can help divide the available resources (CPU and memory) on a SQL Server to minimize the workload competition using a classifier function [10, 11]. It provides multitenancy and resource isolation on SQL Server for different tasks to potentially improve the execution and provide predictable performance.

Other Tricks

One pain point with R is that when we conduct feature engineering it is usually processed on a single CPU. This is a major performance bottleneck for most of the data analysis tasks. In our resume-matching use case, we need to produce 2,500 cross-product features that will be then combined with the original 100 features (Figure 4). This whole process would take significant amount of time if everything was done on a single CPU.

Figure 4: Feature engineering of our resume-matching use case

One trick here is to create a R function for feature engineering and to pass it as rxTransform function during training. The machine learning algorithm is implemented with parallel processing. As part of the training, the feature engineering is also processed on multiple CPUs. In comparison with regular approach in which feature engineering is conducted before training and scoring, we observed a 16% performance improvement in terms of scoring time.

Another trick that can potentially improves the performance is to use SQL compute context within R [12]. Since we have isolated resources for different batch executions, we need to isolate the SQL query for each batch as well. By using SQL compute context, we can parallelize the SQL query to extract data from tables and constrain the data on the same workload group.

Results and Conclusion

To fully illustrate those tips and tricks, we have published a very detailed step-by-step tutorial. A few benchmark tests for scoring 1.1 million rows of data were also conducted. We used both the RevoScaleR and MicrosoftML packages to train a prediction model separately. We then compared the scoring time if using those optimizations versus without optimizations. Figure 5 and 6 summarize the best performance results using RevoScaleR and MicrosoftML packages. The tests were conducted on the same Azure SQL Server VM using the same SQL query and R codes. Eight batches for one matching job were used in all tests.

Figure 5: RevoScaleR scoring results

Figure 6: MicrosoftML scoring results

The results suggested that the number of features had a significant impact on the scoring time. Also, using those optimization tips and tricks could significantly improve the performance in terms of scoring time. The improvement was even more prominent if more features were used in the prediction model.

Acknowledgement

Lastly, we would like to express our thanks to Umachandar Jayachandran, Amit Banerjee, Ramkumar Chandrasekaran, Wee Hyong Tok, Xinwei Xue, James Ren, Lixin Gong, Ivan Popivanov, Costin Eseanu, Mario Bourgoin, Katherine Lin and Yiyu Chen for the great discussions, proofreading and test-driving the tutorial accompanying this blog post.

References

[1] Introduction to Memory-Optimized Tables

[2] Demonstration: Performance Improvement of In-Memory OLTP

[3] 17-minute video explaining In-Memory OLTP and demonstrating performance benefits

[4] Understanding Non-uniform Memory Access

[5] How SQL Server Supports NUMA

[6] Soft-NUMA (SQL Server)

[7] How To: Create a Resource Pool for R

[8] Resource Governance for R Services

[9] Resource Governor

[10] Introducing Resource Governor

[11] SQL SERVER – Simple Example to Configure Resource Governor – Introduction to Resource Governor

[12] Define and Use Compute Contexts
Quelle: Azure