Tuesday, November 17, 2009

At long last... labeled data!

By incredibly popular demand, we are providing a labeled data set. These data are very similar to those used in the competition, although we are reserving the actual competition data for future use. This new data set may be used to develop and evaluate novel algorithms.

The data set is 480 MB, and can be downloaded from the public data set archive of the PHM Society:
https://www.phmsociety.org/references/datasets

Thursday, October 8, 2009

And the winners are...

The winners of the 2009 PHM Society challenge are posted here. Congratulations to both teams! And thanks to everyone who participated...

Sunday, July 12, 2009

FAQ #12: Scoring

Q: What score will be used to rank competitors? Best ever or last submitted?

A: Good question! There is an argument to be made (and in fact we have!) either way. You will be ranked on the basis of your best score ever.

Thursday, July 9, 2009

FAQ #11: Closing Time

Q: When are the final submissions due? Will the closing date be extended?

A: Entries submitted after 13 July 2009 23:59 Eastern Daylight Time (14 July 2009 03:59 Greenwich Mean Time) are not eligible for the competition. This closing date is firm.

Thursday, June 25, 2009

FAQ#10: bad key fault

Q: Is the bad key fault equivalent to a "no load" condition?

A: No, there was partial loading. The shaft and the brake turned at different rates due to slippage between the output shaft and the brake.

Monday, May 25, 2009

FAQ#9: invited papers

Q: Your web site states that the "top scoring teams will be invited to give presentations at the special session, and submit papers to IJPHM". Does this means that only the first two (one from each category) will be invited?

A: No. We expect several competitors from both categories will be invited to present, depending on how they do.

FAQ#8: releasing the answers

Q: Are you going to post the solutions of the Data Challenge, e.g., after the competition?

A: Probably not all of them.

As with last year, we aren't releasing the full data set. Instead, we are holding on to it for use as a "blind standard" for comparing algorithms.

Thursday, May 21, 2009

FAQ#7: Experimental Conditions II

Q: We would like to ask you to clarify your comment in FAQ#6. We are confused and not sure that we understood it correctly.

What do you mean by "a case"? Did you mean that "a case" is a specific combination of good and bad components, and that over all there are only 14 such combinations?

Are you saying that the same combination of faults is replicated 40 times (4 times in each combination of speed and load)?

If the answer to the last question is yes, the implication would be 40 identical lines in the results file for each of the 14 cases. Is that correct?


A: Yes, the same combination of faults is replicated 40 times.

However, the goal of the competition is to promote the exchange of innovative ideas and to advance PHM as a scientific discipline, and, this year in particular, to fault detection and magnitude estimation for a generic gearbox using accelerometer data and information about bearing geometry. Participants are scored based on their ability to correctly identify type, location, and magnitude and damage in a gear system.

Algorithms that rely on knowledge of this specific data set will be disqualified.

Monday, May 11, 2009

FAQ#6: Experimental Conditions

Some explicit guidance on the experimental conditions.

There are six helical cases and eight spur cases; within each case, there are four replications (repeated test conditions), five speeds, and two loads.

So: (6*4*5*2)+(8*4*5*2) = 560 files

Friday, May 1, 2009

FAQ#5: Submitting results & scoring

Q: How often can we submit results, how are they scored, and how frequently is the leaderboard updated?

A: You can submit results once every 24 hours. The leaderboard is updated once every day, at approximately noon PST.

Results are scored by calculating the Hamming distance between the submitted results file, and the ground truth.For example, if the true state of the system is [1,0,0,1,0] and you submit [1,1,0,0,0,0], your score is 2.

FAQ#4: Hey, people might cheat!

Q: Will the leaderboard allow the participants use the evaluation results as "feedback" to tune their algorithms?

A: Please do not do this! It is pointless...

The Data Challenge is about promoting the exchange of innovative ideas and advancing PHM as a scientific discipline. Please compete fairly, and do not try to game the system. You may score well on the leaderboard, but ultimately you will have to publish your approach, and if it is incapable of achieving similar results on a holdout dataset, you will be disqualified.

Note that in the spirit of fair competition, we allow only one account per team. Please do not register multiple times under different user names, under fictitious names, or using anonymous accounts. Competition organizers reserve the right to delete multiple entries from the same person (or team) and will aggressively disqualify those who are trying to game the system or use fictitious identities.

FAQ#3: J'accuse!

Q: The PHM challenge should also allow researchers from the model-free camp to participate. This means that pattern recognition based supervised learning techniques should be able to be used. Since the data is not labeled, it is not possible to run experiments related to classification performance. The authors of the challenge should provide two different data sets, one labeled for the learning and one unlabelled for the testing. Currently only model-based fault diagnosis can be done. This excludes a complete branch of research.

A: Any format will favor some group of researchers. Last year, the competition was exactly what you asked for: essentially a homework problem, with neatly packaged labeled data that gave machine learning researchers a huge advantage. Next year, some other group will assuredly have an advantage.

However, we have made every effort to level the playing field. We have provided background on domain fundamentals; Matlab code for algorithms to extract features from the data; and links to excellent papers on the analysis of this type of data. Moreover, we believe that the problem is difficult enough that very innovative approaches may be required to solve it: researchers who are looking at the problem from a "fresh perspective" may actually have an advantage...

FAQ#2: labeled data

Q: Will some normal operational data or faulty data with ground truth be provided to help calibrate the algorithms?

A: No. The truth is in the data and the system geometry information.

More help getting started...

This paper is an excellent introduction to gearbox data analysis.

This paper is a good introduction to time synchronous averaging.

Wednesday, April 15, 2009

FAQ#1: "error" and "bad key" faults

Q: On the page which shows how the results should be formatted, there is a fault for gears called "Error" and one for shaft called "Bad Key." What are the physical characteristics of these faults?

A: "Error" is short for a manufacturing error, such as an eccentric gear tooth spacing error.

"Bad Key" is when the key sheared in the keyway, allowing the output shaft to rotate at a different speed than the brake.

Saturday, April 11, 2009

Some help getting started...

Not familiar with analyzing data from rotating machinery? Here are some previously published algorithms coded in Matlab to help get you started extracting features from the data...

Friday, April 10, 2009

The challenge data is available via bittorrent!

The data is big - 420 MB - so we have released it via bittorrent. The torrent is here. Please seed! The data will also be made available on the web site.

If you - as I imagine many of you are - were born before 1984 or so, you may never have heard of the bittorrent protocol. You can learn more about the protocol here. In general, it is a peer-to-peer file sharing protocol where each person downloading the data makes it available to other peers to download concurrently. After the file is successfully downloaded, you many continue to make the data available (please do!), known as "seeding". This distributed approach makes the distribution of large amounts of data fast, efficient, and reliable.

The the µTorrent and Vuze clients are both quite good.

Wednesday, April 1, 2009